# <center> <font color=#B40404>Recommendation Systems Project</center>
    
## <center> <font color=#B40404>Submitted by Utathya Ghosh</center>
    
### <center> <font color=#B40404>Batch - AIML Online Jan 21-A</center>

# <center>PROJECT BASED</center>
# <center>TOTAL SCORE - 60</center>

DOMAIN: Smartphone, Electronics

CONTEXT: India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India 
in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by 
smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has 
made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they 
are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the 
right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system 
based on individual consumer’s behaviour or choice.

DATA DESCRIPTION:

    • author : name of the person who gave the rating
    • country : country the person who gave the rating belongs to
    • data : date of the rating
    • domain: website from which the rating was taken from
    • extract: rating content
    • language: language in which the rating was given
    • product: name of the product/mobile phone for which the rating was given
    • score: average rating for the phone
    • score_max: highest rating given for the phone
    • source: source from where the rating was taken
    *Data source: 

PROJECT OBJECTIVE: We will build a recommendation system using popularity based and collaborative filtering methods to recommend 
mobile phones to a user which are most popular and personalised respectively.

Steps and tasks: [ Total Score: 60 points]

In [1]:
# Load required packages
import pandas as pd
import numpy as np
import os
from surprise import Dataset
from surprise import Reader
from surprise import SVD
from surprise import KNNWithMeans
from surprise import accuracy
from collections import defaultdict
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

#### 1. Import the necessary libraries and read the provided CSVs as a data frame and perform the below steps. 
###### • Merge the provided CSVs into one data-frame.# 

In [2]:
# Function to extract names of all files from OS
def find_csv_filenames(path_to_dir, suffix=".csv" ):
    filenames = os.listdir(path_to_dir)
    return([filename for filename in filenames if filename.endswith( suffix )])

In [3]:
# Load all files and join them into one dataframe
df_master = pd.DataFrame()
for f in find_csv_filenames(os.path.abspath('')):
    df_master = pd.concat([df_master, pd.read_csv(f)])
    
# Reset index of the combined dataframe
df_master.reset_index(drop=True, inplace=True)

# Display combined dataframe basic information
df_master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1415133 entries, 0 to 1415132
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   phone_url  1415133 non-null  object 
 1   date       1415133 non-null  object 
 2   lang       1415133 non-null  object 
 3   country    1415133 non-null  object 
 4   source     1415133 non-null  object 
 5   domain     1415133 non-null  object 
 6   score      1351644 non-null  float64
 7   score_max  1351644 non-null  float64
 8   extract    1395772 non-null  object 
 9   author     1351931 non-null  object 
 10  product    1415132 non-null  object 
dtypes: float64(2), object(9)
memory usage: 118.8+ MB


###### • Check a few observations and shape of the data-frame.

In [4]:
# View random records
df_master.sample(30)

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
291533,/cellphones/samsung-galaxy-note-4/,12/11/2014,es,es,Amazon,amazon.es,10.0,10.0,Llevo bastante tiempo utilizando productos de ...,Encarni Torres,"Samsung Galaxy Note 4 - Smartphone de 5.7"" (25..."
1115835,/cellphones/sony-ericsson-xperia-x10/,2/19/2011,ru,ru,Yandex,market.yandex.ru,2.0,10.0,Всем доброго дня!,,Sony Ericsson Xperia X10
425363,/cellphones/lenovo-vibe-k4-note/,4/5/2016,en,in,Amazon,amazon.in,10.0,10.0,Very good phone in this price I love it,Sanjeev Kumar Shukla,"Lenovo Vibe K4 Note (Black, 16GB)"
871490,/cellphones/sony-ericsson-xperia-neo-v/,28/11/2012,ru,ru,Yandex,market.yandex.ru,10.0,10.0,"??????.??????????????, ?????????�???? ???? ???...",smpoytaht,Sony Ericsson Xperia neo V
966370,/cellphones/nokia-asha-300/,6/15/2012,fr,fr,Amazon,amazon.fr,8.0,10.0,tactile pas terrible pas de reglage sensibilit...,sa dav972,Nokia Asha 300 Téléphone Portable Bluetooth/Wi...
495842,/cellphones/wiko-rainbow/,9/17/2016,it,it,Amazon,amazon.it,6.0,10.0,Mi ha accompagnato per un anno e mezzo ed ha s...,Nello Tortorelli,"WIKO Rainbow Smartphone, Dual SIM, Fucsia"
1093444,/cellphones/nokia-c5/,12/31/2011,ru,ru,Yandex,market.yandex.ru,8.0,10.0,в целом телефоном отличный очень доволен покуп...,,Nokia C5-00
259747,/cellphones/elephone-p8000/,9/11/2015,de,de,Amazon,amazon.de,10.0,10.0,Das Handy ist eine sehr gute Wahl und wir könn...,Gonzalez Jose und Carola,Elephone P8000 4G FDD-LTE Android 5.1 Smartpho...
882344,/cellphones/apple-iphone-4s/,17/1/2012,en,gb,Amazon,amazon.co.uk,10.0,10.0,really pleased with the phone - thanks,mb,Apple iPhone 4S 16GB Smartphone - Black - Voda...
1170019,/cellphones/samsung-e1360/,10/13/2009,ru,ru,Yandex,market.yandex.ru,10.0,10.0,"В общем, для своей цены - совершенно замечател...",,Samsung E1360


In [5]:
# Display the shape and size
print("Shape of Master dataset: ", df_master.shape)
print("Size of Master dataset: ", df_master.size)

Shape of Master dataset:  (1415133, 11)
Size of Master dataset:  15566463


###### • Round off scores to the nearest integers.

In [6]:
# View unique values to gauge the format
print(np.unique(df_master["score"]))
print(np.unique(df_master["score_max"]))

[0.2 0.4 0.6 ... nan nan nan]
[10. nan nan ... nan nan nan]


###### <font color=#B40404>Score column has multiple null values. Since this is an important column for further calculation if we impute the null values we risk adding bias and error to the models we will build ahead. Hence instead of imputing null values of 'score' we will drop them and impute the null values of 'score max' to 0.

In [7]:
# Dropping Null values in score
df_master.dropna(axis=0, subset=['score'], inplace=True)

In [8]:
# Achieving round of by converting float to int32
df_master["score_max"] = np.around(df_master["score_max"].replace({np.nan: 0})).astype("int32")
df_master["score"] = np.around(df_master["score"]).astype("int32")
df_master.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1351644 entries, 0 to 1415132
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   phone_url  1351644 non-null  object
 1   date       1351644 non-null  object
 2   lang       1351644 non-null  object
 3   country    1351644 non-null  object
 4   source     1351644 non-null  object
 5   domain     1351644 non-null  object
 6   score      1351644 non-null  int32 
 7   score_max  1351644 non-null  int32 
 8   extract    1332699 non-null  object
 9   author     1291038 non-null  object
 10  product    1351643 non-null  object
dtypes: int32(2), object(9)
memory usage: 113.4+ MB


###### • Check for missing values. Impute the missing values if there is any.

###### <font color=#B40404>We have already noted that there are a lot of null values. However, the null values in 'author' and 'product' column is quite troublesome. Going forward these two columns will become our user and item list, to not have values here may result in biases and errors. Hence we will drop all the records with null values in 'author' and 'product'. Any other column with null values we will impute. 

In [9]:
# Dropping records with null values in author and product
df_master.dropna(axis=0, subset=['author', 'product'], inplace=True)

# Imputing remaining null values to ' '
df_master = df_master.replace({np.nan: ' '})

# Viewing basic information of dataframe
df_master.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1291038 entries, 0 to 1415132
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   phone_url  1291038 non-null  object
 1   date       1291038 non-null  object
 2   lang       1291038 non-null  object
 3   country    1291038 non-null  object
 4   source     1291038 non-null  object
 5   domain     1291038 non-null  object
 6   score      1291038 non-null  int32 
 7   score_max  1291038 non-null  int32 
 8   extract    1291038 non-null  object
 9   author     1291038 non-null  object
 10  product    1291038 non-null  object
dtypes: int32(2), object(9)
memory usage: 108.3+ MB


###### • Check for duplicate values and remove them if there is any.

In [10]:
# Remove Duplicate values
df_master = df_master.loc[~df_master.duplicated(keep = 'first')]

# Display basic information
df_master.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1286361 entries, 0 to 1415132
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   phone_url  1286361 non-null  object
 1   date       1286361 non-null  object
 2   lang       1286361 non-null  object
 3   country    1286361 non-null  object
 4   source     1286361 non-null  object
 5   domain     1286361 non-null  object
 6   score      1286361 non-null  int32 
 7   score_max  1286361 non-null  int32 
 8   extract    1286361 non-null  object
 9   author     1286361 non-null  object
 10  product    1286361 non-null  object
dtypes: int32(2), object(9)
memory usage: 108.0+ MB


###### • Keep only 1000000 data samples. Use random state=612.

In [11]:
# Keep only 1000000 records
df_tel = df_master.sample(1000000, random_state=612).copy().reset_index(drop=True)

###### • Drop irrelevant features. Keep features like Author, Product, and Score.

###### <font color=#B40404>We drop all features except for: -
- "Author"  --- Changed to 'user'
- "Product" --- Changed to 'item'
- "Score"   --- Changed to 'rating'
- "country" --- We will keep this for the time being for a particular purpose, post which it will be promptly dropped

In [12]:
# Keeping only relevant columns
df_tel = df_tel[["country", "author", "product", "score"]].copy()

# Renaming columns to more meaningful names
df_tel.columns = ["country", "user", "item", "rating"]

# Display dataframe
df_tel.head()

Unnamed: 0,country,user,item,rating
0,ar,VIRTUALKIO2008,LG MG220,8
1,it,Piero Montagno,"Nokia 130 Telefono Cellulare, Display da 1.8"",...",6
2,us,itsthemom,Samsung Galaxy S7 32GB (Verizon),6
3,in,Nisar Ahmed,"OnePlus 3 (Graphite, 64 GB)",8
4,gb,Milosav Pesic,Nokia Sony Ericsson C510 Black Mobile Phone Si...,10


###### <font color=#B40404>Before proceeding further we will also remove all leading and trailing white spaces from 'user' and 'item' columns and convert all users to lower case

In [13]:
# Stripping leading and trailing whitespaces
df_tel['user'] = df_tel['user'].str.strip().str.lower()
df_tel['item'] = df_tel['item'].str.strip()

# Viewing dataframe
df_tel

Unnamed: 0,country,user,item,rating
0,ar,virtualkio2008,LG MG220,8
1,it,piero montagno,"Nokia 130 Telefono Cellulare, Display da 1.8"",...",6
2,us,itsthemom,Samsung Galaxy S7 32GB (Verizon),6
3,in,nisar ahmed,"OnePlus 3 (Graphite, 64 GB)",8
4,gb,milosav pesic,Nokia Sony Ericsson C510 Black Mobile Phone Si...,10
...,...,...,...,...
999995,us,aaron,Casio G'zOne Ravine 2 C781 Verizon Black,10
999996,gb,uhn,HTC 10 Sim Free Smartphone - Glacier Silver,10
999997,us,grigor,"BLU Vivo XL Smartphone - 5.5"" 4G LTE - GSM Unl...",10
999998,de,einer kundin,"Huawei P9 Lite Dual-SIM Smartphone, 13,2 cm (5...",10


#### 2. Answer the following questions
###### • Identify the most rated features.

In [14]:
# Display the most rated items
df_tel.groupby("item").count().sort_values("user", ascending=False)

Unnamed: 0_level_0,country,user,rating
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Lenovo Vibe K4 Note (White,16GB)",4030,4030,4030
"Lenovo Vibe K4 Note (Black, 16GB)",3405,3405,3405
"OnePlus 3 (Graphite, 64 GB)",3187,3187,3187
"OnePlus 3 (Soft Gold, 64 GB)",2746,2746,2746
Huawei P8lite zwart / 16 GB,2104,2104,2104
...,...,...,...
Nokia E6-00 Zwart,1,1,1
"Nokia E6-00 - Móvil libre (pantalla táctil de 2,46"" 640 x 480, 8 GB de capacidad, teclado QWERTZ alemán, S.O....",1,1,1
Nokia E6 Sim Free Mobile Phone - White,1,1,1
Nokia E6 Sim Free Mobile Phone - Silver,1,1,1


###### • Identify the users with most number of reviews.

In [15]:
df_tel.groupby("user").count().sort_values("item", ascending=False)

Unnamed: 0_level_0,country,item,rating
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
amazon customer,59929,59929,59929
cliente amazon,14947,14947,14947
e-bit,6609,6609,6609
client d'amazon,5976,5976,5976
amazon kunde,3690,3690,3690
...,...,...,...
henry g.,1,1,1
henry fuentes,1,1,1
henry franz,1,1,1
henry f.,1,1,1


###### • Select the data with products having more than 50 ratings and users who have given more than 50 ratings. Report the shape of the final dataset.

In [16]:
# Build a new dataframe with users and item with more than 50 ratings
df_tel50 = df_tel[df_tel.groupby("user")["item"].transform('count')>50].copy()
df_tel50 = df_tel50[df_tel50.groupby("item")["user"].transform('count')>50].copy()

# Display shape of new dataframe
df_tel50.shape

(68452, 4)

#### 3. Build a popularity based model and recommend top 5 mobile phones.

In [17]:
# Display top 5 mobile phones by rating in the world
print("Top 5 mobile phones in the world are :")
print(list(df_tel50.groupby("item").mean().sort_values("rating", ascending=False).head(5).index))

# Display top 5 mobile phones by rating by country
for cntry in np.unique(df_tel50["country"]):
    print("\nTop 5 mobile phones in ", cntry, " are :")
    print(list(
        df_tel50[df_tel50["country"] == cntry].groupby("item").mean().sort_values("rating", ascending=False).head(5).index))
    
# Drop country column as it has no more use
df_tel50.drop("country", axis=1, inplace=True)

Top 5 mobile phones in the world are :
['Samsung Galaxy Note5', 'Apple iPhone 7 4,7" 32 GB', 'Apple iPhone 6s 4,7" 128 GB', 'Smartphone Asus ZenFone 3 ZE552KL', 'Samsung Galaxy S7 32GB (T-Mobile)']

Top 5 mobile phones in  au  are :
['Motorola Moto G 3rd Gen - Black']

Top 5 mobile phones in  be  are :
['Samsung Galaxy S7 Edge wit / 32 GB', 'Samsung Galaxy S7 zwart / 32 GB', 'Samsung Galaxy S7 Edge zwart / 32 GB', 'Huawei P8 Lite wit / 16 GB', 'Samsung Galaxy S7 Edge goud / 32 GB']

Top 5 mobile phones in  br  are :
['Smartphone Asus ZenFone 3 ZE552KL', 'Smartphone LG G2 D805', 'Smartphone Motorola Moto Z Play XT1635', 'Smartphone Samsung Galaxy S7 SM-G930 32GB', 'Smartphone Apple iPhone 6 16GB']

Top 5 mobile phones in  ca  are :
['Lumia Microsoft Nokia Lumia 640 RM-1073 Unlocked Phone (Black)', 'Samsung Galaxy Note 3/S5 USB 3.0 5-Feet Data Cable, Non-Retail Packaging']

Top 5 mobile phones in  de  are :
['Apple iPhone 7 4,7" 32 GB', 'Apple iPhone 6s 4,7" 128 GB', 'Apple iPhone 6s Plu

#### 4. Build a collaborative filtering model using SVD. You can use SVD from surprise or build it from scratch(Note: Incase you’re building it from scratch you can limit your data points to 5000 samples if you face memory issues). Build a collaborative filtering model using kNNWithMeans from surprise. You can try both user-based and item-based model.

###### <font color=#B40404>First using SVD from surprise to build a collaborative filtering model.

In [18]:
# Building a reader
reader = Reader(rating_scale=(0, 10))

# Creating our data from df_tel50 dataframe
data = Dataset.load_from_df(df_tel50[["user", "item", "rating"]], reader)

# Initialising model with SVD from surprise
svd = SVD(verbose=True, n_epochs=10)

# Creating trainset and testset from data
trainset, testset = train_test_split(data, test_size=.2)

# Training model svd
svd.fit(trainset)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x15f800036d0>

###### <font color=#B40404>Second using kNNWithMeans from surprise to build an user-based collaborative filtering model.

In [19]:
# Use kNNWithMeans from surprise to build an user-based collaborative filter model
algo_user = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': True})
algo_user.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x15f832114c0>

###### <font color=#B40404>Third using kNNWithMeans from surprise to build an item-based collaborative filtering model.

In [20]:
# Use kNNWithMeans from surprise to build an item-based collaborative filter model
algo_item = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': False})
algo_item.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x15f8321b3a0>

#### 5. Evaluate the collaborative model. Print RMSE value.

In [21]:
# Get prediction for svd based collaborative filtering model
# run the trained model against the testset
svd_pred = svd.test(testset)

# get RMSE for svd based collaborative filtering model
print("SVD-based Model : Test Set")
accuracy.rmse(svd_pred, verbose=True)

# Get prediction for user-based collaborative filtering model
# run the trained model against the testset
usr_pred = algo_user.test(testset)

# get RMSE for user-based collaborative filtering model
print("\nUser-based Model : Test Set")
accuracy.rmse(usr_pred, verbose=True)

# Get prediction for item-based collaborative filtering model
# run the trained model against the testset
itm_pred = algo_item.test(testset)

# get RMSE for item-based collaborative filtering model
print("\nItem-based Model : Test Set")
accuracy.rmse(itm_pred, verbose=True)

SVD-based Model : Test Set
RMSE: 2.5794

User-based Model : Test Set
RMSE: 2.6529

Item-based Model : Test Set
RMSE: 2.6543


2.654309141347851

#### 6. Predict score (average rating) for test users. 

###### <font color=#B40404>First using SVD from surprise.

In [22]:
# Initialise lists for df_svd_score columns
ls_id, ls_score = [], []

# Run through all predictions to extract required values
for idx in range(len(svd_pred)):
    ls_id.append(svd_pred[idx].uid)
    ls_score.append(svd_pred[idx].est)

# Creating dataframe to display score
df_svd_score = pd.DataFrame([ls_id, ls_score], index=["user", "rating"]).T

# Rearranging dataframe to display score
df_svd_score["rating"] = df_svd_score["rating"].astype("float32")

# Displaying scores
print("Score (average rating) for all test users : ", df_svd_score.groupby("user").mean().mean())
print("\nScore (average rating) for individual test users : -")
df_svd_score.groupby("user").mean()

Score (average rating) for all test users :  rating    8.259321
dtype: float32

Score (average rating) for individual test users : -


Unnamed: 0_level_0,rating
user,Unnamed: 1_level_1
????????,8.923452
??????????,8.959071
????????????,9.333807
???????????? ????????,8.702247
???????????? ??????????????,8.436412
...,...
сергей,8.923686
татьяна,8.962490
юлия,9.306425
юрий,8.854916


###### <font color=#B40404>Second using user-based kNNWithMeans from surprise.

In [23]:
# Initialise lists for df_usr_score columns
ls_id, ls_score = [], []

# Run through all predictions to extract required values
for idx in range(len(usr_pred)):
    ls_id.append(usr_pred[idx].uid)
    ls_score.append(usr_pred[idx].est)

# Creating dataframe to display score
df_usr_score = pd.DataFrame([ls_id, ls_score], index=["user", "rating"]).T

# Rearranging dataframe to display score
df_usr_score["rating"] = df_usr_score["rating"].astype("float32")

# Displaying scores
print("Score (average rating) for all test users : ", df_usr_score.groupby("user").mean().mean())
print("\nScore (average rating) for individual test users : -")
df_usr_score.groupby("user").mean()

Score (average rating) for all test users :  rating    8.223465
dtype: float32

Score (average rating) for individual test users : -


Unnamed: 0_level_0,rating
user,Unnamed: 1_level_1
????????,9.063495
??????????,8.962032
????????????,9.469950
???????????? ????????,10.000000
???????????? ??????????????,7.859644
...,...
сергей,8.791846
татьяна,8.987241
юлия,9.399129
юрий,9.092880


###### <font color=#B40404>Second using item-based kNNWithMeans from surprise. 

In [24]:
# Initialise lists for df_itm_score columns
ls_id, ls_score = [], []

# Run through all predictions to extract required values
for idx in range(len(itm_pred)):
    ls_id.append(itm_pred[idx].uid)
    ls_score.append(itm_pred[idx].est)

# Creating dataframe to display score
df_itm_score = pd.DataFrame([ls_id, ls_score], index=["user", "rating"]).T

# Rearranging dataframe to display score
df_itm_score["rating"] = df_itm_score["rating"].astype("float32")

# Displaying scores
print("Score (average rating) for all test users : ", df_itm_score.groupby("user").mean().mean())
print("\nScore (average rating) for individual test users : -")
df_itm_score.groupby("user").mean()

Score (average rating) for all test users :  rating    8.215114
dtype: float32

Score (average rating) for individual test users : -


Unnamed: 0_level_0,rating
user,Unnamed: 1_level_1
????????,9.097410
??????????,8.967369
????????????,9.461138
???????????? ????????,10.000000
???????????? ??????????????,7.859644
...,...
сергей,8.854269
татьяна,8.960356
юлия,9.320087
юрий,9.079974


#### 7. Report your findings and inferences.

###### <font color=#B40404>Based on the results we got on the accuracy (RMSE) scores. SVD is definitely outperforming kNNWithMeans. We will go ahead and confirm that later with cross validation.
    
###### <font color=#B40404>However, all three have an RMSE around 2.5, which in my view is quite high. Although, if this model is going to be used in a scenario where getting the real ratings is not very critical then this model is quick, fast and does an excellent job considering how sparse the initial matrix was. To give the client an idea, about 23K cells out 24K cells were empty when building the matrix factorisation table. So in that respect our model has done well.
    
###### <font color=#B40404>Looking at the average rating of the model we can also assume that most people have given very high ratings, this could be a bias of giving a rating only when people are satisfied. Which does not represent true feedback however, again considering that this model will be used to recommend phone to other users, this bias does not play spoilsport.

#### 8. Try and recommend top 5 products for test users.

In [25]:
# Function to get top 5 user ratings
def get_top_n(predictions, n=10):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the n highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return(top_n)

###### <font color=#B40404>Using SVD from surprise to  recommend 5 products for each user since Collaboration filtering model with SVD from surprise gave the best accuracy.

In [26]:
# Recommending top 5 products for each user
top_n = get_top_n(svd_pred, n=5)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print('\n', uid, [iid for (iid, _) in user_ratings])


 e-bit ['Smartphone Asus ZenFone 3 ZE552KL', 'Smartphone Asus ZenFone 3 ZE552KL', 'Smartphone Asus ZenFone 3 ZE552KL', 'Smartphone Asus ZenFone 3 ZE552KL', 'Smartphone Asus ZenFone 3 ZE552KL']

 amazon customer ['Motorola Moto G 3rd Generation LTE UK SIM-Free Smartphone - White', 'Motorola Moto G 3rd Generation LTE UK SIM-Free Smartphone - White', 'Motorola Moto G 3rd Generation LTE UK SIM-Free Smartphone - White', 'Motorola Moto G 3rd Generation LTE UK SIM-Free Smartphone - White', 'Motorola Moto G 3rd Generation LTE UK SIM-Free Smartphone - White']

 jay ['Samsung Galaxy S7 zwart / 32 GB', 'Samsung Galaxy S7 goud, roze / 32 GB', 'Sony Xperia Z3 UK SIM-Free Smartphone - Copper', 'Samsung Galaxy S3 mini (GT-I8200) Smartphone (4 Zoll (10,2 cm) Touch-Display, 8 GB Speicher, Android 4.2) blau', 'Motorola Moto E 2nd Generation (3G, White)']

 maria ['Lenovo Motorola Moto G Smartphone, 4,5 pollici display HD, processore Qualcomm, memoria 16GB, MicroSIM, Android 4.3 OS, fotocamera da 5 MP, 

 jackie ['LG Dare', 'Apple iPhone 5 Unlocked Cellphone, 32GB, Black']

 andy ['Honor 7 Smartphone (13,2 cm (5,2 Zoll) Touchscreen, 16GB interner Speicher, Android OS) grau', 'Motorola Moto G 4G SIM-Free Smartphone - Black (8GB) - Discontinued by manufacturer', 'Microsoft Nokia Lumia 630 Single-SIM Smartphone (4,5 Zoll (11,4 cm) Touch-Display, 8 GB Speicher, Windows 8.1) gelb', 'Huawei Nexus 6P unlocked smartphone, 32GB Gold (US Warranty)', 'Samsung Galaxy S4 mini Smartphone (10,9 cm (4,3 Zoll) Touch-Display, 8 GB Speicher, Android 4.2) schwarz']

 fabian ['Sony Xperia Z3 Compact Smartphone (11,7 cm (4,6 Zoll) HD-TRILUMINOS-Display, 2,5 GHz-Quad-Core-Prozessor, 20,7 Megapixel-Kamera, Android 4.4) meergrün']

 philip ['APPLE iPhone 7 - Silver, 32 GB', 'MICROSOFT Lumia 650 - 16 GB, Black', 'MICROSOFT Lumia 650 - 16 GB, Black', 'MICROSOFT Lumia 650 - 16 GB, Black', 'Microsoft Lumia 550 Smartphone (4,7 Zoll (11,9 cm) Touch-Display, 8 GB Speicher, Windows 10) schwarz']

 ?????????????? ['Son

#### 9. Try cross validation techniques to get better results.

###### <font color=#B40404>Cross validation on SVD from surprise, evaluating by rmse.

In [27]:
# Initialise analysis dataframe
df_svd_analysis = pd.DataFrame(columns=['n_factors', 'reg_all', 'n_epochs', 'lr_all', 'test_rmse', 'mean_test_rmse'])

# Set parameters for cross validation on SVD model
parameters = {'n_factors': [20, 50, 80],
              'reg_all': [0.04, 0.06],
              'n_epochs': [10, 20, 30],
              'lr_all': [0.002, 0.005, 0.01]}

# Run multiple loops to go through all parameter
for nf in parameters['n_factors']:
    for ra in parameters['reg_all']:
        for ne in parameters['n_epochs']:
            for la in parameters['lr_all']:
                
                # Set parameters for cross validation on SVD model
                gridsvd = SVD(n_factors=nf, reg_all=ra, n_epochs=ne, lr_all=la)
                ls_rmse = cross_validate(gridsvd, data, measures=['RMSE', 'MAE'], cv=3)['test_rmse']
                df_svd_analysis = df_svd_analysis.append({'n_factors': nf, 'reg_all': ra, 'n_epochs': ne, 'lr_all': la, 
                                                  'test_rmse': ls_rmse, 'mean_test_rmse': np.mean(ls_rmse)}, ignore_index=True)

# Displaying results
df_svd_analysis

Unnamed: 0,n_factors,reg_all,n_epochs,lr_all,test_rmse,mean_test_rmse
0,20,0.04,10,0.002,"[2.5706896694017938, 2.579327800211746, 2.5699...",2.573336
1,20,0.04,10,0.005,"[2.5924294588297934, 2.5702711336452335, 2.566...",2.57631
2,20,0.04,10,0.01,"[2.589401741892979, 2.6114616536935285, 2.5750...",2.591983
3,20,0.04,20,0.002,"[2.578473435336548, 2.554974930288214, 2.58542...",2.572959
4,20,0.04,20,0.005,"[2.5793983454343654, 2.5926310222757505, 2.610...",2.594079
5,20,0.04,20,0.01,"[2.666912759198316, 2.634583733850986, 2.66786...",2.656454
6,20,0.04,30,0.002,"[2.564954084322474, 2.576475882144381, 2.57015...",2.570527
7,20,0.04,30,0.005,"[2.614275565818265, 2.6004239487526597, 2.6069...",2.607209
8,20,0.04,30,0.01,"[2.661292688940282, 2.679344560958338, 2.67340...",2.671349
9,20,0.06,10,0.002,"[2.5706185037879457, 2.5778419775907278, 2.569...",2.572745


###### <font color=#B40404>Cross validation on kNNWithMeans from surprise, evaluating by rmse. I have kept item based collaboration filtering out of the cross validation steps because it took a lot of time and provided very little gain in error.

In [28]:
# Initialise analysis dataframe
df_knn_analysis = pd.DataFrame(columns=['name', 'user_based', 'min_k', 'test_rmse', 'mean_test_rmse'])

# Set parameters for cross validation on kNNWithMeans model
parameters = {'name': ['MSD', 'cosine'],
              'user_based': [True]}

# Run multiple loops to go through all parameter
for nm in parameters['name']:
    for ub in parameters['user_based']:
        for k in range(380, 530, 20):
            # Set parameters for cross validation on kNNWithMeans model
            gridknn = KNNWithMeans(k=k, sim_options={'name': nm, 'user_based': ub}, verbose=False)
            ls_rmse = cross_validate(gridknn, data, measures=['RMSE'], cv=3)['test_rmse']
            df_knn_analysis = df_knn_analysis.append({'name': nm, 'user_based': ub, 'min_k': k, 
                                              'test_rmse': ls_rmse, 'mean_test_rmse': np.mean(ls_rmse)}, ignore_index=True)

# Displaying results
df_knn_analysis

Unnamed: 0,name,user_based,min_k,test_rmse,mean_test_rmse
0,MSD,True,380,"[2.6015907592332606, 2.597817268354545, 2.5988...",2.599409
1,MSD,True,400,"[2.5885050402345033, 2.599681944927374, 2.6178...",2.602017
2,MSD,True,420,"[2.593713827846625, 2.5916015778490533, 2.5969...",2.594093
3,MSD,True,440,"[2.609416961311682, 2.604160388887175, 2.59183...",2.601803
4,MSD,True,460,"[2.6117382091657144, 2.601461201622084, 2.6047...",2.605979
5,MSD,True,480,"[2.602441384615055, 2.605005841661893, 2.60215...",2.6032
6,MSD,True,500,"[2.5857142654861507, 2.6094406523783893, 2.600...",2.598672
7,MSD,True,520,"[2.600482928201366, 2.598213808178769, 2.60359...",2.600764
8,cosine,True,380,"[2.5905326315195434, 2.5895438412591982, 2.592...",2.590699
9,cosine,True,400,"[2.573273957483209, 2.601368855529584, 2.60776...",2.594135


#### 10. In what business scenario you should use popularity based Recommendation Systems ?

###### <font color=#B40404>In a scenario where there are too many cold start issues (as in we do not have much information on ratings given by users on our platform) then we should use popularity based Recommendation Systems, since that looks at popularity from a macro level and simply the most popular products are recommended. When we do not have enough information on the likes and dislikes at an user level then this would become a viable option.

#### 11. In what business scenario you should use CF based Recommendation Systems ?

###### <font color=#B40404>This type of recommendation system makes predictions of what might interest a person based on the taste of many other users. It assumes that if person X likes item A, and person Y likes item A and item B, then person X might like item B as well.

###### <font color=#B40404>In a scenario where we already have a healthy amount of data about user's historical choices, there is a huge list of items that users cannot quickly glance through and we have the necessary technical capabilities to run Recommendation Systems, then definitely the business should use CF based Recommendation Systems, since this would improve user retention as customers would keep discovering relevant items. Not only is it useful from a ROI perspective but in a situation when customers do not purchase, we can still share relevant items with the user and improve their engagement with the business. Healthy engagement from users creates an ecosystem which would align customer needs with business objectives.

###### <font color=#B40404>The latent characteristics would also create baskets which could be used to further develop business strategies.

#### 12. What other possible methods can you think of which can further improve the recommendation for different users ?

###### <font color=#B40404>Generally the best Recommendation systems are a hybrid of a few types of systems. In our example as well, since we have so few ratings, it would be useful for certain items or users to be recommended based on popularity based recommendation systems. The recommendation should definitely have a high proportion of collaborative filtering based results, which is our model in this example. Finally to further improve recommendations, we should create a database on the technical specs of the phones in question. We should then use the technical specs to also recommend based on content based filtering. The content based filtering could play a more prominent role when the user makes a specific search or even based on his previous choices. Furthermore we could process the extracts column and see if we can recommend items based on what has been said about a different item. So if the sentiment towards a particular item's battery life is good then a user who has given good ratings to that item would probably also like other items whose sentiments match.
    
###### <font color=#B40404>Lastly I want to add that our own model can be still further improved by going through all the item descriptions and user handles. There seem to be a lot of garbage values involved here, so a much detailed processing of these categorical variables should improve our own model.