# NAME: Lalitha Krishnamurthy


## Module: Recommendation Systems

## Project Domain:  Smartphone, Electronics

# CONTEXT:
India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India   in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system based on individual consumer’s behaviour or choice.

# DATA DESCRIPTION:

• author : name of the person who gave the rating 
• country : country the person who gave the rating belongs to
• data : date of the rating
• domain: website from which the rating was taken from
• extract: rating content
• language: language in which the rating was given
• product: name of the product/mobile phone for which the rating was given
• score: average rating for the phone
• score_max: highest rating given for the phone
• source: source from where the rating was taken  

# PROJECT OBJECTIVE:
We will build a recommendation system using popularity based and collaborative filtering methods to recommend
mobile phones to a user which are most popular and personalised respectively. 
## Steps to the project:

### 1. Import the necessary libraries and read the provided CSVs as a data frame and perform the below steps:
  
A. Merge all the provided CSVs into one data-frame. 

B. Explore, understand the Data and share at least 2 observations. 

C. Round oﬀ scores to the nearest integers.

D. Check for missing values. Impute the missing values, if any. 


E. Check for duplicate values and remove them, if any.

F. Keep only 1 Million data samples. Use random state=612. 

G. Drop irrelevant features. Keep features like Author, Product, and Score. 

### 2. Answer the following questions:
   
A. Identify the most rated features. [3 Marks]

B. Identify the users with most number of reviews. [3 Marks]

C. Select the data with products having more than 50 ratings and users who have given more than 50 ratings. Report the shape of the final  dataset. [4 Marks] 
       
### 3. Build a popularity based model and recommend top 5 mobile phones. 
### 4. Build a collaborative filtering model using SVD. You can use SVD from surprise or build it from scratch(Note: Incase you’re building it from scratch you can limit your data points to 5000 samples if you face memory issues). Build a collaborative filtering model using kNNWithMeans from surprise. You can try both user-based and item-based model. 
### 5. Evaluate the collaborative model. Print RMSE value. 
### 6. Predict score (average rating) for test users. 
### 7. Report your findings and inferences.  
### 8. Try and recommend top 5 products for test users. 
### 9. Try other techniques (Example: cross validation) to get better results. 
### 10. In what business scenario you should use popularity based Recommendation Systems ? 
### 11. In what business scenario you should use CF based Recommendation Systems ? 
### 12. What other possible methods can you think of which can further improve the recommendation for diﬀerent users ?
  

# SOLUTION

### 1. Import the necessary libraries and read the provided CSVs as a data frame and perform the below steps:

In [1]:
# Importing the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing
from collections import defaultdict
from surprise import SVD
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
import os
import glob
from os import listdir

# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

### Q1 A - Merge all the provided CSVs into one data-frame.

In [2]:
# setting the path for joining multiple files
files = os.path.join(r"C:\\AIML\\RecommendationSystem\\Project", "phone_user_review_file*.csv")

# list of merged files returned
files = glob.glob(files)
extension = 'csv'
#print("Resultant CSV after joining all CSV files at a particular location...");
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
df = pd.concat([pd.read_csv(f,encoding='latin-1') for f in all_filenames ])
df.head(5)

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.2,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


In [3]:
print("shape of merged dataframe -",df.shape) 
print("size of merged dataframe -",df.size)   

shape of merged dataframe - (1415133, 11)
size of merged dataframe - 15566463


In [4]:
df.columns

Index(['phone_url', 'date', 'lang', 'country', 'source', 'domain', 'score',
       'score_max', 'extract', 'author', 'product'],
      dtype='object')

### Q1 B - Explore, understand the Data and share at least 2 observations.

In [5]:
#Check the datatypes
df.dtypes

phone_url     object
date          object
lang          object
country       object
source        object
domain        object
score        float64
score_max    float64
extract       object
author        object
product       object
dtype: object

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1415133 entries, 0 to 163836
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   phone_url  1415133 non-null  object 
 1   date       1415133 non-null  object 
 2   lang       1415133 non-null  object 
 3   country    1415133 non-null  object 
 4   source     1415133 non-null  object 
 5   domain     1415133 non-null  object 
 6   score      1351644 non-null  float64
 7   score_max  1351644 non-null  float64
 8   extract    1395772 non-null  object 
 9   author     1351931 non-null  object 
 10  product    1415132 non-null  object 
dtypes: float64(2), object(9)
memory usage: 129.6+ MB


In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
score,1351644.0,8.00706,2.616121,0.2,7.2,9.2,10.0,10.0
score_max,1351644.0,10.0,0.0,10.0,10.0,10.0,10.0,10.0


In [8]:
#Find the minimum and maximum ratings
print('Minimum Score is: %d' %(df.score.min()))
print('Maximum Score is: %d' %(df.score.max()))


Minimum Score is: 0
Maximum Score is: 10


#### Observation:

**The datatype of the column are all objects.**
**The main column that can considered is the product, user, score**
**The score provided by user vary from 0 to 10. 0 being the lowest and 10 being the highest**


### Q1 C - Round off scores to the nearest integers.

In [9]:
df['score']=df.score.round()
df['score']

0         10.0
1         10.0
2          6.0
3          9.0
4          4.0
          ... 
163832     2.0
163833    10.0
163834     2.0
163835     8.0
163836     2.0
Name: score, Length: 1415133, dtype: float64

### Q1 D - Check for missing values. Impute the missing values, if any.

In [10]:
# Checking for null values
##making sure TotalCharges is converted to numeric
##checking the numeric columns for empty values
#df.isnull().sum()
def missing_cols(df):
    '''prints out columns with its amount of missing values'''
    total = 0
    ##converting the TotalCharges to Number has it has numberic valyes
    #df['TotalCharges'] = pd.to_numeric(df.TotalCharges, errors='coerce')
    for col in df.columns:
        missing_vals = df[col].isnull().sum()
        total += missing_vals
        if missing_vals != 0:
            print(f"{col} => {df[col].isnull().sum()}")                    
    if total == 0:
        print("No missing values left")          
missing_cols(df)

score => 63489
score_max => 63489
extract => 19361
author => 63202
product => 1


In [11]:
#Delete the product which is null
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

In [12]:
missing_cols(df)

No missing values left


### Q1 E - Check for duplicate values and remove them, if any.

In [13]:
boolean = df.duplicated().any()
print(boolean)

True


In [14]:
#dropping the duplicates
df.drop_duplicates(inplace=True)
df.shape

(1271437, 11)

### Q1 F - Keep only 1 Million data samples. Use random state=612.

In [15]:
df1=df.sample(n = 1000000, random_state = 612)
df1.shape

(1000000, 11)

### Q1 G - Drop irrelevant features. Keep features like Author, Product, and Score.

In [16]:
df2=df1[['author','product','score']].copy()
df2.head(5)

Unnamed: 0,author,product,score
490352,KHILESH KUMAR VERMA,"Lenovo Vibe K5 (Gold, VoLTE update)",10.0
100384,Evyta,Samsung Galaxy S6,8.0
1116116,VanRaZor,Sony Ericsson K810i,8.0
435849,ruga,Sony Xperia Z2 (Black),6.0
16706,einer Kundin,"Samsung Galaxy S7 edge Smartphone, 13,9 cm (5,...",10.0


### Q2 A - Identify the most rated features.

In [68]:
#df2['product'].value_counts(ascending=False).reset_index().head(1)
print("Most rated feature")
print("**************************")
df2.groupby('product')['score'].count().sort_values(ascending=False).head(1) 

Most rated feature
**************************


product
Lenovo Vibe K4 Note (White,16GB)    4109
Name: score, dtype: int64

### Q2 B - Identify the users with most number of reviews.

In [69]:
print("Most rated User")
print("**************")
df2['author'].value_counts(ascending=False).reset_index().head(5)

Most rated User
**************


Unnamed: 0,index,author
0,Amazon Customer,60408
1,Cliente Amazon,15051
2,e-bit,6651
3,Client d'Amazon,6087
4,Amazon Kunde,3683


### Q2 C - Select the data with products having more than 50 ratings and users who have given more than 50 ratings. Report the shape of the final dataset.

In [70]:
# extracting authors who gave greater than 50 ratings
df_author_50 = pd.DataFrame(columns=['author', 'a_count'])
df_author_50['author']=df2['author'].value_counts().index.tolist() 
df_author_50['a_count'] = list(df2['author'].value_counts() > 50)

In [75]:
df_author_50=df_author_50[df_author_50['a_count']==True]

In [76]:
# extracting product that got more than 50 ratings
df_product_50 = pd.DataFrame(columns=['product', 'p_count'])
df_product_50['product']=df2['product'].value_counts().index.tolist() 
df_product_50['p_count'] = list(df2['product'].value_counts() > 50)

In [77]:
df_product_50=df_product_50[df_product_50['p_count']==True]

In [78]:
df_product_50

Unnamed: 0,product,p_count
0,"Lenovo Vibe K4 Note (White,16GB)",True
1,"Lenovo Vibe K4 Note (Black, 16GB)",True
2,"OnePlus 3 (Graphite, 64 GB)",True
3,"OnePlus 3 (Soft Gold, 64 GB)",True
4,Huawei P8lite zwart / 16 GB,True
...,...,...
4468,Motorola C350,True
4469,Motorola Nexus 6 - Smartphone libre Android (p...,True
4470,Nokia 301 Sim Free Mobile Phone - Black (disco...,True
4471,Apple iPhone 4S 64GB,True


In [79]:
# selecting data rows where product is having more than 50 ratings and user having given more than 50 ratings.  
final_df = df2[df2['product'].isin(df_product_50['product']) & df2['author'].isin(df_author_50['author'])] 
final_df

Unnamed: 0,author,product,score
16706,einer Kundin,"Samsung Galaxy S7 edge Smartphone, 13,9 cm (5,...",10.0
827335,Cliente Amazon,"Samsung E1200 Telefono Cellulare, Nero [Italia]",10.0
903079,ÐÐ½Ð´ÑÐµÐ¹,LG P920 Optimus 3D,10.0
72394,Amazon Customer,"OnePlus 3 (Graphite, 64 GB)",2.0
296651,Amazon Customer,"Asus Zenfone Max ZC550KL-6A076IN (Black, 3GB, ...",10.0
...,...,...,...
289698,Cliente Amazon,BQ Aquaris E5 HD - Smartphone libre Android (p...,2.0
137565,Giuseppe,"Samsung Galaxy A5 2016 Smartphone LTE, 16GB, Nero",10.0
108444,Frank,Sony Xperia XZ zwart / 32 GB,9.0
234008,e-bit,Smartphone Motorola Moto X 2Âª GeraÃ§Ã£o XT109...,6.0


In [80]:
print("Shape of the data frame with product and user more who has more than 50 rating",final_df.shape)

Shape of the data frame with product and user more who has more than 50 rating (111595, 3)


### Q3 - Build a popularity based model and recommend top 5 mobile phones.

In [17]:
#Taking the mean count of the score
score_mean_count = pd.DataFrame(df2.groupby('product')['score'].mean()) 

# calculating the number of ratings a product got
score_mean_count['score_counts'] = pd.DataFrame(df2.groupby('product')['score'].count()) 

# 3. Recommending the 5 mobile phones based in highest mean score and highest number of ratings the product got. 
score_mean_count.sort_values(by=['score','score_counts'], ascending=[False,False]).head()

Unnamed: 0_level_0,score,score_counts
product,Unnamed: 1_level_1,Unnamed: 2_level_1
Samsung Galaxy Note5,10.0,156
Motorola Smartphone Motorola Moto X Desbloqueado Preto Android 4.2.2 CÃ¢mera 10MP e Frontal 2MP MemÃ³ria Interna de 16GB GSM,10.0,142
Samsung Smartphone Dual Chip Samsung Galaxy SIII Duos Desbloqueado Claro Azul Android 4.1 3G/Wi-Fi CÃ¢mera 5MP,10.0,137
Nokia Smartphone Nokia Lumia 520 Desbloqueado Oi Preto Windows Phone 8 CÃ¢mera 5MP 3G Wi-Fi MemÃ³ria Interna 8G GPS,10.0,136
Motorola Smartphone Motorola Moto G Dual Chip Desbloqueado TIM Android 4.3 Tela 4.5 8GB 3G Wi-Fi CÃ¢mera 5MP - Preto,10.0,129


####Observation:

**The above result set display the top 5 popular mobile phone based on the score provided by the customer**

### Q4 - Build a collaborative filtering model using SVD. You can use SVD from surprise or build it from scratch.

In [30]:
from surprise import KNNWithMeans
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
import pandas as pd
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

In [31]:
#Taking random 5000 records for the dataset
final=df2.sample(n = 5000, random_state = 612)
final.head(5)

Unnamed: 0,author,product,score
469127,amit kumar gupta,InFocus M810 (Gold),10.0
656058,oscar,Sony Xperia L - Smartphone libre Android (pant...,10.0
663360,kookie,"Samsung Galaxy S4, Brown 16GB (Verizon Wireless)",8.0
134171,e-bit,Smartphone Asus ZenFone 3 ZE520KL,10.0
580430,Sarah Schanz,"5,0 Zoll CUBOT S208 IPS OGS Screen 3G Android ...",8.0


In [32]:
#trainset, testset = train_test_split(final, test_size=.15)
reader = Reader(rating_scale=(1, 10))
#data = Dataset.final(df[['author', 'product', 'score']], reader)
data = Dataset.load_from_df(final[['product', 'author', 'score']], reader)
trainset, testset = train_test_split(data, test_size=.15)
algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x22736360730>

In [33]:
test_pred = algo.test(testset)
#test_pred.iloc[:2]

In [34]:
test_pred

[Prediction(uid='Ð¡Ð¼Ð°Ñ\x80Ñ\x82Ñ\x84Ð¾Ð½ NOKIA 5230 XpressMusic White Red', iid='Ñ\x8eÐ»ÐµÑ\x87ÐºÐ°', r_ui=8.0, est=8.04729411764706, details={'was_impossible': False}),
 Prediction(uid='Apple iPhone 4 16 GB', iid='bover992', r_ui=8.0, est=8.04729411764706, details={'was_impossible': False}),
 Prediction(uid='HTC Windows Phone 8S - Smartphone libre (pantalla tÃ¡ctil de 4" 480 x 800, cÃ¡mara 5 Mp, 4 GB, 2 procesadores de 1 GHz, 512 MB de RAM, S.O. Windows Phone 8), dominÃ³ (negro y blanco)', iid='DEIV87', r_ui=10.0, est=8.04729411764706, details={'was_impossible': False}),
 Prediction(uid='Samsung A950 (Verizon Wireless)', iid='Oy Vey', r_ui=2.0, est=7.552968075905458, details={'was_impossible': False}),
 Prediction(uid='Samsung GALAXY S III Mini', iid='mackers1986', r_ui=8.0, est=8.04729411764706, details={'was_impossible': False}),
 Prediction(uid='Motorola Moto G 3rd Generation (White, 16GB)', iid='Aravind', r_ui=8.0, est=7.268052386719137, details={'was_impossible': False}),
 Pred

### Build a collaborative filtering model using kNNWithMeans from surprise using Item based model

In [20]:
# Read dataset.
reader = Reader(rating_scale=(1, 10))
final_df_knnI = Dataset.load_from_df(final,reader = reader)

In [21]:
trainset_I, testset_I = train_test_split(final_df_knnI, test_size=.50)

In [22]:
# Use user_based true/false to switch between user-based or item-based collaborative filtering
algo = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': False})


In [23]:
algo.fit(trainset_I)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x22737e72d60>

In [24]:
# run the  model against the testset
test_pred_I = algo.test(testset_I)

In [49]:
test_pred_I

[Prediction(uid='Andrea', iid='Huawei P8 lite Smartphone, Display 5.0" IPS, Dual Sim, Processore Octa-Core, Memoria 16 GB, Fotocamera 13 MP, Android 5.0, Bianco', r_ui=10.0, est=10, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='LJS', iid='Samsung Stratosphere I405 4G LTE CDMA Android Slider Phone, Black (Verizon)', r_ui=8.0, est=8.0412, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='JD', iid='Binatone The Brick GSM Sim-Free Mobile Phone', r_ui=8.0, est=8.0412, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='Marco', iid='Samsung Galaxy S5 Smartphone, Display 5.1 Pollici, Processore Quad-Core 2,5 GHz, RAM 2GB, Memoria Fotocamera 16MP, Android 4.4, Nero [Germania]', r_ui=10.0, est=8.0412, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='joga_99', iid='Sony Ericsson T600', r_ui=8.0, est=8.0412, details={'was_impossible': True, 'reas

# Build a collaborative filtering model using kNNWithMeans from surprise using User based model

In [51]:
# Read dataset.
reader = Reader(rating_scale=(1, 10))
final_df_knnII = Dataset.load_from_df(final,reader = reader)

In [52]:
trainset_II, testset_II = train_test_split(final_df_knnII, test_size=.15)

In [53]:
# Use user_based true/false to switch between user-based or item-based collaborative filtering
algo = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': True})
algo.fit(trainset_II)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x22736ef9520>

In [54]:
# we can now query for specific predicions
uid = 'Frances DeSimone'  # raw user id
iid = 'Samsung Galaxy Star Pro DUOS S7262 Unlocked Ce.'  # raw item id

In [None]:
# get a prediction for specific users and items.
pred = algo.predict(uid, iid, verbose=True)

In [55]:
# run the trained model against the testset
test_pred_II = algo.test(testset_II)

In [56]:
#6. Predict score (average rating) for test users
test_pred_II

[Prediction(uid='Ð¡ÐµÑ\x80Ð³ÐµÐ¹', iid='Ð\x9cÐ¾Ð±Ð¸Ð»Ñ\x8cÐ½Ñ\x8bÐ¹ Ñ\x82ÐµÐ»ÐµÑ\x84Ð¾Ð½ Samsung J700H/DS Galaxy J7 Duos Black (SM-J700HZKDSEK)', r_ui=10.0, est=8.021176470588236, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='Onur', iid='Sony Xperia Z C6603 White', r_ui=8.0, est=8.021176470588236, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='Nika3487', iid='HTC Ð\x9cÐ¾Ð±Ð¸Ð»Ñ\x8cÐ½Ñ\x8bÐ¹ Ñ\x82ÐµÐ»ÐµÑ\x84Ð¾Ð½ HTC One Dual Sim 32 Ð³Ð±', r_ui=8.0, est=8.021176470588236, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='Daan', iid='HTC Wildfire S Black', r_ui=5.0, est=8.021176470588236, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='Amy J Anderson', iid='Samsung Galaxy S4 i9500 Factory Unlocked Cellphone, International Version, 16GB, Black', r_ui=10.0, est=8.021176470588236, details={'was_impossible': Tr

### Q5 - Evaluate the collaborative model. Print RMSE value.

In [58]:
# get RMSE
print("RMSE for KNN Item based model : Test Set", accuracy.rmse(test_pred_I, verbose=True))
print("RMSE for KNN User based model : Test Set", accuracy.rmse(test_pred_II, verbose=True))
print("RMSE for SVD : Test Set", accuracy.rmse(test_pred, verbose=True))
print("RMSE for KNN User Based is better than KNN model for Item")

RMSE: 2.6045
RMSE for KNN Item based model : Test Set 2.6045052656408787
RMSE: 2.5038
RMSE for KNN User based model : Test Set 2.503763588701938
RMSE: 2.5753
RMSE for SVD : Test Set 2.575255088478619
RMSE for KNN User Based is better than KNN model for Item


### Q6 - Predict score (average rating) for test users.

In [40]:
testset_I=pd.DataFrame(testset_I,columns=['author','product','score'])
test_user_avg_score = pd.DataFrame(testset_I.groupby("author")['score'].mean())
test_user_avg_score.sort_values(by='score',ascending=False)

Unnamed: 0_level_0,score
author,Unnamed: 1_level_1
#,10.0
Yaririchard,10.0
Sambini,10.0
Sam Bryant,10.0
Sally89,10.0
...,...
Stefan,1.0
Vehemoth,1.0
bagdes,1.0
cs_gve,1.0


### Q7 - Report your findings and inferences.

# 1. The given dataset having combine dataset 1415133 rows and 11 columns.

2. We found that, total nan values in score-63489 and score_max-63489 columns and extract score, score_max,author and product. 

3. We have removed the duplicate values and irrelevent features from dataset and kept only score, author and product for further analysis.

4. As per guidance we kept only 1000000 records for further analysis.

5. We have find out most rated features, there are many products  having highest score of 10 ratings.

6. We have identified the users most nos of reviews and products having most nos of reviews.

7. We have built popularity based model and displayed top 5 mobile.

8. We have built collabrative filttering model by using the SVD.

9. We have made a prediction for items and their estimated rating for test users.

10.We have predicted top 5 items and their rating for test users.

11.Build a collaborative filtering model using SVD from surprise RMSE: 2.57

12.Build a collaborative filtering model using kNNWithMeans from surprise using Item based model RMSE: 2.6986

13.Build a collaborative filtering model using kNNWithMeans from surprise using User based model RMSE: 2.50





 

### Q8 - Try and recommend top 5 products for test users.

In [42]:
def get_top_n(predictions, n=5):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [43]:
# 8. Try and recommend top 5 products for test users
top_n = get_top_n(test_pred_I, n=5)

In [44]:
top_n

defaultdict(list,
            {'Andrea': [('Huawei P8 lite Smartphone, Display 5.0" IPS, Dual Sim, Processore Octa-Core, Memoria 16 GB, Fotocamera 13 MP, Android 5.0, Bianco',
               10)],
             'LJS': [('Samsung Stratosphere I405 4G LTE CDMA Android Slider Phone, Black (Verizon)',
               8.0412)],
             'JD': [('Binatone The Brick GSM Sim-Free Mobile Phone', 8.0412)],
             'Marco': [('Huawei P8 Lite Smartphone, Display 5" IPS, Processore Octa-Core 1.5 GHz, Memoria Interna da 16 GB, 2 GB RAM, Fotocamera 13 MP, monoSIM, Android 5.0, Bianco [Italia]',
               10),
              ('Samsung Galaxy S5 Smartphone, Display 5.1 Pollici, Processore Quad-Core 2,5 GHz, RAM 2GB, Memoria Fotocamera 16MP, Android 4.4, Nero [Germania]',
               8.0412),
              ('Samsung Galaxy J3 2016 Sim Free Mobile Phone - White',
               8.0412)],
             'joga_99': [('Sony Ericsson T600', 8.0412)],
             'Dexter3219': [('Samsung Galaxy A

In [45]:
# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

Andrea ['Huawei P8 lite Smartphone, Display 5.0" IPS, Dual Sim, Processore Octa-Core, Memoria 16 GB, Fotocamera 13 MP, Android 5.0, Bianco']
LJS ['Samsung Stratosphere I405 4G LTE CDMA Android Slider Phone, Black (Verizon)']
JD ['Binatone The Brick GSM Sim-Free Mobile Phone']
Marco ['Huawei P8 Lite Smartphone, Display 5" IPS, Processore Octa-Core 1.5 GHz, Memoria Interna da 16 GB, 2 GB RAM, Fotocamera 13 MP, monoSIM, Android 5.0, Bianco [Italia]', 'Samsung Galaxy S5 Smartphone, Display 5.1 Pollici, Processore Quad-Core 2,5 GHz, RAM 2GB, Memoria Fotocamera 16MP, Android 4.4, Nero [Germania]', 'Samsung Galaxy J3 2016 Sim Free Mobile Phone - White']
joga_99 ['Sony Ericsson T600']
Dexter3219 ['Samsung Galaxy A5 (2016)']
Antonello Cosanti ['HTC One Smartphone con Display 4.7 Pollici, Fotocamera Ultrapixel, 32 GB, Processore Quad Core da 1.7 GHz, 2 GB...']
Simon ['Honor 6 Smartphone (5 Zoll, Touchscreen, Octa-Core, 3GB RAM, 16GB ROM, 13MP Hauptkamera, 5MP Frontkamera, LTE CAT6, Android 4.4, 

antonio carlos peres da c ['Asus Smartphone Asus ZenFone 5 Dual Chip Desbloqueado Android 4.4 Tela...']
OSU_Engineer ['LG Optimus F3 (T-Mobile)']
Cristian ['Meizu M3S 16GB Gris libre']
Jean P Villejoint ['Samsung Galaxy Note, i717 16GB Unlocked GSM 4G LTE 8MP Camera Smartphone with S Pen Stylus - White']
chris153 ['Siemens S65']
ÐÐµÐ»Ð¾Ð³Ð»Ð°Ð·Ð¾Ð² ÐÐ»ÐµÐ³ ['Lenovo IdeaPhone S920']
margaret murphy ['Sony Xperia E3 4G UK SIM-Free Smartphone - Black']
Luis3110  ['Samsung Galaxy S7 32GB (T-Mobile)']
CARLOSLIK ['Sony Ericsson W960']
Xxbarthezxx ['Samsung GT I8000 Omnia II Rose noire Windows Mobile 6.5 Professional']
Sascha Von Rhein ['Samsung Galaxy S5 Smartphone (5,1 Zoll (12,9 cm) Touch-Display, 16 GB Speicher, Android 4.4) electric blue']
Bagpiper ['Nokia 7650']
Jakob&Simone ['Alcatel One Touch Easy db']
Bruh01 ['Smartphone Samsung Galaxy Win Duos Dual Chip Desbloqueado Android 4.1 Tela 8GB 3G Wi- Fi CÃ¢mera 5MP - Branco']
lazac ['Sim Free Apple iPhone SE 16GB Mobile Phone - Rose Gold

Ð¯ÐºÐ¾Ð²ÐµÐ½ÐºÐ¾ Ð¸Ð³Ð¾ÑÑ Ð²Ð¸ÐºÑÐ¾ÑÐ¾Ð²Ð¸ÑÑ ['Nokia 5230 XpressMusic']
Un anonyme ['Motorola RAZR V3']
cheyenne ['Samsung Galaxy S7 goud, roze / 32 GB']
Thorsten ['Motorola W270 schwarz Handy']
Ken97  ['Samsung Galaxy S6 edge+ 64GB (AT&T)']
Maria ['Apple iPhone SE 16GB Black']
Bencilal John ['Lenovo VIBE P1m (White, 16 GB)']
Fabio ['Samsung Galaxy Young Smartphone, Blu [Germania]']
Francymad ['Asus ZenFone 4.5 Smartphone, Storage 8 GB, Nero [Italia]']
Jani.1984 ['HTC One']
James F. ['SAMSUNG Galaxy Nexus Titanium Silver 3G Unlocked GSM Android Smart Phone w/ Android 4.0 / 5 MP Camera / 16GB Internal Memory / NFC (GT-i9250)']
herbreteau ['XEPTIO Etui Samsung Galaxy Note 3 N9000 N9002 N9005 (Wifi / LTE / 4G) blanc 32/64 GB Ultra Slim Cuir Style avec stand...']
Jason Blake ['Alcatel OneTouch Idol 3 Global Unlocked 4G LTE Smartphone, 5.5 HD IPS Display, 16GB (GSM - US Warranty)']
Gabriele H. ['Huawei P8 Lite (2017) Smartphone, 13,2 cm (5,2 Zoll) Display, LTE (4G), Android, 12,0 Mega

### Q9 - Try other techniques (Example: cross validation) to get better results.

In [40]:
from surprise.model_selection import cross_validate
from surprise.model_selection import KFold
from surprise import SVD


kf = KFold(n_splits=10)

algo_c = SVD()

for trainset_c, testset_c in kf.split(data):

    # train and test algorithm.
    algo_c.fit(trainset_c)
    predictions = algo_c.test(testset_c)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

RMSE: 2.7019
RMSE: 2.7975
RMSE: 2.6740
RMSE: 2.7255
RMSE: 2.7443
RMSE: 2.7360
RMSE: 2.7329
RMSE: 2.7422
RMSE: 2.7324
RMSE: 2.7174


### Q10 - In what business scenario you should use popularity based Recommendation Systems ?

1. As the name suggests Popularity based recommendation system works with the trend. It basically uses the items which are in trend right now. For example, if any product which is usually bought by every new user then there are chances that it may suggest that item to the user who just signed up.

2. There are some problems as well with the popularity based recommender system and it also solves some of the problems with it as well.

3. The problems with popularity based recommendation system is that the personalization is not available with this method i.e. even though you know the behaviour of the user you cannot recommend items accordingly.

4. Recommendation systems are important and valuable tools for companies like Amazon and Netflix, who are both known for their personalized customer experiences. Each of these companies collects and analyzes demographic data from customers and adds it to information from previous purchases, product ratings.

### Q11 - In what business scenario you should use CF based Recommendation Systems ?

1. This type of recommendation system makes predictions of what might interest a person based on the taste of many other users. It assumes that if person X likes Snickers, and person Y likes Snickers and Milky Way, then person X might like Milky Way as well.

2. Most websites like Amazon, YouTube, and Netflix use collaborative filtering as a part of their sophisticated recommendation systems. You can use this technique to build recommenders that give suggestions to a user on the basis of the likes and dislikes of similar users.

3. In the more general sense, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc

4. Applications of collaborative filtering typically involve very large data sets. Collaborative filtering methods have been applied to many different kinds of data including: sensing and monitoring data, such as in mineral exploration, environmental sensing over large areas or multiple sensors; financial data, such as financial service institutions that integrate many financial sources; or in electronic commerce and web applications where the focus is on user data


### Q12 - What other possible methods can you think of which can further improve the recommendation for different users ?

1. Conversion rate from recommendations – Conversion rate for people who clicked on recommendations.

2. GMV/1000 recommendations – This is often the hardest to understand at first, but it generally means the average revenue per 1000 recommendations that comes from people purchasing products that they’ve found in recommender boxes.

3. CTRs – While obvious, it is a very important metric to consider. One thing to look out for, though: widgets in unfavorable locations (e.g., footer, sidebar) can skew these numbers (and, in fact, all others as well). This should be kept in mind during evaluation.

4. % of revenue through recommendations – It is one of the most often used metrics (also highlighted in the Amazon example above). This simply means revenues through recommendations / total revenues.

5. Number of products viewed – The number of products viewed by people who are actively using recommendations during their sessions. While more browsing can mean that users are having a hard time finding what they’re looking for, a study by Wolfgang Digital (cited in more detail later) concluded that the time spent on the site by a user and the number of pages they open correlate with conversions positively.

6. Naturally, the ideas, tactics, and techniques featured in this article are not everything there is to say about product recommendations in ecommerce.