# Project - Recommendation Systems
## by HARI SAMYNAATH S

### User Defined functions / classes and library initiations

In [1]:
# dependencies
import pandas as pd
import os, re
from joblib import Parallel, delayed
from surprise import Dataset,Reader
from surprise.model_selection import train_test_split
from surprise.model_selection.validation import cross_validate
from surprise.model_selection.search import GridSearchCV
from surprise import SVD, KNNWithMeans
from surprise import accuracy
import numpy as np

In [2]:
def nulsCount(df):
    """summarise missing/unexpected values"""
    
    d2=pd.DataFrame(columns=["NULL","NAN","BLANKS","UNEXP"])
    try:
        d2["NULL"] = df.isnull().sum().astype('uint32') # check for null values
        d2["NAN"]=df.isna().sum().astype('uint32') # check for NaN
        d2["BLANKS"]=df.isin([""," "]).sum().astype('uint32') # check for blanks
        d2["UNEXP"]=df.isin(["-","?",".","NA","N/A","nan","Unknown","unknown","UNKNOWN"]).sum().astype('uint32') # check for other unexpected values
    except:
        pass
    d2=d2.loc[(d2["NULL"]!=0) | (d2["NAN"]!=0) | (d2["BLANKS"]!=0) | (d2["UNEXP"]!=0)] # shortlist for the missing values
    
    # convert to percentages
    d2["NULL %"] = d2["NULL"].mul(100/df.shape[0]).round(2)
    d2["NAN %"] = d2["NAN"].mul(100/df.shape[0]).round(2)
    d2["BLANKS %"] = d2["BLANKS"].mul(100/df.shape[0]).round(2)
    d2["UNEXP %"] = d2["UNEXP"].mul(100/df.shape[0]).round(2)
    
    # rearrange
    d2=d2[["NULL","NULL %","NAN","NAN %","BLANKS","BLANKS %","UNEXP","UNEXP %"]]
    
    if d2.shape[0]==0:
        return
    else:     
        return d2

**DOMAIN:** Smartphone, Electronics<br>
**CONTEXT:** India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by smartphone users across Asia Paci ic. The combination of very high sales volumes and the average smartphone consumer behaviour has made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system based on individual consumer’s behaviour or choice.<br>

**DATA DESCRIPTION:**<br>
• author : name of the person who gave the rating<br>
• country : country the person who gave the rating belongs to<br>
• data : date of the rating<br>
• domain: website from which the rating was taken from<br>
• extract: rating content<br>
• language: language in which the rating was given<br>
• product: name of the product/mobile phone for which the rating was given<br>
• score: average rating for the phone<br>
• score_max: highest rating given for the phone<br>
• source: source from where the rating was taken<br>

**PROJECT OBJECTIVE:** We will build a recommendation system using popularity based and collaborative iltering methods to recommend mobile phones to a user which are most popular and personalised respectively.

**Steps and tasks:**<br>
1.Import the necessary libraries and read the provided CSVs as a data frame and perform the below steps.<br>
A. Merge all the provided CSVs into one data-frame.<br>

In [3]:
# lets list down all csv files in the local directory
files=[]
for i,file in enumerate(os.listdir()): # load all *.xls* files in the work directory
    if re.search("^~|^.~",file)==None and not re.search(".csv$",file)==None:
        files.append(file)
files

['phone_user_review_file_1.csv',
 'phone_user_review_file_2.csv',
 'phone_user_review_file_3.csv',
 'phone_user_review_file_4.csv',
 'phone_user_review_file_5.csv',
 'phone_user_review_file_6.csv']

In [4]:
# read and merge one by one
ratings=pd.DataFrame()
for file in files:
    ratings=pd.concat([ratings,pd.read_csv(file,encoding_errors='ignore')])
ratings.reset_index(drop=True,inplace=True)

In [5]:
# review data set
print("Shape :",ratings.shape)
display(ratings.head())

Shape : (1415133, 11)


Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.2,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


**Steps and tasks:**<br>
1. B. Explore, understand the Data and share at least 2 observations.

In [6]:
print("The ratings were given in a total of %d languages as follows\n"%(ratings.lang.nunique()))
print(sorted(ratings.lang.unique()))

The ratings were given in a total of 22 languages as follows

['ar', 'cs', 'da', 'de', 'en', 'es', 'fi', 'fr', 'he', 'hu', 'id', 'it', 'ja', 'ko', 'nl', 'no', 'pl', 'pt', 'ru', 'sv', 'tr', 'zh']


In [7]:
print("The ratings were given from %d different countries as follows\n"%(ratings.country.nunique()))
print(sorted(ratings.country.unique()))

The ratings were given from 42 different countries as follows

['ae', 'ar', 'au', 'be', 'br', 'ca', 'ch', 'cl', 'cn', 'co', 'cz', 'de', 'dk', 'ec', 'es', 'fi', 'fr', 'gb', 'hu', 'id', 'il', 'in', 'it', 'jp', 'kr', 'mx', 'nl', 'no', 'nz', 'pe', 'pl', 'pt', 'ru', 'se', 'sg', 'tr', 'tw', 'ua', 'us', 'uy', 've', 'za']


In [8]:
print("The ratings were sourced from %d domains"%(ratings.source.nunique()))

The ratings were sourced from 331 domains


In [9]:
print("The scores range from %.1f to %.1f"%(ratings.score.min(),ratings.score.max()))

The scores range from 0.2 to 10.0


In [10]:
OEMS=ratings["phone_url"].apply(lambda x: str(x).strip().split(sep='/')[2].split(sep='-')[0])
print("The products are expected to be from %d OEs\n"%OEMS.nunique())
print(sorted(OEMS.unique()))

The products are expected to be from 166 OEs

['acer', 'alcatel', 'amazon', 'amoi', 'amplicom', 'anycool', 'apple', 'archos', 'asus', 'at', 'audiovox', 'auro', 'bang', 'beafon', 'benefon', 'benq', 'binatone', 'bird', 'black', 'blackberry', 'blackview', 'blu', 'bluechip', 'bosch', 'bq', 'casio', 'cat', 'caterpillar', 'cect', 'celkon', 'cubot', 'curitel', 'danger', 'dell', 'denon', 'doogee', 'dopod', 'doro', 'elephone', 'emporia', 'ericsson', 'eten', 'firefly', 'fly', 'fujitsu', 'garmin', 'geemarc', 'general', 'gigabyte', 'gionee', 'google', 'gresso', 'haier', 'hisense', 'hitachi', 'hp', 'htc', 'huawei', 'hummer', 'hyundai', 'i', 'iball', 'iconbit', 'idroid', 'inew', 'infocus', 'innostream', 'inq', 'intermec', 'intex', 'iocean', 'jcb', 'jiayu', 'jolla', 'just5', 'karbonn', 'kazam', 'kyocera', 'latte', 'lava', 'leagoo', 'lenovo', 'lg', 'marshall', 'maxon', 'meitu', 'meizu', 'micromax', 'microsoft', 'mitac', 'mitsubishi', 'mobiado', 'mobistel', 'motorola', 'motorolla', 'mysaga', 'nec', 'ne

there are few mis-interpreted OE names (like at and t being separated), the above could be considered as a indicative number

number of source domains are larger than product OEs, which indicates 3rd party seller/reviewer domains as source

**Steps and tasks:**<br>
1. C. Round of scores to the nearest integers

In [11]:
# round off
ratings.score=ratings['score'].round().astype(int,errors='ignore')

In [12]:
# review rounded values
sorted(ratings.score.unique())

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, nan]

NANs present, unable to typecast to integers

**Steps and tasks:**<br>
1. D. Check for missing values. Impute the missing values, if any

In [13]:
#lets check for missing values
nulsCount(ratings)

Unnamed: 0,NULL,NULL %,NAN,NAN %,BLANKS,BLANKS %,UNEXP,UNEXP %
score,63489,4.49,63489,4.49,0,0.0,0,0.0
score_max,63489,4.49,63489,4.49,0,0.0,0,0.0
extract,19361,1.37,19361,1.37,0,0.0,109,0.01
author,63202,4.47,63202,4.47,0,0.0,1888,0.13
product,1,0.0,1,0.0,0,0.0,0,0.0


In [14]:
# review the unexpected values in extract column
ratings.loc[ratings['extract'].isin(["-","?",".","NA","N/A","Unknown"])].head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
47004,/cellphones/apple-iphone-6s/,9/16/2015,pt,br,Ofertou,shopping.ofertou.com,10.0,10.0,.,DAVI DA SILVA LIMA,Apple iPhone 6S 16GB
58721,/cellphones/apple-iphone-se/,5/7/2017,ru,ru,Связной,svyaznoy.ru,,,?,aminatdin@gmail.com,Apple iPhone SE 16GB (розовое золото)
60478,/cellphones/apple-iphone-se/,5/17/2016,ru,ru,Связной,svyaznoy.ru,,,?,aminatdin@gmail.com,Apple iPhone SE 16GB (серый космос)
82730,/cellphones/samsung-galaxy-s6-edge-sm-g925f/,5/23/2016,pt,br,Ofertou,shopping.ofertou.com,10.0,10.0,-,e-bit,Samsung Galaxy S6 Edge SM-G925 32GB
83421,/cellphones/samsung-galaxy-s6-edge-sm-g925f/,11/27/2015,it,it,Dooyoo,dooyoo.it,8.0,10.0,.,ernestogue,Samsung Galaxy S6 Edge G925F 64GB


In [15]:
# review the unexpected values in author column
ratings.loc[ratings['author'].isin(["-","?",".","NA","N/A","Unknown"])].head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
5252,/cellphones/samsung-galaxy-s6-edgeplus/,9/15/2015,fr,fr,Amazon,amazon.fr,6.0,10.0,Qu’écrire de plus que ce qui a déjà été écrit ...,.,Samsung Galaxy S6 Edge Plus Smartphone débloqu...
15277,/cellphones/samsung-galaxy-s7-edge/,3/1/2017,de,de,Amazon,amazon.de,6.0,10.0,"Schön vepackt und auf dem Papier alles, was ma...",.,"Samsung Galaxy S7 EDGE Smartphone (5,5 Zoll (1..."
32768,/cellphones/samsung-galaxy-s7-789999/,4/25/2017,nl,be,KIESKEURIG,kieskeurig.be,8.0,10.0,"erg mooi en goed toestel, eerst alles goed uit...",.,Samsung Galaxy S7 zwart / 32 GB
32769,/cellphones/samsung-galaxy-s7-789999/,4/25/2017,nl,nl,KIESKEURIG,kieskeurig.nl,8.0,10.0,"erg mooi en goed toestel, eerst alles goed uit...",.,"Samsung Galaxy S7 goud, roze / 32 GB"
51238,/cellphones/apple-iphone-7/,11/12/2016,en,gb,Very,very.co.uk,10.0,10.0,Amazing !! I like this very much !! Ease to us...,Unknown,"Apple iPhone 7, 32Gb - Rose Gold"


Also lets delete the utf8 encoding error characters (given by ?) and any other non alpha numeric characters which gives us no meaning<br>
and change the resultant empty author names and extracts to NAN for further imputing

In [16]:
# generate a list of special characters EXCEPT whitespace
splchr=[chr(i) for i in range(0,32)] # upto before whitespace ascii 32
splchr.extend([chr(i) for i in range(33,48)]) # upto before 0 ascii 48
splchr.extend([chr(i) for i in range(58,65)]) # upto before A ascii 65
splchr.extend([chr(i) for i in range(91,97)]) # upto before a ascii 97
splchr.extend([chr(i) for i in range(123,192)]) # rest
splchr.extend([chr(215),chr(247)]) # additional
splchr=''.join(splchr) # create a word of splchr

In [17]:
# delete non alphanumeric characters
ratings["author"]=ratings["author"].apply(lambda x: re.sub(r'[%s]'%splchr,'',str(x)).strip())
ratings["extract"]=ratings["extract"].apply(lambda x: re.sub(r'[%s]'%splchr,'',str(x)).strip())
ratings["product"]=ratings["product"].apply(lambda x: re.sub(r'[%s]'%splchr,'',str(x)).strip())

lets impute the missing & unxepected values in author and extract columns with meaningful replacements

In [18]:
ratings.loc[ratings['author'].isin(["-",".","NA","N/A","nan","Unknown","unknown","UNKNOWN"]),"author"]="ANONYMOUS"
ratings.loc[ratings['author'].isna(),"author"]="ANONYMOUS"
ratings.loc[ratings['author'].isin([""," "]),"author"]="ANONYMOUS"

In [19]:
ratings.loc[ratings['extract'].isin(["-",".","NA","N/A","nan","Unknown","unknown","UNKNOWN"]),"extract"]="no_comments"
ratings.loc[ratings['extract'].isna(),"extract"]="no_comments"
ratings.loc[ratings['extract'].isin([""," "]),"extract"]="no_comments"

In [20]:
ratings.loc[ratings['product'].isin(["-",".","NA","N/A","nan","Unknown","unknown","UNKNOWN"]),"product"]="no_description"
ratings.loc[ratings['product'].isna(),"product"]="no_description"
ratings.loc[ratings['product'].isin([""," "]),"product"]="no_description"

In [21]:
# review the imputed no_description values in product
ratings.loc[ratings["product"]=="no_description"]

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
802795,/cellphones/samsung-galaxy-s-iii/,1/22/2014,de,de,Amazon,amazon.de,10.0,10.0,Bestes Smartphone was ich bisher hatte öafkdö...,ANONYMOUS,no_description


lets impute the phone model from phone_url

In [22]:
key=ratings.loc[ratings["product"]=="no_description"].index
replacement=ratings.values[key,0][0].strip().split('/')[-2]
ratings.loc[ratings["product"]=="no_description","product"]=replacement
ratings.iloc[key]

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
802795,/cellphones/samsung-galaxy-s-iii/,1/22/2014,de,de,Amazon,amazon.de,10.0,10.0,Bestes Smartphone was ich bisher hatte öafkdö...,ANONYMOUS,samsung-galaxy-s-iii


In [23]:
# review the missing values
nulsCount(ratings)

Unnamed: 0,NULL,NULL %,NAN,NAN %,BLANKS,BLANKS %,UNEXP,UNEXP %
score,63489,4.49,63489,4.49,0,0.0,0,0.0
score_max,63489,4.49,63489,4.49,0,0.0,0,0.0


lets impute the score & score_max columns subjectively using product information

In [24]:
for samp in ratings["product"].sample(10):
    print(samp)

Samsung S5830 Galaxy Ace Black
LG U880
Lenovo Vibe K4 Note White16GB
Asus ZenFone 6 Smartphone Storage 16 GB Nero Italia
Nokia E52
Apple iPhone 5s Silver 16GB
Nokia 113 SIM Free Mobile Phone  Black discontinued by manufacturer
Microsoft Lumia 950 DualSIM Smartphone 52 Zoll 132 cm TouchDisplay 32 GB Speicher Windows 10 schwarz
Lenovo S60 Grey Dual SIM
LG G2  VS980  32GB Android Smartphone  Verizon  GSM  White Certified Refurbished


the product columns consists of detiled description about the product<br>
hence lets extract phone name from phone_url columns

In [25]:
# extract phone name
ratings["phone"]=ratings.phone_url.apply(lambda x: x.strip().split('/')[-2])
ratings.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product,phone
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8,samsung-galaxy-s8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone the phone is sleek and smooth a...,james0923,Samsung Galaxy S8,samsung-galaxy-s8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel Nice heft Processors still slugg...,R Craig,Samsung Galaxy S8 64GB G950U 58 4G LTE Unlocke...,samsung-galaxy-s8
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.0,10.0,Never disappointed One of the reasons Ive been...,Buster2020,Samsung Galaxy S8 64GB ATT,samsung-galaxy-s8
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,Ive now found that im in a group of people tha...,S Ate Mine,Samsung Galaxy S8,samsung-galaxy-s8


In [26]:
# before scores & score_max based on phone models, lets review the numeric records in text fields
# lets collect the records with only numbers in product field (due to non alphanumeral deletion)
ratings["prod_flag"]=ratings["product"].apply(lambda x: re.sub(' ','',x).isdecimal())
# Lets impute those with phone name
ratings.loc[ratings["prod_flag"],"product"]=ratings.loc[ratings["prod_flag"],"phone"]

In [27]:
# lets review the records with only numbers in author field (due to non alphanumeral deletion)
ratings["auth_flag"]=ratings["author"].apply(lambda x: re.sub(' ','',x).isdecimal())
# lets impute those author names
ratings.loc[ratings["auth_flag"],"author"]="ANONYMOUS"

In [28]:
# lets review the records with only numbers in extract field (due to non alphanumeral deletion)
ratings["ext_flag"]=ratings["extract"].apply(lambda x: re.sub(' ','',x).isdecimal())
# lets impute those reviews
ratings.loc[ratings["ext_flag"],"extract"]="no_comments"

In [29]:
def modelwise_imputer(ratings_phone):
    phones=[]
    ratings_phone['score'].fillna(ratings_phone['score'].mean(),inplace=True)
    ratings_phone['score_max'].fillna(ratings_phone['score'].max(),inplace=True)
    phones.append(ratings_phone)
    return pd.concat(phones)

In [30]:
# impute model wise details
phones=[]
imputed_ratings=pd.DataFrame()
phonelist=list(sorted(set(ratings["phone"])))
print("Total %d tasks planned using Parallel jobs"%len(phonelist))
imputed_ratings=Parallel(n_jobs=-1,pre_dispatch='1.5*n_jobs',verbose=1
                        )(delayed(modelwise_imputer
                                 )(ratings[ratings["phone"]==phone].copy()
                                  ) for phone in phonelist)

Total 5556 tasks planned using Parallel jobs


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  39 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 189 tasks      | elapsed:   12.6s
[Parallel(n_jobs=-1)]: Done 439 tasks      | elapsed:   27.1s
[Parallel(n_jobs=-1)]: Done 789 tasks      | elapsed:   47.0s
[Parallel(n_jobs=-1)]: Done 1239 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 1789 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 2439 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 3189 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 4039 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done 4989 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 5556 out of 5556 | elapsed:  5.3min finished


In [31]:
# concat the parallel output results
ratings_imputed=pd.concat(imputed_ratings)

In [32]:
# review the final result shape
ratings_imputed.shape

(1415133, 15)

In [33]:
# review the missing values
nulsCount(ratings_imputed)

Unnamed: 0,NULL,NULL %,NAN,NAN %,BLANKS,BLANKS %,UNEXP,UNEXP %
score,57,0.0,57,0.0,0,0.0,0,0.0
score_max,57,0.0,57,0.0,0,0.0,0,0.0


these 57 missing values occur due to no reference values for mean and max functions since all records of those phones have missing scores<br>
lets review a sample of it

In [34]:
pick=ratings_imputed.loc[ratings_imputed["score"].isna(),"phone"].sample(1)
ratings.loc[ratings["phone"]==pick.values[0]]

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product,phone,prod_flag,auth_flag,ext_flag
1043022,/cellphones/lg-dlite/,6/13/2011,en,us,HandCellPhone,handcellphone.com,,,I like my new phone but can not figure out how...,suzette,LG dLIte,lg-dlite,False,False,False


from above example, it is proved that there are no ratings available for these phone models<br>
hence lets drop those records

In [35]:
ratings_imputed.dropna(axis=0,inplace=True)

In [36]:
# review the missing values
nulsCount(ratings_imputed)

hence the data set is imputed completely

**Steps and tasks:**<br>
1. E. Check for duplicate values and remove them, if any.

In [37]:
#check count of duplicate records
ratings_imputed.duplicated().sum()

7415

In [38]:
drop_index=ratings_imputed.loc[ratings_imputed.duplicated()].index
ratings_imputed.drop(drop_index,axis=0,inplace=True)

In [39]:
#review count of duplicate records
ratings_imputed.duplicated().sum()

0

All duplicated dropped<br><br>
since the imputer used mean() function, the scores will have decimal values, lets rounbd it off

In [40]:
# round off
ratings_imputed.score=ratings_imputed['score'].round().astype(int,errors='ignore')

In [41]:
ratings_imputed.score.unique()

array([ 2, 10,  6,  8,  4,  5,  1,  9,  7,  3,  0])

**Steps and tasks:**<br>
1. F. Keep only 1 Million data samples. Use random state=612.

before selecting samples, lets remove records with 0 ratings, as they add no value to recommender systems

In [42]:
shortlist=ratings_imputed.loc[ratings_imputed["score"]!=0]
sorted(shortlist.score.unique()) # review scores

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [43]:
# confirm data size
shortlist.shape

(1407619, 15)

In [44]:
# select 1Million samples
shortlist=shortlist.sample(n=1000000,random_state=612)
shortlist.shape # review shape

(1000000, 15)

**Steps and tasks:**<br>
1. G. Drop irrelevant features. Keep features like Author, Product, and Score.

In [45]:
shortlist.columns # review columns

Index(['phone_url', 'date', 'lang', 'country', 'source', 'domain', 'score',
       'score_max', 'extract', 'author', 'product', 'phone', 'prod_flag',
       'auth_flag', 'ext_flag'],
      dtype='object')

In [46]:
shortlist.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product,phone,prod_flag,auth_flag,ext_flag
285852,/cellphones/lg-g4/,7/28/2015,nl,nl,KIESKEURIG,kieskeurig.nl,8,10.0,Ik heb dit toestel nu een paar maanden en hij ...,lndmn,LG G4 H815 wit 32 GB Overzicht,lg-g4,False,False,False
292121,/cellphones/alcatel-onetouch-20-04/,1/30/2017,it,it,Amazon,amazon.it,10,10.0,Lho acquistato per mio suocero leggero tasti e...,LUCA C,Alcatel One Touch 2004G Telefono Cellulare GSM...,alcatel-onetouch-20-04,False,False,False
591361,/cellphones/sony-xperia-m2/,8/7/2014,pt,br,Americanas,americanas.com.br,8,10.0,comprei o aparelho há um mes e estou muito sat...,fabiocastro99,Sony Smartphone Sony Xperia M2 Preto Android 4...,sony-xperia-m2,False,False,False
887149,/cellphones/apple-iphone-4s/,26/1/2012,de,de,Ciao,ciao.de,6,10.0,Liebe Ciao Leserinnen und Ciao Leser ich mchte...,Yetiritter,Apple iPhone 4S 8GB,apple-iphone-4s,False,False,False
1182939,/cellphones/lg-kp500/,5/29/2009,en,us,Newegg,newegg.com,8,10.0,Cant wait to open it,Dakotah J,LG Mobile KP500 Black Unlocked GSM Bar phones ...,lg-kp500,False,False,False


let us retain the columns score, extract, author and product

In [47]:
# reduce the 1Million records with only relevant features
df=shortlist[['author','product','score','extract']].copy()
df.reset_index(drop=True,inplace=True)

In [48]:
df.shape

(1000000, 4)

In [49]:
df.head()

Unnamed: 0,author,product,score,extract
0,lndmn,LG G4 H815 wit 32 GB Overzicht,8,Ik heb dit toestel nu een paar maanden en hij ...
1,LUCA C,Alcatel One Touch 2004G Telefono Cellulare GSM...,10,Lho acquistato per mio suocero leggero tasti e...
2,fabiocastro99,Sony Smartphone Sony Xperia M2 Preto Android 4...,8,comprei o aparelho há um mes e estou muito sat...
3,Yetiritter,Apple iPhone 4S 8GB,6,Liebe Ciao Leserinnen und Ciao Leser ich mchte...
4,Dakotah J,LG Mobile KP500 Black Unlocked GSM Bar phones ...,8,Cant wait to open it


**Steps and tasks:**<br>
2. Answer the following questions.<br>
A. Identify the most rated <del><i>features</i></del> products

In [50]:
# the most rated product in terms of frequency of ratings
freq=df["product"].value_counts()
print("The %s cellphone have been rated the most frequently at %d times"%(freq.index[0],freq[0]))
print("The mean score secured by the above model is %d"%(df.loc[df["product"]==freq.index[0],"score"].mean()))

The Lenovo Vibe K4 Note White16GB cellphone have been rated the most frequently at 3759 times
The mean score secured by the above model is 7


In [51]:
# the most rated product in terms of highest mean score
toprated=pd.DataFrame(df.groupby(by="product")["score"].mean())
toprated["count"]=df.groupby(by="product")["score"].count()
toprated.reset_index(inplace=True)
toprated.set_index('product',inplace=True)
toprated.sort_values(by=["score","count"],ascending=False).head(10) # displaying top 10 for simplicity

Unnamed: 0_level_0,score,count
product,Unnamed: 1_level_1,Unnamed: 2_level_1
Samsung Galaxy Note5,10.0,147
Motorola Smartphone Motorola Moto G Dual Chip Desbloqueado TIM Android 43 Tela 45 8GB 3G WiFi Câmera 5MP Preto,10.0,129
Motorola Smartphone Motorola Moto X Desbloqueado Preto Android 422 Câmera 10MP e Frontal 2MP Memória Interna de 16GB GSM,10.0,125
Samsung Smartphone Dual Chip Samsung Galaxy SIII Duos Desbloqueado Claro Azul Android 41 3GWiFi Câmera 5MP,10.0,124
Samsung Smartphone Dual Chip Samsung Galaxy SIII Duos Desbloqueado Claro Azul Android 41 3GWiFi Cmera 5MP,10.0,118
Samsung Smartphone Galaxy Win Duos Branco Desbloqueado Dual Chip Câmera 5MP Processador Quad Core 12 Ghz Android 41 3G Wi Fi e Memória 8GB,10.0,113
Motorola Smartphone Motorola Novo Moto G DTV Colors Dual Chip XT 1069 Desbloqueado Android 44 Tela 5 16GB 3G WiFi Câmera de 8MP Preto,10.0,110
Nokia Smartphone Nokia Lumia 520 Desbloqueado Oi Preto Windows Phone 8 Câmera 5MP 3G WiFi Memória Interna 8G GPS,10.0,107
Apple iPhone 4S Branco 8GB Apple,10.0,99
LG 514 Optimus One P503,10.0,47


**Steps and tasks:**<br>
2. B. Identify the users with most number of reviews.

In [52]:
# review are provided in the extract columns
# extracts with no_comments are not to be considered as reviews by the user, as they were imputted values
# ensure not to include ANONYMOUS author as it is imputed value
# lets display the top 10 users with maximum number of reviews
df.loc[(df["extract"]!="no_comments"
       ) & ~(df["author"].str.upper().isin(["ANONYMOUS"])),"author"].value_counts().head(10)

Amazon Customer    54620
Cliente Amazon     13723
ebit                5944
Client dAmazon      5450
Amazon Kunde        3410
einer Kundin        1837
einem Kunden        1337
Александр            706
David                681
Marco                656
Name: author, dtype: int64

**Steps and tasks:**<br>
2. C. Select the data with products having more than 50 ratings and users who have given more than 50 ratings. Report the shape of the final dataset.

In [53]:
# shortlist users (authors) with more than 50 ratings
# ensure not to include ANONYMOUS author as it is imputed value
top_users=df.loc[~df["author"].str.upper().isin(["ANONYMOUS"]),"author"].value_counts() 
top_users=list(top_users.loc[top_users>50].index)

In [54]:
# shortlist products with more than 50 ratings
top_products=df["product"].value_counts()
top_products=list(top_products.loc[top_products>50].index)

In [55]:
# short list records matching above lists
top_records=df.loc[(df["author"].isin(top_users)) & (df["product"].isin(top_products))]
top_records.head()

Unnamed: 0,author,product,score,extract
31,Amazon Customer,Samsung Galaxy Grand Prime Dual Sim Factory Un...,8,Really good phone for the money
32,carlo,elephone P8000 Smartphone 4G FDDLTE 64bit MTK6...,6,Prodotto buono Batteria super ma ha un proble...
76,Amazon Customer,Apple iPhone 6 Plus 128GB Factory Unlocked GSM...,10,Phone as good as brand new
79,Amazon Customer,OnePlus 3T Soft Gold 6GB RAM 64GB memory,10,OP3T is way too cool and premium for its price...
87,Cliente Amazon,Motorola Moto G 3 Generación Carcasa Oficial ...,8,El artículo llegó un día antes y cumple mis es...


In [56]:
# shape of top_records
top_records.shape

(98732, 4)

**Steps and tasks:**<br>
3. Build a popularity based model and recommend top 5 mobile phones.

In [57]:
# model building and shortlist
popular5=top_records.groupby('product')['score'].mean().sort_values(ascending=False).head(5)

# display results
print("The following mobile phones")
for i,key in enumerate(popular5.index):
    print(i+1,key,sep=') ',end='\n')
print("are the popular ones with ratings as",list(popular5),"respectively")

The following mobile phones
1) Sony Ericsson W200i
2) Samsung Smartphone Samsung Galaxy Win Duos Dual Chip Desbloqueado Android 41
3) Samsung Galaxy S3 mini I8190 Smartphone 102 cm 4 Zoll AMOLED Display DualCore 1GHz 1GB RAM 5 Megapixel Kamera Android 41 sapphireblack
4) Huawei Vision U8850
5) Virgin HTC Desire 510 A11 Blue Virgin Mobile
are the popular ones with ratings as [10.0, 10.0, 10.0, 10.0, 10.0] respectively


**Steps and tasks:**<br>
4. Build a collaborative filtering model using SVD.<br>
You can use SVD from surprise or build it from scratch<br>
(Note: Incase you’re building it from scratch you can limit your data points to 5000 samples if you face memory issues).<br>
Build a collaborative iltering model using kNNWithMeans from surprise.<br>
You can try both user-based and item-based model.

In [58]:
# export pandas dataframe to surprice data type
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(top_records[["author","product","score"]],reader)
# split train test data
trainset, testset = train_test_split(data, test_size=.25,random_state=612)

In [59]:
%%time
# Collaborative Filtering based recommender using Singular Value Decomposition
svd_cf = SVD()
svd_cf.fit(trainset)

CPU times: user 2.83 s, sys: 0 ns, total: 2.83 s
Wall time: 2.83 s


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f59809aea30>

In [60]:
%%time
# user-user Collaborative Filtering based recommender using KNN with Means
knn_user = KNNWithMeans(sim_options={ 'user_based': True})
knn_user.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.
CPU times: user 250 ms, sys: 0 ns, total: 250 ms
Wall time: 248 ms


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f598923c580>

In [61]:
%%time
# user-user Collaborative Filtering based recommender using KNN with Means
knn_item = KNNWithMeans(sim_options={ 'user_based': False})
knn_item.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.
CPU times: user 23.1 s, sys: 164 ms, total: 23.2 s
Wall time: 23.1 s


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f598923c070>

**Steps and tasks:**<br>
5. Evaluate the collaborative model. Print RMSE value.

In [62]:
%%time
print("SVD CF")
accuracy.rmse(svd_cf.test(testset));

SVD CF
RMSE: 2.6718
CPU times: user 155 ms, sys: 8 ms, total: 163 ms
Wall time: 160 ms


2.6718174625773377

In [63]:
%%time
print("KNN with Means CF (user-user)")
accuracy.rmse(knn_user.test(testset));

KNN with Means CF (user-user)
RMSE: 2.6970
CPU times: user 2.24 s, sys: 7.93 ms, total: 2.24 s
Wall time: 2.24 s


2.697022910184829

In [64]:
%%time
print("KNN with Means CF (item-item)")
accuracy.rmse(knn_item.test(testset));

KNN with Means CF (item-item)
RMSE: 2.7027
CPU times: user 2min 12s, sys: 41.4 ms, total: 2min 12s
Wall time: 2min 12s


2.702675972271247

the KNNWithMeans item-item collaboration filtering algorithm took 10 time longer in modelling and exorbitant time for prediction, despite no significant improvement in accuracy scores, hence lets not use it for further study<br>

the KNNWithMeans user-user collaboration filtering based recommender has higher rmse than SVD alogorithm. Hence lets stick to the later

**Steps and tasks:**<br>
6. Predict score (average rating) for test users.

In [65]:
# predict
pred=svd_cf.test(testset)

In [66]:
# lets display the results in a pandas dataframe for readability
result=pd.DataFrame(testset,columns=["user","item","score"])
result["predicted_score"]=[rec.est for rec in pred] # predicted ratings
result["was_impossible"]=[rec.details.get('was_impossible') for rec in pred] # was_impossible
result.sample(20)

Unnamed: 0,user,item,score,predicted_score,was_impossible
23703,ebit,Smartphone Asus ZenFone 3 ZE520KL,10.0,9.277715,False
9704,Amazon Customer,Lenovo Vibe K5 Gold VoLTE update,8.0,7.254228,False
17327,Cliente Amazon,Huawei P9 Lite Smartphone 52 Full hd 3 GB RAM ...,10.0,9.125428,False
79,Amazon Customer,Lenovo PHAB Plus Tablet 68 inch 32GB WiFi LTE ...,6.0,6.268403,False
14002,Amazon Customer,Lenovo Used Lenovo Zuk Z1 Space Grey 64GB,2.0,5.19953,False
14500,Amazon Customer,Lenovo Used Lenovo Zuk Z1 Space Grey 64GB,2.0,5.19953,False
9027,Amazon Customer,Nokia 130 Dual SIM Black,6.0,7.312241,False
9950,Nicole,LG Optimus L90 D415 4G GSM Android Smartphone ...,10.0,7.706359,False
15991,Алексей,Samsung Galaxy A3 2016,8.0,8.931976,False
22085,Nico,Blackberry PRIV STV1004 Smartphone 137 cm 54 Z...,4.0,7.793073,False


**Steps and tasks:**<br>
7. Report your findings and inferences

In [67]:
result[result["was_impossible"]]

Unnamed: 0,user,item,score,predicted_score,was_impossible


every record have been systematically predicted

In [68]:
# top ten recommendations for the whole of test users
result.sort_values(by=['predicted_score'],ascending=False).head(10)

Unnamed: 0,user,item,score,predicted_score,was_impossible
11551,Alessio,Lenovo Motorola Moto G LTE Smartphone Display ...,8.0,10.0,False
9720,Gianluca,Asus ZE551ML2A760WW Smartphone ZenFone 2 Delux...,8.0,10.0,False
11864,Massimo,Lenovo Motorola Moto G Smartphone 45 pollici d...,8.0,10.0,False
2683,Massimo,Asus ZenFone 2 Laser 55 Smartphone 16 GB Dual ...,10.0,10.0,False
5321,Марина,Samsung Galaxy Note5,10.0,10.0,False
12164,Andrea,Lenovo Motorola Moto G Smartphone 45 pollici d...,10.0,10.0,False
20764,Marco,Sony Xperia Z3 Compact Smartphone 16 GB Nero I...,10.0,10.0,False
12340,Marco,Microsoft Nokia Lumia 630 SingleSIM Smartphone...,10.0,10.0,False
12576,Paolo,Lenovo Motorola Moto G 4G 2 Generazione Smartp...,8.0,10.0,False
2440,Andrea,Lenovo Motorola Moto G Smartphone 45 pollici d...,10.0,10.0,False


the top 10 recommendations were actually possessing good ratings, hence a fairly good prediction is arrived

In [69]:
# lets counter check for poor predictions
result.sort_values(by=['predicted_score','score'],ascending=[False,True]).head(10)

Unnamed: 0,user,item,score,predicted_score,was_impossible
11475,James,APPLE iPhone 7 Plus Black 128 GB,1.0,10.0,False
3000,Amazon Kunde,Microsoft Lumia 640 XL DualSIM LTE Smartphone ...,2.0,10.0,False
3739,Дмитрий,Nokia 101 Premium Black,2.0,10.0,False
4663,AmazonKunde,Sony Xperia Z1 Compact Smartphone 43 Zoll 109 ...,2.0,10.0,False
7961,Amazon Kunde,Microsoft Lumia 640 XL DualSIM LTE Smartphone ...,2.0,10.0,False
17904,Peter,Samsung Galaxy J5 Smartphone 5 Zoll 127 cm Tou...,2.0,10.0,False
21057,Marco,WIKO Fever 4G Smartphone 16 GB Dual SIM Bianco,2.0,10.0,False
1093,Massimiliano,Apple iPhone 5C 16GB sbloccato blu ciano,4.0,10.0,False
2157,Andrea,Asus ZenFone 2 Laser Smartphone Display da 5 1...,4.0,10.0,False
2878,Daniel,Samsung Galaxy Note 3 Smartphone 145 cm 57 Zol...,4.0,10.0,False


these are certain false predictions which account for the error score

**Steps and tasks:**<br>
8. Try and recommend top 5 products for test users.

In [70]:
test_users=sorted(set(result["user"])) #obtain user list
recommendations=[]
for user in test_users:
    record=[user]
    # obtain top 5 score estimates with top support
    record.extend(list(result[result["user"]==user].sort_values(by=["predicted_score"],ascending=False).head(5).item))
    recommendations.append(record)
recoms=pd.DataFrame(recommendations,columns=["user","rec_1","rec_2","rec_3","rec_4","rec_5"])

# display recommendations
recoms.sample(20)

Unnamed: 0,user,rec_1,rec_2,rec_3,rec_4,rec_5
28,AmazonKunde,Samsung Galaxy Note 3 Smartphone 145 cm 57 Zol...,Sony Xperia Z1 Compact Smartphone 43 Zoll 109 ...,Samsung Galaxy S4 Active Smartphone 127 cm 5 Z...,Samsung Galaxy S4 Active Smartphone 127 cm 5 Z...,Samsung Galaxy Note II N7100 Smartphone 16GB 1...
545,stefano,Lenovo Motorola Moto G LTE Smartphone Display ...,LG D802 G2 Smartphone 16 GB Nero Italia,Lenovo Motorola Moto G 4G 2 Generazione Smartp...,Huawei P8 lite Smartphone Display 50 IPS Dual ...,Samsung A500 Galaxy A5 Smartphone 16 GB Nero I...
153,Giorgio,Asus ZE551ML2A760WW Smartphone ZenFone 2 Delux...,Huawei P9 Lite Smartphone LTE Display 52 FHD P...,Lenovo Motorola Moto G Smartphone Display HD 4...,Lenovo Motorola Moto G 4G 3 Generazione Smartp...,Meizu M2 Note Smartphone 55 Full HD 4G 13MPX D...
285,Melissa,Samsung Galaxy S6 SMG920F Factory Unlocked Cel...,Samsung Convoy 3 SCHU680 Rugged 3G Cell Phone ...,LG Nexus 5X Unlocked Smartphone White 32GB US...,Motorola Moto E 2nd Generation Locked Cellphon...,Verizon LG Ally VS740 3G WiFi Camera Android S...
446,delicate,Nokia E51 Cep Telefonu,Nokia E51 Cep Telefonu,Nokia 6288 Cep Telefonu,Nokia N95 8 GB Cep Telefonu,Nokia 5800 XpressMusic Cep Telefonu
132,Filippo,Huawei P9 Lite Smartphone LTE Display 52 FHD P...,Lenovo Motorola Moto G Smartphone Display HD 4...,Samsung Galaxy S7 Smartphone 32 GB Nero,Honor 7 Smartphone 4G Display Full HD 52 Polli...,LG D855 G3 Smartphone 32 GB Nero Metallico Italia
271,Martin,Huawei P9 grijs zwart 32 GB,APPLE iPhone 7 Silver 32 GB,HUAWEI P9 Lite 16 GB Black,HUAWEI P9 Lite 16 GB Black,HTC Desire X Smartphone 1 GHz DualCore Prozess...
327,Qantas,HTC One X,Huawei Ascend Mate,Huawei Ascend Mate,HTC Windows Phone 8X,Samsung Galaxy Express I8730
368,Sarah,Huawei P9 grijs zwart 32 GB,Samsung Verizon Samsung Alias 2 U750 No Contra...,Honor 7 Smartphone débloqué 4G Ecran 52 pouces...,Samsung Galaxy S II i9100 DualCore Smartphone ...,Samsung B2100 Solid Extreme Sim Free Mobile Phone
53,Arun,Asus Zenfone Max ZC550KL6A076IN Black 3GB 32GB,SAMSUNG Galaxy J7 6 New 2016 Edition White 16 GB,OnePlus 3 Graphite 64 GB,Nokia Microsoft Lumia 640 Windows 81 Phone 4G ...,OnePlus X Onyx 16GB


**Steps and tasks:**<br>
9. Try other techniques (Example: cross validation) to get better results.

In [71]:
# for reference
print("SVD CF")
print("RMSE (testset)  %.4f"%accuracy.rmse(svd_cf.test(testset),verbose=False))

SVD CF
RMSE (testset)  2.6718


In [72]:
# lets obtain the cross validation scores for the same algorithm
svd_CV = cross_validate(svd_cf, data, measures=[u'rmse'], cv=10, return_train_measures=True,
               n_jobs=-1, pre_dispatch=u'2*n_jobs', verbose=True)

Evaluating RMSE of algorithm SVD on 10 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Fold 8  Fold 9  Fold 10 Mean    Std     
RMSE (testset)    2.7130  2.6662  2.6260  2.6825  2.6873  2.6580  2.6615  2.6746  2.6890  2.7087  2.6767  0.0243  
RMSE (trainset)   2.3226  2.3223  2.3277  2.3196  2.3246  2.3258  2.3235  2.3217  2.3261  2.3164  2.3230  0.0031  
Fit time          5.99    6.13    6.17    6.31    6.12    6.18    6.14    6.09    3.76    3.46    5.64    1.02    
Test time         0.07    0.08    0.08    0.08    0.08    0.08    0.08    0.08    0.06    0.06    0.08    0.01    


In [73]:
# summarise the score
print("SVD CF Cross Validation scores")
print("RMSE (testset)  mean %.4f %s %.4f at 95%% confidence"%(svd_CV.get('test_rmse').mean(),
                                                              chr(177),svd_CV.get('test_rmse').std()))

SVD CF Cross Validation scores
RMSE (testset)  mean 2.6767 ± 0.0243 at 95% confidence


In [74]:
svd_CV.get('test_rmse').std()

0.024321352557053443

In [75]:
# lets hyper tune our SVD CF recommender
param_grid = {'n_factors':[10,50,100,200],
              'n_epochs': [5, 10],
              'biased':[True,False],
              'lr_all': [0.002, 0.005],
              'reg_all': [0.02, 0.1, 0.3, 0.6]}

gs=GridSearchCV(SVD, param_grid, measures=[u'rmse'], cv=10, n_jobs=-1, pre_dispatch=u'2*n_jobs', joblib_verbose=1)

In [76]:
%%time
gs.fit(data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   18.3s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   55.5s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  6.0min


CPU times: user 1min 39s, sys: 1.16 s, total: 1min 40s
Wall time: 6min 32s


[Parallel(n_jobs=-1)]: Done 1280 out of 1280 | elapsed:  6.5min finished


In [77]:
# review best parameters
gs.best_params

{'rmse': {'n_factors': 10,
  'n_epochs': 10,
  'biased': True,
  'lr_all': 0.005,
  'reg_all': 0.1}}

In [78]:
# display the best score
gs.best_score

{'rmse': 2.6069445650029994}

the RMSE have improved from 2.67 to 2.60<br>
lets us predict the top  recommendations for the test user using best model

In [79]:
gsSVD=gs.best_estimator['rmse'] # best model from gridsearch
gsSVD.fit(trainset) #fit
pred=gsSVD.test(testset) #predict

# summarise results
result=pd.DataFrame(testset,columns=["user","item","score"])
result["predicted_score"]=[rec.est for rec in pred] # predicted ratings

# create recommendations
test_users=sorted(set(result["user"])) #obtain user list
recommendations=[]
for user in test_users:
    record=[user]
    # obtain top 5 score estimates with top support
    record.extend(list(result[result["user"]==user].sort_values(by=["predicted_score"],ascending=False).head(5).item))
    recommendations.append(record)
recoms=pd.DataFrame(recommendations,columns=["user","rec_1","rec_2","rec_3","rec_4","rec_5"])

# display recommendations
recoms.sample(20)

Unnamed: 0,user,rec_1,rec_2,rec_3,rec_4,rec_5
3,Aaron,APPLE iPhone 7 Silver 32 GB,HTC One M8 Windows 32GB Verizon 4G LTE Smartph...,HTC EVO V 4G Prepaid Android Phone Virgin Mobile,Huawei Ascend G510 Smartphone 114 cm 45 Zoll T...,LG G2 TMobile D801
516,osmntyr,Nokia E65 Cep Telefonu,Nokia N97 mini Cep Telefonu,,,
506,miatamania,Nokia N70 Cep Telefonu SILVER BLACK,Nokia E51 Cep Telefonu,Samsung G810 Cep Telefonu,Samsung G810 Cep Telefonu,Samsung G400 Cep Telefonu
276,Matt,Huawei W1 Stainless Steel Classic Smartwatch w...,Sim Free Apple iPhone 5S 16GB Mobile Phone Sp...,Honor Huawei Honor 6X Dual Camera Unlocked Sma...,Lenovo Motorola Moto E 2a Generazione Smartpho...,Sim Free Samsung Galaxy S7 Edge Mobile Phone ...
266,Marina,Honor 7 Smartphone débloqué 4G Ecran 52 pouces...,Lenovo Moto G4 Smartphone libre Android 55 Fu...,Sony Ericsson Xperia mini pro Smartphone 76 cm...,Nokia Lumia 925,Meizu M3S 16GB Gris libre
374,Simon,Sim Free Motorola Moto G 4th Generation Mobile...,Sim Free Apple iPhone 5S 16GB Mobile Phone Sp...,Huawei P8 lite Smartphone Display 50 IPS Dual ...,Samsung Galaxy Note 3,Sony Xperia S White
464,francisco,Smartphone Motorola Moto X Play XT1563 16GB,Motorola Moto G 3 Generación Carcasa Oficial ...,Sony Ericsson W380,Samsung Galaxy S7 Edge Factory Unlocked Phone ...,Nokia 2610
564,Анатолий,Apple iPhone 5s 16GB серебристый,LG KP500 Cookie Gold Limited Edition,Sony Xperia J черный,Samsung Galaxy S4 GTI9500 16GB белый,HTC Desire 620G Dual SIM белоголубой
52,Anônimo,Sony Smartphone Sony Xperia L Preto Android 41...,Sony Smartphone Sony Xperia L Preto Android 41...,Samsung Smartphone Samsung Galaxy Win Duos Dua...,Samsung Smartphone Samsung Galaxy Win Duos Dua...,Samsung Galaxy S III Mini Desbloqueado Vivo Me...
544,sozer,Nokia E51 Cep Telefonu,Nokia N95 8 GB Cep Telefonu,Nokia N95 8 GB Cep Telefonu,Nokia 6131 Cep Telefonu,Nokia 6131 Cep Telefonu


In [80]:
# lets obtain the cross validation scores for the same algorithm
svd_CV = cross_validate(gsSVD, data, measures=[u'rmse'], cv=10, return_train_measures=True,
               n_jobs=-1, pre_dispatch=u'2*n_jobs', verbose=False)
# summarise the score
print("SVD CF Cross Validation scores for best model")
print("RMSE (testset)  mean %.4f %s %.4f at 95%% confidence"%(svd_CV.get('test_rmse').mean(),
                                                              chr(177),svd_CV.get('test_rmse').std()))

SVD CF Cross Validation scores for best model
RMSE (testset)  mean 2.6080 ± 0.0202 at 95% confidence


the cross validation gave better estimates of RMSE scores<br>
Grid Search achieved better performance of the SVD estimator

**Steps and tasks:**<br>
10. In what business scenario you should use popularity based Recommendation Systems ?

A popularity based recommender referes to only global ratings and not personalised to the user<br>
While this benefits the recommender not to mandate user profile data, it also suffers from user specific recommendation<br>
This method best suits when<br>
 i)  an anonymous user (with all data sharing disabled) approaches the business, (or)<br>
 ii) a first time business with no user profile database<br>
the customer need not be left with no recommendation, but with suggestions from the global popular items

**Steps and tasks:**<br>
11. In what business scenario you should use CF based Recommendation Systems ?

A Collaboration filtering based recommender utilises the inherent pshycology in the history of ratings provided a user and the inherent characteristics in the history of ratings received by an item.<br>
Hence there is no requirement for user personal details or product description for generating suggestions<br>
Yet the system would produce personalised suggestions for each user<br>
This method best suits when several user ratings for several products are sufficiently available<br>
and mostly this fits in a single product (more likely FMCG) use case, like user-movie-ratings, user-gadget-ratings, user-personal_care_product-ratings, etc..

**Steps and tasks:**<br>
12. What other possible methods can you think of which can further improve the recommendation for different users ?

1. There are other methods like Content Based recosys, which doesnot require user data from other similar user, but requiring detailed description / content from the products<br>
2. Also a Market Basket Analysis could suggest users with products from other categories that are irrelevant to current preference of the user, yet meaningfully connect with the user needs<br>
3. Neverless, a hybrid combination of several recommenders ensembled with/without weightages could customise a business' offerings to the customer.

with reference to current dataset in hand, the extract column provided reviews from the users, which could be made use for a NLP based user feel index, along with the product description in product field could be vectorized for a content based recommender<br>
Also the datset contains a ratings data field, which could be used for suggesting newer models to the user