In [1]:
import pandas as pd
import numpy as np
import pandasql as ps
import random

In [2]:
import progressbar
from time import sleep

### Importing Surprise which is specifically created for developing recommendation engines

In [3]:
from surprise import Dataset
from surprise import Reader
from surprise import SVD,SVDpp
from surprise.model_selection import cross_validate,GridSearchCV

### Import books data and the ratings data

In [4]:
books=pd.read_csv("Cleaned datasets/books.CSV")

In [5]:
ratings=pd.read_csv("Cleaned datasets/ratings2.csv")

In [6]:
books.head()

Unnamed: 0,BookID,GENRE,BOOKTITLE,USERRATINGS,Popularity
0,114530,Fiction,The Curious Incident of the Dog in the Night-T...,3.87,15417.96852
1,133131,Science Fiction,Rainbows End by Vernor Vinge,3.76,218.146718
2,153927,Nonfiction,See No Evil: The True Story of a Ground Soldie...,3.93,64.196614
3,160262,History,Augustus: The Life of Rome's First Emperor by ...,4.03,77.324027
4,133451,Mystery,"An Unquiet Grave (Louis Kincaid, #7) by P.J. P...",4.12,10.45441


In [7]:
books.shape

(1898, 5)

In [8]:
books.drop('Popularity',axis=1,inplace=True)

In [9]:
ratings.shape

(18601, 3)

In [82]:
ratings

Unnamed: 0,BookID,UserID,rating
0,165498,600008,2.7
1,165498,600226,3.0
2,129526,600226,3.0
3,163123,600226,3.0
4,167986,600226,2.9
...,...,...,...
18596,144801,672979,2.9
18597,144801,672979,2.9
18598,137390,673405,2.9
18599,147224,673591,2.4


In [10]:
ratings[ratings.UserID==655325]

Unnamed: 0,BookID,UserID,rating
17653,149888,655325,3.5


In [11]:
ratings.dtypes

BookID      int64
UserID      int64
rating    float64
dtype: object

In [12]:
ratings.BookID=ratings.BookID.astype('object')
ratings.UserID=ratings.UserID.astype('object')

In [13]:
ratings.describe()

Unnamed: 0,rating
count,18601.0
mean,2.851852
std,0.245245
min,1.4
25%,2.7
50%,2.8
75%,3.0
max,4.3


## SVD

Create data object for surprise

- In order to train recommender systems with Surprise, we need to create a Dataset object. A Surprise Dataset object is a dataset that contains the following fields in this order:
    1. The user IDs
    2. The item IDs (in this case the IDs for each book)
    3. The corresponding rating (usually on a scale such as 1–5)

In [14]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings[['UserID', 'BookID', 'rating']], reader)

Cross validate data using svd

In [15]:
svd = SVD(verbose=True, n_epochs=10)
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.1836  0.1819  0.1806  0.1820  0.0012  
MAE (testset)     0.1397  0.1390  0.1384  0.1390  0.0005  
Fit time          1.41    1.38    1.37    1.39    0.02    
Test time         0.13    0.15    0.15    0.14    0.01    


{'test_rmse': array([0.18356252, 0.1819154 , 0.18056776]),
 'test_mae': array([0.13966084, 0.13895828, 0.13840294]),
 'fit_time': (1.407069444656372, 1.3780779838562012, 1.3719611167907715),
 'test_time': (0.12992143630981445, 0.1472017765045166, 0.1544952392578125)}

fit trainset

In [16]:
trainset = data.build_full_trainset()
svd.fit(trainset)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x184444a83d0>

- **Trainsets** are different from :class:`Datasets <surprise.dataset.Dataset>`.
You can think of a :class:`Dataset <surprise.dataset.Dataset>` as the raw
data, and Trainsets as higher-level data where useful methods are defined.
Also, a :class:`Dataset <surprise.dataset.Dataset>` may be comprised of
multiple Trainsets (e.g. when doing cross validation).


- Attributes:
    1. ur(:obj:`defaultdict` of :obj:`list`): The users ratings. This is a
        dictionary containing lists of tuples of the form ``(item_inner_id,
        rating)``. The keys are user inner ids.
    2. ir(:obj:`defaultdict` of :obj:`list`): The items ratings. This is a
        dictionary containing lists of tuples of the form ``(user_inner_id,
        rating)``. The keys are item inner ids.
    3. n_users: Total number of users :math:`|U|`.
    4. n_items: Total number of items :math:`|I|`.
    5. n_ratings: Total number of ratings :math:`|R_{train}|`.
    6. rating_scale(tuple): The minimum and maximal rating of the rating
        scale.
    7. global_mean: The mean of all ratings :math:`\mu`.

All the possible pairs of user and book are created

Predict the rating that the user 655325 might give to 174056

In [17]:
svd.predict(uid=655325, iid=174056)

Prediction(uid=655325, iid=174056, r_ui=None, est=3.0899959068181113, details={'was_impossible': False})

- In an easier to explain terminology, SVD finds the latent factors associated with some matrix. For example in recommender systems, the user-rating matrix of books after an SVD, will decompose into matrices that represents latent user-user features and item-item features, e.g. same type of user, same age-group of users, genre of books etc. and many other latent factors involved in the rating behavior that is not apparent from the user-rating matrix. In a broader sense, the SVD exposes the user-rating behavior taking into account a global perspective of all users and all items in consideration.

- But in the conventional collaborative filtering algorithm, either we use user-based neighborhood model or item-based neighborhood model i.e. the predicted rating of an item by an user is either determined by ratings given by the same user to similar items (found by pearson correlation) or by rating received on the item by users similar to the current user. So here the rating behavior is concentrated on a very small subset of potential similar items or similar users but discards the edge cases contained in the other ratings.

- In SVD++ both the methods are combined into one to improve precision and recall of the recommender system.

Lets look at the results that SVD++ gives

In [18]:
svdpp = SVDpp()
output = svdpp.fit(data.build_full_trainset())

In [19]:
svdpp.predict(uid='606494',iid='170387').est

2.851852050965002

In [20]:
svd.predict(uid='606494',iid='170387').est

2.851852050965002

In [21]:
ratings[(ratings.UserID==606494) & (ratings.BookID==170387)]

Unnamed: 0,BookID,UserID,rating
4201,170387,606494,3.1


There is a no significant difference in the prediction of the rating when using svd++

hyperparameter tuning for learning factor and regularization term

In [22]:
param_grid = {'lr_all': [.001, .01], 'reg_all' : [.1, .5]}
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse','mae'], cv=3)
gs.fit(data)

print(gs.best_params['rmse'])

{'lr_all': 0.01, 'reg_all': 0.1}


In [23]:
svdpp = SVDpp(lr_all = 0.01 , reg_all=0.1,random_state=1)
output = cross_validate(svdpp, data, verbose=True)

Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.1270  0.1248  0.1299  0.1308  0.1308  0.1287  0.0024  
MAE (testset)     0.0975  0.0962  0.0991  0.0998  0.0978  0.0981  0.0013  
Fit time          7.11    6.88    6.95    7.36    7.31    7.12    0.19    
Test time         0.21    0.22    0.21    0.20    0.16    0.20    0.02    


lets try the same prediction again

In [24]:
svdpp.predict(uid='606494',iid='170387').est

2.85177743431221

In [25]:
svdpp.predict(uid=655325, iid=172992)

Prediction(uid=655325, iid=172992, r_ui=None, est=2.9132470213954433, details={'was_impossible': False})

In [26]:
svd.predict(uid=655325, iid=172992)

Prediction(uid=655325, iid=172992, r_ui=None, est=2.8727027642597065, details={'was_impossible': False})

In [27]:
ratings[ratings.UserID==655325]

Unnamed: 0,BookID,UserID,rating
17653,149888,655325,3.5


In [28]:
books

Unnamed: 0,BookID,GENRE,BOOKTITLE,USERRATINGS
0,114530,Fiction,The Curious Incident of the Dog in the Night-T...,3.87
1,133131,Science Fiction,Rainbows End by Vernor Vinge,3.76
2,153927,Nonfiction,See No Evil: The True Story of a Ground Soldie...,3.93
3,160262,History,Augustus: The Life of Rome's First Emperor by ...,4.03
4,133451,Mystery,"An Unquiet Grave (Louis Kincaid, #7) by P.J. P...",4.12
...,...,...,...,...
1893,161051,Fiction,The Feast of the Goat by Mario Vargas Llosa,4.24
1894,103196,Classics,The Sound and the Fury by William Faulkner,3.86
1895,149347,Classics,Three Men In A Boat by Jerome K. Jerome,3.90
1896,143688,History,Readings from Voices of a People's History of ...,4.33


Let us find the ratings that a user who bought harry potter might give to the other harry potter books

In [29]:
harry=books[books['BOOKTITLE'].str.contains('J.K. Rowling')]['BookID']

In [30]:
books[books['BOOKTITLE'].str.contains('J.K. Rowling')]

Unnamed: 0,BookID,GENRE,BOOKTITLE,USERRATINGS
56,174056,Fantasy,Harry Potter and the Prisoner of Azkaban by J....,4.55
265,101133,Fantasy,Harry Potter and the Half-Blood Prince by J.K....,4.56
373,149888,Fantasy,Harry Potter and the Goblet of Fire by J.K. Ro...,4.55
804,175446,Fantasy,"Harry Potter Collection (Harry Potter, #1-6) b...",4.73
1644,141668,Fantasy,Harry Potter and the Order of the Phoenix by J...,4.48


In [31]:
s={}
pp={}
for id in harry:
    s[id]=svd.predict(uid=655325, iid=id).est
    pp[id]=svdpp.predict(uid=655325, iid=id).est
    

In [32]:
s,pp

({174056: 3.0899959068181113,
  101133: 2.9691067833742,
  149888: 3.1819095356448197,
  175446: 2.8727027642597065,
  141668: 2.8727027642597065},
 {174056: 3.439951592040258,
  101133: 3.0890667283495548,
  149888: 3.3505699419830672,
  175446: 2.9132470213954433,
  141668: 2.9132470213954433})

So we can see from the above representation that the SVDpp predicts the ratings much accurate than svd

## Get recommendations for all users

In [33]:
users=pd.read_csv('Cleaned datasets/users.csv')

In [34]:
userids=users.UserID.unique()

In [35]:
userids=list(userids)

In [36]:
len(userids)

14387

In [37]:
bookids=books.BookID.unique()

In [38]:
bookids=list(bookids)

In [39]:
len(bookids)

1898

Find the top 10 performing books, inorder to recommend by default

In [40]:
b=pd.read_csv("Cleaned datasets/books.CSV")

In [41]:
q='select BookID from b where USERRATINGS>4 and Popularity>29000 order by Popularity desc'

In [42]:
_=ps.sqldf(q)

In [43]:
Top10Default=list(_['BookID'])

In [44]:
Top10Default

[119228, 150279, 174056, 156940, 149888, 141668, 149428, 168295, 101133]

# Implementation

GenerateRecs: gets recommendations for every user and stores all the info in recs as a key value pair

In [45]:
def generateRecs(userids,model):
    recs={}
    i=0
    length=len(userids)
    bar = progressbar.ProgressBar(maxval=length, \
    widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    bar.start()
    for user in userids:
        _recs=','.join(map(str, getRecommendations(user,model))) 
        recs[user]=_recs
        bar.update(i+1)
        i+=1
    bar.finish()
    return recs

getRecommendations: Calls the recommendations method and does the final checks on the recommendation count 

In [46]:
def getRecommendations(userid, model, bids=bookids, thresh=3):
    bookids=recommendations(userid, model, bids, thresh)
    reclen=len(bookids)
    if reclen<10:
        bookids.extend(Top10Default[:10-reclen])
    return bookids[:10]

recommendations: This is the core process where the user-book relation is calculated

In [47]:
def recommendations(userid, model, bids, thresh=3):
    _recs={}
    random.shuffle(bids)
    for bid in bids:
        rating = predict_rating(userid, bid, model)
        if rating >= thresh:
            _recs[bid]=rating
    bookIds=sorted(_recs,key=_recs.get)
    if len(bookIds)>10:
        return bookIds[:10]
    return bookIds

predict_rating: It finds the rating that a user might give to a new book

In [48]:
def predict_rating(uid,bid,model):
    return round(model.predict(str(uid),str(bid)).est)

Get recommendations for all the users

In [49]:
recs=generateRecs(userids,svd)



In [53]:
recs

{600003: '109088,105017,108795,169649,146326,119942,167115,106978,158326,126227',
 600008: '133770,104100,131214,168285,131353,106511,145331,139771,116090,174891',
 600011: '171462,105461,176625,160598,107563,103231,144607,101898,155502,150146',
 600020: '137962,176182,100892,148444,155158,109725,174326,113545,128177,175080',
 600028: '172168,179771,151127,175446,169302,160363,127601,140891,162917,102691',
 600031: '100252,135800,163711,122907,100987,125283,129365,172507,103825,127414',
 600038: '101800,112407,138878,110520,142691,166723,124980,171714,174040,173036',
 600043: '126487,105474,173561,135531,148153,174056,125228,140115,131353,123718',
 600046: '136834,125184,145653,156272,157918,109102,113540,120338,130585,151539',
 600052: '138372,127036,117087,168913,148826,173052,147527,104070,103074,128561',
 600057: '165678,170411,150038,149930,110944,109868,114400,103797,127767,163629',
 600061: '169538,149884,144737,135451,173702,101332,150843,143259,152038,107189',
 600068: '126110

In [51]:
recs2=generateRecs(userids,svdpp)



### Convert the recommendation dictionary to a dataframe

In [54]:
recs=pd.DataFrame(recs,index=['Recommendations'])

In [55]:
recs=recs.T

In [56]:
recs

Unnamed: 0,Recommendations
600003,"109088,105017,108795,169649,146326,119942,1671..."
600008,"133770,104100,131214,168285,131353,106511,1453..."
600011,"171462,105461,176625,160598,107563,103231,1446..."
600020,"137962,176182,100892,148444,155158,109725,1743..."
600028,"172168,179771,151127,175446,169302,160363,1276..."
...,...
674975,"106969,167215,103959,168469,119457,123886,1625..."
674984,"163109,126863,126276,161319,137390,141451,1620..."
674986,"104766,173894,125966,137561,109730,122549,1626..."
674992,"100775,159025,164454,117945,135027,138442,1131..."


In [57]:
recs['UserID']=recs.index

In [58]:
recs=recs[['UserID','Recommendations']]

In [59]:
recs.reset_index(drop='index',inplace=True)

In [60]:
recs

Unnamed: 0,UserID,Recommendations
0,600003,"109088,105017,108795,169649,146326,119942,1671..."
1,600008,"133770,104100,131214,168285,131353,106511,1453..."
2,600011,"171462,105461,176625,160598,107563,103231,1446..."
3,600020,"137962,176182,100892,148444,155158,109725,1743..."
4,600028,"172168,179771,151127,175446,169302,160363,1276..."
...,...,...
14382,674975,"106969,167215,103959,168469,119457,123886,1625..."
14383,674984,"163109,126863,126276,161319,137390,141451,1620..."
14384,674986,"104766,173894,125966,137561,109730,122549,1626..."
14385,674992,"100775,159025,164454,117945,135027,138442,1131..."


In [61]:
recs[['1','2','3','4','5','6','7','8','9','10']] = recs['Recommendations'].str.split(',',expand=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [62]:
recs.drop('Recommendations',axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [63]:
recs.head()

Unnamed: 0,UserID,1,2,3,4,5,6,7,8,9,10
0,600003,109088,105017,108795,169649,146326,119942,167115,106978,158326,126227
1,600008,133770,104100,131214,168285,131353,106511,145331,139771,116090,174891
2,600011,171462,105461,176625,160598,107563,103231,144607,101898,155502,150146
3,600020,137962,176182,100892,148444,155158,109725,174326,113545,128177,175080
4,600028,172168,179771,151127,175446,169302,160363,127601,140891,162917,102691


In [64]:
recs2=pd.DataFrame(recs2,index=['Recommendations'])

In [65]:
recs2=recs2.T

In [66]:
recs2['UserID']=recs2.index

In [67]:
recs2=recs2[['UserID','Recommendations']]

In [68]:
recs2.reset_index(drop='index',inplace=True)

In [69]:
recs2[['1','2','3','4','5','6','7','8','9','10']] = recs2['Recommendations'].str.split(',',expand=True)

In [70]:
recs2

Unnamed: 0,UserID,Recommendations,1,2,3,4,5,6,7,8,9,10
0,600003,"171927,117167,122617,125595,138868,131557,1064...",171927,117167,122617,125595,138868,131557,106490,143335,133803,140031
1,600008,"179947,121497,141257,101912,157611,103171,1327...",179947,121497,141257,101912,157611,103171,132715,160358,135531,179734
2,600011,"120657,169204,119942,120547,133803,153811,1287...",120657,169204,119942,120547,133803,153811,128730,104000,162666,178675
3,600020,"108227,174564,173533,151430,176534,168551,1460...",108227,174564,173533,151430,176534,168551,146079,106408,179514,137773
4,600028,"157537,113432,169538,160474,100899,138514,1161...",157537,113432,169538,160474,100899,138514,116109,124534,176977,152007
...,...,...,...,...,...,...,...,...,...,...,...,...
14382,674975,"162604,168699,179033,140053,138107,125228,1382...",162604,168699,179033,140053,138107,125228,138208,112488,104221,167731
14383,674984,"175410,116109,108659,126519,153244,122911,1546...",175410,116109,108659,126519,153244,122911,154653,179577,110494,122517
14384,674986,"150541,160633,161662,178244,103915,174841,1525...",150541,160633,161662,178244,103915,174841,152557,119216,106258,173702
14385,674992,"169709,139831,105059,123944,102953,166279,1076...",169709,139831,105059,123944,102953,166279,107683,178472,149347,164329


In [71]:
recs2.drop('Recommendations',axis=1,inplace=True)

In [72]:
recs2

Unnamed: 0,UserID,1,2,3,4,5,6,7,8,9,10
0,600003,171927,117167,122617,125595,138868,131557,106490,143335,133803,140031
1,600008,179947,121497,141257,101912,157611,103171,132715,160358,135531,179734
2,600011,120657,169204,119942,120547,133803,153811,128730,104000,162666,178675
3,600020,108227,174564,173533,151430,176534,168551,146079,106408,179514,137773
4,600028,157537,113432,169538,160474,100899,138514,116109,124534,176977,152007
...,...,...,...,...,...,...,...,...,...,...,...
14382,674975,162604,168699,179033,140053,138107,125228,138208,112488,104221,167731
14383,674984,175410,116109,108659,126519,153244,122911,154653,179577,110494,122517
14384,674986,150541,160633,161662,178244,103915,174841,152557,119216,106258,173702
14385,674992,169709,139831,105059,123944,102953,166279,107683,178472,149347,164329


In [83]:
recs.to_csv('Cleaned datasets/CsubmissionV2.csv',index=False)

Compare the recommendation of both svd and svd plus plus models

In [86]:
def findRecs(userid,dic):
    recs=dic
    try:
        rec=recs[recs['UserID']==userid][['1','2','3','4','5','6','7','8','9','10']]
        recommendedBooks=[]
        for i in range(1,11):
            _i=rec[str(i)]
            recommendedBooks.append(_i.values[0])
    except:
        print('exception occured')
        return Top10Default
    return recommendedBooks

In [87]:
ls=findRecs(662789,recs)

In [88]:
ls2=findRecs(662789,recs2)

In [89]:
ls=list(map(int,ls))

In [90]:
ls2=list(map(int,ls2))

In [91]:
books[books.BookID.isin(ls)].sort_values(by='USERRATINGS',ascending=False)

Unnamed: 0,BookID,GENRE,BOOKTITLE,USERRATINGS
1124,106598,Fantasy,"The Sons of Thestian (The Harmatia Cycle, #1) ...",4.09
1737,162324,Fiction,The Albino Album by Chavisa Woods,4.06
129,115341,Mystery,"The Silkworm (Cormoran Strike, #2) by Robert G...",4.04
1318,121985,Nonfiction,God Is Not Great: How Religion Poisons Everyth...,3.97
1626,110374,Fantasy,"Sky The Blue Fairy (Rainbow Magic, #5) by Dais...",3.83
833,153640,Science Fiction,"In the Garden of Iden (The Company, #1) by Kag...",3.77
78,178061,Classics,The Phantom of the Opera by Jennifer Bassett,3.69
882,108990,Mystery,Here I Stay by Barbara Michaels,3.68
1769,146267,Mystery,The King Is Dead by Ellery Queen,3.59
802,141106,History,Noche de Iesi by Peter Berling,2.3


In [92]:
books[books.BookID.isin(ls2)].sort_values(by='USERRATINGS',ascending=False)

Unnamed: 0,BookID,GENRE,BOOKTITLE,USERRATINGS
1124,106598,Fantasy,"The Sons of Thestian (The Harmatia Cycle, #1) ...",4.09
1737,162324,Fiction,The Albino Album by Chavisa Woods,4.06
129,115341,Mystery,"The Silkworm (Cormoran Strike, #2) by Robert G...",4.04
1318,121985,Nonfiction,God Is Not Great: How Religion Poisons Everyth...,3.97
1626,110374,Fantasy,"Sky The Blue Fairy (Rainbow Magic, #5) by Dais...",3.83
833,153640,Science Fiction,"In the Garden of Iden (The Company, #1) by Kag...",3.77
78,178061,Classics,The Phantom of the Opera by Jennifer Bassett,3.69
882,108990,Mystery,Here I Stay by Barbara Michaels,3.68
1769,146267,Mystery,The King Is Dead by Ellery Queen,3.59
802,141106,History,Noche de Iesi by Peter Berling,2.3


In [80]:
import json
fileName='Cleaned datasets/fin.json'
purchaseData = json.loads(open(fileName).read())

In [81]:
purchaseData['662789']

[[107709, 'Margherita Dolce Vita by Stefano Benni Fiction'],
 [170816,
  'Mayflower: A Story of Courage, Community and War by Nathaniel Philbrick History'],
 [138514, 'People of the Deer by Farley Mowat Nonfiction'],
 [103794, 'A Little History of the World by E.H. Gombrich History'],
 [101133, 'Harry Potter and the Half-Blood Prince by J.K. Rowling Fantasy'],
 [104235,
  'Witches, Midwives and Nurses: A History of Women Healers by Barbara Ehrenreich History']]

So as we can see the recommendations are quite satisfactory, since the recommendation matches the users taste to some extent

In [None]:
d={'A':1,'B'}