In [3]:
import pandas as pd
import numpy as np

Book crossing is a website that allows the free sharing of books between by ‘wild-releasing’ books in public places.  Among the services provided by this website is book recommendations for new and existing users.  These recommendations are implemented using a dataset that the website has made freely available.  The data we are going to use is from the site below.

http://www2.informatik.uni-freiburg.de/~cziegler/BX/


There are three datasets:

    1) The BX-Users.csv contains information about 278,858 users.

    2) The BX-Books.csv contains information about 271,360 books.

    3) And the BX-Book-Rating contains information about 1,149,780 ratings, from 105,283 users and 340,556 items.

The ratings dataset contains explicit feedbacks where users explicitly rate books on a scale of 1-10. It also includes implicit ratings, where a user viewed a book, or otherwise interacted with its page, but did not rate it.  These are given rating values of 0. The project goal is to use this dataset to create a book recommendation engine that will give a top k list of books to recommend.

At this point we have cleaned the Book-Rating dataset and have used it to try out several different algorithms.  So far we have made use of  the Surprise package in python, which implements several different recommendation algorithms.  To begin with we will use RMSE as our accuracy metric for the predictions.  As the project develops we may instead switch to an Accuracy/Precision evaluation metric, or discounted cumulative gain.

The algorithms we have implemented are 

    1) Normal Predictor

    2) Baseline Only

    3) KNN

    4) SVD




# Data Reading

In [4]:
items = pd.read_csv('BX-Books.csv', sep=';', error_bad_lines=False, encoding="latin-1")
items.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']
users = pd.read_csv('BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']
ratings = pd.read_csv('BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
items.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [6]:
items.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [7]:
ratings.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [4]:
#Dont need image links
items.drop(['imageUrlS','imageUrlM','imageUrlL'], axis = 1, inplace=True)

In [10]:
#Checking size of data
print('Number of Users: ',users.shape[0])
print('Number of Items: ',items.shape[0])
print('Number of Ratings: ',ratings.shape[0])
print('Number of Users in Ratings: ', len(ratings['userID'].unique()))
print('Number of Items in Ratings: ', len(ratings['ISBN'].unique()))

Number of Users:  278858
Number of Items:  271360
Number of Ratings:  433671
Number of Users in Ratings:  77805
Number of Items in Ratings:  185973


The original data contains ratings of value 0, which indicate implicit feedback.  Initially we will be only using explicit feedback to make our recommendations, so the first step is to remove those ratings.

In [8]:
#Getting rid of implicit ratings
imp_ratings = ratings[ratings['bookRating'] == 0]
ratings = ratings[ratings['bookRating'] > 0]

There are several ISBN’s in the ratings dataset that do not exist in the books dataset, we filter those out as invalid ISBN’s.  We are losing some data here, so in the future we could try to find a smarter way to correct the invalid ISBNs.

In [9]:
#users in ratings.csv are all included in users.csv
#But items in ratings.csv more than items.csv --> invalid ISBNs in ratings.csv. We first filter them out
ratings = ratings[ratings['ISBN'].isin(items['ISBN'])] # Users: 68091 items: 149836

The original dataset was also very sparse.  The original density, measured as the proportion of number of ratings to all user item combinations, was only .0038%.  To increase this we narrow our focus on books that were rated at least 10 times, and users who have rated at least 20 items.  This brings the sparsity up to .8143%.

In [10]:
density = (float(len(ratings))/(len(np.unique(ratings['userID']))*len(np.unique(ratings['ISBN']))))*100
print("Density in percent: "+str(density) )
print("Users: "+str(len(np.unique(ratings['userID'])))+ " items: "+str(len(np.unique(ratings['ISBN']))))

Density in percent: 0.0037622409872253336
Users: 68091 items: 149836


In [11]:
#To reduce our dataset we are going to remove items which were rated less than 10 times
a = ratings.groupby('ISBN').filter(lambda x: len(x) >= 10)
density = (float(len(a))/(len(np.unique(a['userID']))*len(np.unique(a['ISBN']))))*100
print("Density after filtering items: "+str(density)
print("Users: "+str(len(np.unique(a['userID'])))+ " items: "+str(len(np.unique(a['ISBN']))))

Density after filtering items: 0.06499953850402322
Users: 39365 items: 5444


In [12]:
#Remove users who gave less than 20 ratings
b = a.groupby('userID').filter(lambda x: len(x) >= 20)
densityu = (float(len(b))/(len(np.unique(b['userID']))*len(np.unique(b['ISBN']))))*100
print("Density after filtering users: "+str(densityu))
print("Users: "+str(len(np.unique(b['userID'])))+ " items: "+str(len(np.unique(b['ISBN']))))

Density after filtering users: 0.8143211405243026
Users: 1117 items: 5356


### Our data is ready, we will now bring in functions from the Surprise package

In [51]:
from surprise import SVD, accuracy, SVDpp, NormalPredictor, KNNBaseline, KNNBasic, BaselineOnly
from surprise.model_selection import cross_validate, KFold
from surprise.model_selection import train_test_split
from surprise import Reader, Dataset
from surprise.model_selection import GridSearchCV


In [19]:
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(b, reader)
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=0.25)

### Basic Algorithms - Benchmark

Normal Predictor:  Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.  The prediction is generated from a normal distribution, where the mean and standard deviation are estimated from the training set using Maximum Likelihood Estimation.  As one would expect, this algorithm does not perform particularly well, but provides a baseline RMSE against which to compare other algorithms.

In [39]:
algo = NormalPredictor()
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

RMSE: 2.3377


Baseline Only: Predicts the baseline estimate for given user and item.  The rating is assumed to be a combination of the average rating of all books, plus the bias for user u, plus the bias for item i.  Again, it is not assumed that this algorithm will work particularly well, but on our first pass it performs the best as far as RMSE goes.

In [40]:
algo = BaselineOnly()
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 1.5072


# KNN

In [20]:
#KNN Baseline
algo = KNNBaseline()
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.6782


1.6781709525910349

In [73]:
#KNN Basic
algo = KNNBasic()
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.8841


# SVD

In [25]:
algo = SVD()
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x23ba8165a20>

### We are not at the stage of trying out some hyperparameter tuning.  We will use the GridSeachCV function in Suprise, which is based off the one of the same name in SKLearn

In [13]:
param_grid = {'n_epochs':[20,30],  'lr_all':[0.001,0.01],'reg_all':[0.02,0.5]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
gs.fit(data)
print(gs.best_params['rmse'])

NameError: name 'GridSearchCV' is not defined

{'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.5}


In [35]:
algo = gs.best_estimator['rmse']
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

RMSE: 1.5196


1.519632929419054

In [None]:
accuracy.rmse(predictions)