# Building the Recommendation System

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from surprise import Reader, Dataset, SVD, NMF, KNNBasic, accuracy
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
from collections import defaultdict
import random

plt.style.use('ggplot')
plt.rcParams.update({'font.size': 18})

pd.set_option('display.max_colwidth', -1)

%matplotlib inline

## Building the dataset

In the previous notebook in this repository, we delved into these three tables and fixed a few errors. The code taking care of all those is compiled here.

In [2]:
# BX-Books table
books = pd.read_csv('data/BX-Books.csv', sep=';', encoding='latin-1')
books.drop(['Image-URL-S', 'Image-URL-M', 'Image-URL-L'], axis=1, inplace=True)
books.columns = ['ISBN', 'title', 'author', 'year', 'publisher']
books.loc[books.ISBN=='9627982032', 'author'] = 'Edinburgh Financial Publishing'
books.loc[books.ISBN=='193169656X', 'publisher'] = 'Mundania Press LLC'
books.loc[books.ISBN=='1931696993', 'publisher'] = 'Bantam Books'
books.loc[books.ISBN=='078946697X', 'publisher'] = 'DK Publishing Inc'
books.loc[books.ISBN=='078946697X', 'year'] = 2000
books.loc[books.ISBN=='078946697X', 'author'] = 'Michael Teitelbaum'
books.loc[books.ISBN=='078946697X', 'title'] = 'DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)'
books.loc[books.ISBN=='0789466953', 'publisher'] = 'DK Publishing Inc'
books.loc[books.ISBN=='0789466953', 'year'] = 2000
books.loc[books.ISBN=='0789466953', 'author'] = 'James Buckley'
books.loc[books.ISBN=='0789466953', 'title'] = 'DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)'
books.loc[books.ISBN=='2070426769', 'publisher'] = 'Gallimard'
books.loc[books.ISBN=='2070426769', 'year'] = 2003
books.loc[books.ISBN=='2070426769', 'author'] = 'Jean-Marie Gustave Le ClÃ?Â©zio'
books.loc[books.ISBN=='2070426769', 'title'] = "Peuple du ciel, suivi de 'Les Bergers"
books.year = pd.to_numeric(books.year)
books.loc[(books.year > 2004) | (books.year==0), 'year'] = np.NaN

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# BX-Books-Ratings table
ratings = pd.read_csv('data/BX-Book-Ratings.csv', sep=';', encoding='latin-1')
ratings.columns = ['user_id', 'ISBN', 'rating']
ratings2 = ratings[ratings.ISBN.isin(books.ISBN)]
ratings_exp = ratings2[ratings2.rating != 0]
ratings_imp = ratings2[ratings2.rating == 0]

In [4]:
# BX-Users table
users = pd.read_csv('data/BX-Users.csv', sep=';', encoding='latin-1')
users.columns = ['user_id', 'location', 'age']
users.loc[(users.age < 5) | (users.age > 100), 'age'] = np.NaN
loc_ex = users.location.str.split(',', 2, expand=True)
loc_ex.columns = ['city', 'state', 'country']
users = users.join(loc_ex)
users.drop(columns='location', inplace=True)
users.state.replace(' ', np.NaN, inplace=True)
users.country.replace('', np.NaN, inplace=True)

In [5]:
# Joining the books and ratings tables
books_ratings = ratings_exp.join(books.set_index('ISBN'), on='ISBN', how='left')
books_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 383852 entries, 1 to 1149778
Data columns (total 7 columns):
user_id      383852 non-null int64
ISBN         383852 non-null object
rating       383852 non-null int64
title        383852 non-null object
author       383852 non-null object
year         377983 non-null float64
publisher    383852 non-null object
dtypes: float64(1), int64(2), object(4)
memory usage: 23.4+ MB


### Handling books with multiple editions

During data exploration, we found that books with different editions have different ISBN. In order to not dilute the recommendation engine, we need to group and ascribe a unique identifier to each of those books with multiple editions.

In [6]:
multi_isbn = books_ratings.groupby('title').ISBN.nunique()
multi_isbn.value_counts().sort_values()

18    1     
16    1     
12    1     
15    2     
11    2     
14    5     
10    8     
9     11    
8     28    
7     43    
6     86    
5     181   
4     493   
3     1497  
2     7872  
1     125342
Name: ISBN, dtype: int64

The majority of the books only have one edition. Let's look at one of the titles we saw in the last notebook that had multiple: *Wuthering Heights*.

In [7]:
books_ratings.loc[books_ratings.title=='Wuthering Heights'][:10]

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher
11515,763,553211412,8,Wuthering Heights,Emily Bronte,1983.0,Bantam
13283,1674,812505166,10,Wuthering Heights,Emily Bronte,1989.0,Tor Classics
36107,8680,812505166,5,Wuthering Heights,Emily Bronte,1989.0,Tor Classics
40333,10030,553210211,8,Wuthering Heights,Emily Bronte,1981.0,Bantam Books
55714,11676,1566193087,10,Wuthering Heights,Emily Bronte,1994.0,Dorset Press
59493,11916,439228913,2,Wuthering Heights,Emily Bronte,2003.0,Scholastic Paperbacks
69921,14521,553212583,1,Wuthering Heights,EMILY BRONTE,1983.0,Bantam
73304,15602,553212583,10,Wuthering Heights,EMILY BRONTE,1983.0,Bantam
77035,16634,553212583,8,Wuthering Heights,EMILY BRONTE,1983.0,Bantam
104456,23902,553212583,10,Wuthering Heights,EMILY BRONTE,1983.0,Bantam


Notice that the `author` entry has both upper and lowercase. In order to not end up giving those each a separate identification, we should standardize them in uppercase. We'll do the same with titles.

In [8]:
books_ratings.author = books_ratings.author.str.upper()
books_ratings.title = books_ratings.title.str.upper()

In [9]:
books_ratings.loc[books_ratings.title=='WUTHERING HEIGHTS'][:10]

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher
11515,763,553211412,8,WUTHERING HEIGHTS,EMILY BRONTE,1983.0,Bantam
13283,1674,812505166,10,WUTHERING HEIGHTS,EMILY BRONTE,1989.0,Tor Classics
36107,8680,812505166,5,WUTHERING HEIGHTS,EMILY BRONTE,1989.0,Tor Classics
40333,10030,553210211,8,WUTHERING HEIGHTS,EMILY BRONTE,1981.0,Bantam Books
55714,11676,1566193087,10,WUTHERING HEIGHTS,EMILY BRONTE,1994.0,Dorset Press
59493,11916,439228913,2,WUTHERING HEIGHTS,EMILY BRONTE,2003.0,Scholastic Paperbacks
69921,14521,553212583,1,WUTHERING HEIGHTS,EMILY BRONTE,1983.0,Bantam
73304,15602,553212583,10,WUTHERING HEIGHTS,EMILY BRONTE,1983.0,Bantam
77035,16634,553212583,8,WUTHERING HEIGHTS,EMILY BRONTE,1983.0,Bantam
104456,23902,553212583,10,WUTHERING HEIGHTS,EMILY BRONTE,1983.0,Bantam


Now we can use `groupby` to collect the books by `title` and `author` and give them a unique identifier using `ngroup`.

In [10]:
books_ratings['book_id'] = books_ratings.groupby(['title', 'author']).ngroup()

In [11]:
books_ratings.loc[books_ratings.title=='WUTHERING HEIGHTS'][:10]

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,book_id
11515,763,553211412,8,WUTHERING HEIGHTS,EMILY BRONTE,1983.0,Bantam,136756
13283,1674,812505166,10,WUTHERING HEIGHTS,EMILY BRONTE,1989.0,Tor Classics,136756
36107,8680,812505166,5,WUTHERING HEIGHTS,EMILY BRONTE,1989.0,Tor Classics,136756
40333,10030,553210211,8,WUTHERING HEIGHTS,EMILY BRONTE,1981.0,Bantam Books,136756
55714,11676,1566193087,10,WUTHERING HEIGHTS,EMILY BRONTE,1994.0,Dorset Press,136756
59493,11916,439228913,2,WUTHERING HEIGHTS,EMILY BRONTE,2003.0,Scholastic Paperbacks,136756
69921,14521,553212583,1,WUTHERING HEIGHTS,EMILY BRONTE,1983.0,Bantam,136756
73304,15602,553212583,10,WUTHERING HEIGHTS,EMILY BRONTE,1983.0,Bantam,136756
77035,16634,553212583,8,WUTHERING HEIGHTS,EMILY BRONTE,1983.0,Bantam,136756
104456,23902,553212583,10,WUTHERING HEIGHTS,EMILY BRONTE,1983.0,Bantam,136756


Looking good, but let's check on a common `title` that is likely to have multiple different entries in the `author` column.

In [12]:
books_ratings.loc[books_ratings.title=='SELECTED POEMS'][:10]

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,book_id
20421,3837,081120958X,8,SELECTED POEMS,WILLIAM CARLOS WILLIAMS,1985.0,New Directions Publishing Corporation,89293
52276,11676,0670809179,5,SELECTED POEMS,JOHN ASHBERY,1985.0,Penguin USA,89284
97611,22154,0811201465,8,SELECTED POEMS,K. PATCHEN,1957.0,New Directions Publishing Corporation,89286
103512,23872,0811201465,9,SELECTED POEMS,K. PATCHEN,1957.0,New Directions Publishing Corporation,89286
274347,63956,0517101548,9,SELECTED POEMS,JOHN DONNE,1994.0,Gramercy Books,89285
302598,72214,1550651498,8,SELECTED POEMS,RALPH GUSTAFSON,2001.0,Vehicule Press,89288
320062,76482,0802151027,9,SELECTED POEMS,PABLO NERUDA,1961.0,Grove Press,89287
326486,77819,0679750800,7,SELECTED POEMS,RITA DOVE,1993.0,Vintage Books USA,89289
356406,85884,0156003961,6,SELECTED POEMS,CARL SANDBURG,1996.0,Harvest Books,89280
402730,97050,0333516265,4,SELECTED POEMS,THOMAS HARDY,1989.0,Macmillan,89291


In [13]:
books_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 383852 entries, 1 to 1149778
Data columns (total 8 columns):
user_id      383852 non-null int64
ISBN         383852 non-null object
rating       383852 non-null int64
title        383852 non-null object
author       383852 non-null object
year         377983 non-null float64
publisher    383852 non-null object
book_id      383852 non-null int64
dtypes: float64(1), int64(3), object(4)
memory usage: 26.4+ MB


Everything looks good. Let's join it with users into a full dataframe. We'll be minimizing it for the recommendation engine, but we'll use this full dataframe later to pull the real titles back out.

In [14]:
full_df = books_ratings.join(users.set_index('user_id'), on='user_id', how='left')
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 383852 entries, 1 to 1149778
Data columns (total 12 columns):
user_id      383852 non-null int64
ISBN         383852 non-null object
rating       383852 non-null int64
title        383852 non-null object
author       383852 non-null object
year         377983 non-null float64
publisher    383852 non-null object
book_id      383852 non-null int64
age          268027 non-null float64
city         383852 non-null object
state        375193 non-null object
country      373372 non-null object
dtypes: float64(2), int64(3), object(7)
memory usage: 38.1+ MB


In [15]:
full_df.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,book_id,age,city,state,country
1,276726,0155061224,5,RITES OF PASSAGE,JUDITH RAE,2001.0,Heinle,85641,,seattle,washington,usa
3,276729,052165615X,3,HELP!: LEVEL 1,PHILIP PROWSE,1999.0,Cambridge University Press,47102,16.0,rijeka,,croatia
4,276729,0521795028,6,THE AMSTERDAM CONNECTION : LEVEL 4 (CAMBRIDGE ENGLISH READERS),SUE LEATHER,2001.0,Cambridge University Press,99565,16.0,rijeka,,croatia
8,276744,038550120X,7,A PAINTED HOUSE,JOHN GRISHAM,2001.0,Doubleday,3186,,torrance,california,usa
16,276747,0060517794,9,LITTLE ALTARS EVERYWHERE,REBECCA WELLS,2003.0,HarperTorch,62205,25.0,iowa city,iowa,usa


We minimize the dataframe into just what we need for the recommendations: `user_id`, `book_id`, and `rating`.

In [68]:
min_df = books_ratings[['user_id', 'book_id', 'rating']]
min_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 383852 entries, 1 to 1149778
Data columns (total 3 columns):
user_id    383852 non-null int64
book_id    383852 non-null int64
rating     383852 non-null int64
dtypes: int64(3)
memory usage: 11.7 MB


In [30]:
min_df.head()

Unnamed: 0,user_id,book_id,rating
1,276726,85641,5
3,276729,47102,3
4,276729,99565,6
8,276744,3186,7
16,276747,62205,9


## Building the recommendation algorithm using the Surprise package

My initial attempts at creating a recommendation algorithm were memory-based collaborative filtering. I attempted to create a user-items ratings matrix and calculate pairwise distances. However, hardware limitations proved to be an issue.

Instead, I opted for a model-based approach. [Surprise](https://surprise.readthedocs.io/en/stable/getting_started.html) is a python package that allows matrix factorization through fitted models that are less hardware intensive.

In [69]:
# With Surprise, you first set the ratings scale and load the data
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(min_df, reader)

The matrix factorization models compared here are **Singular Value Decomposition (SVD)** and **Non-negative Matrix Factorization (NMF)**.

In [16]:
nmf = NMF(random_state=23)
cross_validate(nmf, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.4699  2.4709  2.4694  2.4766  2.4757  2.4725  0.0030  
Fit time          51.28   49.44   43.07   48.26   51.07   48.63   2.99    
Test time         1.09    1.01    0.96    1.14    0.63    0.96    0.18    


{'test_rmse': array([2.46986904, 2.47091514, 2.46939632, 2.47659425, 2.47571975]),
 'fit_time': (51.28337121009827,
  49.439143896102905,
  43.06839990615845,
  48.263607025146484,
  51.07368016242981),
 'test_time': (1.0939700603485107,
  1.0058021545410156,
  0.9557077884674072,
  1.1394522190093994,
  0.6257419586181641)}

NMF got an average RMSE of 2.4725.

In [17]:
svd = SVD(random_state=23)
cross_validate(svd, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.6392  1.6341  1.6348  1.6395  1.6299  1.6355  0.0036  
Fit time          23.58   24.53   23.29   23.38   24.64   23.88   0.58    
Test time         1.07    1.34    0.62    0.63    0.63    0.86    0.29    


{'test_rmse': array([1.639246  , 1.63410458, 1.6348495 , 1.63952095, 1.62993646]),
 'fit_time': (23.58285093307495,
  24.5279541015625,
  23.290915966033936,
  23.375879049301147,
  24.6403329372406),
 'test_time': (1.070680856704712,
  1.3358097076416016,
  0.6227309703826904,
  0.6256387233734131,
  0.6295192241668701)}

SVD got an average RMSE of 1.6355.

SVD looks like the best so let's do a train/test split and tune the parameters.

In [70]:
train, test = train_test_split(data, test_size=0.2, random_state=23)

In [25]:
svd.fit(train)
pred = svd.test(test)

accuracy.rmse(pred)

RMSE: 1.6416


1.641590124187428

The defaults of the following parameters being `GridSearch`ed are 100, 0.005, and 0.02. We're testing above and below the default options.

In [21]:
param_grid = {'n_factors': [60, 80, 100, 120, 140],
              'lr_all': [0.001, 0.005, 0.01],
              'reg_all': [0.005, 0.01, 0.02, 0.04, 0.08]}

gs_svd = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)

In [24]:
%time gs_svd.fit(data)

CPU times: user 1h 27min 6s, sys: 1min 23s, total: 1h 28min 30s
Wall time: 1h 36min 32s


In [26]:
best_svd = gs_svd.best_estimator['rmse']

gs_svd.best_params['rmse']

{'n_factors': 60, 'lr_all': 0.005, 'reg_all': 0.08}

We set the best parameters and can once again do a CV test.

In [27]:
cross_validate(best_svd, data, measures=['rmse'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.6245  1.6232  1.6291  1.6270  1.6237  1.6255  0.0022  
Fit time          17.87   19.95   17.88   33.67   21.58   22.19   5.90    
Test time         0.67    0.83    1.10    1.27    1.10    0.99    0.21    


{'test_rmse': array([1.62454617, 1.62323183, 1.62909116, 1.6269746 , 1.62370824]),
 'fit_time': (17.871932983398438,
  19.950823068618774,
  17.882739305496216,
  33.66665196418762,
  21.57614517211914),
 'test_time': (0.6733641624450684,
  0.8339347839355469,
  1.097783088684082,
  1.2683031558990479,
  1.0975959300994873)}

With the tuned parameters, we can once again fit the model to the training data.

In [71]:
svd_tuned = SVD(n_factors=60, lr_all=0.005, reg_all=0.08, random_state=23)
svd_tuned.fit(train)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a296cd978>

In [92]:
pred_tuned = svd_tuned.test(test)
accuracy.rmse(pred_tuned)

RMSE: 1.6320


1.6319695417706765

With the tuned parameters, we see a slight decrease in RMSE from 1.6416 to 1.6320.

## Testing out the recommendations

Let's pull a few random samples are see how good the predictions are based on personal judgement.

In [72]:
min_df.sample(n=1, random_state=23)

Unnamed: 0,user_id,book_id,rating
854966,206691,88501,5


In [75]:
uid = 206691
iid = 88501
check = svd_tuned.predict(uid, iid)
print(f'Estimated Rating: {check.est: .2f}')
print('Real Rating: 5')

Estimated Rating:  6.81
Real Rating: 5


In [95]:
min_df.sample(n=1, random_state=93)

Unnamed: 0,user_id,book_id,rating
502665,121941,75517,10


In [76]:
uid = 121941
iid = 75517

check = svd_tuned.predict(uid, iid)
print(f'Estimated Rating: {check.est: .2f}')
print('Real Rating: 10')

Estimated Rating:  8.63
Real Rating: 10


In [97]:
min_df.sample(n=1, random_state=156)

Unnamed: 0,user_id,book_id,rating
860141,208077,61908,8


In [77]:
uid = 2088077
iid = 61908

check = svd_tuned.predict(uid, iid)
print(f'Estimated Rating: {check.est: .2f}')
print('Real Rating: 8')

Estimated Rating:  7.14
Real Rating: 8


In [99]:
min_df.sample(n=1, random_state=718)

Unnamed: 0,user_id,book_id,rating
550470,131942,95387,8


In [78]:
uid = 131942
iid = 95387

check = svd_tuned.predict(uid, iid)
print(f'Estimated Rating: {check.est: .2f}')
print('Real Rating: 8')

Estimated Rating:  8.15
Real Rating: 8


Overall, the ratings are pretty close. They're not perfect, but the recommendations are fairly accurate. Now let's move on to creating a function that can provide the top recommendations for a user.

## Building the recommendation system

Due to processing power limitations, it was necessary to shrink the dataset when building the full recommendation engine. We will limit the dataset to only users who have rated 100 or more books.

In [31]:
c_user = min_df.user_id.value_counts()
min_df = min_df[min_df.user_id.isin(c_user[c_user >= 100].index)]

In [32]:
min_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103274 entries, 1456 to 1147615
Data columns (total 3 columns):
user_id    103274 non-null int64
book_id    103274 non-null int64
rating     103274 non-null int64
dtypes: int64(3)
memory usage: 3.2 MB


Because we've shrunk the dataset we need to reload it for use by the SVD algorithm.

In [34]:
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(min_df, reader)

In [35]:
def get_top_n(predictions, n=10):
    '''
    Return the top-N recommendation for each user from a set of predictions.

    Arguments:
        predictions: The list of predictions, specifically the object returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user.

    Returns:
        A dict where keys are user ids and values are lists of tuples:
        [(book id, rating estimation), ...] of size n.
    '''

    # Map the predictions to each user
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Sort the predictions for each user and retrieve the n highest
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

The Surprise package's Dataset module allows one to train the data on user/rating pairs that do exist using `build_full_trainset` then create a set of predictions on user/rating pairs that aren't in the training set using `build_anti_testset`.

In [36]:
trainset = data.build_full_trainset()
svd_rec = SVD(n_factors=60, lr_all=0.005, reg_all=0.08, random_state=23)
svd_rec.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a296cd2e8>

In [37]:
testset = trainset.build_anti_testset()
predictions = svd_rec.test(testset)

In [38]:
top_n = get_top_n(predictions, n=10)

Now that we have the list of predictions, let's look at a random entry.

In [59]:
random.seed(77)
random.choice(testset)

(264082, 51905, 7.825454615876213)

In [60]:
top_n[264082]

[(32350, 7.402003268833893),
 (44757, 7.36797269742903),
 (127956, 7.333824161580881),
 (76458, 7.267675698675407),
 (95189, 7.251538907209959),
 (77631, 7.250855803324831),
 (133734, 7.2466541283467105),
 (933, 7.220326452034239),
 (99496, 7.219992980517031),
 (118782, 7.219123305268468)]

Looks like everything's working, but a collection of just numbers doesn't do us much good. We need to convert the book id's back to titles.

In [58]:
def get_reading_rec(user_id, top_n):
    '''
    Returns the predictions in top_n with the correct book titles from the full dataset.
    
    Arguments:
        user_id: the int representing the user you want to create a reading list for
        top_n: the dictionary of user/rating predictions
        
    Returns:
        A dict where the keyword is the book title and the values are the predicted ratings.
    '''
    
    reading_rec = defaultdict(list)
    
    for n in top_n[user_id]:
        book, rating = n
        title = full_df.loc[full_df.book_id==book].title.unique()[0]
        reading_rec[title] = rating
    return reading_rec

Let's take a look at some random entries so see how everything is working now.

In [45]:
random.seed(99)
random.choice(testset)

(132492, 70615, 7.825454615876213)

In [55]:
example = get_reading_rec(132492, top_n)
for book, rating in example.items():
    print(f'{book}: {rating}')

GREEN EGGS AND HAM (I CAN READ IT ALL BY MYSELF BEGINNER BOOKS): 10
HARRY POTTER AND THE PRISONER OF AZKABAN (BOOK 3): 10
HARRY POTTER AND THE GOBLET OF FIRE (BOOK 4): 10
ENDER'S GAME (ENDER WIGGINS SAGA (PAPERBACK)): 10
TO KILL A MOCKINGBIRD: 10
ATLAS SHRUGGED: 10
TUESDAYS WITH MORRIE: AN OLD MAN, A YOUNG MAN, AND LIFE'S GREATEST LESSON: 10
FAHRENHEIT 451: 10
OUTLANDER: 10
GRIFFIN & SABINE: AN EXTRAORDINARY CORRESPONDENCE: 10


In [47]:
random.seed(418)
random.choice(testset)

(104399, 27171, 7.825454615876213)

In [56]:
example = get_reading_rec(104399, top_n)
for book, rating in example.items():
    print(f'{book}: {rating}')

THE WHALE RIDER: 10
THE PICTURE OF DORIAN GRAY (MODERN LIBRARY (PAPERBACK)): 10
INTO THE WILD: 10
THE DA VINCI CODE: 10
GONE FOR GOOD: 10
VERTICAL RUN: 10
LETTERS FOR EMILY: 10
ICE BOUND: A DOCTOR'S INCREDIBLE BATTLE FOR SURVIVAL AT THE SOUTH POLE: 10
THE GOLDEN MEAN: IN WHICH THE EXTRAORDINARY CORRESPONDENCE OF GRIFFIN & SABINE CONCLUDES: 10
HOMICIDAL PSYCHO JUNGLE CAT: A CALVIN AND HOBBES COLLECTION: 10


Two all 10 out of 10's in a row? Let's check a specific number in `top_n` versus the output of `get_reading_rec` to double-check.

In [52]:
top_n[277427]

[(121848, 9.629515536078355),
 (118782, 9.60014398415964),
 (99496, 9.57619222677314),
 (32350, 9.571841484725704),
 (44757, 9.537935247692188),
 (3420, 9.526558068686974),
 (72132, 9.501878239410088),
 (46059, 9.473703203746004),
 (933, 9.464327798325476),
 (36801, 9.459354565936092)]

In [57]:
example = get_reading_rec(277427, top_n)
for book, rating in example.items():
    print(f'{book}: {rating}')

THE TWO TOWERS (THE LORD OF THE RINGS, PART 2): 9.629515536078355
THE SECRET GARDEN: 9.60014398415964
THE AMBER SPYGLASS (HIS DARK MATERIALS, BOOK 3): 9.57619222677314
DUNE (REMEMBERING TOMORROW): 9.571841484725704
GRIFFIN & SABINE: AN EXTRAORDINARY CORRESPONDENCE: 9.537935247692188
A PRAYER FOR OWEN MEANY: 9.526558068686974
MY SISTER'S KEEPER : A NOVEL (PICOULT, JODI): 9.501878239410088
HARRY POTTER AND THE PRISONER OF AZKABAN (BOOK 3): 9.473703203746004
84 CHARING CROSS ROAD: 9.464327798325476
FAHRENHEIT 451: 9.459354565936092


Okay, that's working out. Now let's check to make sure that the recommendations are truly not a pair that's in the full dataset.

In [61]:
dc = full_df.loc[full_df.user_id==277427]
dc.loc[dc.title=='THE SECRET GARDEN']

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,book_id,age,city,state,country


Success! *The Secret Garden* was on the recommendation list but was not a book already rated by this user.

## In Summary

In summary, we used the Book-Crossing dataset to build a functioning recommendation engine using the useful and versatile Surprise package. The engine had an average root mean squared error of 1.6255 and operated well. It should be noted, however, that as this is a recommendation engine based off of user ratings, it will suffer when a user has not put in enough rating data to provide accurate recommendations. So, when using such an engine, initial recommendations should be based on simply popularity until a threshold is met.