<h2> Parameter Tuning and Model Selection </h2>

In this notebook, I tune parameters to find the best models to recommend books from my data.

Due to time constraints and processing power, I limited what I did in this notebook. Parameters I would like to fully tune in the future:
1. Number of clusters (as a function of number of books)
2. Threshold for minimum number of ratings by a user
3. Threshold for minimum number of ratings of a book
4. Outliers to shave off the top of the dataset (although, this might be considered data hacking.)
5. Which recommender algorithm to use
6. Which clustering algorithm to use
7. Number of factors in the SVD algorithm
8. Number of epochs in the SVD algorithm

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import time
import math
import random

from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split

from surprise.prediction_algorithms.matrix_factorization import SVD
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate

In [2]:
## Loading the data

# df_books comes from my BlurbCleaning notebook and is a table of all the books in the dataset
# that I was able to scrape Blurbs for.
df_books = pd.read_csv('/DataScience/Final Capstone Files/books_with_popularities.csv')
# bert_embeddings is all of the books with BERT vectors
bert_embeddings = pd.read_csv('/DataScience/Final Capstone Files/books_with_blurbs_and_BERT_combined.csv')
bert_embeddings.drop('Unnamed: 0', 1, inplace=True)

# Merging the 2 tables together
df_books = pd.merge(df_books, bert_embeddings)

# df_ratings is the ratings table from the book-crossings dataset
df_ratings = pd.read_csv('/DataScience/BX-CSV-Dump/BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding = "latin-1")
df_ratings.rename(columns={'User-ID': 'User', 'Book-Rating': 'Rating'}, inplace=True)
df_ratings = df_ratings[df_ratings['Rating'] > 0]
df_ratings.reset_index(inplace=True, drop=True)

Before modeling using SVD, I want to try to develop some baseline models to compare my results with. The following 2 cells are naive models. The first cell predicts each rating as just the average rating the user makes. This would not work practically because it would recommend all books evenly to a given user, making it impossible to generate a recommendation.

The second cell predicts each rating as the average rating that the book has received. This gives a pretty terrible RMSE but from a naive perspective would give decent recommendations because it would just recommend books that are highly rated.

In [4]:
## User average baseline
rmses, cv = [], 3
shuffled = df_ratings.sample(frac=1).copy()
for chunk in range(cv):
    test_set = shuffled.iloc[chunk * (len(shuffled) // cv) : (chunk + 1) * (len(shuffled) // cv), :]
    train_set = shuffled[~shuffled.index.isin(test_set.index)]
    RMSE, count = 0, 0
    for rating in range(len(test_set)):
        if len(train_set[train_set.User == test_set.iloc[rating, 0]]) > 0:
            RMSE += (test_set.Rating.iloc[rating] - np.mean(train_set.Rating[train_set.User == test_set.iloc[rating, 0]])) ** 2
            count += 1
    RMSE = (RMSE / count) ** 0.5
    rmses.append(RMSE)
print("Average RMSE:", sum(rmses)/cv)


Average RMSE: 1.6832489522972125


In [35]:
## Book average baseline
rmses, cv = [], 1
shuffled = df_ratings.sample(frac=1).copy()
for chunk in range(cv):
    test_set = shuffled.iloc[chunk * (len(shuffled) // 5) : (chunk + 1) * (len(shuffled) // 5), :]
    train_set = shuffled[~shuffled.index.isin(test_set.index)]
    RMSE, count = 0, 0
    for rating in range(len(test_set)):
        if len(train_set[train_set.ISBN == test_set.iloc[rating, 1]]) > 0:
            RMSE += (test_set.Rating.iloc[rating] - np.mean(train_set.Rating[train_set.ISBN == test_set.iloc[rating, 1]])) ** 2
            count += 1
    RMSE = (RMSE / count) ** 0.5
    rmses.append(RMSE)
print("Average RMSE:", sum(rmses)/cv)

Average RMSE: 1.9600300066510694


Now I will run my full pipeline. This next cell is set up with timestamps to help me optimize runtime for my code and algorithms. A basic outline of the data pipeline:
1. Load Data, set thresholds for outliers for data
2. Cluster BERT vectors using K-Means
3. Set up a test-set for our new clusters and ratings table
4. Alter the train-set to account for repeat ratings by averaging repeats
5. Use SVD algorithm on train data
6. Have SVD predict the ratings for our test-set, and use RMSE to evaluate

In [77]:
start = time.time()

# Parameters
min_user_rating = 20 # Ignore users with less than this many ratings
min_book_ratings = 20 # Ignore books with less than this many ratings
cluster_ratio = 50 # Number of clusters will be the number of unique ISBNS divided by this.
cross_vals = 6 # How many cross_validations to run while tuning
num_epochs = 10 # Number of epochs in our SVD algorithm
num_factors = 20 # Number of factors in our SVD algorithm

user_ids_to_keep = []
for user in df_ratings.User.unique():
    if len(df_ratings[df_ratings.User == user]) >= min_user_rating:
        user_ids_to_keep.append(user) # List of users with 20 or more ratings

# Copy of our ratings with only users that we want
df_ratings2 = df_ratings[df_ratings['User'].isin(user_ids_to_keep)].copy()
df_books2 = df_books[df_books.NumberRatings >= min_book_ratings].copy() # copy of books with at least 20 ratings
df_books2 = df_books2[df_books2.ISBN.isin(df_ratings2.ISBN.unique())]
df_ratings2 = df_ratings2[df_ratings2.ISBN.isin(df_books2.ISBN.unique())]

df_books2.reset_index(drop=True, inplace=True)
df_ratings2.reset_index(drop=True, inplace=True)


num_clusters = len(df_books2) // cluster_ratio
X = df_books2.iloc[:, 9:]  ## Just the vector embeddings for K-Means to cluster. They are 768 elements long

print('Phase 1:', time.time() - start)
start = time.time()

km = KMeans(n_clusters = num_clusters)
clusters = km.fit_predict(X)
isbns = list(df_books2.ISBN)
cluster_df = pd.DataFrame() # This is basically just a dictionary of ISBNs to clusters
cluster_df['ISBN'] = isbns
cluster_df['Cluster'] = clusters

print('Phase 2:', time.time() - start)
start = time.time()

cluster_list = []
for isbn in range(len(df_ratings2)):
    cluster_list.append(cluster_df.Cluster[cluster_df.ISBN == df_ratings2.ISBN[isbn]].iloc[0])
df_ratings2['Cluster'] = cluster_list ## Adding a cluster column to df_ratings

clustered_ratings = pd.DataFrame() ## A new dataframe to work out of to keep separate from df_ratings
clustered_ratings['User'] = df_ratings2.User
clustered_ratings['Cluster'] = df_ratings2.Cluster
clustered_ratings['Rating'] = df_ratings2.Rating

cluster_dict = {} # This is a dictionary where the keys are every user, and the values are dictionaries mapping clusters to ratings.
for user in clustered_ratings.User.unique():
    x = clustered_ratings[clustered_ratings.User == user]
    cluster_dict[user] = {}
    for cluster in x.Cluster.unique():
        cluster_dict[user][cluster] = np.mean(x.Rating[x.Cluster == cluster])

# The above block of code is just a streamlined way to set up clustered_ratings2 below, which is a table that
# accounts for repeated ratings of a cluster by a user by taking the average rating they've made for a cluster and
# removing repeats.
new_ratings = []
for rating in range(len(clustered_ratings)):
    new_ratings.append(cluster_dict[clustered_ratings.iloc[rating, 0]][clustered_ratings.iloc[rating, 1]])
clustered_ratings2 = clustered_ratings.copy()
clustered_ratings2['Rating'] = new_ratings

clustered_ratings2 = clustered_ratings2.drop_duplicates(['User', 'Cluster'])
clustered_ratings = clustered_ratings[clustered_ratings.index.isin(clustered_ratings2.index)]

reader = Reader(rating_scale=(1,10))
avg_rmse = 0

print('Phase 3:', time.time() - start)
start = time.time()

# This is NOT technically cross validation. Next time I revise this project I will update it. Right now it's
# just randomly sampling the test data each iteration, which is still fine.
for cv in range(cross_vals):
    svd = SVD(n_factors=num_factors, n_epochs=num_epochs)
    test_set = clustered_ratings.sample(frac=0.15)
    train_set = clustered_ratings2[~clustered_ratings2.index.isin(test_set.index)].dropna()
    data = Dataset.load_from_df(train_set, reader) # Surprise library has this extra step to load the data
    x = data.build_full_trainset()
    svd.fit(x)
    RMSE = 0
    for i in range(len(test_set)):  # For every rating in the test set
        uid = test_set.User.iloc[i] # User ID
        iid = test_set.Cluster.iloc[i] # Cluster Number
        r_ui = test_set.Rating.iloc[i] # Actual rating
        pred = svd.predict(uid, iid, verbose=False) # Predict the rating for the given user ID and Cluster
        RMSE += (r_ui - pred[3]) ** 2 # Take the difference between the actual and predicted rating
    avg_rmse += (RMSE / len(test_set)) ** 0.5 # Sum all of the differences and divide by the length of the test set then square root to get RMSE

print('Phase 4:', time.time() - start)
    
print('Average RMSE:', avg_rmse/cross_vals)

Phase 1: 119.19711899757385
Phase 2: 5.672783136367798
Phase 3: 63.119263887405396
Phase 4: 4.497230052947998
Average RMSE: 1.6644284452746547


In [47]:
# Iterate our model over different min_user_ratings
start = time.time()

# min_user_rating = ?
min_book_ratings = 10
cluster_ratio = 50
cross_vals = 6
num_epochs = 15
num_factors = 50

rmses = []

for min_user_rating in [1, 2, 5, 10, 15, 20, 25, 30, 50, 100]:

    user_ids_to_keep = []

    for user in df_ratings.User.unique():
        if len(df_ratings[df_ratings.User == user]) >= min_user_rating:
            user_ids_to_keep.append(user)

    df_ratings2 = df_ratings[df_ratings['User'].isin(user_ids_to_keep)].copy()
    df_books2 = df_books[df_books.NumberRatings >= min_book_ratings].copy()
    df_books2 = df_books2[df_books2.ISBN.isin(df_ratings2.ISBN.unique())]
    df_ratings2 = df_ratings2[df_ratings2.ISBN.isin(df_books2.ISBN.unique())]


    df_books2.reset_index(drop=True, inplace=True)
    df_ratings2.reset_index(drop=True, inplace=True)

    num_clusters = len(df_books2) // cluster_ratio

    X = df_books2.iloc[:, 9:]

    km = KMeans(n_clusters = num_clusters)
    clusters = km.fit_predict(X)
    isbns = list(df_books2.ISBN)
    cluster_df = pd.DataFrame()
    cluster_df['ISBN'] = isbns
    cluster_df['Cluster'] = clusters

    cluster_list = []
    for isbn in range(len(df_ratings2)):
        cluster_list.append(cluster_df.Cluster[cluster_df.ISBN == df_ratings2.ISBN[isbn]].iloc[0])
    df_ratings2['Cluster'] = cluster_list

    clustered_ratings = pd.DataFrame()
    clustered_ratings['User'] = df_ratings2.User
    clustered_ratings['Cluster'] = df_ratings2.Cluster
    clustered_ratings['Rating'] = df_ratings2.Rating

    cluster_dict = {}
    for user in clustered_ratings.User.unique():
        x = clustered_ratings[clustered_ratings.User == user]
        cluster_dict[user] = {}
        for cluster in x.Cluster.unique():
            cluster_dict[user][cluster] = np.mean(x.Rating[x.Cluster == cluster])

    new_ratings = []
    for rating in range(len(clustered_ratings)):
        new_ratings.append(cluster_dict[clustered_ratings.iloc[rating, 0]][clustered_ratings.iloc[rating, 1]])
    clustered_ratings2 = clustered_ratings.copy()
    clustered_ratings2['Rating'] = new_ratings

    clustered_ratings2 = clustered_ratings2.drop_duplicates(['User', 'Cluster'])
    clustered_ratings = clustered_ratings[clustered_ratings.index.isin(clustered_ratings2.index)]

    reader = Reader(rating_scale=(1,10))
    avg_rmse = 0

    for cv in range(cross_vals):
        svd = SVD(n_factors=num_factors, n_epochs=num_epochs)
        test_set = clustered_ratings.sample(frac=0.15)
        train_set = clustered_ratings2[~clustered_ratings2.index.isin(test_set.index)].dropna()
        data = Dataset.load_from_df(train_set, reader)
        x = data.build_full_trainset()
        svd.fit(x)
        RMSE = 0
        for i in range(len(test_set)):
            uid = test_set.User.iloc[i]
            iid = test_set.Cluster.iloc[i]
            r_ui = test_set.Rating.iloc[i]
            pred = svd.predict(uid, iid, verbose=False)
            RMSE += (r_ui - pred[3]) ** 2
        avg_rmse += (RMSE / len(test_set)) ** 0.5
    rmses.append(avg_rmse/cross_vals)
    
    print(len(clustered_ratings2))
    print(len(df_books2))
print(rmses)
    

107481
5104
90804
5104
74491
5104
59403
5104
51393
5104
44563
5102
39743
5101
37092
5100
25953
5082
13109
4928
[1.7339171099868789, 1.717715697406426, 1.6943699122114901, 1.6787209169260902, 1.6514906238694493, 1.6310497611626171, 1.6201012236863332, 1.6085421408029041, 1.5726427887230348, 1.5384028095144187]


As I Increase the min user rating threshold, the RMSE decreases. By cutting out users with fewer ratings, the SVD algorithm is able to better learn the connections between the clusters. RIght now the best results are with a threshold of 100. I feel uncomfortable cutting out more than that because we will lose too much data.

In [48]:
# Iterating over different min_book_ratings
start = time.time()

min_user_rating = 100
#min_book_ratings = ?
cluster_ratio = 50
cross_vals = 6
num_epochs = 15
num_factors = 50

rmses = []

for min_book_rating in [1, 2, 5, 10, 15, 20, 50]:

    user_ids_to_keep = []

    for user in df_ratings.User.unique():
        if len(df_ratings[df_ratings.User == user]) >= min_user_rating:
            user_ids_to_keep.append(user)

    df_ratings2 = df_ratings[df_ratings['User'].isin(user_ids_to_keep)].copy()
    df_books2 = df_books[df_books.NumberRatings >= min_book_ratings].copy()
    df_books2 = df_books2[df_books2.ISBN.isin(df_ratings2.ISBN.unique())]
    df_ratings2 = df_ratings2[df_ratings2.ISBN.isin(df_books2.ISBN.unique())]


    df_books2.reset_index(drop=True, inplace=True)
    df_ratings2.reset_index(drop=True, inplace=True)

    num_clusters = len(df_books2) // cluster_ratio

    X = df_books2.iloc[:, 9:]

    km = KMeans(n_clusters = num_clusters)
    clusters = km.fit_predict(X)
    isbns = list(df_books2.ISBN)
    cluster_df = pd.DataFrame()
    cluster_df['ISBN'] = isbns
    cluster_df['Cluster'] = clusters

    cluster_list = []
    for isbn in range(len(df_ratings2)):
        cluster_list.append(cluster_df.Cluster[cluster_df.ISBN == df_ratings2.ISBN[isbn]].iloc[0])
    df_ratings2['Cluster'] = cluster_list

    clustered_ratings = pd.DataFrame()
    clustered_ratings['User'] = df_ratings2.User
    clustered_ratings['Cluster'] = df_ratings2.Cluster
    clustered_ratings['Rating'] = df_ratings2.Rating

    cluster_dict = {}
    for user in clustered_ratings.User.unique():
        x = clustered_ratings[clustered_ratings.User == user]
        cluster_dict[user] = {}
        for cluster in x.Cluster.unique():
            cluster_dict[user][cluster] = np.mean(x.Rating[x.Cluster == cluster])

    new_ratings = []
    for rating in range(len(clustered_ratings)):
        new_ratings.append(cluster_dict[clustered_ratings.iloc[rating, 0]][clustered_ratings.iloc[rating, 1]])
    clustered_ratings2 = clustered_ratings.copy()
    clustered_ratings2['Rating'] = new_ratings

    clustered_ratings2 = clustered_ratings2.drop_duplicates(['User', 'Cluster'])
    clustered_ratings = clustered_ratings[clustered_ratings.index.isin(clustered_ratings2.index)]

    reader = Reader(rating_scale=(1,10))
    avg_rmse = 0

    for cv in range(cross_vals):
        svd = SVD(n_factors=num_factors, n_epochs=num_epochs)
        test_set = clustered_ratings.sample(frac=0.15)
        train_set = clustered_ratings2[~clustered_ratings2.index.isin(test_set.index)].dropna()
        data = Dataset.load_from_df(train_set, reader)
        x = data.build_full_trainset()
        svd.fit(x)
        RMSE = 0
        for i in range(len(test_set)):
            uid = test_set.User.iloc[i]
            iid = test_set.Cluster.iloc[i]
            r_ui = test_set.Rating.iloc[i]
            pred = svd.predict(uid, iid, verbose=False)
            RMSE += (r_ui - pred[3]) ** 2
        avg_rmse += (RMSE / len(test_set)) ** 0.5
    rmses.append(avg_rmse/cross_vals)
    
    print(len(clustered_ratings2))
    print(len(df_books2))
print(rmses)
    

13231
4928
13761
4928
13704
4928
13150
4928
13588
4928
13511
4928
12856
4928
[1.522149908291497, 1.5188350891392057, 1.4945708570572227, 1.515255047951058, 1.525059170020427, 1.5295472219978166, 1.502653668808447]


This actually doesn't seem to affect the RMSE much, so I will set this at 5 to try to keep more data.

In [49]:
# Iterate over number of clusters
start = time.time()

min_user_rating = 100
min_book_ratings = 5
#cluster_ratio = ?
cross_vals = 6
num_epochs = 15
num_factors = 50

rmses = []

for cluster_ratio in [5, 10, 25, 50, 75, 100]:

    user_ids_to_keep = []

    for user in df_ratings.User.unique():
        if len(df_ratings[df_ratings.User == user]) >= min_user_rating:
            user_ids_to_keep.append(user)

    df_ratings2 = df_ratings[df_ratings['User'].isin(user_ids_to_keep)].copy()
    df_books2 = df_books[df_books.NumberRatings >= min_book_ratings].copy()
    df_books2 = df_books2[df_books2.ISBN.isin(df_ratings2.ISBN.unique())]
    df_ratings2 = df_ratings2[df_ratings2.ISBN.isin(df_books2.ISBN.unique())]


    df_books2.reset_index(drop=True, inplace=True)
    df_ratings2.reset_index(drop=True, inplace=True)

    num_clusters = len(df_books2) // cluster_ratio

    X = df_books2.iloc[:, 9:]

    km = KMeans(n_clusters = num_clusters)
    clusters = km.fit_predict(X)
    isbns = list(df_books2.ISBN)
    cluster_df = pd.DataFrame()
    cluster_df['ISBN'] = isbns
    cluster_df['Cluster'] = clusters

    cluster_list = []
    for isbn in range(len(df_ratings2)):
        cluster_list.append(cluster_df.Cluster[cluster_df.ISBN == df_ratings2.ISBN[isbn]].iloc[0])
    df_ratings2['Cluster'] = cluster_list

    clustered_ratings = pd.DataFrame()
    clustered_ratings['User'] = df_ratings2.User
    clustered_ratings['Cluster'] = df_ratings2.Cluster
    clustered_ratings['Rating'] = df_ratings2.Rating

    cluster_dict = {}
    for user in clustered_ratings.User.unique():
        x = clustered_ratings[clustered_ratings.User == user]
        cluster_dict[user] = {}
        for cluster in x.Cluster.unique():
            cluster_dict[user][cluster] = np.mean(x.Rating[x.Cluster == cluster])

    new_ratings = []
    for rating in range(len(clustered_ratings)):
        new_ratings.append(cluster_dict[clustered_ratings.iloc[rating, 0]][clustered_ratings.iloc[rating, 1]])
    clustered_ratings2 = clustered_ratings.copy()
    clustered_ratings2['Rating'] = new_ratings

    clustered_ratings2 = clustered_ratings2.drop_duplicates(['User', 'Cluster'])
    clustered_ratings = clustered_ratings[clustered_ratings.index.isin(clustered_ratings2.index)]

    reader = Reader(rating_scale=(1,10))
    avg_rmse = 0

    for cv in range(cross_vals):
        svd = SVD(n_factors=num_factors, n_epochs=num_epochs)
        test_set = clustered_ratings.sample(frac=0.15)
        train_set = clustered_ratings2[~clustered_ratings2.index.isin(test_set.index)].dropna()
        data = Dataset.load_from_df(train_set, reader)
        x = data.build_full_trainset()
        svd.fit(x)
        RMSE = 0
        for i in range(len(test_set)):
            uid = test_set.User.iloc[i]
            iid = test_set.Cluster.iloc[i]
            r_ui = test_set.Rating.iloc[i]
            pred = svd.predict(uid, iid, verbose=False)
            RMSE += (r_ui - pred[3]) ** 2
        avg_rmse += (RMSE / len(test_set)) ** 0.5
    rmses.append(avg_rmse/cross_vals)
    
    print(num_clusters)
print(rmses)
    

2193
1096
438
219
146
109
[1.4926367773395004, 1.5120467439524823, 1.5030748174799058, 1.5014420971467368, 1.5205610450249643, 1.5396184592253543]


Doesn't affect much, but with too few clusters the RMSE starts to rise.

In [126]:
## Final Model
start = time.time()

min_user_rating = 100
min_book_ratings = 5
cluster_ratio = 20
cross_vals = 6
num_epochs = 10
num_factors = 20

user_ids_to_keep = []

for user in df_ratings.User.unique():
    if len(df_ratings[df_ratings.User == user]) >= min_user_rating:
        user_ids_to_keep.append(user)

df_ratings2 = df_ratings[df_ratings['User'].isin(user_ids_to_keep)].copy()
df_books2 = df_books[df_books.NumberRatings >= min_book_ratings].copy()
df_books2 = df_books2[df_books2.ISBN.isin(df_ratings2.ISBN.unique())]
df_ratings2 = df_ratings2[df_ratings2.ISBN.isin(df_books2.ISBN.unique())]


df_books2.reset_index(drop=True, inplace=True)
df_ratings2.reset_index(drop=True, inplace=True)

num_clusters = len(df_books2) // cluster_ratio

X = df_books2.iloc[:, 9:]

km = KMeans(n_clusters = num_clusters)
clusters = km.fit_predict(X)
isbns = list(df_books2.ISBN)
cluster_df = pd.DataFrame()
cluster_df['ISBN'] = isbns
cluster_df['Cluster'] = clusters

cluster_list = []
for isbn in range(len(df_ratings2)):
    cluster_list.append(cluster_df.Cluster[cluster_df.ISBN == df_ratings2.ISBN[isbn]].iloc[0])
df_ratings2['Cluster'] = cluster_list

clustered_ratings = pd.DataFrame()
clustered_ratings['User'] = df_ratings2.User
clustered_ratings['Cluster'] = df_ratings2.Cluster
clustered_ratings['Rating'] = df_ratings2.Rating

cluster_dict = {}
for user in clustered_ratings.User.unique():
    x = clustered_ratings[clustered_ratings.User == user]
    cluster_dict[user] = {}
    for cluster in x.Cluster.unique():
        cluster_dict[user][cluster] = np.mean(x.Rating[x.Cluster == cluster])

new_ratings = []
for rating in range(len(clustered_ratings)):
    new_ratings.append(cluster_dict[clustered_ratings.iloc[rating, 0]][clustered_ratings.iloc[rating, 1]])
clustered_ratings2 = clustered_ratings.copy()
clustered_ratings2['Rating'] = new_ratings

clustered_ratings2 = clustered_ratings2.drop_duplicates(['User', 'Cluster'])
clustered_ratings = clustered_ratings[clustered_ratings.index.isin(clustered_ratings2.index)]

reader = Reader(rating_scale=(1,10))
avg_rmse = 0

for cv in range(cross_vals):
    svd = SVD(n_factors=num_factors, n_epochs=num_epochs)
    test_set = clustered_ratings.sample(frac=0.15)
    train_set = clustered_ratings2[~clustered_ratings2.index.isin(test_set.index)].dropna()
    data = Dataset.load_from_df(train_set, reader)
    x = data.build_full_trainset()
    svd.fit(x)
    RMSE = 0
    for i in range(len(test_set)):
        uid = test_set.User.iloc[i]
        iid = test_set.Cluster.iloc[i]
        r_ui = test_set.Rating.iloc[i]
        pred = svd.predict(uid, iid, verbose=False)
        RMSE += (r_ui - pred[3]) ** 2
    avg_rmse += (RMSE / len(test_set)) ** 0.5
print(avg_rmse/cross_vals)
    

1.49835754983905


In [131]:
isbns = []
for cluster in clustered_ratings2.Cluster:
    isbns.append(df_ratings2.ISBN[df_ratings2.Cluster == cluster].iloc[0])
clustered_ratings2['ISBN'] = isbns

In [132]:
df_books2 = df_books2.iloc[:, :9]
clusters = []
for isbn in df_books2.ISBN:
    clusters.append(df_ratings2.Cluster[df_ratings2.ISBN == isbn].iloc[0])
df_books2['Cluster'] = clusters

In [133]:
clustered_ratings2.to_csv('/DataScience/Final Capstone Files/ModelRatings.csv', index=False)
df_books2.to_csv('/DataScience/Final Capstone Files/ModelBooks.csv', index=False)

<h2> Cluster Analysis

In [142]:
print('Number of books in our condensed dataset for modeling:', len(df_books2))
print('Number of clusters:', len(df_books2.Cluster.unique()))

Number of books in our condensed dataset for modeling: 10969
Number of clusters: 548


Here are some example clusters:

In [146]:
df_books2[df_books2.Cluster == 3]

Unnamed: 0,ISBN,Title,Author,Year,Publisher,Blurb,AverageRating,NumberRatings,PopularityScore,Cluster
84,3257233051,Veronika Deschliesst Zu Sterben / Vernika Deci...,Paolo Coelho,2002,Distribooks,Die Geschichte einer unglücklichen jungen Frau...,8.083333,12,1.13469,3
2353,3257232993,Liebesfluchten,Berhard Schlink,2002,Distribooks,"Flucht in die Liebe, Flucht vor der Liebe – vo...",7.6,5,-0.042973,3
4108,3499101505,Der Richter Und Sein Henker,Friedrich Duerenmatt,2000,Distribooks Inc,ist einer seiner berühmtesten Romane - die Ge...,6.857143,7,-1.49749,3
5077,3404122763,Das fÃ?Â¼nfte Evangelium.,Philipp Vandenberg,1995,LÃ?Â¼bbe,"Die Ehefrau eines Münchner Kunsthändlers, der...",6.0,5,-2.618074,3
6580,3423110066,Fabian Die Geschichte Enes Moralisten,Erich Kaestner,0,Deutscher Taschenbuch Verlag,Erich Kästner kennen viele nur als Autor von K...,7.6,5,-0.042973,3
6583,3596282225,Die Nebel von Avalon. Roman.,Marion Zimmer Bradley,2000,"Fischer (Tb.), Frankfurt","Es ist Morgaine, die Hohepriesterin des Nebelr...",7.363636,11,-0.6308,3
7732,3499230933,Adressat unbekannt.,Kathrine Kressmann Taylor,2002,Rowohlt Tb.,Der Deutsche Martin Schulse und der amerikanis...,7.166667,6,-0.82427,3
7805,3551551936,Harry Potter Und Der Feuerkelch,Joanne K. Rowling,1999,Carlsen Verlag GmbH,Bedrängt von den christlichen Königreichen Spa...,9.6,10,4.54369,3
9129,3442092981,Felidae. Roman.,Akif Pirincci,1989,Goldmann,"Francis, der samtpfotige Klugscheißer, ist neu...",8.095238,21,1.426473,3
9880,3423113456,Bewohnte Frau. Roman.,Gioconda Belli,1991,Dtv,"Lavinia, eine junge, aus wohlhabender Familie ...",7.571429,7,-0.107554,3


Books in German. This highlights a strength and a weakness of clustering using BERT vectors. It picks up on language super well...But do we want to group all books of a given langauge into the same cluster? It will help recommend books to multilingual users in the database, but just because a book is in German doesn't mean it is good. This might not matter though because we recommend the top books in each cluster. 

In [154]:
df_books2[df_books2.Cluster == 11]

Unnamed: 0,ISBN,Title,Author,Year,Publisher,Blurb,AverageRating,NumberRatings,PopularityScore,Cluster
377,0767908473,The Sorcerer's Companion: A Guide to the Magic...,ALLAN ZOLA KRONZEK,2001,Broadway,Who was the real Nicholas Flamel? How did the ...,7.909091,11,0.677142,11
378,0140622063,Scottish Folk and Fairy Tales (Penguin Popular...,Gordon Jarvie,1997,Penguin Books Ltd,"This is a collection of Scottish fairy tales, ...",8.714286,7,2.116343,11
1172,0875424961,The 21 Lessons of Merlyn: A Study in Druid Mag...,Douglas Monroe,1992,Llewellyn Publications,Book Description Publication Date: September 8...,7.4,5,-0.36486,11
1513,0875421180,Wicca: A Guide for the Solitary Practitioner,Scott Cunningham,1988,Llewellyn Publications,"Cunningham presents Wicca as it is today: a,Cu...",8.40625,32,2.701712,11
1552,0836204387,The Calvin and Hobbes Tenth Anniversary Book,Bill Watterson,1995,Andrews McMeel Publishing,"Many moons ago, the magic of , first appeared ...",9.26087,23,5.123927,11
2314,0875421849,Living Wicca: A Further Guide for the Solitary...,Scott Cunningham,1993,Llewellyn Publications,"Selling more than 200,000 copies, Living Wicca...",7.571429,14,-0.145866,11
4105,0345391373,An Incomplete Education,Judy Jones,1995,Ballantine Books,You'll find everything you forgot from school-...,8.333333,9,1.552631,11
4150,0816029091,"Encyclopedia of Gods: Over 2,500 Deities of th...",Michael Jordan,1993,Facts on File,"Since the beginning of time, the same mysterie...",6.8,5,-1.330523,11
4776,0304345350,Dictionary of Superstitions,David Pickering,1996,Sterling Pub Co Inc,An A-Z guide to the origins and meaning of sup...,6.875,8,-1.563117,11
4805,0679833692,Juniper,MONICA FURLONG,1992,Random House Books for Young Readers,"This prequel to ""Wise Child"" recounts the chil...",8.0,5,0.600802,11


This cluster has a lot of books about Wicca, magic, superstitions, Pagan, fairy tales, and a random Calvin and Hobbes book, possibly because it has the word 'magic' in the blurb.

In [157]:
df_books2[df_books2.Cluster == 14]

Unnamed: 0,ISBN,Title,Author,Year,Publisher,Blurb,AverageRating,NumberRatings,PopularityScore,Cluster
418,0312853238,Ender's Game (Ender Wiggins Saga (Paperback)),Orson Scott Card,1992,Tor Books,"Andrew ""Ender"" Wiggin isn't just playing games...",8.888889,9,2.773311,14
716,0886777690,A Thousand Words for Stranger (Daw Book Collec...,Julie E. Czerneda,1997,Daw Books,Sira is on the run. The mysterious Captain Mor...,6.222222,9,-3.085954,14
763,0440418569,"The Amber Spyglass (His Dark Materials, Book 3)",PHILIP PULLMAN,2003,Yearling,"Lyra and Will, the two ordinary children whose...",7.625,8,-0.003536,14
839,0765342294,Ender's Game (Ender Wiggins Saga (Paperback)),Orson Scott Card,2002,Starscape Books,"Andrew ""Ender"" Wiggin thinks he is playing com...",9.166667,12,3.826672,14
854,0609610597,The Shelters of Stone (Earth's Children Series...,JEAN M. AUEL,2002,Crown,"Jean Auel's fifth novel about Ayla, the Cro-Ma...",7.258621,58,-1.494567,14
953,0812550706,Ender's Game (Ender Wiggins Saga (Paperback)),Orson Scott Card,1994,Tor Books,"Andrew ""Ender"" Wiggin thinks he is playing com...",8.837607,117,5.766546,14
1501,0399136487,Damia (Rowan),Anne McCaffrey,1992,Putnam Pub Group,Unquestionably the most brilliant of the Gwyn-...,8.0,7,0.726407,14
1520,0441135560,Damia (Ace Science Fiction),Anne McCaffrey,1993,ACE Charter,The Rowan's daughter Damia is a handful. Aware...,7.076923,13,-1.410151,14
1691,006447335X,Year of the Griffin,Diana Wynne Jones,2001,HarperTrophy,It is eight years after the tours from offworl...,7.428571,7,-0.385542,14
1724,0812575717,Ender's Shadow,Orson Scott Card,2000,Tor Books,"A COMPANION VOLUME TO ,, ONE THAT EXPANDS AND ...",8.72,25,3.519195,14


This cluster is decidedly sci-fi.

In [161]:
df_books2[df_books2.Cluster == 18]

Unnamed: 0,ISBN,Title,Author,Year,Publisher,Blurb,AverageRating,NumberRatings,PopularityScore,Cluster
139,0451523415,Little Women (Signet Classic),Louisa May Alcott,1988,Signet Classics,"In 1868, in response to a publisher's request ...",8.538462,13,2.338621,18
408,0345356578,A Man Rides Through (Man Rides Through),Stephen Donaldson,1990,Del Rey Books,In the thrilling conclusion to THE MIRROR OF H...,8.0,10,0.859554,18
455,0345348036,The Princess Bride: S Morgenstern's Classic Ta...,WILLIAM GOLDMAN,1987,Del Rey,Once upon a time came a story so full of high ...,8.837838,74,5.212814,18
507,0345384466,The Witching Hour (Lives of the Mayfair Witches),ANNE RICE,1993,Ballantine Books,"On the veranda of a great New Orleans house, n...",7.958763,97,1.519089,18
744,0345438329,Big Stone Gap: A Novel (Ballantine Reader's Ci...,Adriana Trigiani,2001,Ballantine Books,Millions of readers around the world have fal...,7.5,92,-0.572913,18
767,0425188361,"Cerulean Sins: An Anita Blake, Vampire Hunter ...",Laurell K. Hamilton,2003,Berkley Publishing Group,"With her ""New York Times"" bestselling Anita Bl...",8.0,16,1.035006,18
927,0345350499,The Mists of Avalon,MARION ZIMMER BRADLEY,1987,Del Rey,"Here is the magical legend of King Arthur, viv...",8.638554,83,4.47122,18
1062,0553205587,Lord God Made Them All,James Herriot,1982,Bantam Doubleday Dell,The triumphant conclusion to the legendary ser...,9.083333,12,3.619596,18
1175,0141002077,Cherry: A Memoir,Mary Karr,2001,Penguin Books,"From Mary Karr comes this gorgeously written, ...",7.0,15,-1.697137,18
1591,0767900383,Under the Tuscan Sun,Frances Mayes,1997,Broadway,"An enchanting and lyrical look at the life, th...",7.256757,74,-1.592262,18


Old-school, romantic, chivalry, classics, targeting Women

In [175]:
df_books2[df_books2.Cluster == 230]

Unnamed: 0,ISBN,Title,Author,Year,Publisher,Blurb,AverageRating,NumberRatings,PopularityScore,Cluster
411,451155750,The Dead Zone,Stephen King,2004,Signet Book,"Johnny, the small boy who skated at breakneck ...",7.5,38,-0.460884,230
6725,451126661,The Dead Zone,Stephen King,1980,Signet Book,"Johnny , the small boy who skated at breakneck...",8.5,6,1.564743,230
6963,451093380,The Dead Zone,Stephen King,1980,New Amer Library,"Johnny , the small boy who skated at breakneck...",7.375,8,-0.523397,230


Unfortunately, there are some repeats in the dataset. They are considered independent and all 3 have more than 5 ratings. K-Means is pretty good at finding these and throwing them into a cluster together, which overcomes needing to manually go through and combine these. In a future version of this project it would probably be safer to correct this.