For Project 4, I chose to use a dataset from Kaggle containing user ratings for different animes.  This dataset has two files - one file contains information on the anime, such as name, genre, type, episodes, rating, and number of members who follow the anime on the source website, while the other file contains user ratings for the anime.  A rating of -1 indicates that the anime was viewed but not rated.  The purpose of this assignment is to create two different recommender systems and assess them based on various metrics.  I do this via the following steps:

1.  Data Pre-Processing: I import data from Kaggle.  To determine cosine similarity between items, I will use tf-idf on the information in the anime.csv file.  I will need to pre-process the data to convert numerical data into strings that can be assessed with tf-idf.  Given the size of the data, I will also randomly select a subset to create this recommender.  I will split this subset further into a training and testing set.
2.  Global Basline: As in previous assignments, I will create a global baseline that can be used as a baseline of comparison for the recommender systems.
3.  As mentioned in step 1, I will use tf-idf to determine cosine similarity between items.  I will use this in the assessment metrics, but also to develop a content-based recommender system.
4.  I will use TruncatedSVD to create a collaborative filtering recommender system.
5.  All recommenders will be assessed with RMSE (after recommender development) and with Diversity Score (in a separate section).  I will also develop an improvement for this recommender system with the goal of improving the diversity metric.

# Data Pre-Processing

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors as KNN
import matplotlib.pyplot as plt
import kagglehub
import gc
# Authenticate
# kagglehub.login()
path = kagglehub.dataset_download("CooperUnion/anime-recommendations-database")

In [2]:
path_anime = path + '\\anime.csv'
path_ratings = path + '\\rating.csv'

In [3]:
anime = pd.read_csv(path_anime, header = 0)
anime

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1,4.15,211
12290,5543,Under World,Hentai,OVA,1,4.28,183
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4,4.88,219
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1,4.98,175


The following is some minor exploratory data analysis on the members and episodes columns of the anime data.  I use this information to add additional tags to the anime data.  For anime in the 4th quartile of members, I tag the anime as "popular".  For anime in the 1st quartile of members, I tag the anime as "niche".  For TV anime in the 4th quartile of number of episodes, I tag the anime as "long".  For TV anime in the 1st quartile of number of episodes, I tag the anime as "short".  This information is added into a "combined" column, which will later include genre and type.

In [4]:
q1_members = anime['members'].quantile(0.25)
q2_members = anime['members'].quantile(0.50)
q3_members = anime['members'].quantile(0.75)
print(f"{q1_members},{q2_members},{q3_members}")
max_episodes = pd.to_numeric(anime.loc[anime['type'] == 'TV']['episodes'], errors='coerce').max()
q1_episodes = pd.to_numeric(anime.loc[anime['type'] == 'TV']['episodes'], errors='coerce').quantile(0.25)
q2_episodes = pd.to_numeric(anime.loc[anime['type'] == 'TV']['episodes'], errors='coerce').quantile(0.50)
q3_episodes = pd.to_numeric(anime.loc[anime['type'] == 'TV']['episodes'], errors='coerce').quantile(0.75)
print(f"{q1_episodes},{q2_episodes},{q3_episodes}")
print(f"max episodes: {max_episodes}")

225.0,1550.0,9437.0
12.0,24.0,39.0
max episodes: 1818.0


In [5]:
anime['episodes'] = pd.to_numeric(anime['episodes'], errors='coerce')
anime['combined'] = anime.apply(
    lambda row: ('Long' if isinstance(row['episodes'], (int, float)) and row['episodes'] > q3_episodes else 
    'Short' if isinstance(row['episodes'], (int, float)) and row['episodes'] < q1_episodes else
    ''), 
    axis=1
)
anime.loc[anime['members'] > q3_members, 'combined'] = anime.loc[anime['members'] > q3_members, 'combined'] + ' popular'
anime.loc[anime['members'] < q1_members, 'combined'] = anime.loc[anime['members'] < q1_members, 'combined'] + ' niche'

anime

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,combined
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1.0,9.37,200630,Short popular
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64.0,9.26,793665,Long popular
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.25,114262,Long popular
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.17,673572,popular
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.16,151266,Long popular
...,...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1.0,4.15,211,Short niche
12290,5543,Under World,Hentai,OVA,1.0,4.28,183,Short niche
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4.0,4.88,219,Short niche
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1.0,4.98,175,Short niche


For the sake of this recommender system, ratings of -1 are excluded as they are not true ratings. 

In [6]:
rating = pd.read_csv(path_ratings, header = 0)
rating = rating.loc[rating['rating'] != -1]
rating

Unnamed: 0,user_id,anime_id,rating
47,1,8074,10
81,1,11617,10
83,1,11757,10
101,1,15451,10
153,2,11771,10
...,...,...,...
7813732,73515,16512,7
7813733,73515,17187,9
7813734,73515,22145,10
7813735,73516,790,9


Given the size of this dataset, I chose to select 20% of the users represented in the ratings file to use for this exercise.

In [7]:
random_selection = pd.DataFrame(rating['user_id'].unique()).sample(frac = .2, random_state = 63)
new_rating = rating[rating['user_id'].isin(random_selection[0])]
new_rating

Unnamed: 0,user_id,anime_id,rating
1149,8,269,9
1150,8,355,9
1151,8,6702,10
1152,8,7088,5
1153,8,7593,9
...,...,...,...
7813500,73512,552,7
7813501,73512,656,8
7813502,73512,790,8
7813503,73512,853,10


I split the 20% of the data into a training and test set.

In [8]:
df_random = new_rating.sample(frac = .2, random_state = 63) # for the sake of this exercise, going to use only 20% of the dataset due to size
split_size = int(0.8*len(df_random)) # designate split size (80%)
train_df = df_random[:split_size] # split dataset into 80% train and 20% test
test_df = df_random[split_size:]
train_df = pd.DataFrame(train_df)
test_df = pd.DataFrame(test_df)

# Global Baseline

The process for this global baseline is the same as in previous projects.  I take the mean rating across all ratings in the training set.  I pivot the user-anime ratings from long to wide to create a user-anime rating matrix.  I find the user bias for each user by taking the average rating for each user and subtracting the average rating across all ratings in the training set.  I find the anime bias for each user by taking the average rating for each user and ubstracting the average rating across all ratings in the training set.

In [9]:
#train_df['rating'] = pd.to_numeric(train_df['rating'])
mean = np.nanmean(train_df['rating'].loc[train_df['rating'] != -1]) 
std_dev = np.std(train_df['rating'].loc[train_df['rating'] != -1])
print(mean) # This is the average rating across all user-anime ratings
print(std_dev) # This is the standard deviation across all user-anime ratings

7.791675917292009
1.5919233619333273


In [10]:
# There seems to have been duplicate ratings for the same user-anime; mean aggregate function was used in these cases
train_df = train_df.pivot_table(index='user_id', columns='anime_id', values='rating',aggfunc='mean') 
train_df

anime_id,1,5,6,7,8,15,16,17,18,19,...,33750,33798,33902,33964,33979,34085,34103,34107,34238,34240
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,,,,,,,,,,,...,,,,,,,,,,
17,,,7.0,,,,,,,10.0,...,,,,,,,,,,
19,,,,,,,,,,,...,,,,,,,,,,
28,,,,,,,,,,,...,,,,,,,,,,
34,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73497,,,,,,,,,,,...,,,,,,,,,,
73500,,,,,,,,,,,...,,,,,,,,,,
73506,,,,,,,,,,,...,,,,,,,,,,
73511,,,,,,,,,,,...,,,,,,,,,,


In [11]:
test_df = test_df.pivot_table(index='user_id', columns='anime_id', values='rating',aggfunc='mean') 
test_df

anime_id,1,5,6,7,8,15,16,17,18,19,...,33514,33524,33558,33569,33709,33740,33964,34085,34103,34240
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
17,,,,,,,,,,,...,,,,,,,,,,
19,,,,,,,,,,,...,,,,,,,,,,
34,,,,,,,,,,,...,,,,,,,,,,
44,,,,,,,,,,,...,,,,,,,,,,
45,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73449,,,,,,,,,,,...,,,,,,,,,,
73485,,,,,,,,,,,...,,,,,,,,,,
73490,,,,,,,,,,,...,,,,,,,,,,
73497,,,,,,,,,,,...,,,,,,,,,,


In [12]:
user_bias = train_df.mean(axis=1) - mean # find user bias by taking mean rating for each user
user_bias = user_bias.fillna(0).reset_index() # I reset index to use the user_id column to set the index of global baseline later on
print(user_bias)
anime_bias = train_df.mean() - mean # find anime bias by taking mean rating for each anime
anime_bias = anime_bias.fillna(0).reset_index() # I reset index to use the anime_id column to set the columns of global baseline later on
print(anime_bias)

       user_id         0
0            8  0.208324
1           17 -0.826764
2           19  0.708324
3           28  1.708324
4           34  0.208324
...        ...       ...
12120    73497  0.041657
12121    73500  0.328324
12122    73506  1.708324
12123    73511  1.208324
12124    73512  1.208324

[12125 rows x 2 columns]
      anime_id         0
0            1  1.039858
1            5  0.721277
2            6  0.641657
3            7 -0.401845
4            8 -0.791676
...        ...       ...
6680     34085 -1.791676
6681     34103 -0.125009
6682     34107  1.208324
6683     34238  1.208324
6684     34240  0.855383

[6685 rows x 2 columns]


In [13]:
global_baseline = pd.DataFrame(np.ones((73513,6685))*mean) # create a matrix of means
global_baseline = global_baseline[global_baseline.index.isin(user_bias['user_id'])] # setting the global baseline index to the users in the training set
user_bias = user_bias.set_index('user_id') # set user_id back as the index of user_bias
global_baseline = global_baseline.transpose() # transpose the matrix so that the same process above can be done with anime_id for the columns
global_baseline = global_baseline.reindex(anime_bias['anime_id'])
#global_baseline = global_baseline.drop(columns=global_baseline.columns.difference(anime_bias['anime_id']))
global_baseline = global_baseline.transpose() # transpose the matrix back so that users are rows and anime are columns
anime_bias = anime_bias.set_index('anime_id') # set anime_id back as the index of anime_bias
global_baseline = global_baseline.fillna(mean) # fill NA values with the mean as the above restructuring reintroduced NA values

In [14]:
# This is an alternative way of doing the above, but for the test global baseline (which is still based on training data)
test_df_T = test_df.transpose()
global_baseline_test = pd.DataFrame(np.ones((9652,4779))*mean)
global_baseline_test = global_baseline_test.set_index(test_df.index)
global_baseline_test = global_baseline_test.transpose()
global_baseline_test = global_baseline_test.set_index(test_df_T.index)
global_baseline_test = global_baseline_test.transpose()
global_baseline_test

anime_id,1,5,6,7,8,15,16,17,18,19,...,33514,33524,33558,33569,33709,33740,33964,34085,34103,34240
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
17,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,...,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676
19,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,...,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676
34,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,...,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676
44,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,...,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676
45,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,...,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73449,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,...,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676
73485,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,...,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676
73490,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,...,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676
73497,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,...,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676,7.791676


In [15]:
print(global_baseline_test.shape)
print(user_bias.shape)  
user_bias_test = user_bias[user_bias.index.isin(global_baseline_test.index)]
user_bias_test

(9652, 4779)
(12125, 1)


Unnamed: 0_level_0,0
user_id,Unnamed: 1_level_1
17,-0.826764
19,0.708324
34,0.208324
44,-0.253214
45,0.636896
...,...
73449,0.258324
73485,-0.613104
73490,0.494038
73497,0.041657


In [16]:
insert_missing_users = global_baseline_test[~global_baseline_test.index.isin(user_bias_test.index)].index
missing_rows = np.zeros((len(insert_missing_users),1))
missing_rows = pd.DataFrame(missing_rows)
missing_rows = missing_rows.set_index(insert_missing_users)
user_bias_test = pd.concat([user_bias_test, missing_rows]).sort_index()

In [17]:
print(anime_bias.shape)
global_baseline_test_T = global_baseline_test.transpose()
anime_bias_test = anime_bias[anime_bias.index.isin(global_baseline_test_T.index)]
insert_missing_animes = global_baseline_test_T[~global_baseline_test_T.index.isin(anime_bias_test.index)].index
missing_rows = np.zeros((len(insert_missing_animes),1))
missing_rows = pd.DataFrame(missing_rows)
missing_rows = missing_rows.set_index(insert_missing_animes)
anime_bias_test = pd.concat([anime_bias_test, missing_rows]).sort_index()

(6685, 1)


In [18]:
# Checking shape for creating global baseline - user_bias rows should match global_baseline rows and anime_bias rows should match global_baseline cols
print(global_baseline.shape)  
print(user_bias.shape)  
print(anime_bias.shape)  
print(global_baseline_test.shape)  
print(user_bias_test.shape)  
print(anime_bias_test.shape)  

(12125, 6685)
(12125, 1)
(6685, 1)
(9652, 4779)
(9652, 1)
(4779, 1)


In [19]:
global_baseline = global_baseline + user_bias.values.reshape(-1, 1) # add user_bias to global baseline

In [20]:
global_baseline = global_baseline + anime_bias.values.reshape(1, -1) # add anime_bias to global_baseline

In [21]:
global_baseline[global_baseline > 10] = 10 # no ratings above 10
global_baseline[global_baseline < 1] = 1 # no ratings below 1
global_baseline

anime_id,1,5,6,7,8,15,16,17,18,19,...,33750,33798,33902,33964,33979,34085,34103,34107,34238,34240
8,9.039858,8.721277,8.641657,7.598155,7.208324,8.471482,8.661932,7.571960,8.514447,9.174991,...,6.208324,8.208324,8.208324,5.922610,6.208324,6.208324,7.874991,9.208324,9.208324,8.855383
17,8.004770,7.686190,7.606570,6.563067,6.173236,7.436394,7.626845,6.536873,7.479359,8.139903,...,5.173236,7.173236,7.173236,4.887522,5.173236,5.173236,6.839903,8.173236,8.173236,7.820295
19,9.539858,9.221277,9.141657,8.098155,7.708324,8.971482,9.161932,8.071960,9.014447,9.674991,...,6.708324,8.708324,8.708324,6.422610,6.708324,6.708324,8.374991,9.708324,9.708324,9.355383
28,10.000000,10.000000,10.000000,9.098155,8.708324,9.971482,10.000000,9.071960,10.000000,10.000000,...,7.708324,9.708324,9.708324,7.422610,7.708324,7.708324,9.374991,10.000000,10.000000,10.000000
34,9.039858,8.721277,8.641657,7.598155,7.208324,8.471482,8.661932,7.571960,8.514447,9.174991,...,6.208324,8.208324,8.208324,5.922610,6.208324,6.208324,7.874991,9.208324,9.208324,8.855383
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73497,8.873191,8.554611,8.474991,7.431488,7.041657,8.304815,8.495266,7.405294,8.347780,9.008324,...,6.041657,8.041657,8.041657,5.755943,6.041657,6.041657,7.708324,9.041657,9.041657,8.688716
73500,9.159858,8.841277,8.761657,7.718155,7.328324,8.591482,8.781932,7.691960,8.634447,9.294991,...,6.328324,8.328324,8.328324,6.042610,6.328324,6.328324,7.994991,9.328324,9.328324,8.975383
73506,10.000000,10.000000,10.000000,9.098155,8.708324,9.971482,10.000000,9.071960,10.000000,10.000000,...,7.708324,9.708324,9.708324,7.422610,7.708324,7.708324,9.374991,10.000000,10.000000,10.000000
73511,10.000000,9.721277,9.641657,8.598155,8.208324,9.471482,9.661932,8.571960,9.514447,10.000000,...,7.208324,9.208324,9.208324,6.922610,7.208324,7.208324,8.874991,10.000000,10.000000,9.855383


In [22]:
global_baseline_test = global_baseline_test + user_bias_test.values.reshape(-1, 1)
global_baseline_test = global_baseline_test + anime_bias_test.values.reshape(1, -1) # add anime_bias to global_baseline
global_baseline_test[global_baseline_test > 10] = 10 # no ratings above 10
global_baseline_test[global_baseline_test < 1] = 1 # no ratings below 1
global_baseline_test

anime_id,1,5,6,7,8,15,16,17,18,19,...,33514,33524,33558,33569,33709,33740,33964,34085,34103,34240
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
17,8.004770,7.686190,7.606570,6.563067,6.173236,7.436394,7.626845,6.536873,7.479359,8.139903,...,5.173236,6.339903,6.573236,6.339903,6.964912,5.173236,4.887522,5.173236,6.839903,7.820295
19,9.539858,9.221277,9.141657,8.098155,7.708324,8.971482,9.161932,8.071960,9.014447,9.674991,...,6.708324,7.874991,8.108324,7.874991,8.500000,6.708324,6.422610,6.708324,8.374991,9.355383
34,9.039858,8.721277,8.641657,7.598155,7.208324,8.471482,8.661932,7.571960,8.514447,9.174991,...,6.208324,7.374991,7.608324,7.374991,8.000000,6.208324,5.922610,6.208324,7.874991,8.855383
44,8.578319,8.259739,8.180119,7.136616,6.746786,8.009944,8.200394,7.110422,8.052908,8.713452,...,5.746786,6.913452,7.146786,6.913452,7.538462,5.746786,5.461071,5.746786,7.413452,8.393844
45,9.468429,9.149849,9.070229,8.026726,7.636896,8.900053,9.090504,8.000532,8.943018,9.603562,...,6.636896,7.803562,8.036896,7.803562,8.428571,6.636896,6.351181,6.636896,8.303562,9.283954
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73449,9.089858,8.771277,8.691657,7.648155,7.258324,8.521482,8.711932,7.621960,8.564447,9.224991,...,6.258324,7.424991,7.658324,7.424991,8.050000,6.258324,5.972610,6.258324,7.924991,8.905383
73485,8.218429,7.899849,7.820229,6.776726,6.386896,7.650053,7.840504,6.750532,7.693018,8.353562,...,5.386896,6.553562,6.786896,6.553562,7.178571,5.386896,5.101181,5.386896,7.053562,8.033954
73490,9.325572,9.006992,8.927372,7.883869,7.494038,8.757196,8.947647,7.857675,8.800161,9.460705,...,6.494038,7.660705,7.894038,7.660705,8.285714,6.494038,6.208324,6.494038,8.160705,9.141097
73497,8.873191,8.554611,8.474991,7.431488,7.041657,8.304815,8.495266,7.405294,8.347780,9.008324,...,6.041657,7.208324,7.441657,7.208324,7.833333,6.041657,5.755943,6.041657,7.708324,8.688716


I calculate the RMSE - the square root of the sum of squared errors - for the global baseline.  I will later calculate and explain the diversity metric, which requires the item cosine similarity matrix.

In [23]:
global_baseline_rmse_train = (((train_df - global_baseline) ** 2).to_numpy())
global_baseline_rmse_train = np.sqrt(np.nanmean(global_baseline_rmse_train))
print(global_baseline_rmse_train) # this is the baseline RMSE

1.185112871387914


In [24]:
global_baseline_rmse_test = (((test_df - global_baseline_test) ** 2).to_numpy())
global_baseline_rmse_test = np.sqrt(np.nanmean(global_baseline_rmse_test))
print(global_baseline_rmse_test) # this is the baseline RMSE

1.3150473119177426


# Content Based Recommendation

In this section, I use tf-idf to assess cosine similarity based on genre, type, and the additional features added in the pre-processing step based on episodes and members.  I add all of these features into the anime "combined" column and use TfidVectorizer on this column.  

In [25]:
anime['genre'] = anime['genre'].fillna('') # replace genre with blank if NA

In [26]:
anime['type'] = anime['type'].fillna(' ') # replace type with blank if NA
anime['combined'] = anime['combined'] + ' ' + anime['genre'] + ' ' + anime['type']
print(anime.loc[anime['combined'].isna()])

Empty DataFrame
Columns: [anime_id, name, genre, type, episodes, rating, members, combined]
Index: []


In [27]:
anime

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,combined
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1.0,9.37,200630,"Short popular Drama, Romance, School, Supernat..."
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64.0,9.26,793665,"Long popular Action, Adventure, Drama, Fantasy..."
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.25,114262,"Long popular Action, Comedy, Historical, Parod..."
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24.0,9.17,673572,"popular Sci-Fi, Thriller TV"
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51.0,9.16,151266,"Long popular Action, Comedy, Historical, Parod..."
...,...,...,...,...,...,...,...,...
12289,9316,Toushindai My Lover: Minami tai Mecha-Minami,Hentai,OVA,1.0,4.15,211,Short niche Hentai OVA
12290,5543,Under World,Hentai,OVA,1.0,4.28,183,Short niche Hentai OVA
12291,5621,Violence Gekiga David no Hoshi,Hentai,OVA,4.0,4.88,219,Short niche Hentai OVA
12292,6133,Violence Gekiga Shin David no Hoshi: Inma Dens...,Hentai,OVA,1.0,4.98,175,Short niche Hentai OVA


In [28]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(anime['combined'])
print(vectorizer.get_feature_names_out()) # extracted features (words) from the 158 jokes
print(vectorizer.vocabulary_) # full mapping of words to index
print(X.shape) # shape of tfidf matrix
print(X) # for each joke (indexed by the first number in coordinate), shows the word (indexed by the second number in the coordinate;
# can cross reference with vocabulary_ from above) and the tf-idf score of the word

['action' 'adventure' 'ai' 'arts' 'cars' 'comedy' 'dementia' 'demons'
 'drama' 'ecchi' 'fantasy' 'fi' 'game' 'harem' 'hentai' 'historical'
 'horror' 'josei' 'kids' 'life' 'long' 'magic' 'martial' 'mecha'
 'military' 'movie' 'music' 'mystery' 'niche' 'of' 'ona' 'ova' 'parody'
 'police' 'popular' 'power' 'psychological' 'romance' 'samurai' 'school'
 'sci' 'seinen' 'short' 'shoujo' 'shounen' 'slice' 'space' 'special'
 'sports' 'super' 'supernatural' 'thriller' 'tv' 'vampire' 'yaoi' 'yuri']
{'short': 42, 'popular': 34, 'drama': 8, 'romance': 37, 'school': 39, 'supernatural': 50, 'movie': 25, 'long': 20, 'action': 0, 'adventure': 1, 'fantasy': 10, 'magic': 21, 'military': 24, 'shounen': 44, 'tv': 52, 'comedy': 5, 'historical': 15, 'parody': 32, 'samurai': 38, 'sci': 40, 'fi': 11, 'thriller': 51, 'sports': 48, 'super': 49, 'power': 35, 'space': 46, 'ova': 31, 'slice': 45, 'of': 29, 'life': 19, 'mecha': 23, 'music': 26, 'mystery': 27, 'seinen': 41, 'martial': 22, 'arts': 3, 'vampire': 53, 'sh

In [29]:
custom_index = anime['anime_id']
cosine_X = pd.DataFrame(X.toarray(), index=custom_index,columns=vectorizer.get_feature_names_out())
cosine_X = cosine_X.sort_values(by='anime_id')

I can now calculate cosine_similarity based on the above. I also drop the anime from the cosine similarity matrix that do not exist in the training data, and vice versa.  I create a cosine_sim_df_train and cosine_sim_df_test to match the matrix dimensions of the training and testing data in order to later perform the dot product operation.

In [30]:
cosine_sim_df = cosine_similarity(cosine_X)
cosine_sim_df = pd.DataFrame(cosine_sim_df, index=cosine_X.index, columns=cosine_X.index)

In [31]:
cosine_sim_df_train =  cosine_sim_df[cosine_sim_df.index.isin(train_df.columns.tolist())] # only keep the anime that exist in the training data
# drop any anime in the training data that does not exist in the cosine similarity matrix 
cosine_sim_df_train = cosine_sim_df_train.drop(columns=cosine_sim_df_train.columns.difference(train_df.columns.tolist()).difference(['anime_id'])) 
cosine_sim_df_train 

anime_id,1,5,6,7,8,15,16,17,18,19,...,33750,33798,33902,33964,33979,34085,34103,34107,34238,34240
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.724109,0.711771,0.297966,0.271988,0.315455,0.308793,0.183792,0.296132,0.165662,...,0.000000,0.000000,0.171856,0.350402,0.000000,0.353976,0.151215,0.000000,0.079170,0.287368
5,0.724109,1.000000,0.502693,0.424785,0.082376,0.169206,0.175569,0.067929,0.221905,0.263359,...,0.049359,0.034245,0.198987,0.357958,0.049359,0.350862,0.375765,0.049359,0.036491,0.291679
6,0.711771,0.502693,1.000000,0.283296,0.228037,0.443197,0.279860,0.258218,0.281553,0.132580,...,0.000000,0.000000,0.000000,0.294740,0.000000,0.364071,0.212448,0.000000,0.111229,0.403736
7,0.297966,0.424785,0.283296,1.000000,0.302630,0.220793,0.220297,0.115785,0.256174,0.487817,...,0.000000,0.000000,0.148667,0.303121,0.000000,0.145199,0.327994,0.000000,0.000000,0.066869
8,0.271988,0.082376,0.228037,0.302630,1.000000,0.547220,0.159757,0.453854,0.139545,0.234225,...,0.000000,0.000000,0.000000,0.111833,0.000000,0.000000,0.080609,0.000000,0.000000,0.085130
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34085,0.353976,0.350862,0.364071,0.145199,0.000000,0.079940,0.093349,0.000000,0.144306,0.060726,...,0.045196,0.031357,0.182205,0.241276,0.045196,1.000000,0.087583,0.045196,0.033413,0.820202
34103,0.151215,0.375765,0.212448,0.327994,0.080609,0.165576,0.072043,0.066472,0.130006,0.563874,...,0.048300,0.189024,0.215966,0.367664,0.048300,0.087583,1.000000,0.048300,0.201423,0.094208
34107,0.000000,0.049359,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.000000,0.071680,0.081897,0.067009,1.000000,0.045196,0.048300,1.000000,0.076382,0.051009
34238,0.079170,0.036491,0.111229,0.000000,0.000000,0.086689,0.077924,0.071898,0.000000,0.000000,...,0.076382,0.298924,0.341531,0.279443,0.076382,0.033413,0.201423,0.076382,1.000000,0.037711


In [32]:
cosine_sim_df_test = cosine_sim_df[cosine_sim_df.index.isin(test_df.columns.tolist())]
cosine_sim_df_test = cosine_sim_df_test.drop(columns=cosine_sim_df_test.columns.difference(test_df.columns.tolist()).difference(['anime_id']))
cosine_sim_df_test

anime_id,1,5,6,7,8,15,16,17,18,19,...,33514,33524,33558,33569,33709,33740,33964,34085,34103,34240
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.724109,0.711771,0.297966,0.271988,0.315455,0.308793,0.183792,0.296132,0.165662,...,0.000000,0.163902,0.256123,0.345618,0.084612,0.000000,0.350402,0.353976,0.151215,0.287368
5,0.724109,1.000000,0.502693,0.424785,0.082376,0.169206,0.175569,0.067929,0.221905,0.263359,...,0.049359,0.121222,0.277330,0.110008,0.103731,0.028501,0.357958,0.350862,0.375765,0.291679
6,0.711771,0.502693,1.000000,0.283296,0.228037,0.443197,0.279860,0.258218,0.281553,0.132580,...,0.000000,0.230274,0.359838,0.279795,0.118875,0.000000,0.294740,0.364071,0.212448,0.403736
7,0.297966,0.424785,0.283296,1.000000,0.302630,0.220793,0.220297,0.115785,0.256174,0.487817,...,0.000000,0.084198,0.221563,0.102305,0.389048,0.000000,0.303121,0.145199,0.327994,0.066869
8,0.271988,0.082376,0.228037,0.302630,1.000000,0.547220,0.159757,0.453854,0.139545,0.234225,...,0.000000,0.107190,0.307768,0.454891,0.294754,0.150591,0.111833,0.000000,0.080609,0.085130
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33740,0.000000,0.028501,0.000000,0.000000,0.150591,0.149726,0.000000,0.124180,0.000000,0.000000,...,0.059657,0.209195,0.266461,0.209122,0.170610,1.000000,0.218256,0.026097,0.157319,0.029454
33964,0.350402,0.357958,0.294740,0.303121,0.111833,0.229712,0.238351,0.092220,0.301255,0.155053,...,0.067009,0.357888,0.622738,0.384238,0.298486,0.218256,1.000000,0.241276,0.367664,0.130700
34085,0.353976,0.350862,0.364071,0.145199,0.000000,0.079940,0.093349,0.000000,0.144306,0.060726,...,0.045196,0.028096,0.148345,0.000000,0.094982,0.026097,0.241276,1.000000,0.087583,0.820202
34103,0.151215,0.375765,0.212448,0.327994,0.080609,0.165576,0.072043,0.066472,0.130006,0.563874,...,0.048300,0.257965,0.448869,0.276958,0.215148,0.157319,0.367664,0.087583,1.000000,0.094208


For the content based recommender, I center the data by subtracting the global baseline.  I do this for both the training and testing data.  Then, NAs are filled with 0.  Columns in the training and testing data that do not exist in the cosine similarity matrix are dropped.

In [33]:
train_df_no_na = train_df - global_baseline
train_df_no_na = train_df_no_na.drop(columns=train_df_no_na.columns.difference(cosine_sim_df_train.columns.tolist()))
train_df_no_na = train_df_no_na.fillna(0)

In [34]:
test_df_no_na = test_df - global_baseline_test
test_df_no_na = test_df_no_na.drop(columns=test_df_no_na.columns.difference(cosine_sim_df_test.columns.tolist()))
test_df_no_na = test_df_no_na.fillna(0)

A matrix of predicted ratings is created by taking the dot product of the data and cosine similarity matrix.  This cosine similarity is normalized by taking the sum of the cosine similarity for each anime and dividing the predicted matrix by this value.  

In [35]:
prediction_train = train_df_no_na.dot(cosine_sim_df_train)
sim_sums = np.abs(cosine_sim_df_train).sum(axis=1)
prediction_train = prediction_train.div(sim_sums, axis=1)

In [36]:
prediction_test = test_df_no_na.dot(cosine_sim_df_test)
sim_sums_test = np.abs(cosine_sim_df_test).sum(axis=1)
prediction_test = prediction_test.div(sim_sums_test, axis=1)

The global baselines are added back in to the predictions to retrieve the final predicted ratings matrices.  Any values higher than 10 are corrected to a rating of 10, and any values lower than 1 are corrected to a rating of 1.

In [37]:
global_baseline = global_baseline.drop(columns=global_baseline.columns.difference(cosine_sim_df_train.columns.tolist()))
prediction_train = prediction_train + global_baseline
prediction_train

anime_id,1,5,6,7,8,15,16,17,18,19,...,33750,33798,33902,33964,33979,34085,34103,34107,34238,34240
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,9.040012,8.721476,8.641771,7.598019,7.208287,8.471255,8.662300,7.571818,8.514502,9.175166,...,6.208324,8.208274,8.208450,5.922654,6.208324,6.208706,7.874926,9.208324,9.208247,8.855631
17,8.002989,7.684474,7.603250,6.562179,6.172842,7.434373,7.626166,6.535815,7.477214,8.140965,...,5.172884,7.170935,7.172634,4.886551,5.172884,5.172467,6.837797,8.172884,8.171718,7.818885
19,9.539421,9.220750,9.141408,8.097736,7.707893,8.971225,9.161331,8.071853,9.013934,9.674701,...,6.708237,8.708234,8.707923,6.421694,6.708237,6.707983,8.374637,9.708237,9.708238,9.355195
28,9.999826,9.999903,9.999837,9.097998,8.708086,9.971319,9.999768,9.071870,9.999808,9.999892,...,7.708324,9.708324,9.708324,7.422476,7.708324,7.708265,9.374856,10.000000,10.000000,9.999933
34,9.039685,8.721006,8.641483,7.597998,7.208261,8.471417,8.661792,7.571903,8.514304,9.174637,...,6.208297,8.208296,8.208198,5.922467,6.208297,6.208060,7.874659,9.208297,9.208297,8.855173
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73497,8.872824,8.554304,8.474681,7.431307,7.041592,8.304713,8.494803,7.404864,8.347505,9.008210,...,6.041657,8.041657,8.041403,5.755743,6.041657,6.041213,7.708374,9.041657,9.041612,8.688402
73500,9.159307,8.841354,8.760798,7.718129,7.327600,8.590676,8.780497,7.691212,8.634287,9.294726,...,6.327954,8.326377,8.327063,6.041447,6.327954,6.329693,7.993676,9.327954,9.325648,8.975074
73506,10.000015,9.999958,10.000142,9.098112,8.708257,9.971478,10.000164,9.071636,9.999948,9.999801,...,7.708324,9.708227,9.708019,7.422373,7.708324,7.708205,9.375073,10.000000,10.000129,10.000083
73511,9.999914,9.721239,9.641542,8.598075,8.208245,9.471368,9.661708,8.571860,9.514348,9.999913,...,7.208324,9.208139,9.208162,6.922557,7.208324,7.208324,8.874938,10.000000,9.999938,9.855329


In [38]:
global_baseline_test = global_baseline_test.drop(columns=global_baseline_test.columns.difference(cosine_sim_df_test.columns.tolist()))
prediction_test = prediction_test + global_baseline_test
prediction_test

anime_id,1,5,6,7,8,15,16,17,18,19,...,33514,33524,33558,33569,33709,33740,33964,34085,34103,34240
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
17,8.003598,7.685799,7.604147,6.561739,6.171778,7.434696,7.625052,6.536016,7.475220,8.137658,...,5.173423,6.336282,6.571715,6.338895,6.963908,5.173070,4.886443,5.172455,6.840146,7.818635
19,9.538980,9.220561,9.140725,8.097319,7.707444,8.970537,9.161344,8.071312,9.013665,9.674312,...,6.708324,7.874450,8.107668,7.874408,8.499321,6.706674,6.421990,6.707325,8.374513,9.354738
34,9.039944,8.721362,8.641772,7.598210,7.208361,8.471541,8.661973,7.571993,8.514514,9.175102,...,6.208324,7.375014,7.608378,7.375016,8.000028,6.208324,5.922660,6.208411,7.875125,8.855480
44,8.577894,8.259129,8.179946,7.135784,6.746636,8.009913,8.200171,7.110543,8.051921,8.712205,...,5.746636,6.912934,7.146271,6.913119,7.537619,5.746296,5.460416,5.746370,7.411869,8.393607
45,9.468131,9.149642,9.069977,8.026192,7.636714,8.899802,9.089883,8.000310,8.942651,9.603235,...,6.636896,7.803147,8.036767,7.803350,8.428248,6.636896,6.350898,6.636205,8.303444,9.283221
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73449,9.089625,8.771125,8.691961,7.648357,7.258307,8.521402,8.710502,7.620228,8.563568,9.226511,...,6.258011,7.424562,7.657772,7.424607,8.049903,6.257044,5.971508,6.257945,7.926557,8.905501
73485,8.213961,7.896874,7.815854,6.772630,6.382838,7.646016,7.833678,6.745136,7.688347,8.350494,...,5.386891,6.551462,6.783774,6.550253,7.176510,5.386085,5.097651,5.383416,7.051622,8.031216
73490,9.326080,9.007464,8.927729,7.884128,7.494572,8.757890,8.948527,7.859016,8.800685,9.461369,...,6.494087,7.660883,7.894450,7.661333,8.285690,6.494096,6.208644,6.494300,8.160784,9.141321
73497,8.873918,8.554985,8.475686,7.432355,7.043042,8.306003,8.495937,7.406143,8.348471,9.009016,...,6.041657,7.208667,7.442550,7.209355,7.834382,6.042039,5.756456,6.041910,7.708833,8.688985


In [39]:
prediction_train[prediction_train > 10] = 10
prediction_train[prediction_train < 1] = 1
prediction_train

anime_id,1,5,6,7,8,15,16,17,18,19,...,33750,33798,33902,33964,33979,34085,34103,34107,34238,34240
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,9.040012,8.721476,8.641771,7.598019,7.208287,8.471255,8.662300,7.571818,8.514502,9.175166,...,6.208324,8.208274,8.208450,5.922654,6.208324,6.208706,7.874926,9.208324,9.208247,8.855631
17,8.002989,7.684474,7.603250,6.562179,6.172842,7.434373,7.626166,6.535815,7.477214,8.140965,...,5.172884,7.170935,7.172634,4.886551,5.172884,5.172467,6.837797,8.172884,8.171718,7.818885
19,9.539421,9.220750,9.141408,8.097736,7.707893,8.971225,9.161331,8.071853,9.013934,9.674701,...,6.708237,8.708234,8.707923,6.421694,6.708237,6.707983,8.374637,9.708237,9.708238,9.355195
28,9.999826,9.999903,9.999837,9.097998,8.708086,9.971319,9.999768,9.071870,9.999808,9.999892,...,7.708324,9.708324,9.708324,7.422476,7.708324,7.708265,9.374856,10.000000,10.000000,9.999933
34,9.039685,8.721006,8.641483,7.597998,7.208261,8.471417,8.661792,7.571903,8.514304,9.174637,...,6.208297,8.208296,8.208198,5.922467,6.208297,6.208060,7.874659,9.208297,9.208297,8.855173
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73497,8.872824,8.554304,8.474681,7.431307,7.041592,8.304713,8.494803,7.404864,8.347505,9.008210,...,6.041657,8.041657,8.041403,5.755743,6.041657,6.041213,7.708374,9.041657,9.041612,8.688402
73500,9.159307,8.841354,8.760798,7.718129,7.327600,8.590676,8.780497,7.691212,8.634287,9.294726,...,6.327954,8.326377,8.327063,6.041447,6.327954,6.329693,7.993676,9.327954,9.325648,8.975074
73506,10.000000,9.999958,10.000000,9.098112,8.708257,9.971478,10.000000,9.071636,9.999948,9.999801,...,7.708324,9.708227,9.708019,7.422373,7.708324,7.708205,9.375073,10.000000,10.000000,10.000000
73511,9.999914,9.721239,9.641542,8.598075,8.208245,9.471368,9.661708,8.571860,9.514348,9.999913,...,7.208324,9.208139,9.208162,6.922557,7.208324,7.208324,8.874938,10.000000,9.999938,9.855329


In [40]:
prediction_test[prediction_test > 10] = 10
prediction_test[prediction_test < 1] = 1
prediction_test

anime_id,1,5,6,7,8,15,16,17,18,19,...,33514,33524,33558,33569,33709,33740,33964,34085,34103,34240
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
17,8.003598,7.685799,7.604147,6.561739,6.171778,7.434696,7.625052,6.536016,7.475220,8.137658,...,5.173423,6.336282,6.571715,6.338895,6.963908,5.173070,4.886443,5.172455,6.840146,7.818635
19,9.538980,9.220561,9.140725,8.097319,7.707444,8.970537,9.161344,8.071312,9.013665,9.674312,...,6.708324,7.874450,8.107668,7.874408,8.499321,6.706674,6.421990,6.707325,8.374513,9.354738
34,9.039944,8.721362,8.641772,7.598210,7.208361,8.471541,8.661973,7.571993,8.514514,9.175102,...,6.208324,7.375014,7.608378,7.375016,8.000028,6.208324,5.922660,6.208411,7.875125,8.855480
44,8.577894,8.259129,8.179946,7.135784,6.746636,8.009913,8.200171,7.110543,8.051921,8.712205,...,5.746636,6.912934,7.146271,6.913119,7.537619,5.746296,5.460416,5.746370,7.411869,8.393607
45,9.468131,9.149642,9.069977,8.026192,7.636714,8.899802,9.089883,8.000310,8.942651,9.603235,...,6.636896,7.803147,8.036767,7.803350,8.428248,6.636896,6.350898,6.636205,8.303444,9.283221
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73449,9.089625,8.771125,8.691961,7.648357,7.258307,8.521402,8.710502,7.620228,8.563568,9.226511,...,6.258011,7.424562,7.657772,7.424607,8.049903,6.257044,5.971508,6.257945,7.926557,8.905501
73485,8.213961,7.896874,7.815854,6.772630,6.382838,7.646016,7.833678,6.745136,7.688347,8.350494,...,5.386891,6.551462,6.783774,6.550253,7.176510,5.386085,5.097651,5.383416,7.051622,8.031216
73490,9.326080,9.007464,8.927729,7.884128,7.494572,8.757890,8.948527,7.859016,8.800685,9.461369,...,6.494087,7.660883,7.894450,7.661333,8.285690,6.494096,6.208644,6.494300,8.160784,9.141321
73497,8.873918,8.554985,8.475686,7.432355,7.043042,8.306003,8.495937,7.406143,8.348471,9.009016,...,6.041657,7.208667,7.442550,7.209355,7.834382,6.042039,5.756456,6.041910,7.708833,8.688985


In [41]:
train_df_content = train_df.drop(columns=train_df.columns.difference(cosine_sim_df_train.columns.tolist())) # match dimensions of predictions

In [42]:
test_df_content = test_df.drop(columns=test_df.columns.difference(cosine_sim_df_test.columns.tolist())) # match dimensions of predictions

This content based recommender yields an RMSE of 1.18 for the training data and an RMSE of 1.31 for the testing data.  Compared to the global baselines, this is a very marginal improvement (0.2% and 0.1% respectively).  

In [43]:
diff = train_df_content.to_numpy(dtype=np.float32, copy=False) - prediction_train.to_numpy(dtype=np.float32, copy=False)
np.square(diff, out=diff)              # in-place square to save memory
content_rmse_train = np.sqrt(np.nanmean(diff))
print(content_rmse_train)

1.182805


In [44]:
diff_test = test_df_content.to_numpy(dtype=np.float32, copy=False) - prediction_test.to_numpy(dtype=np.float32, copy=False)
np.square(diff_test, out=diff_test)              # in-place square to save memory
content_rmse_test = np.sqrt(np.nanmean(diff_test))
print(content_rmse_test)

1.3131765


# Collaborative Filtering

I also implemented a collaborative filtering recommender using TruncatedSVD.  I create a sparse matrix with the centered training data from the previous section, which is used to fit the TruncatedSVD model.  When transforming the testing data, features needed to be both added and dropped to the testing data to match the features included in the training data.  In total, the training data contained over 2000+ additional features while the testing data contained approximately 300 additional features.  Zeroes were inserted for the missing features in the testing set, while the additional features in the testing set were dropped.

In [45]:
from sklearn.decomposition import TruncatedSVD
np.random.seed(63)
from scipy.sparse import csr_matrix

sparse_ratings = csr_matrix(train_df_no_na)

In [46]:
matrix = np.random.random((12125,6685))
svd = TruncatedSVD(n_components = 1,n_iter=7, random_state = 63)
svd.fit(sparse_ratings)
transformed = svd.fit_transform(matrix)
print(transformed)

[[40.86809295]
 [40.53278938]
 [40.39848692]
 ...
 [40.50152635]
 [40.65067217]
 [41.10465325]]


In [47]:
print(train_df.columns.difference(test_df.columns.tolist())) # 2204 additional features in train_df
print(test_df.columns.difference(train_df.columns.tolist())) # 298 additional features in test_df

Index([  103,   112,   179,   188,   211,   213,   214,   215,   220,   271,
       ...
       33493, 33544, 33550, 33696, 33750, 33798, 33902, 33979, 34107, 34238],
      dtype='int64', name='anime_id', length=2204)
Index([  244,   333,   515,   604,   625,   956,  1040,  1070,  1076,  1274,
       ...
       32700, 32736, 32756, 32757, 32961, 33032, 33037, 33205, 33236, 33709],
      dtype='int64', name='anime_id', length=298)


In [48]:
additional_features = train_df.columns.difference(test_df.columns.tolist())
features_to_insert = anime_bias.loc[additional_features] + mean
features_to_insert[1] = 0 
test_df_no_na_svd = pd.DataFrame([features_to_insert[1].values] * len(test_df), columns=additional_features)
test_df_no_na_svd = test_df_no_na_svd.set_index(test_df.index)
test_df_no_na_svd = pd.concat([test_df_no_na,test_df_no_na_svd], axis = 1)
test_df_no_na_svd = test_df_no_na_svd.drop(columns=test_df_no_na_svd.columns.difference(train_df.columns.tolist()))
test_df_no_na_svd

anime_id,1,5,6,7,8,15,16,17,18,19,...,33493,33544,33550,33696,33750,33798,33902,33979,34107,34238
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
44,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73449,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
73485,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
73490,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
73497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [132]:
print(train_df.columns.difference(test_df_no_na_svd.columns.tolist())) # 2204 additional features in train_df
print(test_df_no_na_svd.columns.difference(train_df.columns.tolist())) # 298 additional features in test_df

Index([], dtype='int64', name='anime_id')
Index([], dtype='int64', name='anime_id')


In [49]:
transformed_test = svd.transform(test_df_no_na_svd)

In [50]:
print(svd.singular_values_) # previously did this with n_components = 100; first value was two orders of magnitude higher than the second value

[4501.83716819]


In [51]:
X_approx = np.dot(transformed, svd.components_) 

train_df_T = train_df.transpose()
train_df_T.index
X_approx_df = pd.DataFrame(X_approx)
X_approx_df = X_approx_df.set_index(train_df.index)
X_approx_df = X_approx_df.transpose()
X_approx_df = X_approx_df.set_index(train_df_T.index)
X_approx_df = X_approx_df.transpose()

X_approx_df = X_approx_df + global_baseline
X_approx_df[X_approx_df > 10] = 10 
X_approx_df[X_approx_df < 1] = 1 
X_approx_df

anime_id,1,5,6,7,8,15,16,17,18,19,...,33750,33798,33902,33964,33979,34085,34103,34107,34238,34240
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,9.539438,9.223175,9.139996,8.103282,7.709666,8.971114,9.164192,8.072235,9.014317,9.674896,...,6.706563,8.708350,8.707230,6.426660,6.706907,6.705937,8.375888,9.706695,9.712375,9.355880
17,8.500252,8.183969,8.100820,7.064050,6.670465,7.931927,8.124984,7.033043,7.975128,8.635707,...,5.667388,7.669160,7.668049,5.387436,5.667729,5.666767,7.336691,8.667518,8.673151,8.316686
19,10.000000,9.717407,9.634270,8.597478,8.203905,9.465373,9.658421,8.566487,9.508573,10.000000,...,7.200838,9.202604,9.201498,6.920868,7.201178,7.200219,8.870133,10.000000,10.000000,9.850129
28,10.000000,10.000000,10.000000,9.595025,9.201471,10.000000,10.000000,9.564057,10.000000,10.000000,...,8.198419,10.000000,10.000000,7.918420,8.198757,8.197803,9.867700,10.000000,10.000000,10.000000
34,9.538060,9.221790,9.138621,8.101889,7.708283,8.969735,9.162806,8.070855,9.012938,9.673516,...,6.705188,8.706970,8.705854,6.425269,6.705531,6.704564,8.374506,9.705319,9.710984,9.354499
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73497,9.378807,9.062571,8.979350,7.942718,7.549056,8.810483,9.003593,7.911612,8.853689,9.514268,...,6.545915,8.547724,8.546591,6.266082,6.546263,6.545282,8.215273,9.546048,9.551797,9.195260
73500,9.655342,9.339059,9.255910,8.219141,7.825555,9.087017,9.280074,8.188133,9.130218,9.790797,...,6.822478,8.824250,8.823140,6.542527,6.822819,6.821857,8.491781,9.822608,9.828241,9.471776
73506,10.000000,10.000000,10.000000,9.598752,9.205169,10.000000,10.000000,9.567748,10.000000,10.000000,...,8.202094,10.000000,10.000000,7.922139,8.202435,8.201474,9.871396,10.000000,10.000000,10.000000
73511,10.000000,10.000000,10.000000,9.100595,8.706999,9.968456,10.000000,9.069574,10.000000,10.000000,...,7.703913,9.705690,9.704576,7.423978,7.704255,7.703290,9.373224,10.000000,10.000000,10.000000


The global baseline for the testing data also had to be adjusted to include these additional features and missing features.  The anime bias values with the mean were used to adjust the global baseline.

In [52]:
repeated = pd.DataFrame([features_to_insert[0].values] * len(global_baseline_test), columns=[f'{i}' for i in range(len(features_to_insert))])
repeated = repeated.set_index(global_baseline_test.index)
repeated = repeated.transpose()
repeated = repeated.set_index(features_to_insert.index)
repeated = repeated.transpose()
global_baseline_test_svd = pd.concat([global_baseline_test,repeated], axis = 1)
global_baseline_test_svd = global_baseline_test_svd.drop(columns=global_baseline_test_svd.columns.difference(test_df_no_na_svd.columns.tolist()))
global_baseline_test_svd

anime_id,1,5,6,7,8,15,16,17,18,19,...,33493,33544,33550,33696,33750,33798,33902,33979,34107,34238
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
17,8.004770,7.686190,7.606570,6.563067,6.173236,7.436394,7.626845,6.536873,7.479359,8.139903,...,4.0,5.0,5.0,1.0,6.0,8.0,8.0,6.0,9.0,9.0
19,9.539858,9.221277,9.141657,8.098155,7.708324,8.971482,9.161932,8.071960,9.014447,9.674991,...,4.0,5.0,5.0,1.0,6.0,8.0,8.0,6.0,9.0,9.0
34,9.039858,8.721277,8.641657,7.598155,7.208324,8.471482,8.661932,7.571960,8.514447,9.174991,...,4.0,5.0,5.0,1.0,6.0,8.0,8.0,6.0,9.0,9.0
44,8.578319,8.259739,8.180119,7.136616,6.746786,8.009944,8.200394,7.110422,8.052908,8.713452,...,4.0,5.0,5.0,1.0,6.0,8.0,8.0,6.0,9.0,9.0
45,9.468429,9.149849,9.070229,8.026726,7.636896,8.900053,9.090504,8.000532,8.943018,9.603562,...,4.0,5.0,5.0,1.0,6.0,8.0,8.0,6.0,9.0,9.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73449,9.089858,8.771277,8.691657,7.648155,7.258324,8.521482,8.711932,7.621960,8.564447,9.224991,...,4.0,5.0,5.0,1.0,6.0,8.0,8.0,6.0,9.0,9.0
73485,8.218429,7.899849,7.820229,6.776726,6.386896,7.650053,7.840504,6.750532,7.693018,8.353562,...,4.0,5.0,5.0,1.0,6.0,8.0,8.0,6.0,9.0,9.0
73490,9.325572,9.006992,8.927372,7.883869,7.494038,8.757196,8.947647,7.857675,8.800161,9.460705,...,4.0,5.0,5.0,1.0,6.0,8.0,8.0,6.0,9.0,9.0
73497,8.873191,8.554611,8.474991,7.431488,7.041657,8.304815,8.495266,7.405294,8.347780,9.008324,...,4.0,5.0,5.0,1.0,6.0,8.0,8.0,6.0,9.0,9.0


In [53]:
X_approx_test = np.dot(transformed_test, svd.components_) 

X_approx_df_test = pd.DataFrame(X_approx_test)
X_approx_df_test = X_approx_df_test.set_index(test_df.index)
X_approx_df_test = X_approx_df_test.transpose()
X_approx_df_test = X_approx_df_test.set_index(test_df_no_na_svd.columns)
X_approx_df_test = X_approx_df_test.transpose()

X_approx_df_test = X_approx_df_test + global_baseline_test_svd
X_approx_df_test[X_approx_df_test > 10] = 10 
X_approx_df_test[X_approx_df_test < 1] = 1 
X_approx_df_test

anime_id,1,5,6,7,8,15,16,17,18,19,...,33493,33544,33550,33696,33750,33798,33902,33979,34107,34238
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
17,8.003865,7.685281,7.605667,6.562152,6.172328,7.435489,7.625935,6.535967,7.478453,8.138997,...,3.999097,4.999094,4.999096,1.000000,5.999097,7.999099,7.999093,5.999097,8.999087,8.999093
19,9.539449,9.220867,9.141250,8.097742,7.707914,8.971074,9.161522,8.071551,9.014038,9.674582,...,3.999593,4.999591,4.999592,1.000000,5.999592,7.999593,7.999591,5.999593,8.999588,8.999591
34,9.039888,8.721308,8.641687,7.598185,7.208354,8.471512,8.661963,7.571991,8.514477,9.175021,...,4.000030,5.000030,5.000030,1.000030,6.000030,8.000030,8.000030,6.000030,9.000030,9.000030
44,8.578038,8.259457,8.179839,7.136332,6.746504,8.009663,8.200111,7.110141,8.052627,8.713171,...,3.999720,4.999719,4.999719,1.000000,5.999720,7.999720,7.999718,5.999720,8.999717,8.999719
45,9.468219,9.149638,9.070019,8.026513,7.636684,8.899843,9.090292,8.000321,8.942808,9.603352,...,3.999790,4.999790,4.999790,1.000000,5.999790,7.999791,7.999789,5.999790,8.999788,8.999789
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73449,9.089578,8.770996,8.691378,7.647872,7.258043,8.521202,8.711651,7.621680,8.564167,9.224711,...,3.999721,4.999720,4.999721,1.000000,5.999721,7.999721,7.999719,5.999721,8.999718,8.999720
73485,8.216227,7.897637,7.818032,6.774499,6.384686,7.647851,7.838290,6.748327,7.690815,8.351359,...,3.997804,4.997796,4.997801,1.000000,5.997802,7.997807,7.997792,5.997803,8.997778,8.997794
73490,9.325861,9.007282,8.927660,7.884161,7.494328,8.757485,8.947937,7.857964,8.800450,9.460994,...,4.000288,5.000289,5.000289,1.000292,6.000289,8.000288,8.000290,6.000288,9.000292,9.000290
73497,8.873544,8.554965,8.475343,7.431845,7.042011,8.305168,8.495620,7.405647,8.348133,9.008677,...,4.000352,5.000353,5.000352,1.000356,6.000352,8.000351,8.000354,6.000352,9.000356,9.000353


After adding back the global baseline, this recommender resulted in a training RMSE of 1.27 and testing RMSE of 1.31.  This is a decrease in performance for the training data (approximately 8%) while performance for the testing dataset remained the same.  Potentially this is due to the number of dropped features.  

In [54]:
diff = train_df.to_numpy(dtype=np.float32, copy=False) - X_approx_df.to_numpy(dtype=np.float32, copy=False)
np.square(diff, out=diff)              # in-place square to save memory
collab_rmse_train = np.sqrt(np.nanmean(diff))
print(collab_rmse_train)

1.2737987


In [55]:
X_approx_df_test_rmse = X_approx_df_test.drop(columns=X_approx_df_test.columns.difference(test_df.columns.tolist()))
test_df_rmse = test_df.drop(columns=test_df.columns.difference(X_approx_df_test.columns.tolist()))
diff_test = test_df_rmse.to_numpy(dtype=np.float32, copy=False) - X_approx_df_test_rmse.to_numpy(dtype=np.float32, copy=False)
np.square(diff_test, out=diff_test)              # in-place square to save memory
collab_rmse_test = np.sqrt(np.nanmean(diff_test))
print(collab_rmse_test)

1.3115469


# Diversity Score
Diversity score is a metric that measures the diversity of a user's recommendations.  We can quantify diversity score by examining the cosine similarity between recommended items.  To find the diversity score among a user's recommendations, we can average cosine similarity and take 1 - the average cosine similarity.  The average diversity score among all users recommendations can be used to determine the recommender system's diversity score.  A diversity score of 1 can be interpretted as zero similar items within the set of user recommendations.  For the sake of this exercise, we will consider top 10 predicted highly rated animes for each user as their recommendation.  To ensure already-rated items are not included in the recommendations, the recommendation matrix is created by removing already recommended items.

Diversity Score for Global Baseline

For the global baseline, we can see that the diversity score is approximately 0.82.  

In [158]:
recommendation_content = train_df.isna() * global_baseline # remove already rated items
recommendations_list = {}

for user_id in recommendation_content.index:
    # Get and sort the user's predictions
    sorted_user_predictions = recommendation_content.loc[user_id].sort_values(ascending=False) # sort all predicted ratings
    recommendations_list[user_id] = sorted_user_predictions.head(10).index.tolist() # take top ten rated items as the recommendation and store in dict

diversity_score = []
for user_id in recommendations_list:
    total_similarity = 0
    for anime in recommendations_list[user_id]:
        total_similarity += cosine_sim_df_train.loc[recommendations_list[user_id], anime].sum() - 1 # sum up cosine similarity, subtract itself
    diversity_score.append(1 - (total_similarity/100)) # store diversity score
        
print(np.array(diversity_score).mean()) # take average diversity score

0.8232272975656724


However, the approach outlined above has a limitation, which is that the top 10 recommendations may be tied in score with other non-recommended items.  For instance, if a user has 20 animes with a predicted score of 10, only half are considered in the recommendation list.  To improve upon diversity score and attempt to provide the user with more diverse anime recommendations, I instead create a temporary recommendation list based on predicted ratings that are one standard deviation above the mean.  If this capped rating is above 10, I instead use the user's bias plus the mean as the cap, such that recommended items will only be above the user's average rating.  Then, I randomly select 10 items from this list.  Through this method, the diversity score of the recommender system for the training data has improved to 0.83.  

In [159]:
import random # global baseline improvement, training
random.seed(63)
recommendations_list_diversity = {}

for user_id in recommendation_content.index:
        # Get and sort the user's predictions
    sorted_user_predictions = pd.DataFrame(recommendation_content.loc[user_id].sort_values(ascending=False))
    cap = 0
    cap = user_bias.loc[user_id][0] + mean + std_dev
    if cap >= 10:
        cap = user_bias.loc[user_id][0] + mean
    recommendations_list_diversity[user_id] = sorted_user_predictions[sorted_user_predictions[user_id] >= (cap)].index.tolist() 
    recommendations_list_diversity[user_id] = random.sample(recommendations_list_diversity[user_id],10)
    
diversity_score_improved = []
for user_id in recommendations_list_diversity:
    
    total_similarity = 0
    for anime in recommendations_list_diversity[user_id]:
        total_similarity += cosine_sim_df_train.loc[recommendations_list_diversity[user_id], anime].sum() - 1 # subtract itself
    diversity_score_improved.append(1 - (total_similarity/100))
        
print(np.array(diversity_score_improved).mean())

0.830227764254512


Applying the above to the testing data, I can see that the global baseline has a diversity score of 0.82 for the top 10 recommendations, and a diversity score of 0.84, an approximate 1.9% improvement.

In [161]:
recommendation_content = test_df.isna() * global_baseline_test # repeat process for global baseline, testing
recommendations_list = {}

for user_id in recommendation_content.index:
    # Get and sort the user's predictions
    sorted_user_predictions = recommendation_content.loc[user_id].sort_values(ascending=False) # sort all predicted ratings
    recommendations_list[user_id] = sorted_user_predictions.head(10).index.tolist() # take top ten rated items as the recommendation and store in dict

diversity_score = []
for user_id in recommendations_list:
    total_similarity = 0
    for anime in recommendations_list[user_id]:
        total_similarity += cosine_sim_df_test.loc[recommendations_list[user_id], anime].sum() - 1 # sum up cosine similarity, subtract itself
    diversity_score.append(1 - (total_similarity/100)) # store diversity score
        
print(np.array(diversity_score).mean()) # take average diversity score

0.823274565610261


In [162]:
import random # global baseline improvement, testing
random.seed(63)
recommendations_list_diversity = {}

for user_id in recommendation_content.index:
        # Get and sort the user's predictions
    sorted_user_predictions = pd.DataFrame(recommendation_content.loc[user_id].sort_values(ascending=False))
    cap = 0
    cap = user_bias_test.loc[user_id][0] + mean + std_dev
    if cap >= 10:
        cap = user_bias_test.loc[user_id][0] + mean
    recommendations_list_diversity[user_id] = sorted_user_predictions[sorted_user_predictions[user_id] >= (cap)].index.tolist() 
    if len(recommendations_list_diversity[user_id]) >= 10:
        recommendations_list_diversity[user_id] = random.sample(recommendations_list_diversity[user_id],10)
    else: 
        sorted_user_predictions = recommendation_content.loc[user_id].sort_values(ascending=False)
        recommendations_list_diversity[user_id] = sorted_user_predictions.head(10).index.tolist() 

    
diversity_score_improved = []
for user_id in recommendations_list_diversity:
    
    total_similarity = 0
    for anime in recommendations_list_diversity[user_id]:
        total_similarity += cosine_sim_df_test.loc[recommendations_list_diversity[user_id], anime].sum() - 1 # subtract itself
    diversity_score_improved.append(1 - (total_similarity/100))

print(np.array(diversity_score_improved).mean())

0.8392523649083112


Diversity Score for Content Based Filtering

I repeat the above process for the Content Based Recommender for both the training and testing data.  The diversity score for the training data is 0.75, with an improvement to 0.83 (an approximate 10% improvement).  The diversity score for the testing data is 0.81, with an improvement to 0.84 (an approximate 3% improvement).

In [47]:
recommendation_content = train_df.isna() * prediction_train # repeat process for Content Based Filtering model, training data
recommendations_list = {}

for user_id in recommendation_content.index:
    # Get and sort the user's predictions
    sorted_user_predictions = recommendation_content.loc[user_id].sort_values(ascending=False)
    recommendations_list[user_id] = sorted_user_predictions.head(10).index.tolist() 

diversity_score = []
for user_id in recommendations_list:
    total_similarity = 0
    for anime in recommendations_list[user_id]:
        total_similarity += cosine_sim_df_train.loc[recommendations_list[user_id], anime].sum() - 1 # subtract itself
    diversity_score.append(1 - (total_similarity/100))
        
print(np.array(diversity_score).mean())

0.7537173116277938


In [48]:
import random # Content based filtering improvement, training
random.seed(63)
recommendations_list_diversity = {}

for user_id in recommendation_content.index:
        # Get and sort the user's predictions
    sorted_user_predictions = pd.DataFrame(recommendation_content.loc[user_id].sort_values(ascending=False))
    cap = 0
    cap = user_bias.loc[user_id][0] + mean + std_dev
    if cap >= 10:
        cap = user_bias.loc[user_id][0] + mean
    recommendations_list_diversity[user_id] = sorted_user_predictions[sorted_user_predictions[user_id] >= (cap)].index.tolist() 
    recommendations_list_diversity[user_id] = random.sample(recommendations_list_diversity[user_id],10)
    
diversity_score_improved = []
for user_id in recommendations_list_diversity:
    
    total_similarity = 0
    for anime in recommendations_list_diversity[user_id]:
        total_similarity += cosine_sim_df_train.loc[recommendations_list_diversity[user_id], anime].sum() - 1 # subtract itself
    diversity_score_improved.append(1 - (total_similarity/100))
        
print(np.array(diversity_score_improved).mean())

0.8299880781385994


In [148]:
recommendation_content = test_df.isna() * prediction_test # content based filtering, testing
recommendations_list = {}

for user_id in recommendation_content.index:
        # Get and sort the user's predictions
    sorted_user_predictions = recommendation_content.loc[user_id].sort_values(ascending=False)
    recommendations_list[user_id] = sorted_user_predictions.head(10).index.tolist() 

diversity_score = []
for user_id in recommendations_list:
    total_similarity = 0
    for anime in recommendations_list[user_id]:
        total_similarity += cosine_sim_df_test.loc[recommendations_list[user_id], anime].sum() - 1 # subtract itself
    diversity_score.append(1 - (total_similarity/100))
        
print(np.array(diversity_score).mean())

0.8114396547892405


In [156]:
import random # content based filtering improvement, testing
random.seed(63)
recommendations_list_diversity = {}

for user_id in recommendation_content.index:
        # Get and sort the user's predictions
    sorted_user_predictions = pd.DataFrame(recommendation_content.loc[user_id].sort_values(ascending=False))
    cap = 0
    cap = user_bias_test.loc[user_id][0] + mean + std_dev
    if cap >= 10:
        cap = user_bias_test.loc[user_id][0] + mean
    recommendations_list_diversity[user_id] = sorted_user_predictions[sorted_user_predictions[user_id] >= (cap)].index.tolist() 
    if len(recommendations_list_diversity[user_id]) >= 10:
        recommendations_list_diversity[user_id] = random.sample(recommendations_list_diversity[user_id],10)
    else: 
        sorted_user_predictions = recommendation_content.loc[user_id].sort_values(ascending=False)
        recommendations_list_diversity[user_id] = sorted_user_predictions.head(10).index.tolist() 

    
diversity_score_improved = []
for user_id in recommendations_list_diversity:
    
    total_similarity = 0
    for anime in recommendations_list_diversity[user_id]:
        total_similarity += cosine_sim_df_test.loc[recommendations_list_diversity[user_id], anime].sum() - 1 # subtract itself
    diversity_score_improved.append(1 - (total_similarity/100))

print(np.array(diversity_score_improved).mean())

0.8357659647127614


Diversity Score for Collaborative Filtering

I repeat the above process for the Collaborative Filtering Recommender for both the training and testing data.  The diversity score for the training data is 0.81, with an improvement to 0.84 (an approximate 3.5% improvement).  The diversity score for the testing data is 0.83, with a diversity score of 0.83 for the attempted improved technique (approximately no improvement).

In [87]:
recommendation_collab = train_df.isna() * pd.DataFrame(X_approx_df) # repeat process for Collaborative Filtering model, training data
recommendations_list_collab = {}

for user_id in recommendation_collab.index:
        # Get and sort the user's predictions
    sorted_user_predictions = recommendation_collab.loc[user_id].sort_values(ascending=False)
    recommendations_list_collab[user_id] = sorted_user_predictions.head(10).index.tolist() 

diversity_score_collab = []
for user_id in recommendations_list_collab:
    total_similarity = 0
    for anime in recommendations_list_collab[user_id]:
        total_similarity += cosine_sim_df_train.loc[recommendations_list_collab[user_id], anime].sum() - 1 # subtract itself
    diversity_score_collab.append(1 - (total_similarity/100))
    
print(np.array(diversity_score_collab).mean())

0.8117003598112876


In [88]:
import random # collaborative filtering improvement, training
random.seed(63)
recommendations_list_diversity_collab = {}

for user_id in recommendation_collab.index:
        # Get and sort the user's predictions
    sorted_user_predictions = pd.DataFrame(recommendation_collab.loc[user_id].sort_values(ascending=False))
    cap = 0
    cap = user_bias.loc[user_id][0] + mean + std_dev
    if cap >= 10:
        cap = user_bias.loc[user_id][0] + mean
    recommendations_list_diversity_collab[user_id] = sorted_user_predictions[sorted_user_predictions[user_id] >= (cap)].index.tolist() 
    recommendations_list_diversity_collab[user_id] = random.sample(recommendations_list_diversity_collab[user_id],10)
    
diversity_score_improved_collab = []
for user_id in recommendations_list_diversity_collab:
    
    total_similarity = 0
    for anime in recommendations_list_diversity_collab[user_id]:
        total_similarity += cosine_sim_df_train.loc[recommendations_list_diversity_collab[user_id], anime].sum() - 1 # subtract itself
    diversity_score_improved_collab.append(1 - (total_similarity/100))
        
print(np.array(diversity_score_improved_collab).mean())

0.8404500283104703


In [155]:
recommendation_collab = test_df_rmse.isna() * pd.DataFrame(X_approx_df_test) # collaborative filtering diversity, testing
recommendations_list_collab = {}

for user_id in recommendation_collab.index:
        # Get and sort the user's predictions
    sorted_user_predictions = recommendation_collab.loc[user_id].sort_values(ascending=False)
    recommendations_list_collab[user_id] = sorted_user_predictions.head(10).index.tolist() 

diversity_score_collab = []
for user_id in recommendations_list_collab:
    total_similarity = 0
    for anime in recommendations_list_collab[user_id]:
        total_similarity += cosine_sim_df_test.loc[recommendations_list_collab[user_id], anime].sum() - 1 # subtract itself
    diversity_score_collab.append(1 - (total_similarity/100))
    
print(np.array(diversity_score_collab).mean())

0.8292129607936944


In [157]:
import random # collaborative filtering diversity improvement, testing
random.seed(63)
recommendations_list_diversity_collab = {}

for user_id in recommendation_collab.index:
        # Get and sort the user's predictions
    sorted_user_predictions = pd.DataFrame(recommendation_collab.loc[user_id].sort_values(ascending=False))
    cap = 0
    cap = user_bias_test.loc[user_id][0] + mean + std_dev
    if cap >= 10:
        cap = user_bias_test.loc[user_id][0] + mean
    recommendations_list_diversity_collab[user_id] = sorted_user_predictions[sorted_user_predictions[user_id] >= (cap)].index.tolist() 
    if len(recommendations_list_diversity_collab[user_id]) >= 10:
        recommendations_list_diversity_collab[user_id] = random.sample(recommendations_list_diversity_collab[user_id],10)
    else: 
        sorted_user_predictions = recommendation_collab.loc[user_id].sort_values(ascending=False)
        recommendations_list_diversity_collab[user_id] = sorted_user_predictions.head(10).index.tolist() 

    
diversity_score_improved = []
for user_id in recommendations_list_diversity_collab:
    
    total_similarity = 0
    for anime in recommendations_list_diversity_collab[user_id]:
        total_similarity += cosine_sim_df_test.loc[recommendations_list_diversity_collab[user_id], anime].sum() - 1 # subtract itself
    diversity_score_improved.append(1 - (total_similarity/100))

print(np.array(diversity_score_improved).mean())

0.8307860669248841


# Conclusions

In this exercise, the best performing system with respect to the RMSE metric was the global baseline recommender.  With the original method for assigning recommendations, the global baseline also had the best diversity score with the training dataset, at 0.82.  The collaborative filtering model had the best diversity score among the testing dataset, at 0.83.  For the improved method for assigning recommendations within the training set, the collaborative filtering model had the best diversity score of 0.84.  The global baseline had the best diversity score among the testing dataset with the improved model, at 0.84.  However, given that the difference in diversity score among the collaborative filtering recommender and global baseline recommender is marginal, I would consider the global baseline recommender as the best performing model.  

Regarding diversity score improvements, it does seem like the improved approach has potential for enhancing diversity. This is best demonstrated by the content based filtering model training data, which exhibited an 8% improvement.  However, this method is likely inconsistent due to the use of random selection among the subset of recommendations.  A potential additional improvement could be selection of top 10 recommendations by sequentially selecting items with lowest cosine similarity to the already selected subset.  For example, the first item can be selected based on highest predicted rating.  The next item can be selected among the subset of acceptable recommendations, with lowest cosine similarity as the selection criteria.  This process can be continued until 10 items are selected in total.  If this were implemented for a streaming platform, user feedback can be requested if users choose to watch any of the anime in the recommended list to evaluate the appropriateness of these diverse recommendations.  A click and watch-through for an anime on the recommendation list could also be considered positive feedback, while a click in and then out (i.e. a low percentage of total anime duration) can be considered negative feedback.  There should also be consideration on whether or not users want to see diverse recommendations, potentially through a user setting.  For a environment as low-stakes (i.e. for entertainment only) as an anime streaming platform, non-diverse recommendations may not be a problem if the user does not want diversity in their media.  

For overall improvements, I think cosine similarity needs to be re-assessed for each item.  Potentially additional data can be imported that contains anime descriptions.  Additionally, potentially recommender systems should consider type of media (e.g. TV vs Movie) differently, as these fulfill different purposes for users.  