Methods 6 - Recommendation Algorithms
-------------------------------------
13.3.2020  
Mathematics and Methods in Machine Learning and Neural Networks    
Helsinki Metropolia University of Applied Sciences

The aim of this exercise is to create a recommendation engine for anime content. The methods used and compared are K-Nearest Neighbors (KNN) and Singular Value Decomposition (SVD).

In [18]:
import pandas as pd
from surprise import Reader
from surprise import KNNBasic
from surprise import SVD
from surprise import Dataset
from collections import defaultdict

The data contains quotes strings, so we define `quotechar` properly.

In [19]:
url_anime = r'../input/anime-recommendations-database/anime.csv'

anime = pd.read_csv(url_anime, 
                    sep = ',', 
                    quotechar='"')
anime['genre'] = anime['genre'].str.split(", ")
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"[Action, Adventure, Drama, Fantasy, Magic, Mil...",TV,64,9.26,793665
2,28977,Gintama°,"[Action, Comedy, Historical, Parody, Samurai, ...",TV,51,9.25,114262
3,9253,Steins;Gate,"[Sci-Fi, Thriller]",TV,24,9.17,673572
4,9969,Gintama&#039;,"[Action, Comedy, Historical, Parody, Samurai, ...",TV,51,9.16,151266


We discard all empty ratings, denoted by `-1`.

In [20]:
url_rating = r'../input/anime-recommendations-database/rating.csv'

rating = pd.read_csv(url_rating, 
                    sep = ',', 
                    index_col = None,
                    na_values = '-1')
rating = rating.dropna()
rating.head()

Unnamed: 0,user_id,anime_id,rating
47,1,8074,10.0
81,1,11617,10.0
83,1,11757,10.0
101,1,15451,10.0
153,2,11771,10.0


The data encompasses two files, one containing anime content information, and the other containing ratings given by users. Let's join these two tables together.

In [21]:
df = anime.merge(rating, on='anime_id')

Now we have all the data in the same dataframe, each line representing one rating.

In [22]:
print("dataframe size:", len(df))
print("Animes:" ,len(df['anime_id'].value_counts()))
print("Users:", len(df['user_id'].value_counts()))

dataframe size: 6337239
Animes: 9926
Users: 69600


The amount of data is large. Without trimming the data set, the KNN algorithm's memory usage is unacceptable, resulting in a `MemoryError`.

We choose the number of ratings per user as the discrimination parameter, using only ratings from the most active reviewers. By trial an error it is found, that 500 is a discrimination level that makes the dataset sufficiently small to run on a home computer. This results in a painfully long execution time, so we increase the level to 1500.

SVD hog memory like KNN, but we use the same trimmed dataset on it, to compare running times.

In [24]:
# Filtering method inspired by
# https://towardsdatascience.com/building-and-testing-recommender-systems-with-surprise-step-by-step-d4ba702ef80b
min_ratings_per_user = 1500
filter_user = df['user_id'].value_counts() >= min_ratings_per_user
filter_user = filter_user[filter_user].index.tolist()

df = df[(df['user_id'].isin(filter_user))].dropna()

print('after filtering users by number of ratings:')
print("dataframe size:", len(df))
print("animes:", len(df['anime_id'].value_counts()))
print("users:", len(df['user_id'].value_counts()))

after filtering users by number of ratings:
dataframe size: 67781
animes: 8420
users: 35


Now there is significant reduction in training set size, while still maintaining a diverse selection of different anime contents to recommend.

In [25]:
df

Unnamed: 0,anime_id,name,genre,type,episodes,rating_x,members,user_id,rating_y
332,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",Movie,1,9.37,200630,13954,6.0
449,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",Movie,1,9.37,200630,17033,10.0
1336,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",Movie,1,9.37,200630,49662,8.0
1466,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",Movie,1,9.37,200630,57620,10.0
1689,32281,Kimi no Na wa.,"[Drama, Romance, School, Supernatural]",Movie,1,9.37,200630,65840,10.0
...,...,...,...,...,...,...,...,...,...
6337226,17833,Pink no Curtain,"[Hentai, Slice of Life]",OVA,1,3.61,138,65836,7.0
6337228,10368,Teleclub no Himitsu,[Hentai],OVA,2,4.67,148,65836,5.0
6337229,9352,Tenshi no Habataki Jun,[Hentai],OVA,1,4.33,201,53698,6.0
6337235,5543,Under World,[Hentai],OVA,1,4.28,183,49503,4.0


In [26]:
# Construct reader
reader = Reader(rating_scale=(1, 10))

# Generate surprise Dataset
data = Dataset.load_from_df(df[['user_id', 'anime_id', 'rating_y']], reader)

In [27]:
# Set all data as training set
trainset = data.build_full_trainset()

In [28]:
%%time
# Build and train KNN
sim_options = { 'user_based': False } # item based

knn = KNNBasic(sim_options=sim_options)
knn.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.
CPU times: user 10.5 s, sys: 3.22 s, total: 13.8 s
Wall time: 13.6 s


<surprise.prediction_algorithms.knns.KNNBasic at 0x7fb28a15a860>

In [29]:
%%time
# Build and train SVD
svd = SVD()
svd.fit(trainset)

CPU times: user 8.37 s, sys: 4.92 ms, total: 8.37 s
Wall time: 8.37 s


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fb252e02eb8>

In [30]:
testset = trainset.build_anti_testset()
# This block copied from Surprise documentation at
# http://surprise.readthedocs.io/en/stable/FAQ.html#how-to-get-the-top-n-recommendations-for-each-user

def get_top_n(predictions, n=3):

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [31]:
# Print the recommended items for each user. The argument max limits number of printed lines. 0 = no limit.
def print_top_n(top_n, max=0):
    i = max
    for uid, user_ratings in top_n.items():
        print(uid, [df.query('anime_id == ' + str(iid))['name'].values[0] for (iid, _) in user_ratings])
        if max > 0:
            i = i - 1
            if i is 0: return

In [32]:
%%time
predictions = knn.test(testset)
knn_top_n = get_top_n(predictions, n=3)

CPU times: user 9min 39s, sys: 331 ms, total: 9min 39s
Wall time: 9min 39s


In [33]:
%%time
predictions = svd.test(testset)
svd_top_n = get_top_n(predictions, n=3)

CPU times: user 3.44 s, sys: 69.4 ms, total: 3.51 s
Wall time: 3.44 s


SVD is significantly faster than KNN.

In [34]:
print("Recommendations by KNN:")
print_top_n(knn_top_n, 20)

Recommendations by KNN:
13954 ['Tokimeki Tonight', 'Sougen no Ko Tenguri', 'Perrine Monogatari']
17033 ['Futatsu no Kurumi', 'Pokemon the Movie XY&amp;Z: Volcanion to Karakuri no Magiana', 'Ojamajo Doremi OVA']
49662 ['Soukou Kihei Votoms: Red Shoulder Document - Yabou no Roots', 'gdgd Fairies 2 Episode 0', 'Dededen']
57620 ['Slow Step', 'Blue Dragon: Tenkai no Shichi Ryuu', 'gdgd Fairies 2 Episode 0']
65840 ['Pokemon XY&amp;Z', 'Detective Conan Movie 20: The Darkest Nightmare', 'Go! Princess Precure']
67348 ['Happiness Charge PreCure! Movie: Ningyou no Kuni no Ballerina', 'Time Bokan Series: Yattodetaman', 'Suite Precure♪ Movie: Torimodose! Kokoro ga Tsunaku Kiseki no Melody♪']
1530 ['Midnight Eye: Gokuu', 'Detective Conan: Black History 2', 'Paul no Miracle Daisakusen']
7345 ['Detective Conan Movie 20: The Darkest Nightmare', 'Dr. Slump Movie 08: Arale-chan Hoyoyo!! Tasuketa Same ni Tsurerarete...', 'Hibike! Euphonium Movie: Kitauji Koukou Suisougaku-bu e Youkoso']
9032 ['Crusher Joe

In [35]:
print("Recommendations by SVD:")
print_top_n(svd_top_n, 20)

Recommendations by SVD:
13954 ['Planetes', 'Great Teacher Onizuka', 'Kara no Kyoukai 7: Satsujin Kousatsu (Kou)']
17033 ['Ginga Eiyuu Densetsu', 'Evangelion: 2.0 You Can (Not) Advance', 'Evangelion: 1.0 You Are (Not) Alone']
49662 ['Death Note', 'Chihayafuru 2', 'Skip Beat!']
57620 ['Tsumiki no Ie', 'Nodame Cantabile Finale', 'Daicon Opening Animations']
65840 ['Monster', 'Hotaru no Haka', 'Baccano!']
67348 ['Gintama&#039;', 'Fullmetal Alchemist: Brotherhood', 'Monster']
1530 ['Code Geass: Hangyaku no Lelouch R2', 'Kingdom 2nd Season', 'Gin no Saji 2nd Season']
7345 ['Kuroko no Basket 3rd Season', 'Kimi no Na wa.', 'Noragami Aragoto']
9032 ['Gintama&#039;', 'Ginga Eiyuu Densetsu', 'Gintama Movie: Kanketsu-hen - Yorozuya yo Eien Nare']
11536 ['Monster', 'Gintama', 'Kara no Kyoukai 5: Mujun Rasen']
12431 ['Monster', 'Kara no Kyoukai 7: Satsujin Kousatsu (Kou)', 'Gintama&#039;']
22434 ['Monster', 'Gintama Movie: Kanketsu-hen - Yorozuya yo Eien Nare', 'Toki wo Kakeru Shoujo']
23247 ['Monst