## Anime recommendations assignment
Marja Satukangas 2.3.2021

Task:

Load the public domain Anime dataset either from the original location
(https://www.kaggle.com/CooperUnion/anime-recommendations-database/version/1)

This assignment is of exploratory nature. Your task is to explore the applicability of scikit-surprise in building a recommendation engine for the Anime dataset.

The questions of interest include:
1. What kind of preprocessing is necessary for the ratings dataset?
2. How do the recommendation algorithms (e.g. KNN and SVD) perform with a data set of this magnitude? Do you encounter hardware limitations? If yes, how can you circumvent some of the limitations to be able to carry on with the experiment?
3. Can you combine the information in the two files in a meaningful way to have the recommender display the titles of the recommended movies?

## Answers for questions

1. In the ratings dataset, rating values varied between -1-10. The rating was -1 if the use hadn't rated the movie at all but had watched it. I decided to remove those rows with rating -1.

2. I performed KNN algorithm and encountered hardware limitations. I solved this by setting sim-options  "user-based"=False. This computes similarities between items not users.

3. In the last cell I made a show_recommendations function so that a user can give a name of an anime and the system recommends similar movies for him/her

In [32]:
import pandas as pd
import numpy as np
from collections import defaultdict
from surprise import Reader
from surprise import KNNBasic
from surprise import Dataset
from surprise.model_selection import cross_validate

In [33]:
anime = pd.read_csv("C:/Users/Marja/Downloads/anime.csv")
ratings = pd.read_csv("C:/Users/Marja/Downloads/rating.csv")
df = pd.DataFrame(ratings)
df.tail(20)

Unnamed: 0,user_id,anime_id,rating
7813717,73515,11759,8
7813718,73515,11837,9
7813719,73515,12031,8
7813720,73515,12113,10
7813721,73515,12115,10
7813722,73515,12293,8
7813723,73515,12413,9
7813724,73515,12445,8
7813725,73515,12461,7
7813726,73515,12967,7


In [60]:
df_anime = pd.DataFrame(anime)
df_anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [34]:
ratings.describe()

Unnamed: 0,user_id,anime_id,rating
count,7813737.0,7813737.0,7813737.0
mean,36727.96,8909.072,6.14403
std,20997.95,8883.95,3.7278
min,1.0,1.0,-1.0
25%,18974.0,1240.0,6.0
50%,36791.0,6213.0,7.0
75%,54757.0,14093.0,9.0
max,73516.0,34519.0,10.0


In [35]:
ratings_df = df[ratings['rating'] != -1]

In [36]:
ratings_df

Unnamed: 0,user_id,anime_id,rating
47,1,8074,10
81,1,11617,10
83,1,11757,10
101,1,15451,10
153,2,11771,10
...,...,...,...
7813732,73515,16512,7
7813733,73515,17187,9
7813734,73515,22145,10
7813735,73516,790,9


In [37]:
ratings_df['rating'].unique()

array([10,  8,  6,  9,  7,  3,  5,  4,  1,  2], dtype=int64)

In [38]:
ratings_df['user_id'].unique()

array([    1,     2,     3, ..., 73514, 73515, 73516], dtype=int64)

In [39]:
ratings_df.isna().sum()

user_id     0
anime_id    0
rating      0
dtype: int64

In [40]:
# Construct reader
reader = Reader(rating_scale=(1, 10))

# Generate surprise Dataset
data = Dataset.load_from_df(ratings_df[['user_id', 'anime_id', 'rating']], reader)

In [42]:
# Set all data as training set
trainset = data.build_full_trainset()

# Build and train an algorithm.

sim_options = {
               'user_based': False  # compute  similarities between items
}

algo = KNNBasic(sim_options=sim_options)
algo.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x2054c79d160>

In [46]:
# Sample prediction
user_id = 73510
item_id = 22145

pred = algo.predict(user_id, item_id, verbose=True)

user: 73510      item: 22145      r_ui = None   est = 7.54   {'actual_k': 40, 'was_impossible': False}


In [55]:
toy_story_neighbors = algo.get_neighbors(784, k=10)
toy_story_neighbors

[366, 2774, 3374, 3376, 3384, 3385, 3525, 4222, 4243, 4434]

In [69]:
def show_recommendations(movie):
    recommendations = []
    raw_id = df_anime.loc[df['name'] == movie, 'anime_id'].item()
    inner_id = trainset.to_inner_iid(raw_id)
    neighbors = algo.get_neighbors(inner_id, k=10)
    for i in neighbors:
        id = trainset.to_raw_iid(i)
        recommendations.append(df_anime.loc[df['anime_id'] == id, 'name'].item())
    return recommendations
        

In [70]:
recommendations = show_recommendations('Fullmetal Alchemist: Brotherhood')
recommendations

['Super Bikkuriman',
 'Hello Kitty no Papa Nante Daikirai',
 'Hello Kitty no Suteki na Kyoudai',
 'Hello Kitty no Minna no Mori wo Mamore!',
 'Susie-chan to Marvy',
 'Sobakasu Pucchi',
 'Hulu Xiongdi',
 'Ahiru no Pekkle no Suieitaikai wa Oosawagi',
 'Qin Shiming Yue Zhi: Zhu Zi Bai Jia',
 'Qin Shiming Yue Zhi: Ye Jin Tianming']