# Anime Recommendation System using Nearest Neighbors

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

# Load the datasets

* 2 different datasets will be loaded in to dataframes 
* Dataset can be downloaded in https://www.kaggle.com/CooperUnion/anime-recommendations-database

In [2]:
anime = pd.read_csv('datasets/anime.csv')
rating = pd.read_csv('datasets/rating.csv')

### anime.csv

* anime_id - myanimelist.net's unique id identifying an anime.
* name - full name of anime.
* genre - comma separated list of genres for this anime.
* type - movie, TV, OVA, etc.
* episodes - how many episodes in this show. (1 if movie).
* rating - average rating out of 10 for this anime.
* members - number of community members that are in this anime's "group".

### rating.csv

* user_id - non identifiable randomly generated user id.
* anime_id - the anime that this user has rated.
* rating - rating out of 10 this user has assigned (-1 if the user watched it but didn't assign a rating).

In [3]:
print('anime.csv (shape):',anime.shape)
print('rating.csv (shape):',rating.shape)

anime.csv (shape): (12294, 7)
rating.csv (shape): (7813737, 3)


In [4]:
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [5]:
rating.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


In [6]:
# checking for null values

anime.isnull().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [7]:
# filling all anime without rating with 0

anime.fillna({'rating':0},inplace=True)

Exploratory data analysis is on the other notebook. (Anime Recommendation using Pearson r correlation.)

# Collaborative Filtering using Nearest Neighbors

<br>

```
* In this recommendation system, we will be utilizing the collaborative filtering technique.
* By using this technique, the system will recommend anime based on the nearest rating between the ratings of 
  user's anime and the ratings of other anime.
* For example, I watched 10 anime and gave each of them a rating. Now, my friend watched an anime from my 
  anime list and now asks me to recommend three anime. With that, I will recommend three anime with closest 
  rating to the rating I gave for the anime that my friend watched.
```

### Process

<br>

```
* Remove anime with low count of ratings and users who gave low count of ratings
* Construct Rating Matrix
* Convert rating matrix to csr matrix to save memory
* Fit the csr rating matrix into nearest neighbor
* Retrieve ten nearest neighbor
* Output ten recommended anime
```

<br>

![collaborative-filtering](images/collaborative-filtering.png)

### Remove anime with low count of ratings and users who gave low count of ratings

* We will only consider popular anime (rating count over 250) and users who gave lots of rating on different anime (>100)

In [8]:
anime_rating_count = rating.groupby(by='anime_id').count()['rating'].reset_index().rename(columns={'rating':'rating_count'})
anime_rating_count['rating_count'].describe()

count    11200.000000
mean       697.655089
std       2028.627749
min          1.000000
25%          5.000000
50%         51.500000
75%        385.250000
max      39340.000000
Name: rating_count, dtype: float64

In [9]:
filtered_anime = anime_rating_count[anime_rating_count['rating_count']>250]

In [10]:
# anime with over 250 rating count

filtered_anime.head()

Unnamed: 0,anime_id,rating_count
0,1,15509
1,5,6927
2,6,11077
3,7,2629
4,8,413


In [11]:
user_rating_count = rating.groupby(by='user_id').count()['rating'].reset_index().rename(columns={'rating':'rating_count'})
user_rating_count['rating_count'].describe()

count    73515.000000
mean       106.287656
std        153.086558
min          1.000000
25%         18.000000
50%         57.000000
75%        136.000000
max      10227.000000
Name: rating_count, dtype: float64

In [12]:
# users who gave over 100 ratings to different anime

filtered_user = user_rating_count[user_rating_count['rating_count']>100]

In [13]:
filtered_user.head()

Unnamed: 0,user_id,rating_count
0,1,153
4,5,467
6,7,343
10,11,112
12,13,174


In [14]:
filtered_rating_anime = rating[rating['anime_id'].isin(filtered_anime['anime_id'])]
filtered_rating = filtered_rating_anime[filtered_rating_anime['user_id'].isin(filtered_user['user_id'])]

In [15]:
# this dataset now contains popular anime and users wth high rating counts

filtered_rating.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


### Construct Rating Matrix

* We will construct a matrix by using pivot table wherein anime id will be indexes and user id in columns

In [16]:
# we can see that most of the values are zero since most of the users does not have ratings for every anime

rating_matrix = filtered_rating.pivot_table(index='anime_id',columns='user_id',values='rating').fillna(0)
print(rating_matrix.shape)
rating_matrix.head()

(3318, 24676)


user_id,1,5,7,11,13,14,17,21,29,35,...,73494,73495,73499,73500,73502,73503,73504,73507,73510,73515
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,9.0,0.0,0.0,...,0.0,10.0,9.0,0.0,0.0,9.0,10.0,9.0,0.0,10.0
5,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,9.0,0.0,0.0,...,0.0,8.0,0.0,0.0,0.0,7.0,10.0,8.0,0.0,10.0
6,0.0,8.0,0.0,0.0,-1.0,0.0,7.0,0.0,0.0,0.0,...,9.0,-1.0,9.0,0.0,0.0,9.0,9.0,9.0,0.0,10.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,7.0,0.0,0.0,9.0,0.0,7.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Convert rating matrix to csr matrix to save memory

In [17]:
from scipy.sparse import csr_matrix
csr_rating_matrix =  csr_matrix(rating_matrix.values)

In [18]:
print(csr_rating_matrix)

  (0, 4)	-1.0
  (0, 7)	9.0
  (0, 12)	10.0
  (0, 14)	10.0
  (0, 15)	10.0
  (0, 16)	-1.0
  (0, 21)	10.0
  (0, 32)	9.0
  (0, 33)	10.0
  (0, 36)	10.0
  (0, 39)	9.0
  (0, 40)	8.0
  (0, 44)	7.0
  (0, 45)	8.0
  (0, 49)	10.0
  (0, 51)	8.0
  (0, 52)	10.0
  (0, 55)	8.0
  (0, 56)	9.0
  (0, 62)	10.0
  (0, 64)	9.0
  (0, 65)	-1.0
  (0, 67)	-1.0
  (0, 69)	7.0
  (0, 71)	-1.0
  :	:
  (3317, 24343)	9.0
  (3317, 24345)	-1.0
  (3317, 24359)	10.0
  (3317, 24383)	-1.0
  (3317, 24385)	10.0
  (3317, 24391)	8.0
  (3317, 24392)	-1.0
  (3317, 24403)	8.0
  (3317, 24423)	-1.0
  (3317, 24426)	8.0
  (3317, 24429)	8.0
  (3317, 24444)	-1.0
  (3317, 24450)	9.0
  (3317, 24459)	8.0
  (3317, 24468)	7.0
  (3317, 24469)	8.0
  (3317, 24471)	-1.0
  (3317, 24480)	9.0
  (3317, 24493)	-1.0
  (3317, 24541)	8.0
  (3317, 24546)	-1.0
  (3317, 24557)	8.0
  (3317, 24579)	10.0
  (3317, 24583)	8.0
  (3317, 24631)	9.0


### Fit the matrix into nearest neighbor

* We are using unsupervised algorithm nearest neighbor.
* This algorithm will find k nearest data point which will be the recommended anime to watch.
* We will also use cosine similarity as the metric for the algorithm.

In [19]:
from sklearn.neighbors import NearestNeighbors

recommender = NearestNeighbors(metric='cosine')
# fit the csr matrix to the algorithm
recommender.fit(csr_rating_matrix)

NearestNeighbors(metric='cosine')

### Retrieve ten nearest neighbors

In [20]:
# getting the anime_id of the user's anime

user_anime = anime[anime['name']=='Bleach']
user_anime

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
582,269,Bleach,"Action, Comedy, Shounen, Super Power, Supernat...",TV,366,7.95,624055


In [21]:
user_anime_index = np.where(rating_matrix.index==int(user_anime['anime_id']))[0][0]

# this index is from rating matrix not from the anime dataset
user_anime_index

223

In [22]:
# getting the ratings based on the index

user_anime_ratings = rating_matrix.iloc[user_anime_index]
user_anime_ratings

user_id
1        0.0
5        3.0
7        0.0
11       0.0
13       0.0
        ... 
73503    0.0
73504    0.0
73507    0.0
73510    0.0
73515    8.0
Name: 269, Length: 24676, dtype: float64

In [23]:
# we need to convert this into 2d array (with only 1 row) since the algorithm does not accept 1d array

user_anime_ratings_reshaped = user_anime_ratings.values.reshape(1,-1)
user_anime_ratings_reshaped

array([[0., 3., 0., ..., 0., 0., 8.]])

In [24]:
# the ratings will be plotted and will return 11 indices and distances of nearest neighbors
# note that these indices are based on the indices of rating matrix

distances, indices = recommender.kneighbors(user_anime_ratings_reshaped,n_neighbors=11)

In [25]:
# indices of nearest neighbors (based on rating matrix)

indices

array([[ 223,   10, 1731, 1878, 2041, 1503, 2262,  914, 1543, 2559, 1244]],
      dtype=int64)

In [26]:
# distances of nearest neighbors to the user's anime

distances

array([[7.83817455e-14, 4.22190466e-01, 4.50830928e-01, 4.53959739e-01,
        4.56147847e-01, 4.68925622e-01, 4.72050450e-01, 4.79615356e-01,
        4.80876928e-01, 4.85142220e-01, 4.86467624e-01]])

###  Output ten recommended anime

In [27]:
# the returned indices will be used to get anime id(index) on rating matrix
# these indices are the nearest neighbors
# we are excluding the first element since the first nearest neighbor is itself

nearest_neighbors_indices = rating_matrix.iloc[indices[0]].index[1:]

In [28]:
nearest_neighbors = pd.DataFrame({'anime_id': nearest_neighbors_indices})
pd.merge(nearest_neighbors,anime,on='anime_id',how='left')

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,20,Naruto,"Action, Comedy, Martial Arts, Shounen, Super P...",TV,220,7.81,683297
1,6702,Fairy Tail,"Action, Adventure, Comedy, Fantasy, Magic, Sho...",TV,175,8.22,584590
2,8247,Bleach Movie 4: Jigoku-hen,"Action, Comedy, Shounen, Super Power, Supernat...",Movie,1,7.75,94074
3,9919,Ao no Exorcist,"Action, Demons, Fantasy, Shounen, Supernatural",TV,25,7.92,583823
4,4835,Bleach Movie 3: Fade to Black - Kimi no Na wo ...,"Action, Comedy, Shounen, Super Power, Supernat...",Movie,1,7.66,122373
5,11757,Sword Art Online,"Action, Adventure, Fantasy, Game, Romance",TV,25,7.83,893100
6,1535,Death Note,"Mystery, Police, Psychological, Supernatural, ...",TV,37,8.71,1013917
7,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
8,16498,Shingeki no Kyojin,"Action, Drama, Fantasy, Shounen, Super Power",TV,25,8.54,896229
9,2889,Bleach Movie 2: The DiamondDust Rebellion - Mo...,"Action, Adventure, Shounen, Supernatural",Movie,1,7.6,134739


# Saving the model

In [29]:
import pickle

In [30]:
pickle.dump(recommender,open('output/nearest_neighbor_recommender.pickle','wb'))

In [31]:
from scipy.sparse import save_npz, load_npz
import json

csr_rating_matrix_open = load_npz('output/csr_rating_matrix.npz')

with open('output/rating_matrix_anime_id.json') as f:
    anime_id = json.load(f)
with open('output/rating_matrix_user_id.json') as f:
    user_id = json.load(f)

In [32]:
rating_matrix_open = pd.DataFrame(csr_rating_matrix_open.toarray().T,index=anime_id['anime_id'],columns=user_id['user_id'])

In [33]:
rating_matrix.equals(rating_matrix_open)

True