# Recommender System: Anime Recommendation using Content Based Filtering (Kaggle Dataset)

<p align="center">
    <img src="https://cdn.myanimelist.net/s/common/uploaded_files/1444014275-106dee95104209bb9436d6df2b6d5145.jpeg"
         width=800 
    >
</p>

## MyAnimeList Database 2020
> Recommendation data from 320.0000 users and 16.000 animes at myanimelist.net

This dataset contains information about 17.562 anime and the preference from 325.772 different users. In particular, this dataset contain:

- The anime list per user. Include dropped, complete, plan to watch, currently watching and on hold.
- Ratings given by users to the animes that they has watched completely.
- Information about the anime like genre, stats, studio, etc.
- HTML with anime information to do data scrapping. These files contain information such as reviews, synopsis, information about the staff, anime statistics, genre, etc.

## Import Libraries

In [5]:
import numpy as np
import pandas as pd

## Import Dataset

In [6]:
df_anime = pd.read_csv('dataset/anime_data.csv')
df_rating = pd.read_csv('dataset/user_rating.csv')

## Data Cleansing

In [7]:
df_anime.info()
print('duplicated data:',df_anime.duplicated().sum())
print('missing value:',df_anime.isna().sum().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17562 entries, 0 to 17561
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   anime_id    17562 non-null  int64 
 1   anime_name  17562 non-null  object
 2   genres      17562 non-null  object
dtypes: int64(1), object(2)
memory usage: 411.7+ KB
duplicated data: 0
missing value: 0


In [8]:
df_rating.info()
print('duplicated data:',df_rating.duplicated().sum())
print('missing value:',df_rating.isna().sum().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359 entries, 0 to 358
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   user_id   359 non-null    int64
 1   anime_id  359 non-null    int64
 2   rating    359 non-null    int64
dtypes: int64(3)
memory usage: 8.5 KB
duplicated data: 0
missing value: 0


- data types seems fine on both dataset
- there's no missing value and duplicated data on both dataset

## Data Preprocessing

In [9]:
df_anime.head()

Unnamed: 0,anime_id,anime_name,genres
0,1,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space"
1,5,Cowboy Bebop: Tengoku no Tobira,"Action, Drama, Mystery, Sci-Fi, Space"
2,6,Trigun,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen"
3,7,Witch Hunter Robin,"Action, Mystery, Police, Supernatural, Drama, ..."
4,8,Bouken Ou Beet,"Adventure, Fantasy, Shounen, Supernatural"


We need to create Item Feature Matrix, therefore we need to convert "genres" columns into spesific columns of each genres.

In [10]:
genres = set()
for i in df_anime.genres:
    for j in i.split(','):
        genres.add(j.strip())
genres = list(sorted(genres))

print('Every genre in dataset:',genres)

Every genre in dataset: ['Action', 'Adventure', 'Cars', 'Comedy', 'Dementia', 'Demons', 'Drama', 'Ecchi', 'Fantasy', 'Game', 'Harem', 'Hentai', 'Historical', 'Horror', 'Josei', 'Kids', 'Magic', 'Martial Arts', 'Mecha', 'Military', 'Music', 'Mystery', 'Parody', 'Police', 'Psychological', 'Romance', 'Samurai', 'School', 'Sci-Fi', 'Seinen', 'Shoujo', 'Shoujo Ai', 'Shounen', 'Shounen Ai', 'Slice of Life', 'Space', 'Sports', 'Super Power', 'Supernatural', 'Thriller', 'Unknown', 'Vampire', 'Yaoi', 'Yuri']


### Create Item Feature Matrix

In [11]:
for i in genres:
    isIn = []
    for j in df_anime.genres:
        isIn.append(1 if i in j else 0)
    df_anime[i] = isIn
df_IFM = df_anime.drop(columns='genres') #Item Feature Matrix
df_IFM.head()

Unnamed: 0,anime_id,anime_name,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,...,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Unknown,Vampire,Yaoi,Yuri
0,1,Cowboy Bebop,1,1,0,1,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
1,5,Cowboy Bebop: Tengoku no Tobira,1,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
2,6,Trigun,1,1,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,7,Witch Hunter Robin,1,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
4,8,Bouken Ou Beet,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


Now we add user_id and their rating to Item Feature Matrix by merging Item Feature Matrix with rating dataset (df_rating).

In [12]:
df_merge = df_rating.merge(df_anime).drop(columns=['genres'])
df_merge.head()

Unnamed: 0,user_id,anime_id,rating,anime_name,Action,Adventure,Cars,Comedy,Dementia,Demons,...,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Unknown,Vampire,Yaoi,Yuri
0,1,37521,9,Vinland Saga,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,8246,7,Naruto: Shippuuden Movie 4 - The Lost Tower,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
2,1,36949,8,Shokugeki no Souma: San no Sara - Tootsuki Res...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,38408,8,Boku no Hero Academia 4th Season,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
4,1,34599,8,Made in Abyss,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## System Recommendation

### Content Based Filtering

Now we create the Item Feature Matrix multiplies by each user rating

In [13]:
df_IFM_rating = df_merge.copy() 
for i in genres:
    df_IFM_rating[i] = df_IFM_rating[i]*df_IFM_rating['rating']
df_IFM_rating.head()

Unnamed: 0,user_id,anime_id,rating,anime_name,Action,Adventure,Cars,Comedy,Dementia,Demons,...,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Unknown,Vampire,Yaoi,Yuri
0,1,37521,9,Vinland Saga,9,9,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,8246,7,Naruto: Shippuuden Movie 4 - The Lost Tower,7,0,0,7,0,0,...,0,0,0,7,0,0,0,0,0,0
2,1,36949,8,Shokugeki no Souma: San no Sara - Tootsuki Res...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,38408,8,Boku no Hero Academia 4th Season,8,0,0,8,0,0,...,0,0,0,8,0,0,0,0,0,0
4,1,34599,8,Made in Abyss,0,8,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Create User Feature Vector for each users and added it to new dataset User Feature Matrix (df_UFM)

In [14]:
userFeatureVectors = [] #User Feature Vector for each users
for i in df_IFM_rating.user_id.unique():
    ifmRatingGenresOnly = df_IFM_rating[df_IFM_rating.user_id == i][genres]
    userFeatureVectors.append(ifmRatingGenresOnly.sum() / ifmRatingGenresOnly.sum().sum())
df_UFM = pd.DataFrame(userFeatureVectors,index=df_IFM_rating.user_id.unique()).reset_index().rename(columns={'index':'user_id'})
df_UFM #User Feature Matrix

Unnamed: 0,user_id,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,...,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Unknown,Vampire,Yaoi,Yuri
0,1,0.092336,0.049494,0.0,0.104577,0.0,0.004524,0.085418,0.015966,0.047898,...,0.018361,0.0,0.038318,0.0471,0.046567,0.019691,0.0,0.002129,0.0,0.0
1,2,0.123752,0.035429,0.0,0.076347,0.0,0.008483,0.062874,0.035928,0.017465,...,0.008483,0.0,0.017964,0.0499,0.066866,0.017465,0.0,0.008982,0.0,0.0
2,4,0.051884,0.043488,0.0,0.089343,0.0,0.008611,0.114747,0.008396,0.063509,...,0.047147,0.0,0.001507,0.015285,0.056189,0.005382,0.0,0.00732,0.0,0.0
3,7,0.062519,0.042178,0.0,0.113072,0.0,0.013162,0.072689,0.030512,0.059527,...,0.057733,0.0,0.0,0.010171,0.063117,0.007777,0.0,0.002393,0.0,0.0


Now we can make Inferred Movie Rankings (df_IMR) by using User Feature Matrix dataset (df_UFM) that we create before.

In [15]:
animeScores = []
for i in df_UFM.index:
    animeScores.append((df_UFM[genres].iloc[i] * df_IFM[genres]).T.sum())
df_IMR = pd.DataFrame(animeScores,index=df_UFM.user_id)
df_IMR.columns = df_IFM.anime_name
df_IMR #Inferred Movie Rankings

anime_name,Cowboy Bebop,Cowboy Bebop: Tengoku no Tobira,Trigun,Witch Hunter Robin,Bouken Ou Beet,Eyeshield 21,Hachimitsu to Clover,Hungry Heart: Wild Striker,Initial D Fourth Stage,Monster,...,SK∞: Crazy Rock Jam,Kyoukai Senki,D_Cide Traumerei,Tsuki to Laika to Nosferatu,Wan Jie Shen Zhu 3rd Season,Daomu Biji Zhi Qinling Shen Shu,Mieruko-chan,Higurashi no Naku Koro ni Sotsu,Yama no Susume: Next Summit,Scarlet Nexus
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.364023,0.235764,0.480841,0.263438,0.260777,0.352049,0.250399,0.278073,0.22991,0.177222,...,0.142895,0.098457,0.283662,0.034327,0.097392,0.121873,0.158329,0.119745,0.172432,0.140234
2,0.340319,0.26497,0.439122,0.299401,0.218563,0.316866,0.171657,0.201597,0.236527,0.221058,...,0.094311,0.137226,0.244012,0.050898,0.052894,0.138723,0.169661,0.188124,0.120259,0.141218
4,0.322497,0.225619,0.358235,0.299247,0.198924,0.178471,0.384284,0.173735,0.173305,0.187083,...,0.09085,0.070183,0.311948,0.030355,0.106997,0.13563,0.156943,0.121206,0.179978,0.115393
7,0.309602,0.173796,0.335926,0.246485,0.191146,0.201914,0.38289,0.197128,0.158839,0.152857,...,0.113072,0.069399,0.26563,0.021538,0.101705,0.124738,0.189949,0.119653,0.212982,0.122046


## Summary

This is the result of the recommendation for each users. here are the top 20 anime that system recommended to each users.

In [16]:
for i in df_IMR.index:
    print('User:',i)
    display(pd.DataFrame(df_IMR.loc[i].sort_values(ascending=False).reset_index().rename(columns={1:'score',2:'score',4:'score',7:'score'})).head(20))

User: 1


Unnamed: 0,anime_name,score
0,Battle Athletess Daiundoukai (TV),0.609101
1,InuYasha Movie 2: Kagami no Naka no Mugenjo,0.604843
2,InuYasha Movie 4: Guren no Houraijima,0.604843
3,InuYasha Movie 1: Toki wo Koeru Omoi,0.604843
4,InuYasha Movie 3: Tenka Hadou no Ken,0.604843
5,Aoki Densetsu Shoot!,0.569452
6,Trinity Seven: Nanatsu no Taizai to Nana Madoushi,0.566525
7,Trinity Seven,0.566525
8,Ani*Kuri15,0.562533
9,Saber Marionette J,0.552422


User: 2


Unnamed: 0,anime_name,score
0,InuYasha Movie 1: Toki wo Koeru Omoi,0.518463
1,InuYasha Movie 3: Tenka Hadou no Ken,0.518463
2,InuYasha Movie 2: Kagami no Naka no Mugenjo,0.518463
3,InuYasha Movie 4: Guren no Houraijima,0.518463
4,GetBackers,0.51497
5,Ueki no Housoku,0.513972
6,Trinity Seven,0.512974
7,Trinity Seven: Nanatsu no Taizai to Nana Madoushi,0.512974
8,Saber Marionette J,0.508982
9,Battle Athletess Daiundoukai (TV),0.499002


User: 4


Unnamed: 0,anime_name,score
0,InuYasha Movie 4: Guren no Houraijima,0.646717
1,InuYasha Movie 3: Tenka Hadou no Ken,0.646717
2,InuYasha Movie 1: Toki wo Koeru Omoi,0.646717
3,InuYasha Movie 2: Kagami no Naka no Mugenjo,0.646717
4,Mai-HiME,0.609257
5,Kamikaze Kaitou Jeanne,0.586868
6,Cardcaptor Sakura,0.580194
7,Wagamama☆Fairy Mirumo de Pon!,0.580194
8,Zero no Tsukaima: Princesses no Rondo Picture ...,0.579333
9,Fushigi Yuugi,0.573089


User: 7


Unnamed: 0,anime_name,score
0,Zero no Tsukaima: Princesses no Rondo Picture ...,0.65181
1,InuYasha Movie 1: Toki wo Koeru Omoi,0.630272
2,InuYasha Movie 3: Tenka Hadou no Ken,0.630272
3,InuYasha Movie 2: Kagami no Naka no Mugenjo,0.630272
4,InuYasha Movie 4: Guren no Houraijima,0.630272
5,Trinity Seven,0.626383
6,Trinity Seven: Nanatsu no Taizai to Nana Madoushi,0.626383
7,Gakusen Toshi Asterisk,0.590488
8,Mai-HiME,0.585103
9,Negima!?,0.579719


## References

1. [Anime Recommendation Database 2020 - Kaggle](https://www.kaggle.com/hernan4444/anime-recommendation-database-2020)
1. [Split by comma and strip whitespace in Python- Stackoverflow](https://stackoverflow.com/questions/4071396/split-by-comma-and-strip-whitespace-in-python)
1. [Python: if-else in one line – (A Ternary operator)](https://thispointer.com/python-if-else-in-one-line-a-ternary-operator/)