# ⛩️🌸🍥☯🍜 ANIMEndation System

Hello there! In this notebook, we'll be going over different methods of building a recommendation system for animes. We'll start off by building feature-based recommendation systems, followed by collaborative filtering. Otakus out there, let's all have fun and learn together!

# Modules

First, Let's import the modules that we'll be using in this notebook. For those wondering what the `surprise` module is, it is a module used for collaborative filtering, which is a technique that idenitifies patterns and similarities in user behavior to make personalized recommendations. We'll go over this at the very end.

In [8]:
#import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel

from surprise import Dataset, Reader, NormalPredictor, BaselineOnly
from surprise.model_selection import cross_validate, GridSearchCV, train_test_split
from surprise.prediction_algorithms import SVD, KNNBasic, KNNBaseline
from surprise import accuracy
from tqdm import tqdm
from statistics import mean

# Dataset

We have quite a big dataset here extracted from [myanimelist.net](http://myanimelist.net), with multiple tables containing different information. myanimelist.net is said to be primarily used by westerners, so it's important to note that the recommendations here are probably going to fit better for westerners. Anyways, let's take a look at our first file *anime.csv*,

## anime.csv

In [9]:
anime = pd.read_csv("../data/anime.csv")
anime.head(5)

Unnamed: 0,MAL_ID,Name,Score,Genres,English name,Japanese name,Type,Episodes,Aired,Premiered,...,Score-10,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,...,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",Unknown,...,30043.0,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",Trigun,トライガン,TV,26,"Apr 1, 1998 to Sep 30, 1998",Spring 1998,...,50229.0,75651.0,86142.0,49432.0,15376.0,5838.0,1965.0,664.0,316.0,533.0
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),TV,26,"Jul 2, 2002 to Dec 24, 2002",Summer 2002,...,2182.0,4806.0,10128.0,11618.0,5709.0,2920.0,1083.0,353.0,164.0,131.0
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",Beet the Vandel Buster,冒険王ビィト,TV,52,"Sep 30, 2004 to Sep 29, 2005",Fall 2004,...,312.0,529.0,1242.0,1713.0,1068.0,634.0,265.0,83.0,50.0,27.0


In [10]:
#check variables and missing values
anime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17562 entries, 0 to 17561
Data columns (total 35 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   MAL_ID         17562 non-null  int64 
 1   Name           17562 non-null  object
 2   Score          17562 non-null  object
 3   Genres         17562 non-null  object
 4   English name   17562 non-null  object
 5   Japanese name  17562 non-null  object
 6   Type           17562 non-null  object
 7   Episodes       17562 non-null  object
 8   Aired          17562 non-null  object
 9   Premiered      17562 non-null  object
 10  Producers      17562 non-null  object
 11  Licensors      17562 non-null  object
 12  Studios        17562 non-null  object
 13  Source         17562 non-null  object
 14  Duration       17562 non-null  object
 15  Rating         17562 non-null  object
 16  Ranked         17562 non-null  object
 17  Popularity     17562 non-null  int64 
 18  Members        17562 non-n

Wow! We have quite a comprehensive dataset here. Each row represents a unique anime with a unique `MAL_ID`. The dataset contains 34 variables, ranging from the producer of the anime to scores given by viewers. We'll have to look into them one by one to see if they require any form or preprocessing.

## anime_with_synopsis.csv

Next in line we have the table that contains a synopsis for each anime. Let's take a look.

In [11]:
synopsis = pd.read_csv("../data/anime_with_synopsis.csv")
synopsis.head()

Unnamed: 0,MAL_ID,Name,Score,Genres,sypnopsis
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ..."
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen","Vash the Stampede is the man with a $$60,000,0..."
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",ches are individuals with special powers like ...
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",It is the dark century and the people are suff...


Yep, we can see that each `MAL_ID` corresponds to a `synposis`. This could contain information that we could utilize in our recommendation system, so let's join this variable into main `anime.csv` table.

In [12]:
#drop Name, Score and Genres
synopsis = synopsis.drop(['Name', 'Score', 'Genres'], axis=1)
#merge the two tables
anime = anime.merge(synopsis, how='outer', on='MAL_ID')

anime.head()

Unnamed: 0,MAL_ID,Name,Score,Genres,English name,Japanese name,Type,Episodes,Aired,Premiered,...,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1,sypnopsis
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,...,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0,"In the year 2071, humanity has colonized sever..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",Unknown,...,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0,"other day, another bounty—such is the life of ..."
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",Trigun,トライガン,TV,26,"Apr 1, 1998 to Sep 30, 1998",Spring 1998,...,75651.0,86142.0,49432.0,15376.0,5838.0,1965.0,664.0,316.0,533.0,"Vash the Stampede is the man with a $$60,000,0..."
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),TV,26,"Jul 2, 2002 to Dec 24, 2002",Summer 2002,...,4806.0,10128.0,11618.0,5709.0,2920.0,1083.0,353.0,164.0,131.0,ches are individuals with special powers like ...
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",Beet the Vandel Buster,冒険王ビィト,TV,52,"Sep 30, 2004 to Sep 29, 2005",Fall 2004,...,529.0,1242.0,1713.0,1068.0,634.0,265.0,83.0,50.0,27.0,It is the dark century and the people are suff...


# Basic Data Preprocessing

These two tables will work for now because we're first going to build a feature-based recommendation system. The idea behind feature-based recommendation is simple. The recommendation system takes into account of the different features of an anime and calculates the distance between these animes. For example, if we were to look at only one variable `score`. An anime that scored 9 will be closer to an anime that scored 10 than an anime that scored 8. Pretty simple right! The only difference is we take into account of multiple variables, which expands the dimension in which the distance is calculated. Anyways, let's preprocess the data that we have here.

## Episodes

First, let's remove those anime with `Unknown` number of episodes at the very beginning. How do animes have `Unknown` number of episodes? Doesn't make much sense to me, so let's kick them out.

In [13]:
#select those with Episode not NA
anime = anime[anime['Episodes']!="Unknown"]

## MAL_ID

Just in case, let's also double check that there are no duplicate values in `MAL_ID`. Probably should've checked this earlier, but hey! No duplicate values :).

In [14]:
anime['MAL_ID'].duplicated().any()

False

Since we just removed the animes with `Unknown` numbers of `Episodes` from the dataset, let's reset the index in order and also assign each anime a new anime index (considering that many `MAL_ID` are now missing).

In [15]:
#reset the index to chronological order and drop the original index
anime = anime.reset_index()
anime = anime.drop(['index'], axis=1)

#many MAL_indices are missing, so let's assign each anime a new MAL_ID given their order
anime['new_anime_index'] = anime.index

anime.shape

(17046, 37)

Now, we're left with 17046 animes in total. This should be enough for us to make great recommendations for all of the otakus out there.

# General Recommendations

Alrighty! Let's make some recommendations now. We'll start by recommending with the most simple and straigtforward method, based on popularity and rating. Now you might be wondering "isn't that just ranking animes using these two variables?". Yes! You're absolutely correct. But guess what! This can be pretty effective, because these animes are the most popular and best-rated for a reason. Especially for people who don't watch anime, these recommendations could be a great start.

We can derive the popularity and rating of each anime from the 10 score variables in our dataset. The rating would be the average score, and popularity would the number of score counts. But first, we have to replace the *Unknown* values by 0 and change their datatypes to numerical.

In [16]:
#replace unknown with 0
anime[['Score', 'Score-10','Score-9','Score-8','Score-7','Score-6','Score-5','Score-4','Score-3','Score-2','Score-1']] = anime[['Score', 'Score-10','Score-9','Score-8','Score-7','Score-6','Score-5','Score-4','Score-3','Score-2','Score-1']].replace('Unknown', 0)

#change datatype
anime[['Score', 'Score-10','Score-9','Score-8','Score-7','Score-6','Score-5','Score-4','Score-3','Score-2','Score-1']] = anime[['Score', 'Score-10','Score-9','Score-8','Score-7','Score-6','Score-5','Score-4','Score-3','Score-2','Score-1']].apply(pd.to_numeric)

#find total number of scores
anime['no_of_scores'] = anime['Score-10'] + anime['Score-9'] + anime['Score-8'] + anime['Score-7'] + anime['Score-6'] + anime['Score-5'] + anime['Score-4'] + anime['Score-3'] + anime['Score-2'] + anime['Score-1']

## Most Popular Anime by Average Score

Now we sort!

In [17]:
top10_score = anime.sort_values('Score', ascending=False).iloc[:10][['English name', 'Japanese name', 'Score', 'no_of_scores']]
top10_score

Unnamed: 0,English name,Japanese name,Score,no_of_scores
3964,Fullmetal Alchemist:Brotherhood,鋼の錬金術師 FULLMETAL ALCHEMIST,9.19,1438767.0
15740,Attack on Titan Final Season,進撃の巨人 The Final Season,9.17,288274.0
5672,Steins;Gate,STEINS;GATE,9.11,989905.0
14828,Attack on Titan Season 3 Part 2,進撃の巨人 Season3 Part.2,9.1,728435.0
9892,Gintama Season 4,銀魂°,9.1,161812.0
6462,Hunter x Hunter,HUNTER×HUNTER（ハンター×ハンター）,9.1,1026866.0
5995,Gintama Season 2,銀魂',9.08,162866.0
739,Legend of the Galactic Heroes,銀河英雄伝説,9.07,58655.0
7249,Gintama:Enchousen,銀魂' 延長戦,9.04,113662.0
9865,A Silent Voice,聲の形,9.0,940843.0


The top rated (on average) animes are listed above. We can see that many are very famous animes, including *Attack on Titan*, *Gintama* and *Hunter*. *Fullmetal Alchemist:Brotherhood* came out on top with the highest rating. And yes! Please watch it! It's a crazy good anime!

## Most Popular Anime by Popularity

We do the same thing for popularity (by `no_of_scores`)

In [18]:
top10_score_count = anime.sort_values('no_of_scores', ascending=False).iloc[:10][['English name', 'Japanese name', 'Score', 'no_of_scores']]
top10_score_count

Unnamed: 0,English name,Japanese name,Score,no_of_scores
1389,Death Note,デスノート,8.63,1826691.0
7437,Attack on Titan,進撃の巨人,8.48,1791099.0
6602,Sword Art Online,ソードアート・オンライン,7.25,1574372.0
10420,One Punch Man,ワンパンマン,8.57,1478645.0
3964,Fullmetal Alchemist:Brotherhood,鋼の錬金術師 FULLMETAL ALCHEMIST,9.19,1438767.0
11147,My Hero Academia,僕のヒーローアカデミア,8.11,1305467.0
10,Naruto,ナルト,7.91,1268593.0
8633,Tokyo Ghoul,東京喰種-トーキョーグール-,7.81,1260191.0
11269,Your Name.,君の名は。,8.96,1190759.0
8135,"No Game, No Life",ノーゲーム・ノーライフ,8.2,1137376.0


These animes accumulated the most number of reviews, which is an indication of their popularity. *Death Note*, *Attack on Titan* and *Sword Art Online* came out in the top three spots. 

# Content-based Recommendation

Okay! Now that we have a general sense of the top most popular animes by `Score` (rating) and `no_of_scores` (popularity), let's get more technical. Let's create a recommendation system that is based off of the different features of the anime. In the database, we have different information about the anime's `Genres`, `Producers`, `Episodes` and `Duration`. We'll be using these features to make recommendations.

In [19]:
#extract the anime features
anime_features = anime[['new_anime_index', 'Name', 'Type', 'Episodes', 'Aired', 'Premiered', 'Producers', 'Licensors', 'Studios', 'Source', 'Duration' ,'Genres']]
anime_features.head()

Unnamed: 0,new_anime_index,Name,Type,Episodes,Aired,Premiered,Producers,Licensors,Studios,Source,Duration,Genres
0,0,Cowboy Bebop,TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,Bandai Visual,"Funimation, Bandai Entertainment",Sunrise,Original,24 min. per ep.,"Action, Adventure, Comedy, Drama, Sci-Fi, Space"
1,1,Cowboy Bebop: Tengoku no Tobira,Movie,1,"Sep 1, 2001",Unknown,"Sunrise, Bandai Visual",Sony Pictures Entertainment,Bones,Original,1 hr. 55 min.,"Action, Drama, Mystery, Sci-Fi, Space"
2,2,Trigun,TV,26,"Apr 1, 1998 to Sep 30, 1998",Spring 1998,Victor Entertainment,"Funimation, Geneon Entertainment USA",Madhouse,Manga,24 min. per ep.,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen"
3,3,Witch Hunter Robin,TV,26,"Jul 2, 2002 to Dec 24, 2002",Summer 2002,"TV Tokyo, Bandai Visual, Dentsu, Victor Entert...","Funimation, Bandai Entertainment",Sunrise,Original,25 min. per ep.,"Action, Mystery, Police, Supernatural, Drama, ..."
4,4,Bouken Ou Beet,TV,52,"Sep 30, 2004 to Sep 29, 2005",Fall 2004,"TV Tokyo, Dentsu",Unknown,Toei Animation,Manga,23 min. per ep.,"Adventure, Fantasy, Shounen, Supernatural"


Based on my domain knowledge of animes, I realized that some of these features are of less importance or have coincided with another feature. Let's drop the more useless ones.

* **Premiered**: Somewhat coincides with the Aired date, with many unknown values
* **Producers**: Not a very important variable. Anime viewers tend to be more concerned about the production studio.
* **Licensors**: Also not a very important variable for the viewers
* **Duration**: Somewhat coincides with the episode variable. An anime with 1 episode tends to be a movie with duration of at least 1 hour. Most other animes are around 20-25 mins long. So it's not a particularly very important variable.

In [20]:
#drop the useless variables
anime_features = anime_features.drop(['Premiered', 'Producers', 'Licensors', 'Duration'], axis=1)
anime_features.head()

Unnamed: 0,new_anime_index,Name,Type,Episodes,Aired,Studios,Source,Genres
0,0,Cowboy Bebop,TV,26,"Apr 3, 1998 to Apr 24, 1999",Sunrise,Original,"Action, Adventure, Comedy, Drama, Sci-Fi, Space"
1,1,Cowboy Bebop: Tengoku no Tobira,Movie,1,"Sep 1, 2001",Bones,Original,"Action, Drama, Mystery, Sci-Fi, Space"
2,2,Trigun,TV,26,"Apr 1, 1998 to Sep 30, 1998",Madhouse,Manga,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen"
3,3,Witch Hunter Robin,TV,26,"Jul 2, 2002 to Dec 24, 2002",Sunrise,Original,"Action, Mystery, Police, Supernatural, Drama, ..."
4,4,Bouken Ou Beet,TV,52,"Sep 30, 2004 to Sep 29, 2005",Toei Animation,Manga,"Adventure, Fantasy, Shounen, Supernatural"


Now we're left with 6 variables. Let's take a deeper dive into each of these variables and see if they require any data preprocessing.

In [21]:
anime_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17046 entries, 0 to 17045
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   new_anime_index  17046 non-null  int64 
 1   Name             17046 non-null  object
 2   Type             17046 non-null  object
 3   Episodes         17046 non-null  object
 4   Aired            17046 non-null  object
 5   Studios          17046 non-null  object
 6   Source           17046 non-null  object
 7   Genres           17046 non-null  object
dtypes: int64(1), object(7)
memory usage: 1.0+ MB


## Anime Type

There's nothing really special about the anime `Type`. It's a categorical variable that takes on seven values. We just have to encode it.

In [22]:
anime_features.Type.unique()

array(['TV', 'Movie', 'OVA', 'Special', 'ONA', 'Music', 'Unknown'],
      dtype=object)

## Episodes

`Episodes` is an interesting one. Considering that we have over 17000 animes in our dataset, the number of episodes will inevitably vary. Based on my knowledge on animes, we can group these into a few categories.

In [23]:
anime_features.Episodes.unique()

array(['26', '1', '52', '145', '24', '74', '220', '178', '12', '22', '69',
       '25', '4', '94', '5', '3', '13', '23', '43', '6', '50', '47', '51',
       '49', '39', '8', '7', '75', '62', '14', '44', '45', '64', '101',
       '27', '161', '2', '153', '70', '78', '42', '11', '167', '150',
       '366', '9', '16', '38', '48', '10', '76', '40', '20', '37', '41',
       '112', '224', '180', '296', '358', '63', '276', '46', '54', '15',
       '21', '35', '124', '86', '102', '36', '67', '291', '110', '29',
       '55', '201', '142', '109', '34', '136', '32', '73', '114', '19',
       '195', '58', '155', '96', '103', '113', '104', '192', '191', '203',
       '56', '500', '80', '172', '65', '117', '28', '61', '30', '148',
       '128', '100', '17', '243', '92', '105', '79', '31', '1787', '53',
       '33', '130', '18', '97', '193', '115', '170', '66', '330', '108',
       '68', '119', '95', '137', '60', '77', '72', '127', '99', '373',
       '300', '163', '91', '88', '154', '156', '694', '8

In [24]:
#change datatype
anime_features['Episodes'] = pd.to_numeric(anime_features['Episodes'].copy())

Typically, animes are released by seasons in January, April, July and October. Most individual season last three months with approximately 12 episodes. Some go longer to about 24 episodes (which lasts two seasons, or 6 months. So, we'll be categorizing the animes based on these time periods.

In [25]:
#find the maximum number of episode
max_episodes = anime_features['Episodes'].max()

#put episodes into categories
anime_features['Episodes'] = anime_features['Episodes'].copy().replace(1, "One Episode")
anime_features['Episodes'] = anime_features['Episodes'].copy().replace(range(2,11), "Short")
anime_features['Episodes'] = anime_features['Episodes'].copy().replace(range(11,15), "3-months")
anime_features['Episodes'] = anime_features['Episodes'].copy().replace(range(15,21), "4-5-months")
anime_features['Episodes'] = anime_features['Episodes'].copy().replace(range(21,27), "6-months")
anime_features['Episodes'] = anime_features['Episodes'].copy().replace(range(27,51), "Long")
anime_features['Episodes'] = anime_features['Episodes'].copy().replace(range(51,max_episodes+1), "Very Long")

anime_features.Episodes.unique()

array(['6-months', 'One Episode', 'Very Long', '3-months', 'Short',
       'Long', '4-5-months'], dtype=object)

## Aired Time

Next, `Aired Time`. We'll be extracting the year value from `Aired Time` and then categorize the year based on how old it is. To be honest, I struggled a bit trying to extract the year value, so I ended up using error handling. The code works but it's not pretty and so if you have a better solution, please help me out. 

In [26]:
from datetime import datetime

lstAired = []

for date in anime_features.Aired:
    
    #seperate the date
    x = date.split(' to ')
    
    #if the date follows the following format: eg. Jan 01, 2000
    try:
        date_object = datetime.strptime(x[0], "%b %d, %Y")
    except:
        
        #if the date follows the following format: eg. Jan, 2000
        try:
            date_object = datetime.strptime(x[0], "%b, %Y")
        except:
            
            #if the date follows the following format: eg. 2000
            try:
                date_object = datetime.strptime(x[0], "%Y")
            except:
                
                #append unknown values to the list
                lstAired.append(x[0])
                continue
    
    #append the year value to the list
    lstAired.append(date_object.year)

The year value is then categorized into intervals of five years. 

In [27]:
#categorize the animes' aired year
lstEra = []

for t in lstAired:
    
    if t == 'Unknown':
        lstEra.append("Unknown")
        continue
        
    if t < 2000:
        lstEra.append("Very Old")
    elif t < 2005:
        lstEra.append("Old")
    elif t < 2010:
        lstEra.append("Modern")
    elif t < 2015:
        lstEra.append("Recent")
    else: 
        lstEra.append("New")

anime_features['Era'] = lstEra
anime_features

Unnamed: 0,new_anime_index,Name,Type,Episodes,Aired,Studios,Source,Genres,Era
0,0,Cowboy Bebop,TV,6-months,"Apr 3, 1998 to Apr 24, 1999",Sunrise,Original,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Very Old
1,1,Cowboy Bebop: Tengoku no Tobira,Movie,One Episode,"Sep 1, 2001",Bones,Original,"Action, Drama, Mystery, Sci-Fi, Space",Old
2,2,Trigun,TV,6-months,"Apr 1, 1998 to Sep 30, 1998",Madhouse,Manga,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",Very Old
3,3,Witch Hunter Robin,TV,6-months,"Jul 2, 2002 to Dec 24, 2002",Sunrise,Original,"Action, Mystery, Police, Supernatural, Drama, ...",Old
4,4,Bouken Ou Beet,TV,Very Long,"Sep 30, 2004 to Sep 29, 2005",Toei Animation,Manga,"Adventure, Fantasy, Shounen, Supernatural",Old
...,...,...,...,...,...,...,...,...,...
17041,17041,Kitarou Tanjou: Gegege no Nazo,Movie,One Episode,Unknown,Unknown,Manga,"Comedy, Demons, Supernatural, Shounen",Unknown
17042,17042,"The Sun, Moon and Stars",Music,One Episode,"Jan 18, 2021",Unknown,Other,Music,New
17043,17043,Mahoutsukai no Yome: Nishi no Shounen to Seira...,OVA,Short,"Sep 10, 2021 to ?",Studio Kafka,Manga,"Slice of Life, Magic, Fantasy, Shounen",New
17044,17044,SK∞: Crazy Rock Jam,Special,One Episode,"Mar 14, 2021",Bones,Original,"Comedy, Sports",New


## Source

The `Source` of the anime is also just another categorical variable, so all we have to do is encode it.

In [28]:
anime_features.Source.unique()

array(['Original', 'Manga', 'Light novel', 'Game', 'Visual novel',
       '4-koma manga', 'Novel', 'Unknown', 'Other', 'Picture book',
       'Web manga', 'Music', 'Radio', 'Book', 'Card game',
       'Digital manga'], dtype=object)

## Studios

The values of `Studios` (and `Genres`) are structured in almost a list format. Many animes are created by more than 1 studio (or more than 1 genre). This may seem hard to encode, but we can actually use the `Multilabelbinarizer` in sklearn to encode this type of data. First, we have to break the `Studios` (`Genres`) down into a list.

In [29]:
#break studios into lists within a list
studio_breakdown = []

for studio in anime_features.Studios:
    studio_lst = studio.split(', ')
    studio_breakdown.append(studio_lst)

In [30]:
studio_breakdown[0:20]

[['Sunrise'],
 ['Bones'],
 ['Madhouse'],
 ['Sunrise'],
 ['Toei Animation'],
 ['Gallop'],
 ['J.C.Staff'],
 ['Nippon Animation'],
 ['A.C.G.T.'],
 ['Madhouse'],
 ['Studio Pierrot'],
 ['Trans Arts'],
 ['Toei Animation'],
 ['Studio Comet'],
 ['Gonzo'],
 ['Madhouse'],
 ['Gonzo'],
 ['Sunrise'],
 ['Studio Deen'],
 ['Gainax', 'Tatsunoko Production']]

## Genres

We do the same for `Genres`.

In [31]:
#break genres into lists within a list
genres_breakdown = []

for genre in anime_features.Genres:
    genre_lst = genre.split(', ')
    genres_breakdown.append(genre_lst)

In [32]:
genres_breakdown[0:5]

[['Action', 'Adventure', 'Comedy', 'Drama', 'Sci-Fi', 'Space'],
 ['Action', 'Drama', 'Mystery', 'Sci-Fi', 'Space'],
 ['Action', 'Sci-Fi', 'Adventure', 'Comedy', 'Drama', 'Shounen'],
 ['Action', 'Mystery', 'Police', 'Supernatural', 'Drama', 'Magic'],
 ['Adventure', 'Fantasy', 'Shounen', 'Supernatural']]

And then, we can apply the lists to the `MultiLabelBinarizer`.

In [33]:
#encode studio
mlb_studio = MultiLabelBinarizer()

studio_breakdown_series = pd.Series(studio_breakdown)

studio_encoded = pd.DataFrame(mlb_studio.fit_transform(studio_breakdown_series),
                   columns=mlb_studio.classes_,
                   index=studio_breakdown_series.index)

studio_encoded.head()

Unnamed: 0,10Gauge,1IN,2:10 AM Animation,33 Collective,3xCube,81 Produce,8bit,A-1 Pictures,A-Line,A-Real,...,foodunited.,helo.inc,iDRAGONS Creative Studio,ixtl,l-a-unch・BOX,monofilmo,pH Studio,production doA,teamKG,ufotable
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We concatenate the encoded `Studios` variable back to our main dataset and we do the same thing for `Genres`.

In [34]:
anime_features = pd.concat([anime_features, studio_encoded], axis=1)

In [35]:
#encode genre
mlb_genre = MultiLabelBinarizer()

genre_breakdown_series = pd.Series(genres_breakdown)

genre_encoded = pd.DataFrame(mlb_genre.fit_transform(genre_breakdown_series),
                   columns=mlb_genre.classes_,
                   index=genre_breakdown_series.index)

#concatenate anime feature dataframe with the genre encoded dataframe
anime_features = pd.concat([anime_features, genre_encoded], axis=1)

Awesome! With both `Studios` and `Genres` encoded, all we're left to do is to encode the remaining variables.

In [36]:
#encode the remaining variables
cat_variables = anime_features[['Type','Episodes', 'Source', 'Era']]
cat_dummies = pd.get_dummies(cat_variables)

#drop the original variables
anime_features = anime_features.drop(['Type','Episodes', 'Aired', 'Studios', 'Source', 'Era', 'Genres'], axis=1)
anime_features = pd.concat([anime_features, cat_dummies], axis=1)

Perfect! Now we have processed all of the features.

In [37]:
anime_features.head()

Unnamed: 0,new_anime_index,Name,10Gauge,1IN,2:10 AM Animation,33 Collective,3xCube,81 Produce,8bit,A-1 Pictures,...,Source_Radio,Source_Unknown,Source_Visual novel,Source_Web manga,Era_Modern,Era_New,Era_Old,Era_Recent,Era_Unknown,Era_Very Old
0,0,Cowboy Bebop,0,0,0,0,0,0,0,0,...,False,False,False,False,False,False,False,False,False,True
1,1,Cowboy Bebop: Tengoku no Tobira,0,0,0,0,0,0,0,0,...,False,False,False,False,False,False,True,False,False,False
2,2,Trigun,0,0,0,0,0,0,0,0,...,False,False,False,False,False,False,False,False,False,True
3,3,Witch Hunter Robin,0,0,0,0,0,0,0,0,...,False,False,False,False,False,False,True,False,False,False
4,4,Bouken Ou Beet,0,0,0,0,0,0,0,0,...,False,False,False,False,False,False,True,False,False,False


Before we move on to build our recommendation system, we want to remove `new_anime_index` and `Name` from our dataset. We only want the features to remain. For now, we'll have to identify each anime through its row index.

In [38]:
#drop new_anime_index and Name
content_variables = anime_features.drop(['Name', 'new_anime_index'], axis=1)

content_variables.head()

Unnamed: 0,10Gauge,1IN,2:10 AM Animation,33 Collective,3xCube,81 Produce,8bit,A-1 Pictures,A-Line,A-Real,...,Source_Radio,Source_Unknown,Source_Visual novel,Source_Web manga,Era_Modern,Era_New,Era_Old,Era_Recent,Era_Unknown,Era_Very Old
0,0,0,0,0,0,0,0,0,0,0,...,False,False,False,False,False,False,False,False,False,True
1,0,0,0,0,0,0,0,0,0,0,...,False,False,False,False,False,False,True,False,False,False
2,0,0,0,0,0,0,0,0,0,0,...,False,False,False,False,False,False,False,False,False,True
3,0,0,0,0,0,0,0,0,0,0,...,False,False,False,False,False,False,True,False,False,False
4,0,0,0,0,0,0,0,0,0,0,...,False,False,False,False,False,False,True,False,False,False


# Get Recommendation Function

With our feature table all preprocessed and settled, we can use it to find the cosine similarity between each and every row. Let's define a function to perform this step. Using the `cosine_similarity` function from `sklearn`, we can easily build a matrix that composes the cosine similarity between all rows of anime.

In [39]:
#find cosine similarity
cosine_sim_content = cosine_similarity(content_variables, content_variables)

cosine_sim_content.shape

(17046, 17046)

Then, we build another series with the anime `Name` as the index and the `anime_index` as the value. This series will help us extract the top n closest anime by cosine similarity.

In [40]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(anime_features.index, index=anime_features['Name']).drop_duplicates()
indices

Name
Cowboy Bebop                                                    0
Cowboy Bebop: Tengoku no Tobira                                 1
Trigun                                                          2
Witch Hunter Robin                                              3
Bouken Ou Beet                                                  4
                                                            ...  
Kitarou Tanjou: Gegege no Nazo                              17041
The Sun, Moon and Stars                                     17042
Mahoutsukai no Yome: Nishi no Shounen to Seiran no Kishi    17043
SK∞: Crazy Rock Jam                                         17044
Wan Jie Shen Zhu 3rd Season                                 17045
Length: 17046, dtype: int64

Finally, we build the function that takes in an *anime title* and the *cosine similarity matrix*. The function will output the top 10 most similar animes to the anime title given using the cosine similarity matrix.

In [41]:
# Function that takes in anime title and similarity matrix as input and outputs most similar movies
def get_recommendations(title, cosine_sim):

    # Get the index of the input anime
    idx = indices[title]

    # Get the similarity scores of all movies with that anime
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the animes based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar animes, sim_scores[0] would be the anime itself
    sim_scores = sim_scores[1:11]

    # Get the anime indices
    anime_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return anime_features['Name'].iloc[anime_indices]

We can test this function by feeding in *Shingeki no Kyojin* (*Attack on Titans*), and we can see that the results are quite accurate. The different seasons of *Attack on Titans* took 8 out of the 10 most similar spots. For those who are huge *Attack on Titans* fans, maybe you can try out *Zetsuen no Tempest* and *Akame ga Kill!* which also landed in the top 10 spots.

In [42]:
get_recommendations('Shingeki no Kyojin', cosine_sim_content)

9363                     Shingeki no Kyojin Season 2
13163                    Shingeki no Kyojin Season 3
14828             Shingeki no Kyojin Season 3 Part 2
7077                              Zetsuen no Tempest
15740           Shingeki no Kyojin: The Final Season
16533                  Shingeki no Kyojin: Chronicle
7867                          Shingeki no Kyojin OVA
8039                 Shingeki no Kyojin: Ano Hi Kara
8612                                  Akame ga Kill!
9002     Shingeki no Kyojin Movie 1: Guren no Yumiya
Name: Name, dtype: object

# Recommedation Based on Synopsis

That was pretty simple right? Next, let's do something slightly more complicated. Let's build another similarity matrix using the synposis of the anime. But how does that work?

In [43]:
anime_synopsis = anime[['new_anime_index', 'Name', 'English name', 'Japanese name', 'sypnopsis']]
anime_synopsis.head()

Unnamed: 0,new_anime_index,Name,English name,Japanese name,sypnopsis
0,0,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,"In the year 2071, humanity has colonized sever..."
1,1,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,"other day, another bounty—such is the life of ..."
2,2,Trigun,Trigun,トライガン,"Vash the Stampede is the man with a $$60,000,0..."
3,3,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),ches are individuals with special powers like ...
4,4,Bouken Ou Beet,Beet the Vandel Buster,冒険王ビィト,It is the dark century and the people are suff...


In [44]:
anime_synopsis.shape

(17046, 5)

We'll use the **TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer** to vectorize the synopsis. Essentially, it is a technique that weighs the importance of each word in a document based on its frequency in the document and its rarity in the corpus.

In [45]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
anime_synopsis['sypnopsis'] = anime_synopsis['sypnopsis'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(anime_synopsis['sypnopsis'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  anime_synopsis['sypnopsis'] = anime_synopsis['sypnopsis'].fillna('')


(17046, 44477)

We can then use this TF-IDF matrix to build another cosine similarity matrix. Notice here, that we derive the cosine similarity matrix using the `linear_kernel()` function. Both `cosine_similarity()` and `linear_kernel()` can be used to calculate the cosine similiarity of a matrix. Which one you should use really depends on the use case. Typically, `cosine_similarity()` is preferred when working with text data, while `linear_kernel()` is preferred when working with other types of data.

In [46]:
# Compute the cosine similarity matrix (linear_kernel returns the same thing as cosine_similarity but faster, so its common practice to use linear_kernel when dealing with tfidf matrices)
cosine_sim_synopsis = linear_kernel(tfidf_matrix, tfidf_matrix)

Finally, we can apply the function built previously on the animes' features to find which animes are most similar, except we feed in the cosine similarity matrix by synopsis this time. If we search *Sword Art Online*, we can see that the top ten most similar animes are all part of the SAO franchise. This makes sense because the synopsis for all of the SAO anime are likely to contain similar keywords.

In [47]:
get_recommendations('Sword Art Online', cosine_sim_synopsis)

16816    Sword Art Online: Progressive Movie - Hoshi Na...
11072                Sword Art Online Movie: Ordinal Scale
8538                                   Sword Art Online II
8193                       Sword Art Online: Extra Edition
16007    Sword Art Online: Alicization - War of Underwo...
6517                   .hack//The Movie: Sekai no Mukou ni
15531    Sword Art Online: Alicization - War of Underworld
15985    Sword Art Online: Alicization - War of Underwo...
13919                                      Hana Ichi Monme
13554        Sword Art Online Alternative: Gun Gale Online
Name: Name, dtype: object

# Collaborative Filtering

Now that we're done with content-based filtering, let's move on to **collaborative filtering**. For content-based recommendation, our goal was to recommend animes based on distance and similarity. However, our goal here is to predict the ratings of users on anime (kind of like regression). We'll be using the `surprise` module along with the *rating_complete.csv* file. The file contains the ratings of certain *user_ids* on certain *anime_id*. Since, we've made some new changes to the *anime_id*, let's map the original *anime_id* in the file to their new *anime_new_index*.

In [48]:
#anime_id and new_anime_index
anime_index_df = anime[['MAL_ID', 'new_anime_index']]
anime_index_df.head()

Unnamed: 0,MAL_ID,new_anime_index
0,1,0
1,5,1
2,6,2
3,7,3
4,8,4


In [50]:
anime_rating = pd.read_csv("../data/rating_complete.csv")
anime_rating.head()

Unnamed: 0,user_id,anime_id,rating
0,0,430,9
1,0,1004,5
2,0,3010,7
3,0,570,7
4,0,2762,9


In [51]:
#merge dataframe
anime_rating = anime_rating.merge(anime_index_df, how='left', left_on='anime_id', right_on='MAL_ID')
anime_rating.head()

Unnamed: 0,user_id,anime_id,rating,MAL_ID,new_anime_index
0,0,430,9,430.0,401.0
1,0,1004,5,1004.0,906.0
2,0,3010,7,3010.0,2739.0
3,0,570,7,570.0,533.0
4,0,2762,9,2762.0,2538.0


We'll drop the excessive variables (*anime_id* and *MAL_ID*) and we're left with the only three variables that we need: *user_id*, *rating* and *new_anime_index*.

In [52]:
anime_rating = anime_rating.drop(['anime_id', 'MAL_ID'], axis=1)
anime_rating.head()

Unnamed: 0,user_id,rating,new_anime_index
0,0,9,401.0
1,0,5,906.0
2,0,7,2739.0
3,0,7,533.0
4,0,9,2538.0


Perfect! Now that the *anime_id* is updated, we can use this data to build our model. But first, let's limit our data to only 6000 *user_ids*. The size of the entire data will cost large computational time and memory, and so let's cut that down first. Maybe one day when I get a better computer, I'll be able to use the entirety of the data XD.

In [53]:
#use only the ratings by the top 6000 user_ids
anime_rating_cut = anime_rating[anime_rating['user_id'].isin(list(range(0,6000)))]

Here, we can see that we're down to only 931,948 ratings. We can also see that there seems to be some missing values in *new_anime_index*. This is because we removed some *anime_ids* (`episode = 0`) in the beginning of this notebook. Therefore, we can remove these rows with missing values. We should also change the data type for *new_anime_index* back to `int`.

In [54]:
anime_rating_cut.info()

<class 'pandas.core.frame.DataFrame'>
Index: 931948 entries, 0 to 931947
Data columns (total 3 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   user_id          931948 non-null  int64  
 1   rating           931948 non-null  int64  
 2   new_anime_index  931946 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 28.4 MB


In [55]:
#animes with missing new_anime_index, these animes were removed from the dataframe in the start for having 0 episodes
anime_rating_cut[anime_rating_cut['new_anime_index'].isna()]

Unnamed: 0,user_id,rating,new_anime_index
155111,943,7,
742432,4892,5,


In [56]:
#select the rows where there are no missing value
anime_rating_cut = anime_rating_cut.loc[anime_rating_cut['new_anime_index'].notnull()]
#chagne data type
anime_rating_cut['new_anime_index'] = anime_rating_cut['new_anime_index'].astype(int)

# Surprise

With no more missing values, our data is ready for collaborative filtering. We can implement collaborative filtering easily with the `surprise` module. Don't be SURPRISED by the power of it! (LOL)

To use the surprise module, we need a `Reader` object to define the `rating_scale` and load the data from the dataframe such that the surprise module can handle it. We define the `rating_scale` as `(1, 10)` because that is the range of rating viewers can give to an anime. When loading the data from the dataframe, it's also important to make sure that columns in the dataframe are ordered as 1) user ID, 2) product ID and 3) rating.

In [57]:
#define rating scale
reader = Reader(rating_scale=(1, 10))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(anime_rating_cut[["user_id", "new_anime_index", "rating"]], reader)

Then, we'll split our data into training set and testing set to facilitate model comparison.

In [58]:
trainset, testset = train_test_split(data, test_size=0.25)

# Model Comparison

Now, we can feed the data into the different models available in the `surprise` module. There are many other models available for use, but we'll stick to the more popular ones. 

## Baseline Models

`NormalPredictor` and `BaselineOnly` are both baseline models that provide a naive prediction strategy based on the statistical properties of the data. The `NormalPredictor` predicts a rating for a given user-item pair by drawing a random rating from a normal distribution that is estimated from the training data. The `BaselineOnly`, on the other hand, predicts a rating for a given user-item pair based on the baseline estimate and user/item biases. The **baseline estimate** is a global average rating for all items in the dataset, and the **user/item biases** are learned from the training data and represent how much each user/item deviates from the global average rating. The predicted rating is the sum of the baseline estimate, the user bias, and the item bias.

## KNN-based Models

`KNNBasic` and `KNNBaseline` are KNN-based models. You should have an idea of how these models work if you're familiar with the kNN algorithm. The `KNNBasic` models uses a simple nearest neighbor approach to make predictions. For a given user-item pair, the model finds the k most similar users or items in the training set, based on the cosine similarity between their rating vectors. The model then computes a weighted average of the k nearest neighbor ratings to make a prediction for the user-item pair. The `KNNBaseline` model is similar to KNNBasic, but incorporates a baseline estimate for each user and item. The baseline estimate is a global average rating for all items in the dataset, plus user and item biases that represent how much each user/item deviates from the global average rating. The model then computes the cosine similarity between the baseline-corrected rating vectors of users or items to find the k nearest neighbors, and makes a prediction using a weighted average of the k nearest neighbor ratings.

## Matrix Factorization-based Model

`SVD` (Singular Value Decomposition) is a matrix factorization-based model and is also one of the most popular. In essence, `SVD` breaks down the rating matrix into three matrices: 1) user-feature matrix, 2) item-feature matrix and 3) feature-feature matrix. These features are latent variables and the number of them is a hyperparameter that can be decided by you. It's a little hard to express how SVD works in words, so I recommend you to watch this [youtube video](http://https://www.youtube.com/watch?v=ZspR5PZemcs&t=840s), which clarifies how matrix factorization works in recommendation systems. The SVD algorithm learns the optimal values of the matrices by minimizing the sum of squared errors between the predicted ratings and the actual ratings in the training set, subject to regularization constraints that prevent overfitting. The predicted rating for a given user-item pair is computed as the dot product of the user-feature vector and the item-feature vector.

We'll be using cross validation to get a sense of how each model compares with each other. 

In [60]:
#instantiate our models and put them into a dictionary
models = {"Normal Predictor": NormalPredictor(),
          "Baseline Only": BaselineOnly(),
          "KNNBasic": KNNBasic(),
          "KNNBaseline": KNNBaseline(),
          "SVD": SVD()}

#create dataframe to save results
model_comparison_df = pd.DataFrame(columns=['model_name', 'test_rmse', 'test_mae', 'fit_time', 'test_time'])

#loop over each algorithm
for algo_name, algo in tqdm(models.items()):
    
    #perform cross validation
    cv_dict = cross_validate(algo, data, cv=3)
    
    #save results into dataframe
    model_comparison_df = pd.concat([model_comparison_df, pd.DataFrame([{'model_name': algo_name, 
                                                                         'test_rmse': np.mean(cv_dict['test_rmse']),
                                                                         'test_mae': np.mean(cv_dict['test_mae']),
                                                                         'fit_time': np.mean(cv_dict['fit_time']),
                                                                         'test_time': np.mean(cv_dict['test_time'])}])], ignore_index=True)


  model_comparison_df = pd.concat([model_comparison_df, pd.DataFrame([{'model_name': algo_name,
 20%|██        | 1/5 [00:12<00:50, 12.60s/it]

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


 40%|████      | 2/5 [00:28<00:43, 14.64s/it]

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


 60%|██████    | 3/5 [05:21<04:43, 141.71s/it]

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


100%|██████████| 5/5 [11:04<00:00, 132.85s/it]


In [61]:
#visualize model comparison results
model_comparison_df

Unnamed: 0,model_name,test_rmse,test_mae,fit_time,test_time
0,Normal Predictor,2.30397,1.824872,0.817205,1.666014
1,Baseline Only,1.249828,0.944354,1.861716,1.854364
2,KNNBasic,1.329197,0.991563,9.459602,86.50042
3,KNNBaseline,1.224685,0.922603,11.117616,89.76752
4,SVD,1.222993,0.919167,7.301846,2.663356


After performing cross validation using our different algorithms, it appears that `SVD` performed the best. This is expected since both the baseline and KNN-based models aren't as robust and have several limitations (baseline models are too simple while KNN-based models have trouble handling **cold-start scenarios**, which happens when data is limited). Let's tune the `SVD` model to see if we can get better results.

# Singular Value Decomposition (SVD)

There are numerous hyperparameters we can play with in regards to SVD. Here is a list of the parameters that we'll adjust along with their explanations.

- `n_factor`: This hyperparameter controls the number of latent factors used to represent users and items. Increasing the number of factors can lead to a more expressive model but can also increase the risk of overfitting.
- `n_epochs`: This hyperparameter controls the number of iterations of the stochastic gradient descent algorithm used to train the model. Increasing the number of epochs can improve the accuracy of the model but can also increase the training time.
- `lr_all`: This hyperparameter controls the learning rate for all parameters in the model. A higher learning rate can lead to faster convergence but can also make the model unstable.
- `reg_all`: This hyperparameter controls the regularization strength for all parameters in the model. Regularization helps prevent overfitting by adding a penalty term to the objective function, which encourages the model to have smaller parameter values. Increasing the regularization strength can lead to a simpler model but can also decrease the accuracy.
- `biased`: This hyperparameter controls whether to include bias terms for users and items in the model. Bias terms represent the average rating for each user and item and can improve the accuracy of the model by accounting for systematic deviations from the global average rating.
- `random_state`: This hyperparameter controls the random seed used by the model. Setting a fixed random seed ensures reproducibility of the results.

To save time, I'll only try out limited values for the hyperparameter. But feel free to experiment with different values and let me know if you have a set of hyperparameters that performs better. 

In [62]:
svd_param_grid = {'n_factors': [150, 200],
                  'n_epochs': [20, 30],
                  'lr_all': [0.005],
                  'reg_all': [0.02, 0.05],
                  'biased': [True],
                  'random_state': [42]}

svd_gs = GridSearchCV(SVD, svd_param_grid, measures=["rmse", "mae"], cv=3, n_jobs=-1, joblib_verbose=3)

svd_gs.fit(data)

# best RMSE score
print(svd_gs.best_score["rmse"])

# combination of parameters that gave the best RMSE score
print(svd_gs.best_params["rmse"])

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 out of  24 | elapsed:   23.5s remaining:   16.8s


1.1916842593049433
{'n_factors': 200, 'n_epochs': 30, 'lr_all': 0.005, 'reg_all': 0.05, 'biased': True, 'random_state': 42}


[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:   35.5s finished


Looking Good! The model improved even further with some tuning. After we've acquired these more optimal parameters, let's fit the SVD model with our training dataset.

In [63]:
#SVD model with the best parameters
best_SVD = svd_gs.best_estimator["rmse"]

#train the model with the training dataset
best_SVD.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x14271fc20>

# Testing

Finally, we can use this SVD model to make predictions on our testing dataset.

In [64]:
#uid = user_id, iid = anime_id, r_ui = actual value, est = prediction
predictions = best_SVD.test(testset)
predictions

[Prediction(uid=5094, iid=407, r_ui=9.0, est=8.978099091315837, details={'was_impossible': False}),
 Prediction(uid=5122, iid=535, r_ui=10.0, est=9.251076265379842, details={'was_impossible': False}),
 Prediction(uid=2464, iid=2362, r_ui=8.0, est=8.268569884299962, details={'was_impossible': False}),
 Prediction(uid=751, iid=4702, r_ui=8.0, est=6.986396889625195, details={'was_impossible': False}),
 Prediction(uid=3100, iid=1531, r_ui=9.0, est=8.899361658542146, details={'was_impossible': False}),
 Prediction(uid=2566, iid=13553, r_ui=9.0, est=8.554513034123142, details={'was_impossible': False}),
 Prediction(uid=2158, iid=9092, r_ui=8.0, est=7.628330750456886, details={'was_impossible': False}),
 Prediction(uid=4387, iid=13328, r_ui=9.0, est=9.14957783589672, details={'was_impossible': False}),
 Prediction(uid=613, iid=11363, r_ui=8.0, est=7.846650125982564, details={'was_impossible': False}),
 Prediction(uid=3123, iid=8058, r_ui=6.0, est=5.2116232679155, details={'was_impossible': Fa

In case you're wondering what each of these variables represent:

- *uid*: user_id
- *iid*: anime_id
- *r_ui*: actual rating
- *est*: predicted rating

We can use then use the `accuracy` function imported from the `surprise` module to calculate the final rmse.

In [65]:
#find the rmse of the testing result
accuracy.rmse(predictions, verbose=True)

RMSE: 1.1756


1.175572763042182

The predictions are overall pretty close to the actual figure. Hooray! We did it!

# Conclusion

Wow! That was a good one. We discussed quite a lot in this notebook from content-based recommendations to collaborative filtering. There are still many other more sophisticated models out there like neural collaborative filtering. Maybe one day I'll be able to add that to this notebook. Anyways, hope you learned something from this notebook and please give this notebook an upvote if you found it interesting. Feel free to also comment down below if you have any suggestions or feedbacks. Cheers <3