# Anime Recommendation System

For this project, I developed a machine learning-powered anime recommendation system. By inputting the title of an anime, the system offers suggestions for 10 other animes to consider watching. The process involves sifting through a vast database of 320,000 users and 16,000 animes. This entails analyzing individuals who have viewed the anime you entered, along with its summary. Subsequently, the system evaluates their ratings to propose animes they've highly rated as recommendations.

The database I used for this project was provided by user HERNAN VALDIVIESO on Kaggle. It contains data from myanimelist.net up to March 20th, 2020. 

https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020

Import all necessary libraries.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

Read in the datasets.

In [2]:
anime = pd.read_csv('/Users/KaryLy/Desktop/datasets/Anime Dataset/anime.csv')
rating_complete = pd.read_csv('/Users/KaryLy/Desktop/datasets/Anime Dataset/rating_complete.csv')
synopsis = pd.read_csv('/Users/KaryLy/Desktop/datasets/Anime Dataset/anime_with_synopsis.csv')

In [3]:
anime.head()

Unnamed: 0,MAL_ID,Name,Score,Genres,English name,Japanese name,Type,Episodes,Aired,Premiered,...,Score-10,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,...,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",Unknown,...,30043.0,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",Trigun,トライガン,TV,26,"Apr 1, 1998 to Sep 30, 1998",Spring 1998,...,50229.0,75651.0,86142.0,49432.0,15376.0,5838.0,1965.0,664.0,316.0,533.0
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),TV,26,"Jul 2, 2002 to Dec 24, 2002",Summer 2002,...,2182.0,4806.0,10128.0,11618.0,5709.0,2920.0,1083.0,353.0,164.0,131.0
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",Beet the Vandel Buster,冒険王ビィト,TV,52,"Sep 30, 2004 to Sep 29, 2005",Fall 2004,...,312.0,529.0,1242.0,1713.0,1068.0,634.0,265.0,83.0,50.0,27.0


The anime dataset contains general information of every anime (17,562 different anime) like genre, stats, studio, etc. This dataframe has the following columns:
- MAL_ID: MyAnimelist ID of the anime. (e.g. 1)
- Name: full name of the anime. (e.g. Cowboy Bebop)
- Score: average score of the anime given from all users in MyAnimelist database. (e.g. 8.78)
- Genres: comma separated list of genres for this anime. (e.g. Action, Adventure, Comedy, Drama, Sci-Fi, Space)
- English name: full name in english of the anime. (e.g. Cowboy Bebop)
- Japanese name: full name in japanses of the anime. (e.g. カウボーイビバップ)
- Type: TV, movie, OVA, etc. (e.g. TV)
- Episodes': number of chapters. (e.g. 26)
- Aired: broadcast date. (e.g. Apr 3, 1998 to Apr 24, 1999)
- Premiered: season premiere. (e.g. Spring 1998)
- Producers: comma separated list of produducers (e.g. Bandai Visual)
- Licensors: comma separated list of licensors (e.g. Funimation, Bandai Entertainment)
- Studios: comma separated list of studios (e.g. Sunrise)
- Source: Manga, Light novel, Book, etc. (e.g Original)
- Duration: duration of the anime per episode (e.g 24 min. per ep.)
- Rating: age rate (e.g. R - 17+ (violence & profanity))
- Ranked: position based in the score. (e.g 28)
- Popularity: position based in the the number of users who have added the anime to their list. (e.g 39)
- Members: number of community members that are in this anime's "group". (e.g. 1251960)
- Favorites: number of users who have the anime as "favorites". (e.g. 61,971)
- Watching: number of users who are watching the anime. (e.g. 105808)
- Completed: number of users who have complete the anime. (e.g. 718161)
- On-Hold: number of users who have the anime on Hold. (e.g. 71513)
- Dropped: number of users who have dropped the anime. (e.g. 26678)
- Plan to Watch': number of users who plan to watch the anime. (e.g. 329800)
- Score-10': number of users who scored 10. (e.g. 229170)
- Score-9': number of users who scored 9. (e.g. 182126)
- Score-8': number of users who scored 8. (e.g. 131625)
- Score-7': number of users who scored 7. (e.g. 62330)
- Score-6': number of users who scored 6. (e.g. 20688)
- Score-5': number of users who scored 5. (e.g. 8904)
- Score-4': number of users who scored 4. (e.g. 3184)
- Score-3': number of users who scored 3. (e.g. 1357)
- Score-2': number of users who scored 2. (e.g. 741)
- Score-1': number of users who scored 1. (e.g. 1580)

In [4]:
rating_complete.head()

Unnamed: 0,user_id,anime_id,rating
0,0,430,9
1,0,1004,5
2,0,3010,7
3,0,570,7
4,0,2762,9


The rating_complete dataset is a subset of animelist.csv. This dataset only considers animes that the user has watched completely (watching_status==2) and gave it a score (score!=0). This dataset contains 57 Million ratings applied to 16,872 animes by 310,059 users. This dataframe has the following columns:

- user_id: non identifiable randomly generated user id.
- anime_id: - MyAnimelist ID of the anime that this user has rated.
- rating: rating that this user has assigned.


In [5]:
synopsis.head()

Unnamed: 0,MAL_ID,Name,Score,Genres,sypnopsis
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space","In the year 2071, humanity has colonized sever..."
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space","other day, another bounty—such is the life of ..."
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen","Vash the Stampede is the man with a $$60,000,0..."
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",ches are individuals with special powers like ...
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",It is the dark century and the people are suff...


The synopsis dataset contains information regarding the summary of each anime. This datafame has the following columns:

- MAL_ID: MyAnimelist ID of the anime. (e.g. 1)
- Name: full name of the anime. (e.g. Cowboy Bebop)
- Score: average score of the anime given from all users in MyAnimelist database. (e.g. 8.78)
- Genres: comma separated list of genres for this anime. (e.g. Action, Adventure, Comedy, Drama, Sci-Fi, Space)
- sypnosis: summary of the anime

# Cleaning the Data

I started off by removing unnecessary columns from each of the dataframes.

In [6]:
# Drop the columns that are not needed
anime = anime.drop(columns=['Japanese name', 'Aired', 'Premiered', 'Producers', 'Licensors', 'Studios', 'Source',
'Duration', 'Members', 'Favorites', 'Watching', 'Completed', 'On-Hold', 'Dropped', 'Plan to Watch', 'Score-10', 
'Score-9', 'Score-8', 'Score-7', 'Score-6', 'Score-5', 'Score-4', 'Score-3', 'Score-2', 'Score-1'])
synopsis.drop(columns=['Name', 'Score', 'Genres'])

Unnamed: 0,MAL_ID,sypnopsis
0,1,"In the year 2071, humanity has colonized sever..."
1,5,"other day, another bounty—such is the life of ..."
2,6,"Vash the Stampede is the man with a $$60,000,0..."
3,7,ches are individuals with special powers like ...
4,8,It is the dark century and the people are suff...
...,...,...
16209,48481,No synopsis information has been added to this...
16210,48483,ko is a typical high school student whose life...
16211,48488,Sequel to Higurashi no Naku Koro ni Gou .
16212,48491,New Yama no Susume anime.


Then, I removed any columns that have no values for scores, rankings, and ratings. The recommendation system takes into account ratings and such, so it wouldn't be effective to keep animes that have no data for these columns. I also removed animes that contained adult content.

In [7]:
# Removing columns with 'Unknown' values for 'Score', 'Ranked', and 'Rating'
anime = anime.where(anime['Score'] != 'Unknown').dropna()
anime = anime.where(anime['Ranked'] != 'Unknown').dropna()
anime = anime.where(anime['Rating'].str.contains('Hentai') == False)

I also wanted to only get ratings from users who have posted at least 200 ratings. The more ratings a user has, the more animes they have watched and thus, the more credible they are at recommending animes. 

In [8]:
# Get ratings from only users that have posted at least 200 ratings
num_ratings = rating_complete.groupby('user_id')['anime_id'].nunique()
high_num_ratings = num_ratings[num_ratings >= 200]
rating_complete = rating_complete[rating_complete['user_id'].isin(high_num_ratings.index)]

I then merged the anime and rating_complete dataframes to make things easier.

In [9]:
# Rename the column so we can merge the anime and rating_complete dataframes
anime.rename(columns={'MAL_ID': 'anime_id'}, inplace=True)
anime_ratings = anime.merge(rating_complete, on='anime_id')
anime_ratings

Unnamed: 0,anime_id,Name,Score,Genres,English name,Type,Episodes,Rating,Ranked,Popularity,user_id,rating
0,1.0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,TV,26,R - 17+ (violence & profanity),28.0,39.0,3,9
1,1.0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,TV,26,R - 17+ (violence & profanity),28.0,39.0,6,6
2,1.0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,TV,26,R - 17+ (violence & profanity),28.0,39.0,19,8
3,1.0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,TV,26,R - 17+ (violence & profanity),28.0,39.0,44,9
4,1.0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,TV,26,R - 17+ (violence & profanity),28.0,39.0,53,10
...,...,...,...,...,...,...,...,...,...,...,...,...
40411511,48456.0,SK∞: Crazy Rock Jam,6.52,"Comedy, Sports",Unknown,Special,1,PG-13 - Teens 13 or older,5799.0,4830.0,338488,8
40411512,48456.0,SK∞: Crazy Rock Jam,6.52,"Comedy, Sports",Unknown,Special,1,PG-13 - Teens 13 or older,5799.0,4830.0,342067,5
40411513,48456.0,SK∞: Crazy Rock Jam,6.52,"Comedy, Sports",Unknown,Special,1,PG-13 - Teens 13 or older,5799.0,4830.0,347462,4
40411514,48456.0,SK∞: Crazy Rock Jam,6.52,"Comedy, Sports",Unknown,Special,1,PG-13 - Teens 13 or older,5799.0,4830.0,348266,5


When skimmed over the dataframe, I noticed that some values for 'English name' were 'Unknown,' so I replaced those values with the values in 'Name.' I also wanted to only recommend animes that were scored at least a 5. So, I modified the dataframe to only have animes with scores 5 and above.

In [10]:
# Some values for 'English name' are 'Unknown', so replace it with the value in 'Name'
anime_ratings.loc[anime_ratings['English name'] == 'Unknown', 'English name'] = anime_ratings.loc[anime_ratings['English name'] == 'Unknown', 'Name']
# Only get animes that scored at least or above a 5
anime_ratings = anime_ratings.where(anime_ratings['Score'].astype(float) >= 5).dropna()
anime_ratings

Unnamed: 0,anime_id,Name,Score,Genres,English name,Type,Episodes,Rating,Ranked,Popularity,user_id,rating
0,1.0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,TV,26,R - 17+ (violence & profanity),28.0,39.0,3.0,9.0
1,1.0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,TV,26,R - 17+ (violence & profanity),28.0,39.0,6.0,6.0
2,1.0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,TV,26,R - 17+ (violence & profanity),28.0,39.0,19.0,8.0
3,1.0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,TV,26,R - 17+ (violence & profanity),28.0,39.0,44.0,9.0
4,1.0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,TV,26,R - 17+ (violence & profanity),28.0,39.0,53.0,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...
40411511,48456.0,SK∞: Crazy Rock Jam,6.52,"Comedy, Sports",SK∞: Crazy Rock Jam,Special,1,PG-13 - Teens 13 or older,5799.0,4830.0,338488.0,8.0
40411512,48456.0,SK∞: Crazy Rock Jam,6.52,"Comedy, Sports",SK∞: Crazy Rock Jam,Special,1,PG-13 - Teens 13 or older,5799.0,4830.0,342067.0,5.0
40411513,48456.0,SK∞: Crazy Rock Jam,6.52,"Comedy, Sports",SK∞: Crazy Rock Jam,Special,1,PG-13 - Teens 13 or older,5799.0,4830.0,347462.0,4.0
40411514,48456.0,SK∞: Crazy Rock Jam,6.52,"Comedy, Sports",SK∞: Crazy Rock Jam,Special,1,PG-13 - Teens 13 or older,5799.0,4830.0,348266.0,5.0


# TF-IDF Vectorization

I then used TF-IDF vectorization to processes anime plot summaries and computed cosine similarities between them. The result is a cosine similarity matrix (cosine_sim) that can be used to find similar anime recommendations based on the content of their plot summaries. The indices Series helps in mapping anime names to their corresponding indices in the matrix.

In [11]:
# TF-IDF vectorization so we can make a recommendations based on the anime's plot summaries given in 
# the 'sypnopsis' column.
tfv = TfidfVectorizer(stop_words ='english')
synopsis['sypnopsis'] = synopsis['sypnopsis'].fillna('')
# Fitting the TF-IDF on the 'overview' text
tfv_matrix = tfv.fit_transform(synopsis['sypnopsis'])
tfv_matrix.shape
cosine_sim = linear_kernel(tfv_matrix, tfv_matrix)
indices = pd.Series(synopsis.index, index=synopsis['Name']).drop_duplicates()

 # Recommendation

Finally, I created a function called get_anime_recommendations. It provides anime recommendations based on a combination of content-based and collaborative filtering approaches.

In [14]:
def get_anime_recommendations(anime_name, num_recommendations=5):
    # Check if the target anime name is in the data
    if anime_name not in indices:
        print('Anime not in data. Enter another anime.')
        return
    
    # Find the index of the given anime_name in the indices Series
    anime_index = indices[anime_name]
    
    # Get the cosine similarity scores for all animes with respect to the given anime
    sim_scores = list(enumerate(cosine_sim[anime_index]))
    
    # Sort the anime similarity scores in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the indices of the top N similar animes
    top_indices = [i for i, score in sim_scores[1:num_recommendations+1]]
    
    # Get the names of the top recommended animes using the indices
    recommended_animes = synopsis['Name'].iloc[top_indices]
    
    # Filter the DataFrame to get users who watched the target anime and their ratings
    anime_users = anime_ratings[anime_ratings['English name'] == anime_name][['user_id', 'rating']]
    
    # Find similar users who watched the target anime
    similar_users = anime_ratings[anime_ratings['user_id'].isin(anime_users['user_id'])]
    
    # Group similar users by anime name and calculate the mean rating for each anime
    anime_mean_ratings = similar_users.groupby('English name')['rating'].mean()
    
    # Sort the anime ratings in descending order to get top-rated animes
    top_animes_ratings = anime_mean_ratings.sort_values(ascending=False)
    
    # Filter out the given anime from the recommendations
    top_animes_ratings = top_animes_ratings[top_animes_ratings.index != anime_name]
    
    # Convert Index to Series before concatenation
    top_animes_ratings_series = pd.Series(top_animes_ratings.index)
    
    # Combine the recommendations from content-based and collaborative filtering approaches
    combined_recommendations = pd.concat([recommended_animes, top_animes_ratings_series])
    
    # Exclude anime names that contain the target anime name
    # This makes it so that it does not include any sequels or spin-offs
    combined_recommendations = combined_recommendations[
        ~combined_recommendations.str.contains(anime_name, case=False)
    ]
    
    return combined_recommendations.head(num_recommendations)

# Example: Get recommendations for an anime called 'Jujutsu Kaisen' and get the top 10 recommendations
target_anime_name = 'Jujutsu Kaisen'
recommended_anime = get_anime_recommendations(target_anime_name, num_recommendations=10)

# Display the recommended anime
print(recommended_anime)

15302    Uchida Shungicu no Noroi no One-Piece
5480                                        C³
2375             Shakugan no Shana II (Second)
14516                              Bai Niao Gu
3649                  Gegege no Kitarou (1968)
4175                  Gegege no Kitarou (1971)
4395                  Gegege no Kitarou (1996)
12908                                  Radiant
8898                         Grisaia no Rakuen
0              Fullmetal Alchemist:Brotherhood
dtype: object
