# Anime Recommendation 🐱‍👤🐱‍🏍🤖


<p style="text-align: center"><img src="https://storage.googleapis.com/kaggle-datasets-images/3384322/5890423/cff45124ace014117ada5f9435b5b624/dataset-cover.jpg?t=2023-06-11-13-43-52"></p>

Recommending anime based on **Item based Content Filtering**.

### 1. Import packages and data

In [1]:
"""Import Packages"""
import pandas as pd
import numpy as np
import re
import plotly.express as px
import plotly.graph_objects as go 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
import pickle

In [2]:
# Dataset Ref: https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset/data
data = pd.read_csv('anime-dataset-2023.csv')
data.head(3)

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",...,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",...,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24905 entries, 0 to 24904
Data columns (total 24 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   anime_id      24905 non-null  int64 
 1   Name          24905 non-null  object
 2   English name  24905 non-null  object
 3   Other name    24905 non-null  object
 4   Score         24905 non-null  object
 5   Genres        24905 non-null  object
 6   Synopsis      24905 non-null  object
 7   Type          24905 non-null  object
 8   Episodes      24905 non-null  object
 9   Aired         24905 non-null  object
 10  Premiered     24905 non-null  object
 11  Status        24905 non-null  object
 12  Producers     24905 non-null  object
 13  Licensors     24905 non-null  object
 14  Studios       24905 non-null  object
 15  Source        24905 non-null  object
 16  Duration      24905 non-null  object
 17  Rating        24905 non-null  object
 18  Rank          24905 non-null  object
 19  Popu

    Contents of each column

|*Column*|*Content*|*Column*|*Content*
|----------|-----------|----------|-----------
|**anime_id**| *Unique ID for each anime*.|**Licensors**| *The licensors of the anime (e.g., streaming platforms)*.|
|**Name**| *The name of the anime in its original language*.|**Studios**| *The animation studios that worked on the anime*.|
|**English name**| *The English name of the anime*.|**Source**| *The source material of the anime (e.g., manga, light novel, original)*.|
|**Other name**| *Native name or title of the anime(can be in Japanese, Chinese or Korean)*.|**Duration**| *The duration of each episode*.|
|**Score**| *The score or rating given to the anime*.|**Rating**| *The age rating of the anime*.|
|**Genres**| *The genres of the anime, separated by commas*.|**Rank**| *The rank of the anime based on popularity or other criteria*.|
|**Synopsis**| *A brief description or summary of the anime's plot*.|**Popularity**| *The popularity rank of the anime*.|
|**Type**| *The type of the anime (e.g., TV series, movie, OVA, etc.)*.|**Favorites**| *The number of times the anime was marked as a favorite by users*.|
|**Episodes**| *The number of episodes in the anime*.|**Scored By**| *The number of users who scored the anime*.|
|**Aired**| *The dates when the anime was aired*.|**Members**| *The number of members who have added the anime to their list on the platform*.|
|**Premiered**| *The season and year when the anime premiered*|**Image URL**| *The URL of the anime's image or poster*.|
|**Status**| *The status of the anime (e.g., Finished Airing, Currently Airing, etc.)*.|**Producers**| *The production companies or producers of the anime*.|

In [4]:
data.describe()

Unnamed: 0,anime_id,Popularity,Favorites,Members
count,24905.0,24905.0,24905.0,24905.0
mean,29776.709014,12265.388356,432.595222,37104.96
std,17976.07629,7187.428393,4353.181647,156825.2
min,1.0,0.0,0.0,0.0
25%,10507.0,6040.0,0.0,209.0
50%,34628.0,12265.0,1.0,1056.0
75%,45240.0,18491.0,18.0,9326.0
max,55735.0,24723.0,217606.0,3744541.0


In [5]:
data['Genres'].value_counts()

Genres
UNKNOWN                                          4929
Comedy                                           2279
Fantasy                                          1341
Hentai                                           1181
Drama                                             624
                                                 ... 
Adventure, Comedy, Drama, Romance, Sci-Fi           1
Boys Love, Comedy, Supernatural                     1
Fantasy, Suspense, Ecchi                            1
Fantasy, Romance, Slice of Life, Supernatural       1
Romance, Suspense                                   1
Name: count, Length: 1006, dtype: int64

In [6]:
data['Score'].value_counts()

Score
UNKNOWN    9213
6.31         80
6.54         80
6.25         79
6.51         79
           ... 
4.05          1
2.9           1
3.03          1
3.65          1
9.0           1
Name: count, Length: 567, dtype: int64

In [7]:
data['Aired'].value_counts()

Aired
Not available                   915
2012 to ?                        76
2011 to ?                        74
2005                             74
2010 to ?                        72
                               ... 
Jun 7, 2017 to May 21, 2021       1
Jun 13, 2023                      1
Dec 30, 2021                      1
Jun 1, 2022                       1
Oct 10, 2001 to Mar 23, 2005      1
Name: count, Length: 15213, dtype: int64

In [8]:
data['Rank'].value_counts()

Rank
UNKNOWN    4612
0.0         187
18804.0       4
12591.0       4
9618.0        4
           ... 
14456.0       1
14699.0       1
805.0         1
55.0          1
599.0         1
Name: count, Length: 15198, dtype: int64

In [9]:
# Replacing unknown score with mean score
scores = data[data['Score'] != 'UNKNOWN']['Score']
scores = scores.astype('float64')
score_mean = scores.mean()
print("Mean Score: ", score_mean)

Mean Score:  6.3808896252867715


In [10]:
data['Score'] = data['Score'].replace('UNKNOWN', score_mean).astype('float64')

In [11]:
# Replacing unknown rank with NaN
data['Rank'] = data['Rank'].replace('UNKNOWN', np.nan).astype('float64')

In [12]:
# Function to extract the year, the anime aired
def extract_year(air_col):
    if air_col == "Not available":
        return np.nan
    years = re.findall(r'\b(19\d{2}|20\d{2})\b', air_col)
    if len(years) == 2:
         return (int(years[0]) + int(years[1])) // 2
    elif len(years) == 1:
        return int(years[0])
    else: 
        return np.nan

In [13]:
data['Aired'] = data['Aired'].apply(extract_year).astype('Int64')
data['Aired']

0        1998
1        2001
2        1998
3        2002
4        2004
         ... 
24900    2023
24901    2023
24902    2023
24903    2022
24904    2022
Name: Aired, Length: 24905, dtype: Int64

### 2. Interactive visualization

In [14]:
# Count the number of anime titles by type
type_counts = data['Type'].value_counts()

# Bar chart based on type of anime
fig = px.bar(type_counts, x = type_counts.index, y = type_counts.values, color = type_counts.index, labels = {'x': 'Anime Type', 'y': 'Count'}, 
             title='Count of Anime Titles by Type')

fig.show()

In [15]:
# Filter out anime titles with popularity value 0
df_valid_popularity = data[data['Popularity'] > 0]

# Sort the dataframe by popularity and select the top 15
top_10_popular = df_valid_popularity.sort_values(by = 'Popularity', ascending = True).head(15)

# Bar chart of top 15 anime based on popularity
fig = px.bar(top_10_popular, x = 'Name', y = 'Popularity',
             labels = {'Name': 'Anime Title', 'Popularity': 'Popularity'},
             title = 'Top 15 Most Popular Animes',
             color = 'Name')
# Note:- Less the popularity no. is more popular is the anime. (Popularity indicates rank of the anime.)
fig.show()

In [16]:
# Sort the dataframe by the number of users who have scored the anime
top_15_scored = data.sort_values(by = 'Members', ascending = False).head(15)

# Bar chart of anime based on number of members scored the anime
fig = px.bar(top_15_scored, x = 'Name', y = 'Members', labels = {'Members': 'Number of Users', 'Name': 'Anime Title'}, 
             color = 'Name', title = 'Top 15 Animes by Number of Users who have added the anime to their list on the platform.')

fig.show()

In [17]:
# Creating the count of individual genre
genre_counts = data[data['Genres'] != "UNKNOWN"]['Genres'].apply(lambda x: x.split(', ')).explode().value_counts()

# Bar chart to show the count of anime for each genre
fig = px.bar(genre_counts, x = genre_counts.index, y = genre_counts.values,
             labels = {'x': 'Genre', 'y': 'Count'},
             title = 'Count of Anime Titles by Genre',
             color = genre_counts.index)

fig.show()

In [18]:
data['Licensors'].value_counts()

Licensors
UNKNOWN                            20170
Funimation                           957
Sentai Filmworks                     818
Discotek Media                       275
Aniplex of America                   222
                                   ...  
Funimation, Muse Communication         1
Crunchyroll, Muse Communication        1
Aniplex of America, Crunchyroll        1
Funimation, Travel Compass             1
Bandai Namco Online                    1
Name: count, Length: 265, dtype: int64

In [19]:
licensors_list = [licensor for licensors in data[data['Licensors'] != "UNKNOWN"]['Licensors'].str.split(', ') for licensor in licensors]

# Count the occurrences of each licensor
licensor_counts = pd.Series(licensors_list).value_counts()

# Filter the licensor_counts series to exclude 'Unknown'
filtered_licensor_counts = licensor_counts[licensor_counts.index != 'Unknown']

# Select the top 10 licensors
top_15_licensors = filtered_licensor_counts.head(10)

# Bar plot to show top 10 licensors
fig = px.bar(top_15_licensors, x = top_15_licensors.index, y = top_15_licensors.values, color = top_15_licensors.index)

# Customize the plot
fig.update_layout(
    title = 'Top 10 Anime Licensors',
    xaxis_title = 'Licensors',
    yaxis_title = 'Count',
    xaxis_tickangle = -45
)

# Show the plot
fig.show()

In [20]:
data['Premiered'].value_counts()

Premiered
UNKNOWN        19399
spring 2017       88
fall 2016         83
spring 2018       81
spring 2016       78
               ...  
summer 1962        1
summer 1993        1
summer 2024        1
winter 2025        1
summer 2025        1
Name: count, Length: 244, dtype: int64

In [21]:
# Function to extract the season and year from the premiered string
def extract_season_year(premiered):
    if premiered == 'UNKNOWN':
        return None, None
    else:
        season, year = premiered.split()
        return season, int(year)

# Apply the function to extract the season and year from the "Premiered" column
season_year = data['Premiered'].map(extract_season_year)
premiered_season = season_year.apply(lambda x: x[0])
premiered_Year = season_year.apply(lambda x: x[1])


In [22]:
# Filter out None values from premiered_season
filtered_premiered_season = premiered_season.dropna()

# Count the occurrences of each season
season_counts = filtered_premiered_season.value_counts()

# Pie plot to show count of anime on each seasons
fig = go.Figure(data = go.Pie(
    labels = season_counts.index,
    values = season_counts.values,
    hole = 0.4,  # Add a donut hole in the center
    hoverinfo = 'label+percent',  # Display label and percentage on hover
    textinfo = 'value',  # Display count value as text inside each slice
    textfont = dict(size = 14),  # Set the text font size
    marker = dict(
                  colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd'],  # Custom color palette
                  line = dict(color = '#ffffff', width = 2)  # Set the color and width of the slice borders
                 )
))

# Set the title and font style for the plot
fig.update_layout(
    title = 'Distribution of Premiered Seasons',
    title_font = dict(size = 20),
    font = dict(size = 12)
)

fig.show()

In [23]:
season_df = data[data['Premiered'].str.contains('|'.join([str for str in season_counts.index]))]
season_df.head(3)

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,1998,...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,1998,...,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...
3,7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,2002,...,Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,https://cdn.myanimelist.net/images/anime/10/19...


In [24]:
# Creating the treemap plot to show anime premiered on each seasons
fig = go.Figure(go.Treemap(
                           labels=['Season'] + season_counts.index.tolist() + season_df['Name'].tolist(),
                           parents=[''] + ['Season'] * len(season_counts.index.tolist()) + season_df['Premiered'].apply(lambda x: x.split(' ')).apply(lambda x: x[0]).tolist(),
                           root_color = 'lightblue',
                           hovertemplate='Name: %{label}<br>Season: %{parent}'
                          ))

fig.update_layout(margin = dict(t = 50, l = 5, r = 5, b = 5))

# Set the title
fig.update_layout(
                  title = 'Anime premiered in each seasons (Treemap)',
                  title_font = dict(size = 20),
                  font = dict(size = 12)
                 )

fig.show()

In [25]:
# Filter out None values from premiered_Year
filtered_premiered_year = premiered_Year.dropna()

# Count the occurrences of each year
year_counts = filtered_premiered_year.value_counts()

# Sort the years in ascending order
sorted_years = sorted(year_counts.index)

# Bar plot to show count of anime based on premiered year
fig = go.Figure(data = go.Bar(
    x = sorted_years,
    y = year_counts[sorted_years],
    marker = dict(color = '#1f97a3'),  # Set the color of the bars
))

# Set the title and axis labels
fig.update_layout(
    title = 'Number of Animes Premiered by Year',
    xaxis_title = 'Year',
    yaxis_title = 'Number of Animes',
    title_font = dict(size = 20),
    font = dict(size = 12)
)

fig.show()

In [26]:
# Filter out None values from premiered_Year
filtered_aired_year = data['Aired'].dropna()

# Count the occurrences of each year
year_counts = filtered_aired_year.value_counts()

# Sort the years in ascending order
sorted_years = sorted(year_counts.index)

# Bar plot to show count of anime based on aired
fig = go.Figure(data = go.Bar(
    x = sorted_years,
    y = year_counts[sorted_years],
    marker = dict(color = '#1f77a4'),  # Set the color of the bars
))

# Set the title and axis labels
fig.update_layout(
    title = 'Number of Animes Aired by Year',
    xaxis_title = 'Year',
    yaxis_title = 'Number of Animes',
    title_font = dict(size = 20),
    font = dict(size = 12)
)

fig.show()

In [27]:
data['Studios'].value_counts()

Studios
UNKNOWN                                10526
Toei Animation                           834
Sunrise                                  532
J.C.Staff                                385
Shanghai Animation Film Studio           335
                                       ...  
Hananona Studio                            1
OLM, TOHO animation STUDIO                 1
Fever Creations                            1
Studio M2, Miyu Productions                1
Shin-Ei Animation, Miyu Productions        1
Name: count, Length: 1547, dtype: int64

In [28]:
# Count the occurrences of each studio
studio_counts = data['Studios'].value_counts()

# Filter the studio_counts series to exclude 'Unknown'
studio_counts = studio_counts[studio_counts.index != 'UNKNOWN']

# Select the top 10 studios with the highest number of animes
top_studios = studio_counts.head(10)

# Bar plot to show count of anime based on studios
fig = go.Figure(data = go.Bar(
                              x = top_studios.index,
                              y = top_studios.values,
                              marker = dict(color = top_studios.values, colorscale = 'Reds'),  # Set the color of the bars using a colorscale
                              text = top_studios.values,  # Set the text to be displayed on hover
                              hovertemplate = 'Studio: %{x}<br>Number of Animes: %{y}<extra></extra>',  # Customize the hover template
                             ))

# Set the title and axis labels
fig.update_layout(
                  title = 'Number of Animes by Studio (Top 10)',
                  xaxis_title = 'Studios',
                  yaxis_title = 'Number of Animes',
                  title_font = dict(size = 20),
                  font = dict(size = 12),
                  plot_bgcolor = 'rgba(0, 0, 0, 0.015)'  # Set the background color to transparent
                 )

fig.show()

In [29]:
top_studios_df = data[data['Studios'].isin(top_studios.index)]
top_studios_df.head(3)

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,1998,...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,1998,...,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...
3,7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,2002,...,Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,https://cdn.myanimelist.net/images/anime/10/19...


In [30]:
# Creating the treemap plot to show anime of top 10 studios
fig = go.Figure(go.Treemap(
                           labels=['Studios'] + top_studios.index.tolist() + top_studios_df['Name'].tolist(),
                           parents=[''] + ['Studios'] * len(top_studios.index.tolist()) + top_studios_df['Studios'].tolist(),
                           root_color = 'lightgrey',
                           hovertemplate='Name: %{label}<br>Studio: %{parent}'
                          ))

fig.update_layout(margin = dict(t = 50, l = 10, r = 10, b = 10))

# Set the title
fig.update_layout(
                  title = 'Top 10 Studios with Anime (Treemap)',
                  title_font = dict(size = 20),
                  font = dict(size = 12)
                 )

fig.show()

In [31]:
data['Source'].value_counts()

Source
Original        9622
Manga           4687
Unknown         3689
Game            1232
Visual novel    1107
Other           1008
Light novel      968
Novel            709
Web manga        447
Music            395
4-koma manga     314
Picture book     210
Book             191
Mixed media      162
Web novel         82
Card game         68
Radio             14
Name: count, dtype: int64

In [32]:
# Count the occurrences of each source
source_counts = data['Source'].value_counts()

# Filter the source_counts series to exclude 'Unknown'
source_counts = source_counts[source_counts.index != 'Unknown']

# Horizontal bar plot to show count of anime based on source
fig = go.Figure(data = go.Bar(
    x = source_counts.values,
    y = source_counts.index,
    orientation = 'h',  # Set the orientation to horizontal
    marker = dict(color = source_counts.values, colorscale = 'Viridis'),  # Set the color of the bars using a colorscale
    text = source_counts.values,  # Set the text to be displayed on hover
    hovertemplate = 'Source: %{y}<br>Number of Animes: %{x}<extra></extra>',  # Customize the hover template
))

# Set the title and axis labels
fig.update_layout(
    title = 'Number of Animes by Source',
    xaxis_title = 'Number of Animes',
    yaxis_title = 'Source',
    title_font = dict(size = 20),
    font = dict(size = 12)
)

fig.show()

In [33]:
data['Favorites'].value_counts()

Favorites
0         10808
1          2310
2          1202
3           798
4           543
          ...  
5139          1
3004          1
198986        1
76343         1
47235         1
Name: count, Length: 1814, dtype: int64

In [34]:
# Sort the DataFrame by the 'Favorites' column in descending order
sorted_df = data.sort_values('Favorites', ascending = False)

# Select the top 10 most favorited anime
top_favorites = sorted_df.head(10)

# Horizontal bar plot to show count of anime based on favourites
fig = go.Figure(data = go.Bar(
    x = top_favorites['Favorites'],
    y = top_favorites['Name'],
    orientation = 'h',  # Set the orientation to horizontal
    marker = dict(color = '#1f99b4'),  # Set the color of the bars
    text = top_favorites['Favorites'],  # Set the text to be displayed on hover
    hovertemplate = 'Anime: %{y}<br>Favorites: %{x}<extra></extra>',  # Customize the hover template
))

# Set the title and axis labels
fig.update_layout(
    title = 'Top 10 Most Favorited Anime',
    xaxis_title = 'Number of Favorites',
    yaxis_title = 'Anime',
    title_font = dict(size = 20),
    font = dict(size = 12)
)

fig.show()

### 3. Similarity matrix

In [35]:
# Function to combine certain column values to create the soup
def create_soup(x):
    def clean_and_format(element):
    # Convert element to string, remove special characters, and convert to lowercase
        return re.sub(r'[^a-zA-Z0-9\s]', '', str(element)).lower()
    def format_rating(element):
    # Convert element to string, remove spaces and convert to lowercase
        return re.sub(' ', '', str(element)).lower()
    
    return ' '.join([
        'score_' + str(f'{x.Score:.2f}'),  
        'popularity_' + str(x['Popularity']), 
        'type_' + clean_and_format(str(x['Type'])),     
        'episodes_' + str(x['Episodes']),  
        'aired_' + str(x['Aired']),
        'rank_' + str(x['Rank']), 
        'favorites_' + str(x['Favorites']),
        'status_' + str(x['Status']),    
        clean_and_format(str(x['Producers'])),
        clean_and_format(str(x['Licensors'])), 
        clean_and_format(str(x['Studios'])),
        clean_and_format(str(x['Genres'])), 
        #clean_and_format(str(x['Synopsis'])),
        clean_and_format(str(x['Source'])),      
        format_rating(str(x['Rating'])),       
    ])

In [36]:
data['soup'] = data.apply(create_soup, axis = 1)
data.head(3)

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL,soup
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,1998,...,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...,score_8.75 popularity_43 type_tv episodes_26.0...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,2001,...,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...,score_8.38 popularity_602 type_movie episodes_...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,1998,...,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...,score_8.22 popularity_246 type_tv episodes_26....


In [37]:
data = data.reset_index()

In [38]:
indices = pd.Series(data.index, index=data['Name']).drop_duplicates()
indices

Name
Cowboy Bebop                           0
Cowboy Bebop: Tengoku no Tobira        1
Trigun                                 2
Witch Hunter Robin                     3
Bouken Ou Beet                         4
                                   ...  
Wu Nao Monu                        24900
Bu Xing Si: Yuan Qi                24901
Di Yi Xulie                        24902
Bokura no Saishuu Sensou           24903
Shijuuku Nichi                     24904
Length: 24905, dtype: int64

In [39]:
# Creating similarity matrix based on countvectorizer
count = CountVectorizer(stop_words = 'english')
count_matrix = count.fit_transform(data['soup'].values)
cosine_property = cosine_similarity(count_matrix, count_matrix)

In [40]:
# Creating similarity matrix based on tfidf vectorizer
tfidf = TfidfVectorizer(stop_words = 'english')
tfidf_matrix = tfidf.fit_transform(data['soup'].values)
tfidf_prop = linear_kernel(tfidf_matrix, tfidf_matrix)

### 4. Recommendation

In [41]:
# Function to recommend anime for the given anime title 
def get_recommendations(title, cosine_sim, suggest_amount = 15):
    try:
        idx = indices[title]
    except KeyError:
        raise ValueError(f"Anime {title} not found in list.")

    return get_recommendations_by_id(idx, cosine_sim, suggest_amount)


def get_recommendations_by_id(idx, cosine_sim, suggest_amount = 15):
    
    # Extracting the respective anime similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Sorting the similarity scores
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)

    max_amount = len(sim_scores)
    if suggest_amount > max_amount:
        suggest_amount = max_amount

    sim_scores = sim_scores[1 : suggest_amount]

    # Get the anime indices
    anime_indices = [i[0] for i in sim_scores]

    # Map indices to anime_id
    anime_ids = data.iloc[anime_indices]['anime_id'].values
    scores = np.array([i[1] for i in sim_scores])

    anime_link = [f"https://myanimelist.net/anime/{id}/{data[data['anime_id'] == id]['Name'].values[0]}" for id in anime_ids]

    # Create a DataFrame 
    recommendation_ratings = pd.DataFrame({
        'Anime': data.iloc[anime_indices]['Name'].values,
        'Score': scores,
        'Rec_Pos': range(1, suggest_amount),
        'anime_id': anime_ids,
        'anime_link': anime_link
    })

    return recommendation_ratings


In [42]:
# Based on Countvectorizer
get_recommendations('One Piece', cosine_property)

Unnamed: 0,Anime,Score,Rec_Pos,anime_id,anime_link
0,Dragon Ball Z,0.613396,1,813,https://myanimelist.net/anime/813/Dragon Ball Z
1,Dragon Ball GT,0.572503,2,225,https://myanimelist.net/anime/225/Dragon Ball GT
2,One Piece: Kinkyuu Kikaku One Piece Kanzen Kou...,0.572503,3,16143,https://myanimelist.net/anime/16143/One Piece:...
3,Toriko,0.566139,4,10033,https://myanimelist.net/anime/10033/Toriko
4,One Piece Film: Z,0.556349,5,12859,https://myanimelist.net/anime/12859/One Piece ...
5,Dragon Ball,0.549125,6,223,https://myanimelist.net/anime/223/Dragon Ball
6,Dragon Ball Super,0.549125,7,30694,https://myanimelist.net/anime/30694/Dragon Bal...
7,One Piece: Episode of Nami - Koukaishi no Nami...,0.543557,8,15323,https://myanimelist.net/anime/15323/One Piece:...
8,One Piece: Episode of Merry - Mou Hitori no Na...,0.543557,9,19123,https://myanimelist.net/anime/19123/One Piece:...
9,One Piece: Episode of Sabo - 3 Kyoudai no Kizu...,0.543557,10,31289,https://myanimelist.net/anime/31289/One Piece:...


In [43]:
# Based on Tfidf vectorizer
get_recommendations('One Piece', tfidf_prop)

Unnamed: 0,Anime,Score,Rec_Pos,anime_id,anime_link
0,One Piece: Kinkyuu Kikaku One Piece Kanzen Kou...,0.353611,1,16143,https://myanimelist.net/anime/16143/One Piece:...
1,Lovely★Complex,0.258388,2,2034,https://myanimelist.net/anime/2034/Lovely★Complex
2,Kochira Katsushikaku Kameari Kouenmae Hashutsu...,0.233473,3,3547,https://myanimelist.net/anime/3547/Kochira Kat...
3,One Piece: Episode of Nami - Koukaishi no Nami...,0.216593,4,15323,https://myanimelist.net/anime/15323/One Piece:...
4,One Piece: Episode of Merry - Mou Hitori no Na...,0.215722,5,19123,https://myanimelist.net/anime/19123/One Piece:...
5,One Piece Film: Z,0.215714,6,12859,https://myanimelist.net/anime/12859/One Piece ...
6,Dragon Ball Z,0.213025,7,813,https://myanimelist.net/anime/813/Dragon Ball Z
7,One Piece: Adventure of Nebulandia,0.210191,8,32051,https://myanimelist.net/anime/32051/One Piece:...
8,One Piece: Episode of Sabo - 3 Kyoudai no Kizu...,0.20983,9,31289,https://myanimelist.net/anime/31289/One Piece:...
9,Tousouchuu: Great Mission,0.206325,10,54040,https://myanimelist.net/anime/54040/Tousouchuu...


In [44]:
pickle.dump(data, open('anime.pkl', 'wb'))
pickle.dump(cosine_property, open('similarity.pkl', 'wb'))
pickle.dump(tfidf_prop, open('tfidf_sim.pkl', 'wb'))

### 5. References
1. [Kaggle](https://www.kaggle.com/code/dbdmobile/anime-recommendation-1)
2. [Kaggle](https://www.kaggle.com/code/lachimolalala/anime-recommendation-system/notebook)