Content-based filtering incorporates the information of items or users into the generation of recommendations. In this notebook, we go through a genre-based system and a synopsis-based system. Both models use contents of items and are used to find the similarities between two anime. Recommendations are then made by considering which items have similar genres/synopses.

## Load Data

The dataset is the anime-dataset-2023.csv file from https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset/

In [2]:
import pandas as pd

anime = pd.read_csv('anime-dataset-2023.csv')

In [3]:
anime.head()

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",...,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...
2,6,Trigun,Trigun,トライガン,8.22,"Action, Adventure, Sci-Fi","Vash the Stampede is the man with a $$60,000,0...",TV,26.0,"Apr 1, 1998 to Sep 30, 1998",...,Madhouse,Manga,24 min per ep,PG-13 - Teens 13 or older,328.0,246,15035,356739.0,727252,https://cdn.myanimelist.net/images/anime/7/203...
3,7,Witch Hunter Robin,Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),7.25,"Action, Drama, Mystery, Supernatural",Robin Sena is a powerful craft user drafted in...,TV,26.0,"Jul 3, 2002 to Dec 25, 2002",...,Sunrise,Original,25 min per ep,PG-13 - Teens 13 or older,2764.0,1795,613,42829.0,111931,https://cdn.myanimelist.net/images/anime/10/19...
4,8,Bouken Ou Beet,Beet the Vandel Buster,冒険王ビィト,6.94,"Adventure, Fantasy, Supernatural",It is the dark century and the people are suff...,TV,52.0,"Sep 30, 2004 to Sep 29, 2005",...,Toei Animation,Manga,23 min per ep,PG - Children,4240.0,5126,14,6413.0,15001,https://cdn.myanimelist.net/images/anime/7/215...


### Genre-based 

To find the similarity between the genres of two anime, we take the cosine similarity of the vectorized form of genres, which is a binary vector with a value of 1 for genres that the anime has and a value 0 for genres that the anime doesn't. 

In [4]:
import math

def get_genre_similarity(anime1, anime2):
    genre1 = anime.loc[anime['Name'] == anime1]['Genres'].values[0].split(', ')
    genre2 = anime.loc[anime['Name'] == anime2]['Genres'].values[0].split(', ')

    shared_genres = set(genre1).intersection(genre2) # equivalent to dot product for binary vectors
    return len(shared_genres)/ (math.sqrt(len(genre1)) * math.sqrt(len(genre2))) # cosine similarity

print(get_genre_similarity('One Piece', 'Jujutsu Kaisen'))
print(get_genre_similarity('One Piece', 'Love Hina'))


0.6666666666666667
0.0


To find the similarity between the synopsis of two anime, we take the cosine similarity of the embeddings of the synopsis after being fed into a text embedding model. Here we use the bge-large-en model from https://huggingface.co/BAAI/bge-large-en-v1.5 which scores highly on the huggingface embedding leaderboard https://huggingface.co/spaces/mteb/leaderboard.

In [5]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

def get_embedding_similarity(anime1, anime2):
    # Sentences we want sentence embeddings for
    embedding1 = anime.loc[anime['Name'] == anime1]['Synopsis'].values[0]
    embedding2 = anime.loc[anime['Name'] == anime2]['Synopsis'].values[0]

    # Load model from HuggingFace Hub
    tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-en-v1.5')
    model = AutoModel.from_pretrained('BAAI/bge-large-en-v1.5')
    model.eval()

    # Tokenize sentences
    encoded_input = tokenizer([embedding1, embedding2], padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
        # Perform pooling. In this case, cls pooling.
        sentence_embeddings = model_output[0][:, 0]
    # normalize embeddings
    sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
    return np.dot(sentence_embeddings[0], sentence_embeddings[1]) / (np.linalg.norm(sentence_embeddings[0]) * np.linalg.norm(sentence_embeddings[1]))

print(get_embedding_similarity('One Piece', 'Jujutsu Kaisen'))
print(get_embedding_similarity('One Piece', 'Love Hina'))


0.6013926
0.5343566


We see that both models predict One Piece to be more similar to Jujutsu Kaisen than Love Hina, which is expected.

### References

https://developers.google.com/machine-learning/recommendation/content-based/basics