# Similarity model

In this notebook I computed the cosine similarity of the animes sinopses, modeling the features using the ti-idp importance model from the scikit-learn package. The spacy library is used to perform nlp computations: stopwords, tokenization and lemmatization.

First we start importing the important packages:

In [1]:
import numpy as np

import pandas as pd

from tqdm import tqdm

import spacy

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
tqdm.pandas() # to show the progressbar in pandas computations

Import the synopses information:

In [3]:
df = pd.read_csv('archive/anime_with_synopsis.csv', index_col='MAL_ID', dtype=str)

The files are to big to host in this github repository, they are avalilable in https://www.kaggle.com/datasets/hernan4444/anime-recommendation-database-2020

The NLP model is provided by spacy, here we import the en_core_web_lg model

In [4]:
nlp = spacy.load('en_core_web_lg')

In [5]:
sypnopsis = df.sypnopsis.dropna()

Construct the nlp models for each synopse:

In [6]:
docs = sypnopsis.progress_apply(nlp)

100%|██████████| 16206/16206 [02:59<00:00, 90.46it/s] 


Function which performs the tokenization and lemmatization from the documents:

In [7]:
en_stop_words = spacy.lang.en.stop_words.STOP_WORDS
    
def get_lemmas(doc, stop_words=en_stop_words):
    return ' '.join([
        token.lemma_
        for token in doc
        if token.lemma_ not in stop_words
        and token.lemma_.isalpha()
    ])

Applying it we get the lemmas from each sysnopse:

In [8]:
lemmas = docs.apply(get_lemmas)

Create the Tf-idf model, restricting that each token must appears in at least 100 synopses and at most in 90% of the synopses. This also uses a ngram range from 1 to 10:

In [9]:
tfidf = TfidfVectorizer(min_df=100, max_df=0.9, ngram_range=(1, 10))

Transforming data:

In [10]:
vec_data = tfidf.fit_transform(lemmas)

Computing the cosine similarity:

In [11]:
cos_sim = cosine_similarity(vec_data)
cos_sim = pd.DataFrame(cos_sim, index=lemmas.index,
                       columns=lemmas.index)

In [12]:
valid_idx = cos_sim.sum() > 0

In [13]:
cos_sim = cos_sim.loc[valid_idx, valid_idx]

For the recommendation we do not store all the similarities, we need only the top most similar for each one. First we count how many represents the 0.01% most similar for each one:

In [14]:
sizes = {
    idx: row[(row > np.percentile(row, 99.9))&(row < 1)].size
    for idx, row in cos_sim.iterrows()
}

After we take the median value:

In [15]:
size = np.median([*sizes.values()]).astype(int)
size

16

Which means that 16 animes represents the 0.01% most similar to other, at median. So we get the 16 most similar for each one:

In [16]:
similarities = {
    idx: row[~np.isclose(row, 1)].nlargest(size)
    for idx, row in cos_sim.iterrows()
}

Splitting the id info and the weight:

In [None]:
anime_id = pd.DataFrame({
    key: value.index.values
    for key, value in similarities.items()
}).T

anime_weight = pd.DataFrame({
    key: value.values / value.values.sum()
    for key, value in similarities.items()
}).T

In [None]:
weights = pd.concat((anime_id, anime_weight),
                    keys=('MAL_ID', 'WEIGHT'),
                    axis=1)

In [None]:
weights.to_csv('models/weights.csv')

Get the animes names:

In [None]:
info = pd.read_csv('archive/anime.csv', index_col='MAL_ID')

In [None]:
unknown = info['English name'] == 'Unknown'
info.loc[unknown, 'English name'] = info.loc[unknown, 'Name']
names = info.loc[weights.index, 'English name'].to_frame()
names = names.reset_index()
names.columns = 'MAL_ID', 'Name'

In [None]:
names.to_csv('data/anime_list.csv', index=False)