# Anime Similarity

>After training our LDA model, we obtained genre-breakdowns for each anime.
>
>Using this output, we can determine the similarity between anime shows based on their genre breakdowns.
>
>One way of accomplishing this is by using the *Hellinger* distance, which is also provided in *gensim*.
>
>Two shows having a **shorter distance** can be regarded as **more similar**, and vice-versa.
>
>The following will compute the distance for each pair of shows
>
>This will be written in one file in the JSON Lines format.

In [1]:
import json
from os import mkdir

mkdir('lda_distance')

## Load LDA Model

In [2]:
from gensim.models import LdaModel

lda_model = LdaModel.load('lda_model/lda_model')

## Read Text Input

In [3]:
from gensim.corpora.dictionary import Dictionary
from lda_helpers import read_lda_input  # Package with helpers

# Read anime show titles -with text-, for later
title_texts = read_lda_input('lda_input/lda_input.jl', title=True)
title2probs = {title: lda_model[lda_model.id2word.doc2bow(text)] for title, text in title_texts}

## Compute Hellinger Distances

In [4]:
from gensim.matutils import hellinger
from math import comb
from tqdm.notebook import tqdm

bar1 = tqdm(title2probs.items(), desc='Anime Shows', bar_format='{l_bar}{bar}{n_fmt}/{total_fmt}{postfix}')
computed = set()  # Set of already computed Titles
N = len(title2probs)
bar2 = tqdm(total=comb(N,2), desc='Anime Pairs', bar_format='{l_bar}{bar}{n_fmt}/{total_fmt} [Elapsed: {elapsed}, Remaining: {remaining}]')

with open('lda_distance/lda_distance.jl', 'w') as f:
    for title1, probs1 in bar1:
        bar1.set_postfix_str(f'(Working on "{title1}")')
        for title2, probs2 in title2probs.items():
            if title1 == title2 or title2 in computed:
                continue  # No need to compute this
            else:
                dist = hellinger(probs1, probs2)
                # Write first output JSON as newline
                record = {
                    'Title 1': title1,
                    'Title 2': title2,
                    'Distance': dist
                }
                line = json.dumps(record)
                f.write('{}\n'.format(line))
                # Write second output JSON as newline
                record = {
                    'Title 1': title2,
                    'Title 2': title1,
                    'Distance': dist
                }
                bar2.update()
        computed.add(title1)
    bar2.close()

HBox(children=(HTML(value='Anime Shows'), FloatProgress(value=0.0, max=4757.0), HTML(value='')))

HBox(children=(HTML(value='Anime Pairs'), FloatProgress(value=0.0, max=11312146.0), HTML(value='')))



