# Anime Similarity

>After training our LDA model, we obtained genre-breakdowns for each anime.
>
>Using these, we can further determine the similarity between anime shows.
>
>One way of accomplishing this is by using the [*Hellinger distance*](https://radimrehurek.com/gensim_3.8.3/auto_examples/tutorials/run_distance_metrics.html#hellinger).
>
>Two shows having a **shorter distance** can be regarded as **more similar**, and vice-versa.
>
>The following will compute the distance per pair of shows.
>
>For each show, the top 5 closest other shows will be written in one file in the JSON Lines format.

## Load LDA Model

In [1]:
from gensim.models import LdaModel

lda_model = LdaModel.load('lda_model/lda_model')

## Read Text Input

In [2]:
from gensim.corpora.dictionary import Dictionary
from lda_helpers import read_lda_input  # Package with helpers

# Read anime show text -with titles-, for later
title_texts = read_lda_input('lda_input/lda_input.jl', title=True)
title2probs = {title: lda_model[lda_model.id2word.doc2bow(text)] for title, text in title_texts}

## Compute Hellinger Distances

In [3]:
from os import mkdir

# Directory to write output file
mkdir('lda_distance')

In [4]:
from collections import defaultdict
from gensim.matutils import hellinger
import json
from math import comb
from tqdm.notebook import tqdm

# Progress bars
N = len(title2probs)
bar1 = tqdm(title2probs.items(), desc='Anime Shows', bar_format='{l_bar}{bar}{n_fmt}/{total_fmt}{postfix}')
bar2 = tqdm(total=comb(N,2),     desc='Anime Pairs', bar_format='{l_bar}{bar}{n_fmt}/{total_fmt} [Elapsed: {elapsed}, Remaining: {remaining}]')
bar3 = tqdm(total=N*5,           desc='Closest 5',   bar_format='{l_bar}{bar}{n_fmt}/{total_fmt}')

# Keep track of pairs' distances
computed = set()  # Already computed Titles
full_dist_map = defaultdict(dict)
closest_5 = dict()

with open('lda_distance/lda_distance.jl', 'w') as f:
    for title1, probs1 in bar1:
        bar1.set_postfix_str(f'(Working on "{title1}")')
        
        # Compute distances between title1 and all others
        for title2, probs2 in title2probs.items():
            if title1 == title2 or title2 in computed:
                continue  # No need to compute this
            else:
                dist = hellinger(probs1, probs2)
                full_dist_map[title1][title2] = dist
                full_dist_map[title2][title1] = dist
                bar2.update()
        
        # Write closest 5 to file
        closest_5 = sorted(full_dist_map[title1].items(), key=lambda x: x[1])[:5]
        for i, (title2, dist) in enumerate(closest_5):
            record = {
                'Title 1': title1,
                'Title 2': title2,
                'Similarity Rank': i+1,
                'Distance': dist
            }
            line = json.dumps(record)
            f.write('{}\n'.format(line))
            bar3.update()
        
        computed.add(title1)
        bar1.set_postfix_str('')
    bar2.close()
    bar3.close()

HBox(children=(HTML(value='Anime Shows'), FloatProgress(value=0.0, max=4757.0), HTML(value='')))

HBox(children=(HTML(value='Anime Pairs'), FloatProgress(value=0.0, max=11312146.0), HTML(value='')))

HBox(children=(HTML(value='Closest 5'), FloatProgress(value=0.0, max=23785.0), HTML(value='')))




