Finding Similar Anime By Genre

we're trying to make an anime finder by genre using simple feature and jaccard similarity score. 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import itertools
import collections
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import jaccard_similarity_score # Jaccard Similarity

Preprocessing

We split the genre in preprocessing, so later we can turn them to features.

In [2]:
animes = pd.read_csv("anime.csv") 

animes['genre'] = animes['genre'].fillna('None') # filling 'empty' data
animes['genre'] = animes['genre'].apply(lambda x: x.split(', ')) # split genre into list of individual genre

genre_data = itertools.chain(*animes['genre'].values.tolist()) # flatten the list
genre_counter = collections.Counter(genre_data)
genres = pd.DataFrame.from_dict(genre_counter, orient='index').reset_index().rename(columns={'index':'genre', 0:'count'})
genres.sort_values('count', ascending=False, inplace=True)

print(genres)

            genre  count
10         Comedy   4645
4          Action   2845
5       Adventure   2348
6         Fantasy   2309
14         Sci-Fi   2070
0           Drama   2016
9         Shounen   1712
38           Kids   1609
1         Romance   1464
2          School   1220
19  Slice of Life   1220
41         Hentai   1141
3    Supernatural   1037
20          Mecha    944
21          Music    860
11     Historical    806
7           Magic    778
31          Ecchi    637
26         Shoujo    603
23         Seinen    547
16         Sports    543
22        Mystery    495
17    Super Power    465
8        Military    426
12         Parody    408
18          Space    381
27         Horror    369
36          Harem    317
30         Demons    294
24   Martial Arts    265
35       Dementia    240
29  Psychological    229
28         Police    197
34           Game    181
13        Samurai    148
25        Vampire    102
15       Thriller     87
37           Cars     72
33     Shounen Ai     65


Feature Extraction
The feature extraction is simple, a binary encoded vector of genre.

this shows which feature/genre each show has in binary (by id)

In [3]:
genre_map = {genre: idx for idx, genre in enumerate(genre_counter.keys())}
def extract_feature(genre):
    feature = np.zeros(len(genre_map.keys()), dtype=int)
    feature[[genre_map[idx] for idx in genre]] += 1
    return feature
    
anime_feature = pd.concat([animes['name'], animes['genre']], axis=1)
anime_feature['genre'] = anime_feature['genre'].apply(lambda x: extract_feature(x))
print(anime_feature.head(30))

                                                 name  \
0                                      Kimi no Na wa.   
1                    Fullmetal Alchemist: Brotherhood   
2                                            Gintama°   
3                                         Steins;Gate   
4                                       Gintama&#039;   
5   Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...   
6                              Hunter x Hunter (2011)   
7                                Ginga Eiyuu Densetsu   
8   Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...   
9                            Gintama&#039;: Enchousen   
10                               Clannad: After Story   
11                                     Koe no Katachi   
12                                            Gintama   
13                 Code Geass: Hangyaku no Lelouch R2   
14                            Haikyuu!! Second Season   
15                      Sen to Chihiro no Kamikakushi   
16                            S

Testing

In [4]:
test_data = anime_feature.take([0, 19, 1, 2, 16, 23, 6, 49, 220, 66])
for row in test_data.iterrows():
    print('Similar anime like {}:'.format(row[1]['name']))
    search = anime_feature.drop([row[0]]) # drop current anime
    search['result'] = search['genre'].apply(lambda x: jaccard_similarity_score(row[1]['genre'], x))
    search_result = search.sort_values('result', ascending=False)['name'].head(10)
    for res in search_result.values:
        print('\t{}'.format(res))
    print()

Similar anime like Kimi no Na wa.:
	Wind: A Breath of Heart (TV)
	Wind: A Breath of Heart OVA
	To Heart 2 Special
	Koi to Senkyo to Chocolate Special
	Koi to Senkyo to Chocolate
	Touka Gettan
	Mizuiro (2003)
	Myself; Yourself
	Air Movie
	Kimikiss Pure Rouge

Similar anime like Code Geass: Hangyaku no Lelouch:
	Code Geass: Hangyaku no Lelouch Special Edition Black Rebellion
	Code Geass: Hangyaku no Lelouch Recaps
	Code Geass: Hangyaku no Lelouch R2 Special Edition Zero Requiem
	Muv-Luv Alternative: Total Eclipse Recap - Climax Chokuzen Special
	Yuusha-Ou GaoGaiGar Final Grand Glorious Gathering
	Kiddy Grade: Ignition
	Code Geass: Boukoku no Akito 1 - Yokuryuu wa Maiorita
	Appleseed Saga Ex Machina
	Firestorm
	Code Geass: Fukkatsu no Lelouch

Similar anime like Fullmetal Alchemist: Brotherhood:
	Fullmetal Alchemist
	Fullmetal Alchemist: The Sacred Star of Milos
	Fullmetal Alchemist: Brotherhood Specials
	Magi: Sinbad no Bouken
	Magi: The Kingdom of Magic
	Dragon Quest: Dai no Daibouken B