Using KNN to recommend anime

In [36]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import seaborn as sns
%matplotlib inline

In [37]:
anime = pd.read_csv("anime.csv")

In [38]:
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


# Data Preprocessing : 

Many animes have unknown number of episodes even if they have similar rating. On top of that many super popular animes such as Naruto Shippuden, Attack on Titan Season 2 were ongoing when the data was collected, thus their number of episodes was considered as "Unknown". For some of my favorite animes I've filled in the episode numbers manually. For the other anime's, I had to make some educated guesses. Changes I've made are :

1. Animes that are grouped under Hentai Categories generally have 1 episode in my experience. So I've filled the unknown values with 1.

2. Animes that are grouped are "OVA" stands for "Original Video Animation". These are generally one/two episode long animes(often the popular ones have 2/3 episodes though), but I've decided to fill the unknown numbers of episodes with 1 again.

3. Animes that are grouped under "Movies" are considered as '1' episode as per the dataset overview goes.

4. For all the other animes with unknown number of episodes, I've filled the na values with the median which is 2.  

In [39]:
anime.loc[(anime["genre"]=="Hentai") & (anime["episodes"]=="Unknown"),"episodes"] = "1"
anime.loc[(anime["type"]=="OVA") & (anime["episodes"]=="Unknown"),"episodes"] = "1"

anime.loc[(anime["type"] == "Movie") & (anime["episodes"] == "Unknown")] = "1"

In [40]:
known_animes = {"Naruto Shippuuden":500, "One Piece":784,"Detective Conan":854, "Dragon Ball Super":86,
                "Crayon Shin chan":942, "Yu Gi Oh Arc V":148,"Shingeki no Kyojin Season 2":25,
                "Boku no Hero Academia 2nd Season":25,"Little Witch Academia TV":25}

In [41]:
for k,v in known_animes.items():    
    anime.loc[anime["name"]==k,"episodes"] = v

In [42]:
anime["episodes"] = anime["episodes"].map(lambda x:np.nan if x=="Unknown" else x)

In [43]:
anime["episodes"].fillna(anime["episodes"].median(),inplace = True)

### Rating 

Many animes have unknown ratings. These were filled with the median of the ratings.

In [44]:
anime["rating"] = anime["rating"].astype(float)

In [45]:
anime["rating"].fillna(anime["rating"].median(),inplace = True)

### Type 

Type category differentiates between movies, music, TV shows(regular anime episodes), OVA/ONA etc. These are categorical variables so I used ```pd.get_dummies``` to convert them to dummy variables.

In [46]:
pd.get_dummies(anime[["type"]]).head()

Unnamed: 0,type_1,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV
0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,1
2,0,0,0,0,0,0,1
3,0,0,0,0,0,0,1
4,0,0,0,0,0,0,1


### Members

Just converted the strings to float.

In [47]:
anime["members"] = anime["members"].astype(float)

In [48]:
# anime_features = pd.concat([anime["genre"].str.get_dummies(sep=","),pd.get_dummies(anime[["type"]]),anime[["rating","episodes"]]],axis=1)

# Feature Selection and Preprocessing


Episode numbers, members and rating are different from categorical variables and very different in values. Rating ranges from 0-10 in the dataset while the episode number can be even 800+ episodes long when it comes to long running popular animes such as One Piece, Naruto etc. So I ended up using ```sklearn.preprocessing.MaxAbsScaler``` as it preserves the sparsity while scaing the values from 0-1.

In [49]:
anime_features = pd.concat([anime["genre"].str.get_dummies(sep=","),pd.get_dummies(anime[["type"]]),anime[["rating"]],anime[["members"]],anime["episodes"]],axis=1)

In [50]:
anime["name"] = anime["name"].map(lambda name:re.sub('[^A-Za-z0-9]+', " ", name))

In [51]:
anime_features.head()

Unnamed: 0,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,Harem,...,type_1,type_Movie,type_Music,type_ONA,type_OVA,type_Special,type_TV,rating,members,episodes
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,9.37,200630.0,1
1,1,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,1,9.26,793665.0,64
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,9.25,114262.0,51
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,9.17,673572.0,24
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,9.16,151266.0,51


In [52]:
anime_features.columns

Index([' Adventure', ' Cars', ' Comedy', ' Dementia', ' Demons', ' Drama',
       ' Ecchi', ' Fantasy', ' Game', ' Harem', ' Hentai', ' Historical',
       ' Horror', ' Josei', ' Kids', ' Magic', ' Martial Arts', ' Mecha',
       ' Military', ' Music', ' Mystery', ' Parody', ' Police',
       ' Psychological', ' Romance', ' Samurai', ' School', ' Sci-Fi',
       ' Seinen', ' Shoujo', ' Shoujo Ai', ' Shounen', ' Shounen Ai',
       ' Slice of Life', ' Space', ' Sports', ' Super Power', ' Supernatural',
       ' Thriller', ' Vampire', ' Yaoi', ' Yuri', '1', 'Action', 'Adventure',
       'Cars', 'Comedy', 'Dementia', 'Demons', 'Drama', 'Ecchi', 'Fantasy',
       'Game', 'Harem', 'Hentai', 'Historical', 'Horror', 'Josei', 'Kids',
       'Magic', 'Martial Arts', 'Mecha', 'Military', 'Music', 'Mystery',
       'Parody', 'Police', 'Psychological', 'Romance', 'Samurai', 'School',
       'Sci-Fi', 'Seinen', 'Shoujo', 'Shounen', 'Slice of Life', 'Space',
       'Sports', 'Super Power', 'Supernat

In [53]:
from sklearn.preprocessing import MaxAbsScaler

In [54]:
max_abs_scaler = MaxAbsScaler()
anime_features = max_abs_scaler.fit_transform(anime_features)

# KNN for finding similar animes

In [55]:
from sklearn.neighbors import NearestNeighbors

In [56]:
nbrs = NearestNeighbors(n_neighbors=6, algorithm='ball_tree').fit(anime_features)

In [57]:
distances, indices = nbrs.kneighbors(anime_features)

# Query examples and helper functions 

In [58]:
def get_index_from_name(name):
    return anime[anime["name"]==name].index.tolist()[0]
    

In [59]:
get_index_from_name("Naruto")

841

In [60]:
distances[841]

array([0.        , 0.1906634 , 1.08585653, 1.4177656 , 1.44782088,
       1.47398665])

In [61]:
indices[841]

array([841, 615, 175, 582, 206, 178], dtype=int64)

Many anime names have not been documented properly and in many cases the names are in Japanese instead of English and the spelling is often different. For that reason I've also created another helper function ```get_id_from_partial_name``` to find out ids of the animes from part of names.

In [62]:
all_anime_names = list(anime.name.values)

In [63]:
def get_id_from_partial_name(partial):
    for name in all_anime_names:
        if partial in name:
            print(name,all_anime_names.index(name))

In [64]:
""" print_similar_query can search for similar animes both by id and by name. """

def print_similar_animes(query=None,id=None):
    if id:
        for id in indices[id][1:]:
            print(anime.ic[id]["name"])
    if query:
        found_id = get_index_from_name(query)
        for id in indices[found_id][1:]:
            print(anime.ic[id]["name"])

# Query Examples 

<b> Naruto </b>
    
Naruto is a shounen manga and Naruto Shippuden, the 1st neighbor is the closest to it as its the second season. Even if Naruto has 220 episodes and Naruto Shippuden has 500 episodes, it appears that the model was able to find out the second season. I've watchd Katekyo Hitman Reborn too and its very similar to Naruto(I'm a big fan of both) in the story structure. Bleach, Dragon Ball Z are also long standing animes with under the same action,comedy, Shounen genre.

![](http://s1.picswalls.com/wallpapers/2015/09/27/naruto-wallpaper_104044786_274.jpg)

In [65]:
print_similar_animes(query="Naruto")

AttributeError: 'DataFrame' object has no attribute 'ic'

<b> Noragami(Stray God).</b>

Noragami is a supernatural manga featuring Japanese Shinto Gods. Noragami Aragoto is the 2nd season. I've not seen the other ones, but they seem to be very highly rated too.

![](http://thisisanothercastle.files.wordpress.com/2014/04/noragami-title2.jpg)

In [None]:
print_similar_animes("Noragami")

<b>Mushishi, Gintama, Fairy Tail</b> 

I just checked Mushishi and Gintama because my sister asked me to, but looks like we are getting pretty good results. Fairy Tail is about a magical guild, the 1st neighbor is Fairy Tail 2014, when the anime was re-made. The other neighbors seem to be magical anime's too.

In [None]:
print_similar_animes("Mushishi")

In [None]:
print_similar_animes("Gintama")

In [None]:
print_similar_animes("Fairy Tail")