# Introduction

Among those in their teens and twenties, watching anime has become a popular pastime.  When an anime series ends, however, it can be difficult to determine which anime to watch next.

__Given a list of users, what is the most recommended show?__

## Loading the data

There are two files associated with this dataset.  The first file is anime.csv and it holds the data that we can use to create our model.  The second file is rating.csv and it'll be used for predicting with our model.

We'll first load the csv file and designate anime_id as the index.  We'll also sort by index to make things easier.

In [1]:
import numpy as mp
import pandas as pd

d = pd.read_csv("anime.csv")#,index_col="anime_id")

d = d.set_index('anime_id',drop=False)

d.sort_index(inplace=True)

# Convert the genre list string to an array.
d['genre'] = d['genre'].str.split(',')

d.head()

Unnamed: 0_level_0,anime_id,name,genre,type,episodes,rating,members
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,Cowboy Bebop,"[Action, Adventure, Comedy, Drama, Sci-Fi,...",TV,26,8.82,486824
5,5,Cowboy Bebop: Tengoku no Tobira,"[Action, Drama, Mystery, Sci-Fi, Space]",Movie,1,8.4,137636
6,6,Trigun,"[Action, Comedy, Sci-Fi]",TV,26,8.32,283069
7,7,Witch Hunter Robin,"[Action, Drama, Magic, Mystery, Police, S...",TV,26,7.36,64905
8,8,Beet the Vandel Buster,"[Adventure, Fantasy, Shounen, Supernatural]",TV,52,7.06,9848


We're now going to create a list of possible genres.  This will be used to modify the DataFrame.  We'll also remove genre since accessing a particular genre within the column would be inefficient.

In [2]:
listByGenre = {}
for index, anime in d.iterrows():
    genreList = anime['genre']
    if type(genreList) == float:
        continue
    for genre in genreList:
        if genre not in listByGenre:
            listByGenre[genre] = [anime['name']]
        else:
            listByGenre[genre].append(anime['name'])

for genre, movies in listByGenre.items():
    d[genre] = 0
    for name in movies:
        d.loc[d['name'] == name, genre] = 1

del d['genre']

## Determining membership from ratings

Next, we want to determine the membership based on the ratings.  Will it be the case that the higher the rating, the more members would like the show?

In [3]:
temp = d[['name','rating','members']]
temp = temp.sort_values(['rating'], ascending=False)
temp[:10]

Unnamed: 0_level_0,name,rating,members
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
33662,Taka no Tsume 8: Yoshida-kun no X-Files,10.0,13
30120,Spoon-hime no Swing Kitchen,9.6,47
23005,Mogura no Motoro,9.5,62
32281,Kimi no Na wa.,9.37,200630
33607,Kahei no Umi,9.33,44
5114,Fullmetal Alchemist: Brotherhood,9.26,793665
28977,Gintama°,9.25,114262
26313,Yakusoku: Africa Mizu to Midori,9.25,53
9253,Steins;Gate,9.17,673572
9969,Gintama&#039;,9.16,151266


In general, the assumption does hold.  However, there are several highly ranked shows that have a very low user base.  It's possible that the anime is not well known with many anime watchers.

Thus, when we recommend animes, we'll have to give less credence to those that are highly ranked but have low membership base.

We will now determine what show was the most recommended.  We will use a simple criteria:

* Recommended shows must be in the range between the sum/difference of the mean and standard deviation.
* The show must have a member base greater than the average member base from the first crieria.

In [None]:
import math
import operator
dr = pd.read_csv("rating.csv")

recShows = {show:0 for show in d['name'].unique()}
for i in dr['user_id'].unique():
    userData = dr[(dr['user_id'] == i) & (dr['rating'] != -1)]
    animeData = d.ix[userData['anime_id']]
    genreCount = {genre:animeData[genre].sum() for genre in listByGenre}
    genreAvg = {genre:(animeData[animeData[genre] == 1]['rating'].mean(),
                       math.sqrt(animeData[animeData[genre] == 1]['rating'].var()))
                if len(animeData[animeData[genre] == 1]) > 0 else (0,0)
                for genre in listByGenre}
    for genre in listByGenre:
        if genreCount == 0:
            continue
        tmp = d[(d['rating'] <= genreAvg[genre][0] + genreAvg[genre][1]) &
                (d['rating'] >= genreAvg[genre][0] - genreAvg[genre][1]) &
                (d[genre] == 1)]
        theMean = tmp['members'].mean()
        tmp = tmp[tmp['members'] > theMean]['name']
        for name in tmp:
            recShows[name] += 1

print(max(recShows.iteritems(), key=operator.itemgetter(1))[0])

Note: The current method is inefficient, but it’s meant to demonstrate that building an efficient recommendation systems is nontrivial.