# Recommendation System

In this notebook, I explored three kinds of recommendation - popularity, content-based, and colaborative filter recommendations. The final result should be that anytime our customers input one movie's name, the system can provide them the other most possible movies they want to watch.

## Ingest

In [1]:
import numpy as np
import pandas as pd
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)
%matplotlib inline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

import surprise
from surprise import Dataset
from surprise import Reader
from surprise import SVDpp
from surprise import model_selection
from surprise.model_selection import cross_validate, GridSearchCV

In [5]:
df_rating = pd.read_csv('./input/rating.csv')
df_anime = pd.read_csv('./input/anime.csv')
print("Rating's shape: {}".format(df_rating.shape))
print("Anime's shape: {}".format(df_anime.shape))

Rating's shape: (7813737, 3)
Anime's shape: (12294, 7)


In [7]:
print(df_rating.head())
print(df_rating.rating.unique())

   user_id  anime_id  rating
0        1        20      -1
1        1        24      -1
2        1        79      -1
3        1       226      -1
4        1       241      -1
[-1 10  8  6  9  7  3  5  4  1  2]


-1 means people didn't give the movies they have seen ratings.

In [8]:
df_anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


## The Simplest Recommendation System
### Popularity
For the most direct recommendation, we can just find the most popular movies and it is probably that most of the people will like them. However, just using average rating of the moives is not a very good way because it cannot take popularity and audience's amount into consideration. Therefore, I used another way to give every movie a score. 

Weighted Rating(WR) = v*R/(v+m) + m*C/(v+m)

v is the number of votes for the movie;

m is the minimum votes required to be listed

R is the average rating of the movie

C is the average rating of all movies

### Data Processing

In [9]:
# Remove null values (-1)
df_rating = df_rating[df_rating.rating != -1]
df_rating.shape

(6337241, 3)

In [12]:
# making merge on the datasets
df_anime_rec1 = df_anime.drop(['genre', 'type', 'episodes', 'members'], axis=1)
df_rating_rec1 = df_rating.drop(['rating', 'user_id'], axis=1)
df_anime_rating_total = df_anime_rec1.merge(df_rating_rec1, how='inner', on='anime_id')
df_anime_rating_total.head()

Unnamed: 0,anime_id,name,rating
0,32281,Kimi no Na wa.,9.37
1,32281,Kimi no Na wa.,9.37
2,32281,Kimi no Na wa.,9.37
3,32281,Kimi no Na wa.,9.37
4,32281,Kimi no Na wa.,9.37


In [13]:
# Getting the rating number
df_anime_rating_total = df_anime_rating_total.dropna()
df_anime_rating_number = df_anime_rating_total.groupby(['anime_id'], as_index=False)['rating'].count()
df_anime_rating_number = df_anime_rating_number.rename(columns={'rating': 'rating number'})
df_anime_rating_number.head()

Unnamed: 0,anime_id,rating number
0,1,13449
1,5,5790
2,6,9385
3,7,2169
4,8,308


In [14]:
# Merging rating number with total
df_anime_rating_total = df_anime_rating_total.merge(df_anime_rating_number, on='anime_id', how='inner')
df_anime_rating_total.head()

Unnamed: 0,anime_id,name,rating,rating number
0,32281,Kimi no Na wa.,9.37,1961
1,32281,Kimi no Na wa.,9.37,1961
2,32281,Kimi no Na wa.,9.37,1961
3,32281,Kimi no Na wa.,9.37,1961
4,32281,Kimi no Na wa.,9.37,1961


In [15]:
# Keeping only distincts animes
df_anime_rating = df_anime_rating_total.drop_duplicates(keep='first')
df_anime_rating = df_anime_rating.reset_index()
df_anime_rating = df_anime_rating.drop(['index'], axis=1)
df_anime_rating.head()

Unnamed: 0,anime_id,name,rating,rating number
0,32281,Kimi no Na wa.,9.37,1961
1,5114,Fullmetal Alchemist: Brotherhood,9.26,21494
2,28977,Gintama°,9.25,1188
3,9253,Steins;Gate,9.17,17151
4,9969,Gintama&#039;,9.16,3115


In [16]:
# Handle with scoring
warnings.filterwarnings('ignore')
# Weighted Rating(WR) = vR/(v+m) + mC/(v+m)
C = df_anime_rating['rating'].mean()
m = df_anime_rating['rating number'].quantile(0.9)
df_anime_rating_recommend = df_anime_rating[df_anime_rating['rating number'] >= m]
df_anime_rating_recommend['scoring'] = df_anime_rating_recommend['rating number']*df_anime_rating_recommend['rating']/(df_anime_rating_recommend['rating number']+m) + m*C/(df_anime_rating_recommend['rating number']+m)
df_anime_rating_recommend.head()

Unnamed: 0,anime_id,name,rating,rating number,scoring
0,32281,Kimi no Na wa.,9.37,1961,8.082243
1,5114,Fullmetal Alchemist: Brotherhood,9.26,21494,9.065249
3,9253,Steins;Gate,9.17,17151,8.938426
4,9969,Gintama&#039;,9.16,3115,8.255402
6,11061,Hunter x Hunter (2011),9.13,7477,8.6614


### Provide Recommendations

In [18]:
df_anime_rating_recommend = df_anime_rating_recommend.sort_values('scoring', ascending=False)
df_anime_rating_recommend.head(10)

Unnamed: 0,anime_id,name,rating,rating number,scoring
1,5114,Fullmetal Alchemist: Brotherhood,9.26,21494,9.065249
3,9253,Steins;Gate,9.17,17151,8.938426
10,4181,Clannad: After Story,9.06,15518,8.817249
13,2904,Code Geass: Hangyaku no Lelouch R2,8.98,21124,8.802825
15,199,Sen to Chihiro no Kamikakushi,8.93,19481,8.743065
19,1575,Code Geass: Hangyaku no Lelouch,8.83,24126,8.683245
6,11061,Hunter x Hunter (2011),9.13,7477,8.6614
39,1535,Death Note,8.71,34226,8.610159
29,2001,Tengen Toppa Gurren Lagann,8.78,16955,8.58133
22,1,Cowboy Bebop,8.82,13449,8.570855


In [20]:
# Show detailed information about those most popular movies
df_anime_popular = pd.DataFrame({'anime_id':[], 'name':[], 'genre':[], 'type':[], 'episodes':[], 'rating':[], 'members':[]})

df_anime_popular_name = df_anime_rating_recommend.head(10)['name']
for name in df_anime_popular_name:
    df_anime_popular = df_anime_popular.append(df_anime[df_anime['name'] == name])
        
df_anime_popular = df_anime_popular.reset_index()
df_anime_popular = df_anime_popular.drop(['index'], axis=1)
df_anime_popular

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,5114.0,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665.0
1,9253.0,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572.0
2,4181.0,Clannad: After Story,"Drama, Fantasy, Romance, Slice of Life, Supern...",TV,24,9.06,456749.0
3,2904.0,Code Geass: Hangyaku no Lelouch R2,"Action, Drama, Mecha, Military, Sci-Fi, Super ...",TV,25,8.98,572888.0
4,199.0,Sen to Chihiro no Kamikakushi,"Adventure, Drama, Supernatural",Movie,1,8.93,466254.0
5,1575.0,Code Geass: Hangyaku no Lelouch,"Action, Mecha, Military, School, Sci-Fi, Super...",TV,25,8.83,715151.0
6,11061.0,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855.0
7,1535.0,Death Note,"Mystery, Police, Psychological, Supernatural, ...",TV,37,8.71,1013917.0
8,2001.0,Tengen Toppa Gurren Lagann,"Action, Adventure, Comedy, Mecha, Sci-Fi",TV,27,8.78,562962.0
9,1.0,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",TV,26,8.82,486824.0


## Content-Based Recommendation System
For this kind of recommender, it is based on movies' genres. Anytime one customer inputs a movie's name, the system will provide him or her with 10 possible movies that are similar to the one customer chooses. I will apply TF-IDF to 'genre' and you can find more introduction about TF-IDF in my Github (https://github.com/Zhenyu0521/Text-Analysis/blob/master/NLP%20for%20Yelp%20Reviews/NLP_for_Yelp_Reviews.ipynb)

### Data Processing

In [15]:
df_anime.isna().sum()

anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

In [16]:
# drop na
df_anime_cbrs = df_anime.dropna(axis=0)
df_anime_cbrs = df_anime_cbrs.reset_index()
df_anime_cbrs.isna().sum()

index       0
anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

In [22]:
# Term Frequency - Inverse Document Frequency (TF-IDF)
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df_anime_cbrs['genre'])
tfidf_matrix

<12017x46 sparse matrix of type '<class 'numpy.float64'>'
	with 39659 stored elements in Compressed Sparse Row format>

In [23]:
# Calculate cosine similarities
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim

array([[1.        , 0.14715318, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.14715318, 1.        , 0.17877808, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.17877808, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        1.        ],
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        1.        ],
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        1.        ]])

### Provide Recommendations

In [20]:
indices = pd.Series(df_anime_cbrs.index, index=df_anime_cbrs['name']).drop_duplicates()

def get_recommendations_cb(title, cosine_sim=cosine_sim):
    index = indices[title]
    sim_scores = list(enumerate(cosine_sim[index]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return df_anime['name'].iloc[movie_indices]

In [24]:
get_recommendations_cb('Fullmetal Alchemist: Brotherhood')

200                               Fullmetal Alchemist
1558    Fullmetal Alchemist: The Sacred Star of Milos
402         Fullmetal Alchemist: Brotherhood Specials
879               Tales of Vesperia: The First Strike
4262            Tetsujin 28-gou: Hakuchuu no Zangetsu
1967                 Fullmetal Alchemist: Reflections
101                        Magi: The Kingdom of Magic
268                      Magi: The Labyrinth of Magic
290                       Magi: Sinbad no Bouken (TV)
461                            Magi: Sinbad no Bouken
Name: name, dtype: object

## Collaborative Filtering Recommendation System

### Exploratory Data Analysis

In [25]:
df_rating.head()

Unnamed: 0,user_id,anime_id,rating
47,1,8074,10
81,1,11617,10
83,1,11757,10
101,1,15451,10
153,2,11771,10


In [26]:
data = df_rating['rating'].value_counts().sort_index(ascending=False)
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / df_rating.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )

layout = dict(title = 'Distribution Of {} moive-ratings'.format(df_rating.shape[0]),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))

fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

Since the rating data has more than seven million lines, I just selected fifty thousands to show the function of colaborative filtering system.

In [27]:
df = df_rating.iloc[:50000,].reset_index()
df = df.drop(['index'], axis=1)
df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,8074,10
1,1,11617,10
2,1,11757,10
3,1,15451,10
4,2,11771,10


### Build up Recommendation Model

In [28]:
df['rating'].unique()

array([10,  8,  6,  9,  7,  3,  5,  4,  1,  2], dtype=int64)

In [None]:
recmodel = SVDpp()
reader = Reader(rating_scale=(1,10))
df_rating_rec = Dataset.load_from_df(df, reader)
recmodel.fit(df_rating_rec.build_full_trainset()) 
cross_validate(recmodel, df_rating_rec, measures=['RMSE', 'MAE'], cv=5, verbose=True)

In [23]:
anime_id = df['anime_id'].unique()
# just take user1 for example
anime_id1 = df.loc[df['user_id'] == 1, 'anime_id']
anime_id_to_pred = np.setdiff1d(anime_id, anime_id1)

In [24]:
testset = [[1, anime_id, 10] for anime_id in anime_id_to_pred]
user_id1_pred = recmodel.test(testset)
df_pred = pd.DataFrame(user_id1_pred)
df_pred.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,1,1,10,9.48242,{'was_impossible': False}
1,1,5,10,9.251682,{'was_impossible': False}
2,1,6,10,9.039687,{'was_impossible': False}
3,1,7,10,8.474708,{'was_impossible': False}
4,1,8,10,8.249671,{'was_impossible': False}


In [25]:
df_pred = df_pred.rename(columns={'uid': 'user_id', 'iid': 'anime_id', 'est': 'predicted rating'})
df_pred = df_pred.drop(['r_ui', 'details'], axis=1)
df_pred.head()

Unnamed: 0,user_id,anime_id,predicted rating
0,1,1,9.48242
1,1,5,9.251682
2,1,6,9.039687
3,1,7,8.474708
4,1,8,8.249671


In [26]:
df_pred = df_pred.sort_values('predicted rating', ascending=False)
df_pred_anime_id = df_pred.head(10)['anime_id']

df_recommendation = pd.DataFrame({'anime_id':[], 'name':[], 'genre':[], 'type':[], 'episodes':[], 'rating':[], 'members':[]})

for anime_id in df_pred_anime_id:
    df_recommendation = df_recommendation.append(df_anime[df_anime['anime_id'] == anime_id])
        
df_recommendation = df_recommendation.reset_index()
df_recommendation = df_recommendation.drop(['index'], axis=1)
df_recommendation['anime_id'] = df_recommendation['anime_id'].astype('int')
df_recommendation

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855.0
1,19815,No Game No Life,"Adventure, Comedy, Ecchi, Fantasy, Game, Super...",TV,12,8.47,602291.0
2,16894,Kuroko no Basket 2nd Season,"Comedy, School, Shounen, Sports",TV,25,8.58,243325.0
3,6702,Fairy Tail,"Action, Adventure, Comedy, Fantasy, Magic, Sho...",TV,175,8.22,584590.0
4,2904,Code Geass: Hangyaku no Lelouch R2,"Action, Drama, Mecha, Military, Sci-Fi, Super ...",TV,25,8.98,572888.0
5,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572.0
6,4181,Clannad: After Story,"Drama, Fantasy, Romance, Slice of Life, Supern...",TV,24,9.06,456749.0
7,245,Great Teacher Onizuka,"Comedy, Drama, School, Shounen, Slice of Life",TV,43,8.77,268487.0
8,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266.0
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109.0


### Provide Recommendations

In [27]:
def get_recommendations_cf(user_id, num_recommendations):
    """Provide recommendations for specific user with the number they want to show.
    """
    anime_id = df['anime_id'].unique()
    anime_id_user = df.loc[df['user_id'] == user_id, 'anime_id']
    anime_id_to_pred = np.setdiff1d(anime_id, anime_id_user)
    testset = [[user_id, anime_id, 10] for anime_id in anime_id_to_pred]
    user_id_pred = recmodel.test(testset)
    df_pred = pd.DataFrame(user_id_pred)
    
    df_pred = df_pred.sort_values('est', ascending=False)
    df_pred_anime_id = df_pred.head(num_recommendations)['iid']

    df_recommendation = pd.DataFrame({'anime_id':[], 'name':[], 'genre':[], 'type':[], 'episodes':[], 'rating':[], 'members':[]})

    for anime_id in df_pred_anime_id:
        df_recommendation = df_recommendation.append(df_anime[df_anime['anime_id'] == anime_id])
        
    df_recommendation = df_recommendation.reset_index()
    df_recommendation = df_recommendation.drop(['index'], axis=1)
    df_recommendation['anime_id'] = df_recommendation['anime_id'].astype('int')
    return df_recommendation

In [28]:
get_recommendations_cf(1, 15)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855.0
1,19815,No Game No Life,"Adventure, Comedy, Ecchi, Fantasy, Game, Super...",TV,12,8.47,602291.0
2,16894,Kuroko no Basket 2nd Season,"Comedy, School, Shounen, Sports",TV,25,8.58,243325.0
3,6702,Fairy Tail,"Action, Adventure, Comedy, Fantasy, Magic, Sho...",TV,175,8.22,584590.0
4,2904,Code Geass: Hangyaku no Lelouch R2,"Action, Drama, Mecha, Military, Sci-Fi, Super ...",TV,25,8.98,572888.0
5,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572.0
6,4181,Clannad: After Story,"Drama, Fantasy, Romance, Slice of Life, Supern...",TV,24,9.06,456749.0
7,245,Great Teacher Onizuka,"Comedy, Drama, School, Shounen, Slice of Life",TV,43,8.77,268487.0
8,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266.0
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109.0


## Conclusion

### Popularity Recommendation System

In [29]:
df_anime_popular

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,5114.0,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665.0
1,9253.0,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572.0
2,4181.0,Clannad: After Story,"Drama, Fantasy, Romance, Slice of Life, Supern...",TV,24,9.06,456749.0
3,2904.0,Code Geass: Hangyaku no Lelouch R2,"Action, Drama, Mecha, Military, Sci-Fi, Super ...",TV,25,8.98,572888.0
4,199.0,Sen to Chihiro no Kamikakushi,"Adventure, Drama, Supernatural",Movie,1,8.93,466254.0
5,1575.0,Code Geass: Hangyaku no Lelouch,"Action, Mecha, Military, School, Sci-Fi, Super...",TV,25,8.83,715151.0
6,11061.0,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855.0
7,1535.0,Death Note,"Mystery, Police, Psychological, Supernatural, ...",TV,37,8.71,1013917.0
8,2001.0,Tengen Toppa Gurren Lagann,"Action, Adventure, Comedy, Mecha, Sci-Fi",TV,27,8.78,562962.0
9,1.0,Cowboy Bebop,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",TV,26,8.82,486824.0


### Content-Based Recommendation System

In [30]:
get_recommendations_cb('Fullmetal Alchemist: Brotherhood')

200                               Fullmetal Alchemist
1558    Fullmetal Alchemist: The Sacred Star of Milos
402         Fullmetal Alchemist: Brotherhood Specials
879               Tales of Vesperia: The First Strike
4262            Tetsujin 28-gou: Hakuchuu no Zangetsu
1967                 Fullmetal Alchemist: Reflections
101                        Magi: The Kingdom of Magic
268                      Magi: The Labyrinth of Magic
290                       Magi: Sinbad no Bouken (TV)
461                            Magi: Sinbad no Bouken
Name: name, dtype: object

### Collaborative Filtering Recommendation System - User to User

In [31]:
get_recommendations_cf(2, 5)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855.0
1,4181,Clannad: After Story,"Drama, Fantasy, Romance, Slice of Life, Supern...",TV,24,9.06,456749.0
2,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572.0
3,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109.0
4,199,Sen to Chihiro no Kamikakushi,"Adventure, Drama, Supernatural",Movie,1,8.93,466254.0


## Improvement
Seems that the best way to provide recommendations is to combine these three types of systems and give comprehensive suggestions. It needs me to spend much more time to find a proper method to combine them and get better model.