### Content Based Recommender System
In this notebook, we will explore Content-based recommendation systems and implement a simple version of one using Python and the Pandas library.

In [1]:
import pandas as pd
import numpy as np

In [2]:
movies_df = pd.read_csv('movies.csv')
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
34203,151697,Grand Slam (1967),Thriller
34204,151701,Bloodmoney (2010),(no genres listed)
34205,151703,The Butterfly Circus (2009),Drama
34206,151709,Zero (2015),Drama|Sci-Fi


In [3]:
#preprocessing
movies_df['year'] = movies_df.title.str.extract('(\d\d\d\d)',expand=False)
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))','')
movies_df['title'] = movies_df['title'].apply(lambda x:x.strip())

In [4]:
movies_df

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995
...,...,...,...,...
34203,151697,Grand Slam,Thriller,1967
34204,151701,Bloodmoney,(no genres listed),2010
34205,151703,The Butterfly Circus,Drama,2009
34206,151709,Zero,Drama|Sci-Fi,2015


With that, let's also split the values in the Genres column into a list of Genres to simplify future use.

In [5]:
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995
...,...,...,...,...
34203,151697,Grand Slam,[Thriller],1967
34204,151701,Bloodmoney,[(no genres listed)],2010
34205,151703,The Butterfly Circus,[Drama],2009
34206,151709,Zero,"[Drama, Sci-Fi]",2015



We will use the One Hot Encoding technique to convert the list of genres to a vector where each column corresponds to one possible value of the feature.

In [6]:
movies_df_copy = movies_df.copy()

In [7]:
for index,row in movies_df.iterrows():
    for genres in row['genres']:
        movies_df_copy.at[index,genres] = 1

In [8]:
movies_df_copy.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,,...,,,,,,,,,,
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,,1.0,,1.0,,...,,,,,,,,,,
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,,,,1.0,,1.0,...,,,,,,,,,,
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,,,,1.0,,1.0,...,,,,,,,,,,
4,5,Father of the Bride Part II,[Comedy],1995,,,,1.0,,,...,,,,,,,,,,


In [9]:
#removing the NAN
movies_df_copy = movies_df_copy.fillna(0)
movies_df_copy

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34203,151697,Grand Slam,[Thriller],1967,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34204,151701,Bloodmoney,[(no genres listed)],2010,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
34205,151703,The Butterfly Circus,[Drama],2009,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34206,151709,Zero,"[Drama, Sci-Fi]",2015,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Collaborative Filtering

We will use the One Hot Encoding technique to convert the list of genres to a vector where each column corresponds to one possible value of the feature.

In [17]:
userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ]

In [11]:
InputMovies = pd.DataFrame(userInput)
InputMovies

Unnamed: 0,title,rating
0,"Breakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5



Add movieId to input user

In [12]:
input_id = movies_df[movies_df['title'].isin(InputMovies['title'].tolist())]

In [13]:
input_id

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
293,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994
1246,1274,Akira,"[Action, Adventure, Animation, Sci-Fi]",1988
1885,1968,"Breakfast Club, The","[Comedy, Drama]",1985


In [14]:
InputMovies = pd.merge(input_id,InputMovies)

In [15]:
InputMovies

Unnamed: 0,movieId,title,genres,year,rating
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,3.5
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,2.0
2,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994,5.0
3,1274,Akira,"[Action, Adventure, Animation, Sci-Fi]",1988,4.5
4,1968,"Breakfast Club, The","[Comedy, Drama]",1985,5.0


In [16]:
InputMovies = InputMovies.drop(['genres','year'],axis=1)
InputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


#### The users who have seen the same movies
Now with the movie ID's in our input, we can now get the subset of users that have watched and reviewed the movies in our input.

In [None]:
userSubset = ratings_df[ratings_df]