# Content Based Filtering
It is based on profile of the user’s preference and the item’s description. In CBF,  to  describe items we  use keywords apart  from  user’s  profile  to indicate users  preferred likes  or  dislikes.  In  other  words CBF  algorithm recommend  items  or  similar  to those  items  that were liked    in past. It examines previously rated items and recommends best matching item. For instance, in a content-based movie recommender system, the similarity between the movies is calculated on the basis of genres, the actors in the movie, the director of the movie, etc. (https://www.researchgate.net/publication/324763207_A_Hybrid_Approach_using_Collaborative_filtering_and_Content_based_Filtering_for_Recommender_System).

**Issues :**<br>
Recommend films based on watched film from user input.

### 1. Import library

In [57]:
import pandas as pd
import numpy as np

### 2. Load dataset

In [58]:
# load film
movies = pd.read_csv("../Dataset/movies.csv")

In [59]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### 3. Data cleansing

We need to clean the data such as drop unused column, delete year in title, delete white spaces in data, and split genres (because we will use genre as parameters in CBF algorithm).

In [60]:
#delete year on title
movies['title'] = movies['title'].str.replace('(\(\d\d\d\d\))', '')

#delete whitespace
movies['title'] = movies['title'].apply(lambda x: x.strip())

#Split genres
movies['genres'] = movies['genres'].str.split('|')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji,"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men,"[Comedy, Romance]"
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II,[Comedy]


In [61]:
# check movies missing values
movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [62]:
# check ratings missing values
ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

### 4. Steps Content Based Filtering

#### 4.1 One hot encoding to create new dataframe

In [63]:
#Copy the movies dataset
new_movies = movies.copy()

# Loop movies dataset, then loop again in genre and save the list into column
for index, row in movies.iterrows():
    for genre in row['genres']:
        new_movies.at[index, genre] = 1
        
# fill 0 to movie that is not fit the genres
new_movies = new_movies.fillna(0)
new_movies.head()

Unnamed: 0,movieId,title,genres,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1.0,1.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",0.0,0.0,0.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### 4.2 User movie's Input

In [64]:
# list of user input
usr_input = [
            {'title':'Cutthroat Island', 'rating':4},
            {'title':'Assassins', 'rating':3.5},
            {'title':'Powder', 'rating':4},
            {'title':'Money Train', 'rating':5},
            {'title':'Dracula: Dead and Loving It', 'rating':3.5}
         ]

# save the input to dataframe
usr_movie = pd.DataFrame(usr_input)
usr_movie

Unnamed: 0,rating,title
0,4.0,Cutthroat Island
1,3.5,Assassins
2,4.0,Powder
3,5.0,Money Train
4,3.5,Dracula: Dead and Loving It


In [65]:
#get movie ID of user input
input_id = movies[movies['title'].isin(usr_movie['title'].tolist())]

#merge user input data with input_id
usr_final = pd.merge(input_id, usr_movie)

#Dropping information we won't use from the input dataframe
usr_final = usr_final.drop(columns=['genres'])

In [66]:
usr_final

Unnamed: 0,movieId,title,rating
0,12,Dracula: Dead and Loving It,3.5
1,15,Cutthroat Island,4.0
2,20,Money Train,5.0
3,23,Assassins,3.5
4,24,Powder,4.0


#### 4.3 One hot encoding for user movie input

In [67]:
usr_movies = new_movies[new_movies['movieId'].isin(usr_final['movieId'].tolist())]
usr_movies

Unnamed: 0,movieId,title,genres,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
11,12,Dracula: Dead and Loving It,"[Comedy, Horror]",0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,15,Cutthroat Island,"[Action, Adventure, Romance]",1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19,20,Money Train,"[Action, Comedy, Crime, Drama, Thriller]",0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,23,Assassins,"[Action, Crime, Thriller]",0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,24,Powder,"[Drama, Sci-Fi]",0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


From the above data, we just need the column of all genre to be parameters in CBF algorithm.

In [68]:
#Reset index start with 0
usr_movies = usr_movies.reset_index(drop=True)

#Drop unused columns
usr_movies = usr_movies.drop(columns=['movieId', 'title', 'genres'])
usr_movies

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### 4.4 Give weight for each genres

In [69]:
#Find user's history based on genre
#Use dot product to get weight for each genre
usr_weight = usr_movies.transpose().dot(usr_final['rating'])
usr_weight

Adventure              4.0
Animation              0.0
Children               0.0
Comedy                 8.5
Fantasy                0.0
Romance                4.0
Drama                  9.0
Action                12.5
Crime                  8.5
Thriller               8.5
Horror                 3.5
Mystery                0.0
Sci-Fi                 4.0
War                    0.0
Musical                0.0
Documentary            0.0
IMAX                   0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

#### 4.5 Create film recommendation

In [70]:
# load new_movies that is coppied from movies dataset
new_movies.set_index(new_movies['movieId'], inplace=True)
new_movies = new_movies.drop('movieId', 1).drop('title', 1).drop('genres', 1)
new_movies.head()

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [71]:
# Calculate the weighted average from user movie input and the full movie list
# Multiply the genres by the weights and then take the weighted average
rec_film = ((new_movies*usr_weight).sum(axis=1))/(usr_weight.sum())

# show weighted movie descending
rec_film.sort_values(ascending=False).head()

movieId
81132    0.872
4719     0.816
7235     0.808
7007     0.752
145      0.752
dtype: float64

In [72]:
# Top 10 film recommendation
movies.loc[movies['movieId'].isin(rec_film.head(10).keys())]

Unnamed: 0,movieId,title,genres
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji,"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men,"[Comedy, Romance]"
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II,[Comedy]
5,6,Heat,"[Action, Crime, Thriller]"
6,7,Sabrina,"[Comedy, Romance]"
7,8,Tom and Huck,"[Adventure, Children]"
8,9,Sudden Death,[Action]
9,10,GoldenEye,"[Action, Adventure, Thriller]"


------