<h1 align=center><font size = 5>CONTENT-BASED FILTERING</font></h1>

**Content-Based Filtering** is a technique attempts to figure out what a user's favourite aspects of an item is, and then recommends items that present those aspects. <br>
<br>
Collaborative Users is part of Recommendation System. <br>
<br>
**Recommendation systems** are a collection of algorithms used to recommend items to users based on information taken from the user.  <br>
<br>
In this case, we're going to try to figure out the input's favorite genres from the movies and ratings given.<br>
<br>
First step in this content-based filtering is same like others case to Import library that is needed for analyzing and processing the data.

In [0]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Data Description

In [0]:
movies_df = pd.read_csv('movie.csv', sep=";")
ratings_df = pd.read_csv('ratings.csv', sep=";")

### Movies Data

In [45]:
movies_df

Unnamed: 0,MovieId,Title,Genres
0,0,AADC 2,Drama|Romance
1,1,Gundala,Action|Crime|Drama
2,2,Dilan 1991,Drama|Romance
3,3,Bumi Manusia,Drama|History
4,4,Dua Garis Biru,Drama|Family
5,5,Avengers End Game,Action|Adventure|SciFi
6,6,The Lion King,Action|Adventure|Drama
7,7,Aladdin,Adventure|Family|Fantasy
8,8,Spiderman Far From Home,Action|Adventure|SciFi
9,9,Captain Marvel,Action|Adventure|SciFi


split the values in the __Genres__ column into a __list of Genres__ to simplify future use. This can be achieved by applying Python's split string function on the correct column.

In [46]:
#Every genre is separated by a | so we simply have to call the split function on |
movies_df['Genres'] = movies_df.Genres.str.split('|')
movies_df

Unnamed: 0,MovieId,Title,Genres
0,0,AADC 2,"[Drama, Romance]"
1,1,Gundala,"[Action, Crime, Drama]"
2,2,Dilan 1991,"[Drama, Romance]"
3,3,Bumi Manusia,"[Drama, History]"
4,4,Dua Garis Biru,"[Drama, Family]"
5,5,Avengers End Game,"[Action, Adventure, SciFi]"
6,6,The Lion King,"[Action, Adventure, Drama]"
7,7,Aladdin,"[Adventure, Family, Fantasy]"
8,8,Spiderman Far From Home,"[Action, Adventure, SciFi]"
9,9,Captain Marvel,"[Action, Adventure, SciFi]"


Genres in a list format isn't optimal for the content-based recommendation system technique. So It needed to use the **One Hot Encoding technique** to convert the list of genres to a vector where each column corresponds to one possible value of the feature. This encoding is needed for feeding categorical data. 

In this case, store every different genre in columns that contain either 1 or 0. 1 shows that a movie has that genre and 0 shows that it doesn't. Let's also store this dataframe in another variable since genres won't be important for the first recommendation system.

In [47]:
#Copying the movie dataframe into a new one since we won't need to use the genre information in our first case.
moviesWithGenres_df = movies_df.copy()

#For every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column
for index, row in movies_df.iterrows():
    for Genre in row['Genres']:
        moviesWithGenres_df.at[index, Genre] = 1
#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
moviesWithGenres_df.head()

Unnamed: 0,MovieId,Title,Genres,Drama,Romance,Action,Crime,History,Family,Adventure,SciFi,Fantasy
0,0,AADC 2,"[Drama, Romance]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Gundala,"[Action, Crime, Drama]",1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2,2,Dilan 1991,"[Drama, Romance]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Bumi Manusia,"[Drama, History]",1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,4,Dua Garis Biru,"[Drama, Family]",1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


### Ratings Data

In [48]:
ratings_df.head()

Unnamed: 0,UserId,Name,MovieId,Title,Ratings
0,1,Hania,0,AADC 2,3
1,1,Hania,1,Gundala,5
2,1,Hania,2,Dilan 1991,4
3,1,Hania,3,Bumi Manusia,4
4,1,Hania,4,Dua Garis Biru,4


In ratings dataframe, there was an input from a user by the movie that they already watch. <br>
So, there was a feature UserId and Name of the User, MovieId and Tittle of Movie, and How they give a ratings for the Movie.

## Content Based Filtering

### Input User

This is the new data as a new movie reviewers. By this data, **we want to know what is the other movie that this user need to watch.**

In [49]:
userInput = [
            {'Title':'AADC 2', 'Ratings':3},
            {'Title':'Dilan 1991', 'Ratings':2},
            {'Title':'Dua Garis Biru', 'Ratings':4},
            {'Title':'Avengers End Game', 'Ratings':5},
            {'Title':'Captain Marvel', 'Ratings':3}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,Ratings,Title
0,3,AADC 2
1,2,Dilan 1991
2,4,Dua Garis Biru
3,5,Avengers End Game
4,3,Captain Marvel


### Add MovieId to input user
First step after inputing the data of new user is extract the input movies's ID's from the movies dataframe and add them into it.

We can achieve this by first filtering out the rows that contain the input movies' title and then merging this subset with the input dataframe.

In [50]:
#Filtering out the movies by title
inputId = movies_df[movies_df['Title'].isin(inputMovies['Title'].tolist())]
#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)
inputMovies = inputMovies.drop(columns=['Genres'])
inputMovies

Unnamed: 0,MovieId,Title,Ratings
0,0,AADC 2,3
1,2,Dilan 1991,2
2,4,Dua Garis Biru,4
3,5,Avengers End Game,5
4,9,Captain Marvel,3


It start by learning the input's preferences, the it need to get the subset of movies that the input has watched from the Dataframe containing genres defined with binary values.

In [51]:
#Filtering out the movies from the input
userMovies = moviesWithGenres_df[moviesWithGenres_df['MovieId'].isin(inputMovies['MovieId'].tolist())]
userMovies

Unnamed: 0,MovieId,Title,Genres,Drama,Romance,Action,Crime,History,Family,Adventure,SciFi,Fantasy
0,0,AADC 2,"[Drama, Romance]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,Dilan 1991,"[Drama, Romance]",1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,Dua Garis Biru,"[Drama, Family]",1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
5,5,Avengers End Game,"[Action, Adventure, SciFi]",0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
9,9,Captain Marvel,"[Action, Adventure, SciFi]",0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0


It only need the actual genre table, then clean this up a bit by resetting the index and dropping the movieId, title, genres and year columns.

In [53]:
#Resetting the index to avoid future issues
userMovies = userMovies.reset_index(drop=True)
#Dropping unnecessary issues due to save memory and to avoid issues
userGenreTable = userMovies.drop('MovieId', 1).drop('Title', 1).drop('Genres', 1)
userGenreTable

Unnamed: 0,Drama,Romance,Action,Crime,History,Family,Adventure,SciFi,Fantasy
0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0


### Input Preference

To do this, It going to turn each genre into weights. It can done by using the input's reviews and multiplying them into the input's genre table and then summing up the resulting table by column. This operation is actually a dot product between a matrix and a vector, so we can simply accomplish by calling Pandas's "dot" function.

In [54]:
inputMovies['Ratings']

0    3
1    2
2    4
3    5
4    3
Name: Ratings, dtype: int64

In [55]:
#Dot produt to get weights
userProfile = userGenreTable.transpose().dot(inputMovies['Ratings'])
#The user profile
userProfile

Drama        9.0
Romance      5.0
Action       8.0
Crime        0.0
History      0.0
Family       4.0
Adventure    8.0
SciFi        8.0
Fantasy      0.0
dtype: float64

The weights for every of the user's preferences is available. This is known as the User Profile. By this, It can recommend movies that satisfy the user's preferences.

### Extracting Genre table

Next Step is extracting genre table from original data frame

In [57]:
#Now let's get the genres of every movie in our original dataframe
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['MovieId'])
#And drop the unnecessary information
genreTable = genreTable.drop('MovieId', 1).drop('Title', 1).drop('Genres', 1)
genreTable.head()

Unnamed: 0_level_0,Drama,Romance,Action,Crime,History,Family,Adventure,SciFi,Fantasy
MovieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [58]:
genreTable.shape

(10, 9)

### Weighted Average of Movie

With the input's profile and the complete list of movies and their genres in hand, next step is going to take the weighted average of every movie based on the input profile and recommend the top twenty movies that most satisfy it.

In [59]:
#Multiply the genres by the weights and then take the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendationTable_df.head()

MovieId
0    0.333333
1    0.404762
2    0.333333
3    0.214286
4    0.309524
dtype: float64

In [60]:
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#Just a peek at the values
recommendationTable_df.head()

MovieId
6    0.595238
9    0.571429
8    0.571429
5    0.571429
1    0.404762
dtype: float64

### Result and Recommendation

In [62]:
movies_df.loc[movies_df['MovieId'].isin(recommendationTable_df.head(3).keys())]

Unnamed: 0,MovieId,Title,Genres
6,6,The Lion King,"[Action, Adventure, Drama]"
8,8,Spiderman Far From Home,"[Action, Adventure, SciFi]"
9,9,Captain Marvel,"[Action, Adventure, SciFi]"


We can see that the top 3 recommended movies for this Users need to watch based on Conten Based Filtering is **The Lion King, Spiderman Far From Home and Captain Marvel**.