# <span style="color:red"> <strong> Content Based Recommender System </strong>

Content-based recommender systems generate recommendations by relying on attributes of items and/or users. User attributes can include age, sex, job type, and other personal information. Items attributes on the other hand, are descriptive information that distinguishes individual items from each other. In case of movies, this could include title, cast, description, genre and others.

By relying on features, those of users and items, content-based recommender systems are more like a traditional machine learning problem than is the case for collaborative filtering. Content-based method uses item-based or user-based features to predict an action of the user for a given item. 

One of the advantages of content-based recommendation is `user independence` - to make recommendations to a user, it does not require information about other users, unlike collaborative filtering. This makes content-based approach easier to scale. Another benefit is that the recommendations are more transparent, as the recommender can more clearly explain recommendation in terms of the features used.

Content-based approach also has its drawbacks, one is `over specialization` - if the user is only interested in specific categories, recommender will have difficulty recommending items outside of this scope, leading to user remaining in its current circle of items/interests. Content-based approaches also often require domain knowledge to produce relevant item and user features.

Now let's build an implementation of content-based recommender in python, using the MovieLens dataset!

## Load Dataset

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
%matplotlib inline

movie_df  = pd.read_csv("../data/movie.csv")
rating_df = pd.read_csv("../data/rating.csv")
movie_rating = movie_df.merge(rating_df, how="inner", on="movieId")
movie_rating["timestamp"] = pd.DatetimeIndex(movie_rating["timestamp"])
movie_rating["year"] = movie_rating["timestamp"].dt.year
movie_rating

Unnamed: 0,movieId,title,genres,userId,rating,timestamp,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3,4.0,1999-12-11 13:36:47,1999
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6,5.0,1997-03-13 17:50:52,1997
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8,4.0,1996-06-05 13:37:51,1996
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10,4.0,1999-11-25 02:44:47,1999
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11,4.5,2009-01-02 01:13:41,2009
...,...,...,...,...,...,...,...
20000258,131254,Kein Bund für's Leben (2007),Comedy,79570,4.0,2015-03-30 19:32:59,2015
20000259,131256,"Feuer, Eis & Dosenbier (2002)",Comedy,79570,4.0,2015-03-30 19:48:08,2015
20000260,131258,The Pirates (2014),Adventure,28906,2.5,2015-03-30 19:56:32,2015
20000261,131260,Rentun Ruusu (2001),(no genres listed),65409,3.0,2015-03-30 19:57:46,2015


In [8]:
#remove movies that have low number of votes
votes = movie_rating[["movieId", "rating"]].groupby("movieId").count().reset_index()

#determine the minimum number of votes that the movie must have to be included
cut_off = np.percentile(votes["rating"].values, 85) 
movies = list(votes[votes["rating"] >= int(cut_off)]["movieId"])

movie_rating = movie_rating[movie_rating["movieId"].isin(movies)]
movie_rating

Unnamed: 0,movieId,title,genres,userId,rating,timestamp,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3,4.0,1999-12-11 13:36:47,1999
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6,5.0,1997-03-13 17:50:52,1997
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8,4.0,1996-06-05 13:37:51,1996
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10,4.0,1999-11-25 02:44:47,1999
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11,4.5,2009-01-02 01:13:41,2009
...,...,...,...,...,...,...,...
19991870,116797,The Imitation Game (2014),Drama|Thriller,138148,4.0,2015-03-08 19:53:38,2015
19991871,116797,The Imitation Game (2014),Drama|Thriller,138166,4.0,2015-01-14 15:32:17,2015
19991872,116797,The Imitation Game (2014),Drama|Thriller,138186,5.0,2015-03-29 20:52:28,2015
19991873,116797,The Imitation Game (2014),Drama|Thriller,138231,4.0,2015-01-29 23:12:53,2015


## TF-IDF Approach

The dataset contains seven variables:  `movieId`, `title`, `genres`, `userId`, `rating`, `timestamp`, and `year`.
To build a content-based filtering system, we will convert `genres` into numerical variable so we can compare the similarities of different movies based on their genres.

In [9]:
tf_idf = TfidfVectorizer(stop_words="english")
tf_idf_matrix = tf_idf.fit_transform(movie_rating["genres"]) #represent each movie genre with numerical vector

Now that we have numerical vectors, representing each movie genre, we can compute similarity of movies by calculating their pairwise cosine similarities and storing them in cosine similarity matrix!

## Calculate Similarity Matrix Based on Genres

In [11]:
cosine_similarity_matrix = cosine_similarity(tf_idf_matrix, tf_idf_matrix)

## Make Recommendations

In [None]:
def index_from_title(df, genre):
    return df[df["genres"]==genre].index.values[0]

def recommendations(genre, df, cosine_similarity_matrix, number_of_recommendations):
        index = index_from_title(genre)
        similarity_scores = list(enumerate(cosine_similarity_matrix[index]))
        similarity_scores_sorted = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
        recommendations_indices = [t[0] for t in similarity_scores_sorted[1: (number_of_recommendations + 1)]]
        return df["title"].iloc[recommendations_indices]

recommendations("Drama|Thriller", movie_rating, cosine_similarity_matrix, 10)