# Content Based Filtering

In this notebook we will be working again with movie data but this one is a bit different as it does not contain ratings but only 1 csv with movie content information. The aim is to build a simple content based recommender that finds the top 10 movies to be receommended to a user who already likes a given movie.

## Data Description
**id** - ID of the movie title

**title** - Movie Title

**keywords** - Predefined keywords for each movie

**cast** - Entire cast of the movie

**genres** - list of genres corresponding to the movie

**director** - Director of the movie

## Table of Content

[1. Reading Dataset & Basic Exploration](#Reading-Dataset)

[2. Create Item Representation using various movie features](#Item-Representation)

[3. Finding most similar movies based on content](#similar-movies)

[4. Conclusion](#conclusion)

In [1]:
#Importing required libraries
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## 1. Reading Data & Basic Exploration <a class="anchor" id="Reading-Dataset"></a>

In [2]:
#Reading movie dataset containing detailed movie information
df = pd.read_csv("movie_info.csv")
df.columns

Index(['id', 'title', 'keywords', 'cast', 'genres', 'director'], dtype='object')

**View a sample of the data**

In [3]:
#Eyeballing dataset
df.sample(3).T

Unnamed: 0,3675,4789,1443
id,12498,39851,13496
title,Sling Blade,Clean,American Outlaws
keywords,independent film repair shop southern death th...,addiction recovering drug addict estranged son,sheriff horse outlaw jesse james cole younger
cast,Billy Bob Thornton Dwight Yoakam J. T. Walsh J...,Maggie Cheung Nick Nolte B\u00e9atrice Dalle J...,Colin Farrell Scott Caan Ali Larter Gabriel Ma...
genres,Drama,Drama,Action Western
director,Billy Bob Thornton,Olivier Assayas,Les Mayfield


**Number of unique movie titles**

In [4]:
# unique movies
df['title'].nunique()

4800

##  2. Create Item Representation using various movie features <a class="anchor" id="Item-Representation"></a>

**Select content features from the dataframe**

In [5]:
#list of content features
features = ['keywords','cast','genres','director']

**Fill null values with empty string and combine all the selected features.**

In [6]:
#Filling missing values with spaces
for feature in features:
    df[feature] = df[feature].fillna('')

In [7]:
#Concatenating strings from each of the content features to get the entire information in 1 column
def combine_features(row):
    return row['keywords'] +" "+row['cast']+" "+row["genres"]+" "+row["director"]

df["combined_features"] = df.apply(combine_features,axis=1)

In [8]:
#Printing a sample of combinedd features
print ("Combined Features:\n", df["combined_features"].sample(1).values)

Combined Features:
 ['casino monte carlo painting caper independent film Nick Nolte Tch\\u00e9ky Karyo Nutsa Kukhianidze Marc Lavoine Sa\\u00efd Taghmaoui Crime Drama Thriller Neil Jordan']


**Creating count vectors for all movies**

In [9]:
#Creating frequency of top 500 terms across all movies using count vectorizor
cv = CountVectorizer(ngram_range=(1, 2),max_features=500)
count_matrix = cv.fit_transform(df["combined_features"])

In [10]:
#View the shape of count matrix
count_matrix.todense().shape

(4803, 500)

**Compute the Cosine Similarity based on the count_matrix**

In [11]:
#Finding cosine similarity between frequency or count vectors for each movie
cosine_sim = cosine_similarity(count_matrix)

In [12]:
#Similarity Matrix Shape
cosine_sim.shape

(4803, 4803)

## 3. Finding most similar movies based on content <a class="anchor" id="similar-movies"></a>

In [13]:
#Picking a random movie title
movie_user_likes = "The Dark Knight Rises"

**Get index of this movie from its title**

We will define a function to get the title from the index and one for index from the title. This is to extract relevant movie from the similarity matrix

In [14]:
#Get title from index
def get_title_from_index(index):
    return df[df.index == index]["title"].values[0]

#Get index from title
def get_index_from_title(title):
    return df[df.title == title].index.values[0]

In [15]:
# Get the movie index of the "The Dark Knight Rises"
movie_index = get_index_from_title(movie_user_likes)

**Get the similar movies & sorting on the basis of similarity (most similar to least similar)**

In [16]:
#Storing list of similar movies
similar_movies =  list(enumerate(cosine_sim[movie_index]))

#Sorting with similarity score
sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)

**Print titles of first 10 movies**

In [17]:
#Top 10 similar movies to the Dark Knight Rises
i=0
for element in sorted_similar_movies:
    print(get_title_from_index(element[0]))
    i=i+1
    if i>10:
        break

The Dark Knight Rises
Batman Begins
The Dark Knight
The Killer Inside Me
Point Blank
Hitman
The Protector
The Way of the Gun
Bound by Honor
Armored
Harsh Times


**Now let's define a function to get top 10 recommendations**

In [18]:
def get_recommendations(movie_name):
    movie_user_likes = movie_name

    movie_index = get_index_from_title(movie_user_likes)

    similar_movies =  list(enumerate(cosine_sim[movie_index]))

    sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)

    i=0
    movies = []
    for element in sorted_similar_movies:
        movies.append(get_title_from_index(element[0]))
        i=i+1
        if i>10:
            break
            
    return movies

In [19]:
get_recommendations('Tangled')

['Tangled',
 'Pinocchio',
 'Mulan',
 'Roadside Romeo',
 'Aladdin',
 'Arthur and the Invisibles',
 "A Turtle's Tale: Sammy's Adventures",
 'Dinosaur',
 'Rugrats Go Wild',
 'Anastasia',
 'Fantasia']

In [20]:
get_recommendations('Batman Returns')

['Batman Returns',
 'Batman',
 'The Book of Mormon Movie, Volume 1: The Journey',
 'The Grace Card',
 'Shooter',
 'Flicka',
 'The Dark Knight',
 'Mi America',
 'Fantastic 4: Rise of the Silver Surfer',
 'High Heels and Low Lifes',
 'Checkmate']

## 4. Conclusion <a class="anchor" id="Reading-Dataset"></a>
Here, we have seen how one can use item features to find similarities between movies and hence use it to recommend movie. It works quite well as we can see here. As an exercise you may try to change the number of terms being used for count vectorizor and see if you can further improve the recommendations.