[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ignaziogallo/data-mining/blob/aa20-21/tutorials/recommender-systems/Content-Based-Movie-Recommendation-Example.ipynb)

## Content-Based Recommendations

**Content based filtering** uses `features` or `properties` of an **item** to serve recommendations. 
* Characteristic information includes:
  * Characteristics of **Items** 
    * Keywords
    * Attributes
  * Characteristics of **Users** 
    * specified demographic profile, 
    * specified interests at registration time, 
    * the product description of the items bought, 
    * and so on. 

### Example: Movie recommendation system based on item content

Characteristics for the **item** `Harry Potter and the Sorcerer’s Stone` might include:
* **Director Name** – Chris Columbus
* **Genres** – Adventure, Fantasy, Family (IMDB)
* **Stars** – Daniel Radcliffe, Rupert Grint, Emma Watson

### Let's Create a Simple Content-Based Recommendations System

#### Getting Started: Loading Libraries

In [57]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#### Loading the Dataset
Loading the Dataset provided by Kaggle <a href = "https://www.kaggle.com/rounakbanik/the-movies-dataset">The Movies Dataset</a> to a Pandas DataFrame

In [16]:
df = pd.read_csv("data/movie_dataset.csv")

We have our dataframe ready, so let`s visualize it

In [17]:
df.head()

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


In [18]:
df.describe()

Unnamed: 0,index,budget,id,popularity,revenue,runtime,vote_average,vote_count
count,4803.0,4803.0,4803.0,4803.0,4803.0,4801.0,4803.0,4803.0
mean,2401.0,29045040.0,57165.484281,21.492301,82260640.0,106.875859,6.092172,690.217989
std,1386.651002,40722390.0,88694.614033,31.81665,162857100.0,22.611935,1.194612,1234.585891
min,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
25%,1200.5,790000.0,9014.5,4.66807,0.0,94.0,5.6,54.0
50%,2401.0,15000000.0,14629.0,12.921594,19170000.0,103.0,6.2,235.0
75%,3601.5,40000000.0,58610.5,28.313505,92917190.0,118.0,6.8,737.0
max,4802.0,380000000.0,459488.0,875.581305,2787965000.0,338.0,10.0,13752.0


In [19]:
print(df.columns.values)

['index' 'budget' 'genres' 'homepage' 'id' 'keywords' 'original_language'
 'original_title' 'overview' 'popularity' 'production_companies'
 'production_countries' 'release_date' 'revenue' 'runtime'
 'spoken_languages' 'status' 'tagline' 'title' 'vote_average' 'vote_count'
 'cast' 'crew' 'director']


Onvisualizing the dataset, you may have noticed that it has many extra info about a movie. We don’t need all of them. So, we choose keywords, cast, genres, director and title column to use as our feature set.

In [20]:
features = ['genres', 'keywords', 'title', 'cast', 'director']

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 24 columns):
index                   4803 non-null int64
budget                  4803 non-null int64
genres                  4775 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4391 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null fl

As you may can noticed that **some columns have NaN** data points that will create a problem for us, so what we will do is instead of NaN values we will replace it with empty string ('').

Now, we need to call this function over each row of our dataframe. But, before doing that, we need to clean and preprocess the data for our use. We will fill all the NaN values with blank string in the dataframe

In [23]:
for feature in features:
    df[feature] = df[feature].fillna('')

#### Combine features to create movie content

Our next task is to create a function for **combining the values** of these columns into a single string

In [24]:
def combine_features(row):
    return row['title']+' '+row['genres']+' '+row['director']+' '+row['keywords']+' '+row['cast']

applying combine_feature method over each row of Dataframe and storing the combined string in "combined_features" column

In [25]:
df['combined_features'] = df.apply(combine_features, axis = 1)

In [26]:
print(df.loc[0, 'combined_features'])

Avatar Action Adventure Fantasy Science Fiction James Cameron culture clash future space war space colony society Sam Worthington Zoe Saldana Sigourney Weaver Stephen Lang Michelle Rodriguez


####  Count matrix

Now that we have obtained the combined strings, we can now feed these strings to a CountVectorizer() object for getting the **count matrix**.

In [27]:
cv = CountVectorizer()
count_matrix = cv.fit_transform(df['combined_features'])

#### Similarity matrix

Now, we need to obtain the cosine similarity matrix from the count matrix.

The similarity matrix looks like this
<img src="figures/similarity_matrix.png" width="50%">

In [46]:
cosine_sim = cosine_similarity(count_matrix)
print(cosine_sim.shape,"\n", cosine_sim)

(4803, 4803) 
 [[1.         0.09078413 0.11572751 ... 0.         0.         0.        ]
 [0.09078413 1.         0.06537205 ... 0.06052275 0.         0.        ]
 [0.11572751 0.06537205 1.         ... 0.         0.10206207 0.        ]
 ...
 [0.         0.06052275 0.         ... 1.         0.         0.07142857]
 [0.         0.         0.10206207 ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.07142857 0.         1.        ]]


#### Utility functions

Now, we will define two helper functions to get movie title from movie index and vice-versa.

In [29]:
def get_title_from_index(index):
    return df[df.index == index]["title"].values[0]
def get_index_from_title(title):
    return df[df.title == title]["index"].values[0]

#### Recommender Engine

Our next step is 
* to get the title of the **movie that the user currently likes**. 
* Then we will find the index of that movie. 
* After that, we will access the row corresponding to this movie in the similarity matrix. 
* Thus, we will get the **similarity scores** of all other movies from the current movie. 
* Then we will enumerate through all the similarity scores of that movie to make a tuple of movie index and similarity score. 
* This will convert a row of similarity scores like this- `[1 0.5 0.2 0.9]` to this- `[(0, 1) (1, 0.5) (2, 0.2) (3, 0.9)]` . 
* Here, each item is in this form- (movie index, similarity score)

In [47]:
movie_user_likes = "Star Trek Beyond"
# movie_user_likes = "Avatar"
# movie_user_likes = "Aliens"
movie_index = get_index_from_title(movie_user_likes)
# accessing the row corresponding to given movie to find all the similarity scores 
# for that movie and then enumerating over it
similar_movies = list(enumerate(cosine_sim[movie_index])) 

In [56]:
print(similar_movies[:3])

[(0, 0.3086066999241838), (1, 0.06537204504606135), (2, 0.12500000000000003)]


* We will sort the list `similar_movies` according to similarity scores in descending order. 
* Since the most similar movie to a given movie will be itself, we will discard the first element after sorting the movies.

In [38]:
sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)[1:]

Then, we will run a loop to print **first 5 entries** from `sorted_similar_movies` list.

In [39]:
i=0
print("Top 10 similar movies to "+movie_user_likes+" are:\n")
for element in sorted_similar_movies:
    print(get_title_from_index(element[0]))
    i=i+1
    if i>10:
        break

Top 10 similar movies to Aliens are:

The Terminator
Alien
Avatar
The Box
Alien³
Jason X
The One
The Abyss
Terminator 2: Judgment Day
Alien: Resurrection
Moonraker


<div class="alert alert-success">
    
## Practice 
1. compare the results to other recommendation engines. 
2. [Searched Google for similar movies to “Star Trek Beyond”](https://www.google.com/search?q=similar+movies+to+Star+Trek+Beyond).
2. Compare some test results
</div>

#### Testing the recommender

In [70]:
# INPUT: movie title 
# RETURNS: the top 10 recommended movies
def recommendations(title, cosine_sim = cosine_sim):
    movie_index = get_index_from_title(title)
    similar_movies = list(enumerate(cosine_sim[movie_index])) 
    sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)[1:]
    recommended_movies = []
    for i in range(10):
        recommended_movies.append(get_title_from_index(sorted_similar_movies[i][0]))
    return recommended_movies

In [71]:
recommendations("Star Trek Beyond")

['Star Trek Into Darkness',
 'Star Trek',
 'Guardians of the Galaxy',
 'Avatar',
 'Star Trek: Insurrection',
 'Star Wars: Episode III - Revenge of the Sith',
 'Avengers: Age of Ultron',
 'Star Wars: Clone Wars: Volume 1',
 'Star Trek: Nemesis',
 'Mad Max Beyond Thunderdome']

In [72]:
recommendations("Avatar")

['Guardians of the Galaxy',
 'Aliens',
 'Alien',
 'Zathura: A Space Adventure',
 'Star Trek Into Darkness',
 'Star Trek Beyond',
 'Lockout',
 'Jason X',
 'Star Wars: Clone Wars: Volume 1',
 'Lost in Space']

In [73]:
recommendations("Lost in Space")

['Zathura: A Space Adventure',
 'Titan A.E.',
 'Gravity',
 'Silent Running',
 'Space Dogs',
 'Avatar',
 'Galaxina',
 'Star Trek',
 "Bill & Ted's Excellent Adventure",
 '2001: A Space Odyssey']