## Content Based Recommendation

In this section, we will implement **Content Based Recommendation** using the MovieLens dataset.
* We will use the tf-idf vectorizer to extract features from the reviews (**NLP techniques**)
* We will use the cosine similarity to find similar movies.

### Import MovieLens dataset

The movie lens dataset contains 100,000 ratings from 1000 users on 1700 movies. The dataset is available at [GroupLens](https://grouplens.org/datasets/movielens/). We will use the 100k dataset. The dataset is available in the `data` folder.

Let's inspect the dataset by looking at the columns and the first few rows.

In [None]:
import pandas as pd

metadata = pd.read_csv('datasets/movies_metadata.csv', low_memory=False)

# only keep movie ratings having more votes than 90% of the movies
# in order to avoid movies with very few ratings
quantile = 0.9
metadata = metadata[metadata.vote_count > metadata.vote_count.quantile(quantile)].reset_index()
pd.DataFrame(metadata.columns, columns=['columns']).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
columns,index,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count


In [None]:
metadata.head()

Unnamed: 0,index,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
3,5,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,...,1995-12-15,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0
4,8,False,,35000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,9091,tt0114576,en,Sudden Death,...,1995-12-22,64350171.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Terror goes into overtime.,Sudden Death,False,5.5,174.0


### Preprocessing

The `overview` column contains the **plot summary** of a movie, and we will use it to build our content-based recommendation system.

 In order to perform machine learning on the plot summaries, we need to **transform** them into vector representations such that we can apply numeric machine learning to them. This process is called **feature extraction** or in this case, simply, vectorization, and is an essential first step toward language-aware analysis. Every plot summary will be transformed from a sequence of words to a point in a **high-dimensional semantic space**. The simplest encoding of semantic space is the **bag-of-words (BOW)** model, another is the tf-idf model.

Both models are in fact tables where each row represents a plot summary and each column represents a word and are **count-based**. They count the number of times a word appears in a document and use that as a proxy for the importance of the word in that document. The difference between the two is that the BOW model simply counts the number of times a word appears in a document, while the tf-idf model also takes into account how often the word appears in all documents. The BOW model is a **sparse** representation, meaning that most of the entries in the vector are zero. The tf-idf model is more useful than the BOW model because it **downweights** words that appear frequently in a corpus and are therefore less informative than those that appear rarely.

 So, to sum up, every plot summary (or document) will be encoded as a single vector whose length is equal to the size of the vocabulary of all the plot summaries (the so-called corpus) and whose entries are some sort of counts of the words in that summary. This is because most words in the vocabulary do not appear in a given plot summary.

We'll use the tf-idf vectorizer to extract features from the plot summaries. The tf-idf vectorizer will transform the plot summaries into a matrix of tf-idf features. The tf-idf vectorizer will **ignore** words that occur in **more than 80%** of the movies and **ignore** words that occur in **less than 2 movies**. This will help us **reduce the noise** in the dataset.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8, min_df=2)

metadata.overview = metadata.overview.fillna('')
tfidf_model = vectorizer.fit_transform(metadata.overview)
print(f'Matrix contains {tfidf_model.shape[0]} movies and {tfidf_model.shape[1]} words')

Matrix contains 4538 movies and 10083 words


### Inspect the tf-idf model

What does the tf-idf model look like?
Let us inspect the columns with popular movie terms like 'love', 'young', 'story', etc.

In [None]:
popular_terms = ['life', 'young', 'man', 'film', 'new', 'love', 'story', 'world']
columns = vectorizer.get_feature_names_out()
tfidf_model_df = pd.DataFrame.sparse.from_spmatrix(tfidf_model, columns=columns)
tfidf_model_df[popular_terms].head()

Unnamed: 0,life,young,man,film,new,love,story,world
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.078801
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.093418,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Find similar movies

We'll use a (K-Nearest Neighbors) KNN model to find similar movies. As you might remember from earlier lessons, every KNN uses a **distance metric** to find the nearest neighbors. In this case we're going to use the **cosine similarity** as the distance metric. The cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

### Cosine Similarity

The cosine similarity is a measure of similarity between two vectors $\bf{x}$ and $\bf{y}$.

$cos(\bf{x},\bf{y}) = \frac{\bf{x} \cdot \bf{y}}{||\bf{x}|| \cdot ||\bf{y}||}$
<br/>
$\phantom{cos(\bf{x},\bf{y})} = \frac{\sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n}(x_i)^2} \sqrt{\sum_{i=1}^{n}(y_i)^2}}$

where $\bf{x}$ and $\bf{y}$ are vectors and $||\bf{x}||$ and $||\bf{y}||$ are the norms of $\bf{x}$ and $\bf{y}$ and where $x_i$ and $y_i$ are the term frequency of the $i$th word in the two documents.

The cosine similarity has the following properties, it is:

* a **normalized dot product**.
* **independent of the magnitude** of the vectors.
* is **zero** if the two vectors are **orthogonal** and **one** if the two vectors are **equal**.
* **symmetric**, this means that the similarity between A and B is the same as the similarity between B and A.
* **non-negative**
* **bounded** between 0 and 1, this means that the similarity between two vectors is always between 0 and 1.

We're going to compute the **cosine similarity** between different movies based on their plot summary _term frequency occurence - signature_. That is the vector representation of the plot summary. The higher the cosine similarity, the more similar the movies are.

SciKit-Learn provides a function to compute the cosine similarity between two vectors. We could use that function to compute the cosine similarity between the plot summaries of the movies, but in this case we're going to use a kNN model to find the nearest neighbors of a movie. The kNN model has a parameter called `metric` that we can set to `cosine` to use the cosine similarity as the distance metric.


In [None]:
from sklearn.neighbors import NearestNeighbors


def get_content_based_recommendation(title, top_n=10, metric='cosine'):
    # Get the index of the movie that matches the title
    # we'll use that index to locate the row in the tf-idf matrix that corresponds to that movie
    idx = metadata[metadata.title.str.lower() == title.lower()].index[0]

    model = NearestNeighbors(n_neighbors=top_n, metric=metric)
    model.fit(tfidf_model)
    similar_movies = model.kneighbors(tfidf_model[idx], return_distance=False)[0]

    # Return the top 10 most similar movies
    return metadata.iloc[similar_movies]

In [None]:
get_content_based_recommendation('I am legend')[['title', 'genres', 'vote_average', 'vote_count', 'overview']]

Unnamed: 0,title,genres,vote_average,vote_count,overview
2335,I Am Legend,"[{'id': 18, 'name': 'Drama'}, {'id': 27, 'name...",6.9,4977.0,Robert Neville is a scientist who was unable t...
2108,Snakes on a Plane,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",5.1,504.0,America is on the search for the murderer Eddi...
1938,Land of the Dead,"[{'id': 27, 'name': 'Horror'}]",6.0,395.0,The world is full of zombies and the survivors...
1531,28 Days Later,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",7.1,1816.0,Twenty-eight days after a killer virus was acc...
2709,Pontypool,"[{'id': 27, 'name': 'Horror'}, {'id': 9648, 'n...",6.6,187.0,When disc jockey Grant Mazzy reports to his ba...
1283,Osmosis Jones,"[{'id': 12, 'name': 'Adventure'}, {'id': 16, '...",6.0,237.0,"A policeman white blood cell, with the help of..."
2353,P.S. I Love You,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",7.0,1011.0,A young widow discovers that her late husband ...
3812,Left Behind,"[{'id': 53, 'name': 'Thriller'}, {'id': 28, 'n...",3.7,396.0,A small group of survivors are left behind aft...
20,Twelve Monkeys,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",7.4,2470.0,"In the year 2035, convict James Cole reluctant..."
4114,Before We Go,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",6.5,490.0,A woman who is robbed on her way to catch the ...


In [None]:
get_content_based_recommendation('The Matrix')[['title', 'genres', 'vote_average', 'vote_count', 'overview']]


Index of The Matrix is 810


Unnamed: 0,title,genres,vote_average,vote_count,overview
810,The Matrix,"[{'id': 28, 'name': 'Action'}, {'id': 878, 'na...",7.9,9079.0,"Set in the 22nd century, The Matrix tells the ..."
55,Hackers,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",6.2,406.0,"Along with his new friends, a teenager who was..."
3785,Who Am I,"[{'id': 53, 'name': 'Thriller'}]",7.6,430.0,"Benjamin, a young German computer whiz, is inv..."
1851,The Animatrix,"[{'id': 16, 'name': 'Animation'}, {'id': 878, ...",6.9,433.0,Straight from the creators of the groundbreaki...
1557,Commando,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",6.4,753.0,"John Matrix, the former leader of a special co..."
2701,Avatar,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",7.2,12114.0,"In the 22nd century, a paraplegic Marine is di..."
3669,Transcendence,"[{'id': 53, 'name': 'Thriller'}, {'id': 878, '...",5.9,2339.0,Two leading computer scientists work toward th...
3618,The Zero Theorem,"[{'id': 18, 'name': 'Drama'}, {'id': 14, 'name...",5.9,383.0,A computer hacker's goal to discover the reaso...
2261,Live Free or Die Hard,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",6.4,2122.0,"John McClane is back and badder than ever, and..."
2583,Angels & Demons,"[{'id': 53, 'name': 'Thriller'}, {'id': 9648, ...",6.5,2183.0,Harvard symbologist Robert Langdon investigate...
