# CONTENT-BASED RECOMMENDER SYSTEM
### PROJECT OVERVIEW
Content-based recommender system recommend ITEMS to users based on similarity between the ITEM to be recommended and other items that the target user had interacted with. This recommender system depend on characteristics of items to make recommendation and hence the name content based. In thie project, I will make use of Natural Language Processing (NLP) algorithm to construct content-based recommender system that will take in title of a movie and recommend 20 different movie to specific users that had watched the input movie.
### BRIEF THEORETICAL BACKGROUND
In this project, that task would basically be to compute the cosine similarity value for all possible pairs of movies in the dataset and the similarity values will represent the extent of similarities between pairs of movies. The overview column of the dataset will be used. The cosine similarity value for a pair of movie is basically the numerical representation of the similarity between the description (overview) of the movies.I shall start the project with data cleaning/preparation. After the data cleaning, a corpus which is the list of all unique words in the overview column will be defined. Also, documents will be defined as the collection of words in overview of a particular movie (row). The Term Frequency - Inverse Document Frequency (TF-IDF) approach will be used to vectorize each document. TF-IDF measures how important a term is within a document relative the corpus. There are many different text vectorization scoring schemes, with TF-IDF being one of the most common. 
 - TF-IDF = TFxIDF; TF = n(w)/n where n(w) = number of times the word w appear in a document, n is the total number of words in the document. 
  - IDF = N/N(w) where N is the total number of documents in the corpus and N(w) is the total number of document where the word w appears. 
The cosine similarity matrix defined as follow:
 - cosine(X,Y) = (X• transposed Y)/||X||•||Y||. 
 Based on the cosine similarity values, a function will be defined to take in movie title and return 20 most similar movies.
 ### DATASET
The data used for this project is the The Movie Data Base (tmbd_5000_movies) Dataset downloaded from Kaggle (https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata) as at October 2022.

#### STEP 1: LIBRARY AND DATA IMPORTATION

In [1]:
import pandas as pd
df = pd.read_csv('tmdb_5000_movies.csv')
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [2]:
df.shape

(4803, 20)

In [3]:
df['overview'].head()

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

#### STEP 2: DATA CLEANING/PREPARATION

In [4]:
#Importation of TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
#Definition of a TF-IDF Vectorizer Object and removal of all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')


In [6]:
#Replacemement of NaN with an empty string
df['overview'] = df['overview'].fillna('')


#### STEP 3: COMPUTATION OF TF-IDF MATRIX AND COSINE SIMILARITY MATRIX

In [7]:
#Construction of the TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['overview'])

In [8]:
#Outputing the shape of tfidf_matrix. The shape of 4808 by 20978 indicate that there are 20978 unique terms/words in the corpus
tfidf_matrix.shape

(4803, 20978)

In [9]:
#Importation of linear_kernel from sklearn
from sklearn.metrics.pairwise import linear_kernel

In [10]:
# Compute the cosine similarity matrix
cosine_similarity = linear_kernel(tfidf_matrix, tfidf_matrix)

In [11]:
cosine_similarity.shape

(4803, 4803)

In [12]:
cosine_similarity[3]

array([0.02499512, 0.        , 0.        , ..., 0.03386366, 0.04275232,
       0.02269198])

#### STEP 4: PREDICTIONS
Here, I will define a function that takes a movie title as input and output list of 20 most similar movies. Thereafter, I will use the model for prediction.

In [13]:
#Construction a reverse map of indices and movie titles
indices = pd.Series(df.index, index=df['title']).drop_duplicates()
indices[:20]

title
Avatar                                          0
Pirates of the Caribbean: At World's End        1
Spectre                                         2
The Dark Knight Rises                           3
John Carter                                     4
Spider-Man 3                                    5
Tangled                                         6
Avengers: Age of Ultron                         7
Harry Potter and the Half-Blood Prince          8
Batman v Superman: Dawn of Justice              9
Superman Returns                               10
Quantum of Solace                              11
Pirates of the Caribbean: Dead Man's Chest     12
The Lone Ranger                                13
Man of Steel                                   14
The Chronicles of Narnia: Prince Caspian       15
The Avengers                                   16
Pirates of the Caribbean: On Stranger Tides    17
Men in Black 3                                 18
The Hobbit: The Battle of the Five Armies   

In [14]:
# Definition of the function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_similarity=cosine_similarity):
    idx = indices[title]
    similarity_scores = list(enumerate(cosine_similarity[idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:21]
    movie_indices = [i[0] for i in similarity_scores]
    return df['title'].iloc[movie_indices]

In [15]:
get_recommendations('Avatar')

3604                          Apollo 18
2130                       The American
634                          The Matrix
1341               The Inhabited Island
529                    Tears of the Sun
1610                              Hanna
311        The Adventures of Pluto Nash
847                            Semi-Pro
775                           Supernova
2628                Blood and Chocolate
942                    The Book of Life
1033                           Insomnia
2767                      Birthday Girl
570                              Ransom
1213        Aliens vs Predator: Requiem
2967         E.T. the Extra-Terrestrial
2875                         Two Lovers
3070            Jeff, Who Lives at Home
36      Transformers: Age of Extinction
1013                           Child 44
Name: title, dtype: object