# Imports

In [1]:
from ast import literal_eval
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel

### Data loading

**'tmdb_5000_credits.csv' contains the following features:**

* movie_id - A unique identifier for each movie.
* title - Title of the movie.
* cast - The name of lead and supporting actors.
* crew - The name of Director, Editor, Composer, Writer etc.

**'tmdb_5000_movies.csv' contains the following features:**

* budget - The budget in which the movie was made.
* genre - The genre of the movie, Action, Comedy ,Thriller etc.
* homepage - A link to the homepage of the movie.
* id - This is infact the movie_id as in the first dataset.
* keywords - The keywords or tags related to the movie.
* original_language - The language in which the movie was made.
* original_title - The title of the movie before translation or adaptation.
* overview - A brief description of the movie.
* popularity - A numeric quantity specifying the movie popularity.
* production_companies - The production house of the movie.
* production_countries - The country in which it was produced.
* release_date - The date on which it was released.
* revenue - The worldwide revenue generated by the movie.
* runtime - The running time of the movie in minutes.
* status - "Released" or "Rumored".
* tagline - Movie's tagline.
* title - Title of the movie.
* vote_average -  average ratings the movie recieved.
* vote_count - the count of votes recieved.

In [2]:
df_credits = pd.read_csv('input/tmdb-movie-metadata/tmdb_5000_credits.csv')
df_movies = pd.read_csv('input/tmdb-movie-metadata/tmdb_5000_movies.csv')

In [3]:
print('Credits:', df_credits.columns)
print('Movie:', df_movies.columns)

In [4]:
# Changing the column name from 'movie id' to 'id'
df_credits.columns = ['id','tittle','cast','crew']

# Merging datasets based on movie id
df_movies = df_movies.merge(df_credits, on='id')

# **Demographic Filtering**
Before proceeding, it's crucial to establish a metric for rating movies and calculate the score for each film. Once scores are computed, sorting them allows for recommending the highest-rated movies to users.

Utilizing the average ratings of movies as the sole metric may result in unfair evaluations. For instance, a movie with a high average rating but few votes shouldn't rank higher than a film with a lower average rating but significantly more votes.

To address this, IMDB's weighted rating (***wr***) formula is employed:

![Weighted Rating](resources/weighted_rating.png)

where,
* ***v*** is the number of votes for the movie;
* ***m*** is the minimum votes required to be listed in the chart;
* ***R*** is the average rating of the movie; And
* ***C*** is the mean vote across the whole report

Given that ***v*** (vote_count) and ***R*** (vote_average) are already available, ***C*** can be calculated as:

In [5]:
C = df_movies['vote_average'].mean()
C

Given that the mean rating for all movies is approximately 6 out of 10, the subsequent step involves establishing an optimal value for m, representing the minimum votes necessary for inclusion in the chart. Utilizing the 90th percentile as the cutoff, it's stipulated that for a movie to be featured in the charts, it must amass more votes than at least 90% of the movies within the dataset.

In [6]:
m = df_movies['vote_count'].quantile(0.9)
m

At this juncture, the movies that meet the criteria for chart inclusion can be filtered out.

In [7]:
q_movies = df_movies.copy().loc[df_movies['vote_count'] >= m]
q_movies.shape

Observing that 481 movies meet the criteria for inclusion in the list, the next step involves calculating the metric for each qualified movie. This will be accomplished by defining a function named **weighted_rating()** and introducing a new feature called **score**. This score will be calculated by applying the defined function to the DataFrame of qualified movies.

In [8]:
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v / (v + m) * R) + (m / (m + v) * C)

In [9]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

Ultimately, the DataFrame will be sorted based on the score feature, and the titles, vote count, vote average, and weighted rating (score) of the top 10 movies will be presented.

In [10]:
q_movies = q_movies.sort_values('score', ascending=False)

#Printing the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(10)

In [11]:
pop = df_movies.sort_values('budget', ascending=False)

plt.figure(figsize=(12,4))
plt.barh(pop['title'].head(6),pop['budget'].head(6), align='center', color='skyblue')
plt.gca().invert_yaxis()
plt.xlabel("Budget in hundreds of millions")
plt.title("The most expensive films to produce")

In [12]:
pop = df_movies.sort_values('popularity', ascending=False)

plt.figure(figsize=(12,4))
plt.barh(pop['title'].head(6),pop['popularity'].head(6), align='center', color='skyblue')
plt.gca().invert_yaxis()
plt.xlabel("Calculated value indicating popularity")
plt.title("The most popular movies")

It's important to note that demographic recommenders offer a general list of recommended movies to all users, without considering individual interests and tastes. This limitation prompts the transition to a more refined system: Content-Based Filtering.

# Content Based Filtering

In this type of recommender system, the content of each movie - like its summary, cast, crew, keywords, and tagline - is used to find similar movies. Then, those similar movies are recommended because they're likely to match what a user enjoys.

![Similarities](resources/similarities.png)

## **Plot description based Recommender**

We'll calculate similarity scores between all pairs of movies based on their plot descriptions and make recommendations based on those scores. The plot descriptions are provided in the **overview** feature of our dataset.

The word vectors of each overview need to be converted. This is achieved by computing Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each overview.

Term Frequency (TF) measures the relative frequency of a word in a document, calculated as the ratio of the number of times the term appears to the total number of terms in the document. Inverse Document Frequency (IDF) measures the relative count of documents containing the term, calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term. The overall importance of each word to the documents in which they appear is then calculated as TF multiplied by IDF.

This process results in a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document), and each row represents a movie, as before. This method helps reduce the significance of words that occur frequently in plot overviews and, therefore, in computing the final similarity score.

In [13]:
#Define a TF-IDF Vectorizer Object; removing all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

# Replacing NaN with an empty string
df_movies['overview'] = df_movies['overview'].fillna('')

tfidf_matrix = tfidf.fit_transform(df_movies['overview'])
tfidf_matrix.shape

Upon examination, it's noted that over 20,000 different words were utilized to describe the 4,800 movies in our dataset.

With this matrix available, the computation of a similarity score can now be undertaken. Several options exist for this task, such as the Euclidean, Pearson, and cosine similarity scores. There isn't a definitive answer to which score is optimal; different metrics perform better in different scenarios. Therefore, it's often advisable to experiment with various metrics.

For this analysis, the cosine similarity will be employed to calculate a numeric quantity denoting the similarity between two movies. This choice is made because cosine similarity is independent of magnitude and is relatively straightforward and quick to compute. Mathematically, it is defined as follows:

![Cosine Similarity](resources/cosine_similarity.png)

Since the TF-IDF vectorizer has been used, computing the dot product directly provides the cosine similarity score. Consequently, **linear_kernel()** from sklearn will be chosen over **cosine_similarities()** due to its faster performance.

In [14]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

To accomplish this task, a function will be defined that takes a movie title as input and generates a list of the 10 most similar movies. However, to achieve this, a reverse mapping of movie titles and DataFrame indices is necessary. In essence, there's a need for a mechanism to identify the index of a movie in the metadata DataFrame based on its title.

In [15]:
indices = pd.Series(df_movies.index, index=df_movies['title']).drop_duplicates()

The recommendation function will now be defined, following these steps:

1. Obtain the index of the movie based on its title.
2. Retrieve the list of cosine similarity scores for that specific movie with all other movies, and convert it into a list of tuples. Each tuple will contain the position (index) of the movie and its corresponding similarity score.
3. Sort the list of tuples based on the similarity scores (the second element).
4. Extract the top 10 elements from this sorted list. Disregard the first element as it corresponds to the movie itself (the most similar movie to any movie is the movie itself).
5. Return the titles associated with the indices of the top elements.

In [16]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df_movies['title'].iloc[movie_indices]

In [17]:
get_recommendations('The Dark Knight Rises')
get_recommendations('The Avengers')

Current system has effectively identified movies with similar plot descriptions. However, the quality of recommendations leaves room for improvement. For instance, "The Dark Knight Rises" retrieves all Batman movies, whereas it's more plausible that individuals who enjoyed that movie are inclined to appreciate other films by Christopher Nolan. This nuance cannot be captured by the existing system.

## **Credits, Genres and Keywords Based Recommender**

Undoubtedly, the quality of the recommender would be enhanced with the utilization of better metadata. A recommender will be built based on the following metadata: the top 3 actors, the director, related genres, and the movie plot keywords.

To accomplish this, the task involves extracting the three most important actors, the director, and the keywords associated with each movie from the *cast*, *crew*, and *keywords* features. Presently, the data is in the form of "stringified" lists, necessitating conversion into a structured and usable format.

In [18]:
# Parsing the stringified features into their corresponding python objects
features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df_movies[feature] = df_movies[feature].apply(literal_eval)

In [19]:
# Extracting the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [20]:
# Returning the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        if len(names) > 3:
            names = names[:3]
        return names

    return []

In [21]:
# Defining new director, cast, genres and keywords features that are in a suitable form.
df_movies['director'] = df_movies['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    df_movies[feature] = df_movies[feature].apply(get_list)

In [22]:
# Coverting all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        # Checking if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [23]:
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    df_movies[feature] = df_movies[feature].apply(clean_data)

Now, a "metadata soup" can be created, consisting of a string containing all the metadata intended to be fed to the vectorizer, namely actors, director, and keywords.

In [24]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [25]:
df_movies['soup'] = df_movies.apply(create_soup, axis=1)

The subsequent steps follow a similar approach to our plot description-based recommender. One significant difference lies in employing **CountVectorizer()** instead of TF-IDF. This decision is based on the intention to avoid diminishing the importance of an actor/director if they have participated or directed in a relatively larger number of movies. This adjustment seems more intuitive in achieving our goal.

In [26]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df_movies['soup'])

In [27]:
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [28]:
# Reseting indices of our main DataFrame and construct reverse mapping as before
df_movies = df_movies.reset_index()
indices = pd.Series(df_movies.index, index=df_movies['title'])

In [29]:
get_recommendations('The Dark Knight Rises', cosine_sim2)

**The model was created based on the model from the Ibtesam Ahmed notebook "Getting Started with a Movie Recommendation System"**  
https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system