# Content-Based Recommender System for Movies: An End-To-End Project

by [Sumit Pokharel](https://github.com/psumitcode), Ritsumeikan Asia Pacific University (last updated: Jan 5, 2023)

**Objective**: To build a recommender system that can recommend 5 similar movies when a user picks 1 movie from the database.

The dataset I've made use of is the [tmdb-movie-dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata), taken from Kaggle.

The final product is a website that can be accessed [via this link](https://cine-recs.com/). More details on [the deployment](#5-deployment) section.

## Table of Contents
- [1. Preprocessing](#1-preprocessing)

    - [1.1 Importing the data and merging](#1.1-importing-the-data-and-merging)
    - [1.2 Feature extraction](#1.2-feature-extraction)
    - [1.3 Completing the feature extraction: 3 steps](#1.3-completing-the-feature-extraction-3-steps)
        - [Step 1: Getting rid of duplicate and missing data](#step-1-getting-rid-of-duplicate-and-missing-data)
        - [Step 2: Converting some columns into better formats](#step-2-converting-some-columns-into-better-formats)
        - [Step 3: Building a new dataframe from the dataframe above](#step-3-building-a-new-dataframe-from-the-dataframe-above)
<!-- Blank line -->
- [2. Text Vectorization](#2-text-vectorization)

- [3. Creating the Main Function for Recommendation](#3-creating-the-main-function-for-recommendation)

- [4. Streamlit and Deployment](#4-streamlit-and-deployment)

**Keywords:** text vectorization, panda, numpy, scikit-learn, nltk, streamlit, pickle, render

<a name="1"></a>
## 1. Preprocessing

<a name="1.1"></a>
### 1.1 Importing the data and merging

In [None]:
import pandas as pd
import numpy as np

In [None]:
movies = pd.read_csv("../input/tmdb-movie-metadata/tmdb_5000_movies.csv")
credits = pd.read_csv("../input/tmdb-movie-metadata/tmdb_5000_credits.csv")

In [None]:
movies.head()

In [None]:
credits.head()

In [None]:
movies = movies.merge(credits, on = "title")
# Merging the two csv files

In [None]:
movies.head()
#how it looks after combining two csv files

In [None]:
movies.info()

<a name="1.2"></a>
### 1.2 Feature extraction

**Some unnecessary columns include:**
* `budget`
* `homepage`
* `original_language` may not be very useful in this scenario because the languages are mostly dominated by English and the others are scarcely distributed. Better to avoid this for this project.
* `original_title`
* `popularity` could be important but not for this project. 
* `production_countries`
* `release_date` might want to think about how to incorporate this after I'm done with the project. But for now, I'll skip this.
* `revenue`
* `runtime`
* `spoken_languages`
* `status`
* `tagline` this is vague. Overview is better, so I'm keeping that instead of this.
* `vote_average`
* `vote_count`
* `movie_id`

**On the other hand, some important columns are:**
* `genres`
* `id`
* `keywords`
* `title`
* `overview` for content similarity
* `cast`
* `crew`

In [None]:
movies = movies[["id", "title", "genres", "cast", "overview", "keywords", "crew"]]
movies.head()

<a name="1.3"></a>
### 1.3 Completing the feature extraction: 3 steps

1. Get rid of duplicate and missing data.

2. Convert some columns into better formats.

3. Build a new dataframe from the dataframe above. The dataframe will have:
1. `id`
2. `title`
3. `tags` *

*`tags` will be made by merging `genres`, `cast`, `overview`, `keywords`, and `crew`.

* From `cast`, we'll extract only the top 3 cast names.

* From `crew`, we'll extract only the name of the director.

<a name="1.3.1"></a>
#### Step 1: Getting rid of duplicate and missing data

In [None]:
# step 1a: getting rid of missing data
movies.isnull().sum()

In [None]:
movies.dropna(inplace = True)

In [None]:
#step1b: getting rid of duplicate data
movies.duplicated().sum()

*No duplicate values, so we're done with step 1.*

<a name="1.3.2"></a>
#### Step 2: Converting some columns into better formats

We will convert the columns `genres`, `cast`, `overview`, `keywords`, and `crew` into better formats.

In [None]:
movies.iloc[0].genres

In [None]:
import ast
def convert(obj):
    L = []
    for i in ast.literal_eval(obj): #ast.literal_eval used since only integers are accepted as string indices
        L.append(i["name"])
    return L;

In [None]:
#for genres and keywords
movies['genres'] = movies['genres'].apply(convert)
movies['keywords'] = movies['keywords'].apply(convert)

In [None]:
# I want to collect only top 3 cast names, so I'll stop before the counter hits 3.
def convert1(obj):
    L = []
    counter = 0
    for i in ast.literal_eval(obj):
        if counter < 3:
            L.append(i["name"])
            counter += 1;
        else:
            break;
    return L;

In [None]:
movies['cast'] = movies['cast'].apply(convert1)

In [None]:
movies.head()

In [None]:
# Extracting the director name from within the crew
def extract_director(obj):
    L = []
    for i in ast.literal_eval(obj):
        if i["job"] == "Director":
            L.append(i["name"])
            break;
    return L;

In [None]:
movies['crew'] = movies['crew'].apply(extract_director)

In [None]:
movies.head()

In [None]:
movies['overview'] = movies['overview'].apply(lambda x : x.split())

In [None]:
movies.head()

We have finally preprocessed `genres`, `cast`, `overview`, and `crew` and converted them into lists. This brings **step 2** to a close.

<a name="1.3.3"></a>
#### Step 3: Building a new dataframe from the dataframe above
Step 3 will be to concatenate these lists from each row into single lists and then convert each of those single lists into single strings. The column for all those strings will be `tags`.


*For recommender systems, having spaces within a name or words to describe a single thing (such as a genre name) could create some issues; for instance, we might be searching for the actor named '**Christian Bale**' but we might get recommended to '**Christian Slater**'. Thus, removing the space and changing the name to 'ChristianBale' would make the recommendation better for me.*

***I will remove spaces between words for `genres`, `cast`, `keywords`, and `crew`.***

In [None]:
movies['genres'] = movies['genres'].apply(lambda x : [i.replace(" ", "") for i in x])
movies['cast'] = movies['cast'].apply(lambda x : [i.replace(" ", "") for i in x])
movies['overview'] = movies['overview'].apply(lambda x : [i.replace(" ", "") for i in x])
movies['crew'] = movies['crew'].apply(lambda x : [i.replace(" ", "") for i in x])

In [None]:
movies.head()

In [None]:
# Concatenating the 5 columns into a new column 'tags'
movies['tags'] = movies['genres'] + movies['cast'] + movies['overview'] + movies['keywords'] + movies['crew']
mov_new = movies[['id', 'title', 'tags']]
mov_new

In [None]:
# Finally, converting the list 'tags' into a string
mov_new['tags'] = mov_new['tags'].apply(lambda x : " ".join(x))

In [None]:
#converting every letter into lowercase
mov_new['tags'] = mov_new['tags'].apply(lambda x : x.lower())

In [None]:
import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
#importing this library to stem all the different forms of the same words into single words

In [None]:
def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i));
    
    return " ".join(y) #to change into a string again
        

In [None]:
mov_new['tags'] = mov_new['tags'].apply(stem)

In [None]:
mov_new.head()

<a name="2"></a>
## 2. Text vectorization

What is our objective? A user will write a name, and we want to recommend other similar 5 movies to the ones they type.

For that, we need to calculate similarities between the texts that we have under the column `tags` and find 5 most similar movies.

To calculate similarities between the texts, we need to calculate their similarity scores. For this, we need **text vectorization**.

**Concept**: We convert each movie `tag` into a vector. Out of roughly 5000 movies, the 5 movie vectors that are closest to the vector of the movie that the user types will be recommended. And, the text vectorization I'll use will be **CountVectorizer** from sklearn library.

I shall be ignoring any **stop words**. The vectorization will be done for the rest of the words. Then, I'll calculate the distance between the movie typed by the user to the other vectors and see which movies are the closest in distance.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(max_features = 5000, stop_words = 'english')

In [None]:
vectorization = cvec.fit_transform(mov_new['tags']).toarray()
vectorization

In [None]:
cvec.get_feature_names()

We will not use **`Euclidean distance`**. Instead, we will calculate the **`Cosine distance`**, i.e., the angle between vectors.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
similarity = cosine_similarity(vectorization) #calculating distance between the vectors

<a name="3"></a>
## 3. Creating the Main Function for Recommendation

We create the main function that will be our workhorse to fetch index of each movie from the dataset, sort the movies from highest to lowest similarity scores, and pick out 5 movies that are the most similar to the movie that gets entered by the user.

In [None]:
#fetching the index of each movie from the dataset
def recommend(movie):
    movie_index = mov_new[mov_new['title'] == movie].index[0]
    distance = similarity[movie_index]
    
    # sorting the movies from highest similarity scores to lowest similarity scores while preserving their indexes
    # and then picking out 5 movies that are the most similar
    movie_list = sorted(list(enumerate(distance)), reverse = True, key = lambda x : x[1])[1:6]
    
    for i in movie_list:
        print(mov_new.iloc[i[0]].title)

In [None]:
#Checking the recommendation with different movies
recommend('The Dark Knight')

In [None]:
recommend('John Carter')

In [None]:
recommend('The Mummy Returns')

In [None]:
# using pickle to save our data
import pickle
pickle.dump(mov_new.to_dict(), open('movies_dict.pkl', 'wb'))
pickle.dump(similarity, open('similarity.pkl', 'wb'))

<a name="4"></a>
## 4. Streamlit and Deployment

I have used Pycharm for the steps from here. I used the movie database API with my API key to get the json of each movie and extracted the poster path via there.

The Streamlit code is pasted below:

In [None]:
import streamlit as st
import pickle
import pandas as pd
import requests

def get_poster(movie_id):
    response = requests.get("https://api.themoviedb.org/3/movie/{}?api_key={api_key_here}".format(movie_id))
    data = response.json()
    return "https://image.tmdb.org/t/p/w500/" + data['poster_path']

def recommend(movie):
    movie_index = movies[movies['title'] == movie].index[0]
    distance = similarity[movie_index]
    movie_list = sorted(list(enumerate(distance)), reverse = True, key = lambda x : x[1])[1:6]

    recommended_movies = []
    recommended_mov_posters = []
    for i in movie_list:
        movie_id = movies.iloc[i[0]].id;
        #we'll get posters here via API
        recommended_movies.append(movies.iloc[i[0]].title)
        recommended_mov_posters.append(get_poster(movie_id))
    return recommended_movies, recommended_mov_posters;

movie_list = pickle.load(open('movies_dict.pkl', 'rb'))
movies = pd.DataFrame(movie_list)

similarity = pickle.load(open('similarity.pkl', 'rb'))

st.title('Movie Recommender System')

movie_picked = st.selectbox(
'Pick a movie. We will recommend 5 movies from our library that are the most similar to it.', movies['title'].values)

if st.button('Recommend'):
    names, posters = recommend(movie_picked)
    col1, col2, col3, col4, col5 = st.columns(5)

    with col1:
        st.text(names[0])
        st.image(posters[0])

    with col2:
        st.text(names[1])
        st.image(posters[1])

    with col3:
        st.text(names[2])
        st.image(posters[2])

    with col4:
        st.text(names[3])
        st.image(posters[3])

    with col5:
        st.text(names[4])
        st.image(posters[4])

The streamlit code can also be accessed in Github [via this link](https://github.com/psumitcode/movie-recommender-system/blob/main/recommender_app.py).

In addition to the code above, I had created 4 other files: `Procfile`, `setup.sh`, `.gitignore`, and `requirements.txt` for deployment. `Procfile` is used by the website [Heroku](http://www.heroku.com/) to help its users build cloud-based applications. 

> I had originally used the Heroku platform for hosting; however, since it doesn't have free services anymore, I migrated the app to [Render](https://render.com/).

The link to the recommender app developed with these codes can be accessed [HERE](https://cine-recs.com/). It's a custom domain that I've made use of via `Route 53`, the DNS provider by AWS (`A` and `CNAME` records in use).

These files have been uploaded to the [GitHub repository](https://github.com/psumitcode/movie-recommender-system) along with this notebook.