<a href="https://www.kaggle.com/code/shikristin/top-n-movie-recommendation-using-cosine-similarity?scriptVersionId=244333172" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Movie Recommendation System

In [None]:
# there is compatbility issue with gensim & numpy, run the following command to upgrade gensim if problem persists

# %pip install --upgrade numpy==1.26.0
# %pip install --upgrade gensim==4.3.3


In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import StandardScaler

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
import string
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity

from scipy.sparse import hstack, csr_matrix

The main objective of this project is to create **a movie recommendation system** based on the TMDb dataset(The Movie Database), which is a comprehensive movie database that provides information about movies, including details like titles, ratings, release dates, revenue, genres, and much more.
<br><br>The original dataset can be found here https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies/data
<br>It contains real-time movie information and updated to include movies in 2024.

## Model Development

This recommendation system suggests similar movies based on the ***similarity*** between each movie, based on genres, keywords, ratings, language etc.

### Data Exploration

In [None]:
# load dataset
df = pd.read_csv('TMDB_movie_dataset_v11.csv')

In [None]:
# display all columns & the head rows of the dataset
pd.set_option('display.max_columns', None)
df.head()

In [None]:
df.info()

There are over 1 million records of movies in our dataset, 23 attributes are documented, including title, vote counts, revenue, etc.
<br>Most columns are in object format(conversion needed), and **key features** include:
1. **movie itself**: id, title, genres, runtime, adult, original language, keywords, release year...
2. **rating & popularity**: rating, vote counts, popularity...
3. **profitability**: revenue, budget, production companies & countries...

Since the stakeholder of our recommendatoion system is audiences instead of investors, thus our model is not concerned about the profitability, so **columns such as revenue, budgets, should be removed during data preprocessing stage**.

In [None]:
# check for any missing values
df.isnull().sum()

Notied there are significant amount of missing values in columns such as backdrop paths, homepage, poster_path...Since these information are irrelevant to our model building, it is safe to ignore. 
<br>
<br>However, **release_date** (17% missing) and **genres** (40% missing) are essential to our model development, so **new method should be implemented to solve this issue during data preprocessing stage**.

In [None]:
# check for any duplicates
df.duplicated().sum()

In [None]:
# relatively small numbers of duplicates, just drop them
df.drop_duplicates(inplace=True)
df.duplicated().sum()

### Content-based Filtering

Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback.<br><br>
In this case, I will focus on the **intrinsic attributes** of movies in our dataset, which include title, release date, runtime, adult, original language and genres. I will **extract relevant information**, **binary-encode each feature** and **vectorize each data entry**, so that I can **calculate the cosine similarity** between each movie for recommendation.


### Feature Selection

In [None]:
df['title'].isnull().sum()

In [None]:
df = df[~df['title'].isnull()]

In [None]:
df['title'].isnull().sum()

In [None]:
# convert release_date to datetime and extract year
df['release_date'] = pd.to_datetime(df['release_date'])
df['release_date'].dt.year.value_counts().sort_index()

Noticed some years are wayyyy off the normal timeline, since the first movie ever released is in 1895, and the latest movie in this databse should be releaed in 2024.

In [None]:
# view the rows with release years before 1900
df[df['release_date'].dt.year < 1900].sort_values('release_date')

In [None]:
# # view the rows with release years after 2024
# df[df['release_date'].dt.year > 2024].sort_values('release_date', ascending=False)

In [None]:
df[df['release_date'].dt.year > 2025]['status'].value_counts()

In [None]:
# set the reference data as today
ref_date = pd.Timestamp.now()
ref_date


In [None]:
# filter movies that are released in the future
df[df['release_date'] > ref_date][['status']].value_counts()

Noticed movie records with a release year before 1900 are mostly scripts or plays (ex.A Farsa de Inês Pereira), which do not fall into the category under 'movie', thus they should be trimmed off.
<br><br>Also noticed some movies released after today are marked as 'released'-- they be trimmed off as well, along with any movie that is not under the status of **'released'**.

In [None]:
# drop any rows with release year smaller than 1900
df = df[df['release_date'].dt.year >= 1900]
# drop any rows with released data greater than today
df = df[df['release_date']<= ref_date]
# validate
df['release_date'].dt.year.value_counts().sort_index()

Any movie that is not under 'released' status is not available to watch, thus should be trimmed off

In [None]:
df = df[df['status'] == 'Released']
df['status'].value_counts()

Up unil now we still have 974k validate movie data in our databse, it is still a great resort.

In [None]:
# check runtime of each movie in the dataset
df['runtime'].describe()

The minumum runtime is 0 and maxium is 14400, something is off.

In [None]:
df['runtime'].value_counts().sort_index()

In [None]:
df[df['runtime'] == 0].shape[0]

Noticed there are **208230 records** with 0 runtime, they must be removed.

Some movies under 15 minutes are shorts (ex.frozen fever), they still fall under the definition of movie, but **they should be categorized as short during data preprocessing stage**.

In [None]:
df[df['runtime']>= 360].shape[0]

Noticed there are **1660 records** with runtime greater than 360 minutes (any movie longer than 6 hours is not watchable, in my humble opinion), thus these records need to be trimmed off.

In [None]:
df=df[(df['runtime'] != 0) & (df['runtime'] <= 360)]
df['runtime'].describe()

Noticed some movies have zero revenue, even though this attribute is not relevant to our model building, this can be used to validate our data.

In [None]:
df[df['revenue'] == 0].head()

Noticed many blockbuster hits(bird box, Zack Snyder's Justice League) are listed here, so there could be many errors in recording revenue, so we should proceed without trimming.

In [None]:
# find movies that have no imdb_id
df[df['imdb_id'].isnull()].sample(10)

Most movies without imdb id have very little information about themselves, thus, we will remove these rows.

In [None]:
# remove any rows with no imdb_id
df = df[df['imdb_id'].notnull()]
df.shape[0]

In [None]:
df[df['adult']==True].head()

Movies that are marked as 'adult' can be porngraphy, thus they will be removed from movie analysis for user discretion.

In [None]:
# remove any rows with adult 
df = df[df['adult'] == False]
df.shape[0]

Text columns include genres, tagline, overviews, and keywords, and they are all important in sub-classify movies based on similarity.

In [None]:
df['genres'].head()

In [None]:
df['tagline'].head()

In [None]:
df['overview'].head()

In [None]:
df['keywords'].head()

Noticed many movies have comprehensive genres and keywords, and different combination can confuse the similarity between each pair, **more data processing is needed**.

Column 'tagline' and column 'overview' both contain paragraphs that describe movies' plots, they can be subjective -- **remove tagline, and tokenize overview in later preprocessing**.

What about production companies and languages? Are they comprehensive as well?

In [None]:
df['production_countries'].value_counts()

In [None]:
df['original_language'].value_counts().head(10)

Noticed all records only have **one** original language, which can be used in model development, and **total of 10 languages can be major categories during data preprocessing stage.**

In [None]:
df['spoken_languages'].value_counts()

Noticed there are incidents with movies have multiple spoken languages, which can means this movie is available in multiple languages.<BR><BR>
However, no guarantee if the audiences will also see similarity between two movies in two different cultures but only available in the same translation , thus **this column will be removed during data preprocessing stage**.

In [None]:
df['vote_average'].describe()

In [None]:
# select samples of movie that has rating over 8.0
df_rating = df[df['vote_average'] > 8.0].sample(10, random_state=42)
df_rating.sort_values('vote_average', ascending=False)

Noticed movies with the same rating (vote_average) has very different popularity score, because popularity ranks how popular movies are, and blockbusters may have lower rating, but definitely more popular content for recommendation.

In [None]:
df.sort_values('popularity', ascending=False).head(10)

In [None]:
df.sort_values('vote_count', ascending=False).head(10)

Noticed the most popular movies do not necessarily have more people voting. This is easy to understand as popular movies are usually released not long before the reference date, but movies with most voting tend to be movies that have continually voted for quality check.

Both can be used to build our model, but voting counts can be redundant as it provides little infotmation.

In [None]:
# find the most popular production companies
df['production_companies'].value_counts().head(20)

Columns showing a movie's profitability (such as revenue, backdrop_path, budget, homepage) are irrelevant to building our recommendation model, thus they should be removed. <br>
But some famous production companies (ex.BBC, Disney) are mentioned, which can be used for model building **with more preprocessing**.

In [None]:
df.info()

After exploring all columns, some will be removed as they are irrelevant for model building, which include: vote_count(not as informative as rating), status(all'released'), revenue, budget, adult(all are FALSE), backdrop_path, homepage, imdb_id(not informative), original title(repetitive as col 'title'), poster_path, tagline(not as informative as overview), production_countries, spoken_languages(same as original language and difficult to classify).

In [None]:
df_for_model = df.copy()
df_for_model = df_for_model[['id', 'title', 'runtime','original_language', 'vote_average', 'release_date', 'genres', 'keywords', 'overview', 'production_companies', 'popularity']]
df_for_model.head()

In [None]:
# output a csv file for before data preprocessing
# df_for_model.to_csv('TMDB_movie_dataset_v11_cleaned_for_model.csv', index=False)

### Feature Engineering

1. For numerical columns, remains the same.
2. For caregorical columns with uni-value cell, I will encode categorical features using label encoder; 
3. For caregorical columns with multiple values in one cell (ex.genres), I will parse each value and convert them into usable lists of genres, and encode each distinct genres using multilabel binarizer.
4. For text columns such as keywords and overviews, I will use tfidf matrix to combine them into one combined text for keyword searchup.
<br><br>***The transformed table will vectorize each movie with binary encoding.***

In [None]:
# df_for_model = pd.read_csv('TMDB_movie_dataset_v11_cleaned_for_model.csv')
# df_for_model.head()

In [None]:
# check for any missing values
df_for_model.isnull().sum()

In [None]:
# filled missing values with unknown
df_for_model['genres'].fillna('unknown', inplace=True)
df_for_model['keywords'].fillna('unknown', inplace=True)
df_for_model['overview'].fillna('unknown', inplace=True)
df_for_model['production_companies'].fillna('unknown', inplace=True)
# check for any missing values
df_for_model.isnull().sum()

In [None]:
# extract year from release_date, and refactor it into different time periods based on empirical knowledge
def define_era(row):
    if row['release_date'].year < 1927:
        return 'The Silent Era'
    elif row['release_date'].year < 1960:
        return 'Golden Age'
    elif row['release_date'].year < 1980:
        return 'Post-War Era'
    elif row['release_date'].year < 1990:
        return 'Blockbuster Era'
    else:
        return 'Digital Era'


In [None]:
# Ensure release_date is in datetime format
# df_for_model['release_date'] = pd.to_datetime(df_for_model['release_date'], errors='coerce')

# Apply the define_era function
df_for_model['era'] = df_for_model.apply(define_era, axis=1)
df_for_model.head()

In [None]:
df_for_model['era'].value_counts()

In [None]:
# refector runtime into different types of movies based on empirical knowledge
def define_runtime(row):
    if row['runtime'] < 40:
        return 'Short Film'
    elif row['runtime'] < 60:
        return 'Featurette'
    elif row['runtime'] < 120:
        return 'Feature Film'
    elif row['runtime'] < 180:
        return 'Extended Feature Film'
    else:
        return 'Other'

In [None]:
df_for_model['runtime_type'] = df_for_model.apply(define_runtime, axis=1)
df_for_model.head()

In [None]:
df_for_model['runtime_type'].value_counts()

In [None]:
pd.set_option('display.max_rows', None)
df_for_model['original_language'].value_counts()

In [None]:
df_for_model['original_language'].value_counts().shape

In [None]:
# refector languages into six major langueges and others
def define_language(row):
    if row['original_language']=='en':
        return 'English'
    elif row['original_language']=='ja':
        return 'Japanese'
    elif row['original_language']=='fr':
        return 'French'
    elif row['original_language']=='es':
        return 'Spanish'
    elif row['original_language']=='de':
        return 'German'
    elif row['original_language']=='it':
        return 'Italian'
    else:
        return 'Other'

In [None]:
df_for_model['language'] = df_for_model.apply(define_language, axis=1)
df_for_model.head()

In [None]:
# df_for_model['language'].value_counts()

In [None]:
# find all production companies listed in the dataset
production_companies_list = df_for_model['production_companies'].value_counts().index.tolist()
production_companies_list[:50]
# split each element in the list by comma
production_companies_list = [i.split(', ') for i in production_companies_list]
# flatten the list
production_companies_list = [item for sublist in production_companies_list for item in sublist]
# remove duplicates
production_companies_list = list(set(production_companies_list))
# remove empty strings
production_companies_list = [i for i in production_companies_list if i]
# remove unknown
production_companies_list = [i for i in production_companies_list if i != 'unknown']
len(production_companies_list)

Too many production companies, choose only major ones.

In [None]:
major_production_companies = df_for_model['production_companies'].value_counts().head(20).index.tolist()
major_production_companies

In [None]:
def define_production_company(row):
    for company in major_production_companies:
        if company in row['production_companies']:
            return company
    return 'Other'

In [None]:
df_for_model['production_company'] = df_for_model.apply(define_production_company, axis=1)
df_for_model.head()

In [None]:
df_for_model['production_company'].value_counts()

In [None]:
# use dummy variables to encode era, runtime_type, language, and production company
df_for_model = pd.get_dummies(df_for_model, columns=['era', 'runtime_type', 'language', 'production_company'])

# oh = OneHotEncoder(handle_unknown='ignore')
# oh.fit(df_for_model[['era', 'runtime_type', 'original_language', 'production_company']])
# # transform the categorical variables into one-hot encoded variables

# df_for_model['era'] = oh.transform(df_for_model['era'])
# df_for_model['runtime_type'] = oh.transform(df_for_model['runtime_type'])
# df_for_model['original_language'] = oh.transform(df_for_model['original_language'])
# df_for_model['production_company'] = oh.transform(df_for_model['production_company'])
pd.set_option('display.max_columns', None)
df_for_model.head()

In [None]:
# make a list of genres in each row
genre_l = df_for_model['genres'].apply(lambda x: x.split(', ')).reset_index(drop=True)
type(genre_l)

Noticed multiple genres are associated with the same movie-need to filter out unique genres by iterating each sub-list of genres of each movie.

In [None]:
# find all genres in genre_l
genre_l = genre_l.to_list()
gen_lst = []
for i in range(len(genre_l)):
    for j in range(len(genre_l[i])):
        gen_lst.append(genre_l[i][j])
gen_lst = pd.Series(gen_lst)
gen_lst.value_counts()


Noticed there are only 20 genres in total(including unknown), we can encode them all using **multilabel binarizer**.

In [None]:
# encode genres using MultiLabelBinarizer
mlb = MultiLabelBinarizer()
gen_l = pd.Series(genre_l)
genre_encoded = mlb.fit_transform(genre_l)
genre_encoded[:5]

In [None]:
df_genre_encoded = pd.DataFrame(genre_encoded, columns=mlb.classes_)
df_genre_encoded

In [None]:
# Ensure 'genres' column is present before dropping it
if 'genres' in df_for_model.columns:
	df_for_model.drop('genres', axis=1, inplace=True)
df_for_model_encoded = pd.concat([df_for_model.reset_index(drop=True), df_genre_encoded.reset_index(drop=True)], axis=1)
pd.set_option('display.max_columns', None)
df_for_model_encoded.head()

It seems column overview has little information regarding the overall context compared to keywords, will remove for computational efficiency.

In [None]:
# remove unnecessary  columns
df_for_model_encoded.drop(['id', 'runtime', 'original_language', 'release_date', 'overview', 'production_companies'], axis=1, inplace=True)
df_for_model_encoded.head()

#### Text columns preprocessing

In [None]:
# put all distinct keywords into a giant list

keyword_l = list(set(keyword for sublist in df_for_model['keywords'].apply(lambda x: x.split(', ')).reset_index(drop=True) for keyword in sublist))
len(keyword_l)

In [None]:
keyword_l[:10]

In [None]:
# show all texts in the column
# pd.set_option('display.max_colwidth', None)
# df_for_model[['title', 'keywords', 'overview']].head(10)

### NLP on Text Columns using word embedding, PCA, and K-Means

**Preprocessing**: Clean and preprocess text columns including lowercasing, removing punctustion, removing stop words(filler words), lemmatizing(combining synonyms), and combining keywords and overview into one column 'combine_text'.

In [None]:
# Convert all text columns to lowercase and remove punctuation
import re
punctuation = "!\"#$%&'()*+-./:;<=>?@[\]^_`{|}~"
def clean_text(text):
    text = text.lower()
    text = re.sub(re.escape(punctuation), '', text)  # Remove punctuation
    return text
df_for_model_encoded['keywords'] = df_for_model_encoded['keywords'].apply(clean_text)
df_for_model_encoded.head()

In [None]:
# remove duplicates words in keywords and overview
def remove_duplicates(text):
    words = text.split()
    unique_words = list(dict.fromkeys(words))
    return ' '.join(unique_words)
df_for_model_encoded['keywords'] = df_for_model_encoded['keywords'].apply(remove_duplicates)


In [None]:
# remove stop words from overview
stop_words = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now", "would", "yet"]
df_for_model_encoded['keywords'] = df_for_model_encoded['keywords'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in stop_words]))

In [None]:
# use WordNetLemmatizer to lemmatize the overview
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
df_for_model_encoded['keywords'] = df_for_model_encoded['keywords'].apply(lemmatize_text)


In [None]:
keyword_l = list(set(keyword for sublist in df_for_model['keywords'].apply(lambda x: x.split(', ')).reset_index(drop=True) for keyword in sublist))
len(keyword_l)

In [None]:
# overview_l = list(set(keyword for sublist in df_for_model['overview'].apply(lambda x: x.split(', ')).reset_index(drop=True) for keyword in sublist))
# len(overview_l)

In [None]:
# combine keywords & overview into one column named combine_text, then use tfidf to vectorize the text
# df_for_model_encoded['combine_text'] = df_for_model_encoded['keywords'] + ' ' + df_for_model_encoded['overview']
# df_for_model_encoded['combine_text'].head()

### Word Embedding using TF IDF

**Vectorization**: Convert words into numerical representations using word embedding

**FastText** and **TF-IDF** are both methods used in Natural Language Processing (NLP) for representing text, but they differ significantly in their approach and capabilities. FastText is a word embedding technique that learns vector representations of words **based on their character n-grams**, while TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that **reflects a word's importance within a document relative to a collection of documents**. <br>
In this case, it is important to find the most frequent words and compare their similarity within the corpus of all cleaned texts given in the dataset, and TF-IDF is a better tool for word embedding.

In [None]:
# use tfidf to vectorize the text
tfidf = TfidfVectorizer(max_features=1000, token_pattern=r'[^,]+')
tfidf_matrix = tfidf.fit_transform(df_for_model_encoded['keywords'])
tfidf_matrix.shape

In [None]:
# convert the sparse matrix to a dense matrix
tfidf_matrix_dense = tfidf_matrix.todense()
# convert the dense matrix to a dataframe
df_tfidf = pd.DataFrame(tfidf_matrix_dense, columns=tfidf.get_feature_names_out())
df_tfidf.head()

In [None]:
# output df_tfidf to csv
df_tfidf.to_csv('keyword_embedding.csv', index=False)

In [None]:
# print out all columns name in df_tfidf
keywords_searchup = df_tfidf.columns.tolist()
# output the keywords_searchup to a txt file
with open('keywords_searchup.txt', 'w') as f:
    for item in keywords_searchup:
        f.write("%s\n" % item)

In [None]:
# use PCA to reduce the dimensionality of the tfidf matrix
# set the number of components to 90% of the variance (explaine the variance)
pca = PCA(n_components=0.90)
# convert the dense matrix to a numpy array
tfidf_matrix_dense = np.array(tfidf_matrix_dense)
# fit the PCA model to the dense matrix
pca.fit(tfidf_matrix_dense)
pca_matrix = pca.transform(tfidf_matrix_dense)
pca_matrix.shape


In [None]:
# drop the keyword columns from df_for_model_encoded, then output the csv file
df_for_model_encoded.drop(['keywords'], axis=1, inplace=True)
df_for_model_encoded.head()

In [None]:
# concat the pca matrix with the original dataframe
df_pca = pd.DataFrame(pca_matrix, columns=[f'pca_{i}' for i in range(pca_matrix.shape[1])])
df_pca.head()

In [None]:
# concat the pca matrix with the original dataframe for similarity search
df_for_model_encoded_sim = pd.concat([df_for_model_encoded.reset_index(drop=True), df_pca.reset_index(drop=True)], axis=1)
# df_for_model_encoded_sim.drop('keywords', axis=1, inplace=True)
df_for_model_encoded_sim.head()

In [None]:
# combine the tfidf matrix with the original dataframe
# df_for_model_encoded = pd.concat([df_for_model_encoded.reset_index(drop=True), df_tfidf.reset_index(drop=True)], axis=1)
# df_for_model_encoded.shape

In [None]:
# from gensim.models import Word2Vec

In [None]:
# create a CBOW Word2Vec model 
# model1 = Word2Vec(tokenized_words, vector_size=200, window=5, min_count=1, workers=4)

In [None]:
# # test the model on example keywords, and find the most similar words by cosine similarity
# model1.wv.most_similar('rescue')[:10]

In [None]:
# create a Skip Gram Word2Vec model 
# model2 = Word2Vec(tokenized_words, vector_size=200, window=10, min_count=1, workers=4, sg=1)

In [None]:
# test the model on example keywords, and find the most similar words by cosine similarity
# model2.wv.most_similar('rescue')[:10]

In [None]:
# from gensim.models import FastText

In [None]:
# model3 = FastText(tokenized_words, vector_size=200, window=5, min_count=1, workers=4)

In [None]:
# model3.wv.most_similar('love')[:10]

In [None]:
# word_vectors = [model3.wv[word] for word in processed_words if word in model3.wv]
# dense_matrix = np.array(word_vectors)
# dense_matrix.shape

In [None]:
# no_vec_lst = [word for word in processed_words if word not in model3.wv]
# len(no_vec_lst)

In [None]:
# from sklearn.decomposition import PCA
# pca = PCA(n_components=10)

In [None]:
# reduced_word_vectors = pca.fit(dense_matrix).transform(dense_matrix)

In [None]:
# check whether dimensions have been reduced to 50
# len(reduced_word_vectors[0])

In [None]:
# len(reduced_word_vectors)

In [None]:
# import  the necessary libraries for clustering
# from sklearn.cluster import KMeans

In [None]:
#find the optimal k using the elbow method
# wcss = []
# for k in range(1, 20):
#     kmeans = KMeans(n_clusters=k, random_state=42)
#     kmeans.fit(reduced_word_vectors)
#     wcss.append(kmeans.inertia_)
# plt.plot(range(1, 20), wcss)
# plt.xlabel('Number of clusters')
# plt.ylabel('WCSS')
# plt.title('Elbow Method')
# plt.show()


In [None]:
# k = 10
# kmeans = KMeans(n_clusters=k, random_state=42)
# kmeans.fit(reduced_word_vectors)
# labels = kmeans.labels_

**Evaluation**: use **k-means inertia** to measure the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster

In [None]:
# find out the intertia of the CURRENT kmeans model
# kmeans.inertia_

A good k-means model is one with low inertia AND a low number of clusters ( K ). Since the inertia is very close to 0(0.06), we will consider we have a good k-means clustering and the 10 labels can successfully group all processed keywords.

crete a dataframe to apply each row of the tokenized words to the map_to_word_vector function

In [None]:
# tokenized_words_series = pd.Series(tokenized_words)
# df_tokenized_words = pd.DataFrame(tokenized_words_series, columns=['tokenized_words'])
# df_tokenized_words.head()

In [None]:
# map each tokenized word to its corresponding word vectors, if it is not in the word vector matrix, then use a zero vector
# def map_to_word_vector(row):
#     word_vector = np.zeros(200)
#     for word in row:
#         if word in model3.wv:
#             word_vector += model3.wv[word]
#     return word_vector

In [None]:
# df_tokenized_words['word_vector'] = df_tokenized_words['tokenized_words'].apply(map_to_word_vector)
# df_tokenized_words.head()

In [None]:
# make sure each tokenized word find its corresponding word vector
# df_tokenized_words.isna().sum()

In [None]:
# df_tokenized_words.shape

In [None]:
# lables_series = pd.Series(labels)
# lables_series.shape

In [None]:
# df_tokenized_words['label'] = lables_series
# df_tokenized_words.head()

In [None]:
# check if each cluster has appropraiet number of words
# df_tokenized_words['label'].value_counts()

In [None]:
# df_tokenized_words[df_tokenized_words['label']==0].head()

There are some form of similarities between the semantic relationship between words in each value, so for now we can proceed.

In [None]:
# transform dataframe by making label as a dummy variable
# df_tokenized_words_getdummies = pd.get_dummies(df_tokenized_words, columns=['label'])
# df_tokenized_words_getdummies.head()

In [None]:
# load the similarity searchup dataset
df_for_model_encoded_sim = pd.read_csv('TMDB_movie_dataset_v11_cleaned_for_model_encoded_sim.csv')
df_for_model_encoded_sim.head()

In [None]:
# set title as index
df_for_model_encoded_sim.set_index('title', inplace=True)
# check the index
df_for_model_encoded_sim.index[:10]

In [None]:
# drop any null values
df_for_model_encoded_sim.dropna(inplace=True)
df_for_model_encoded_sim.shape

In [None]:
# drop any duplicates
df_for_model_encoded_sim.drop_duplicates(inplace=True)
df_for_model_encoded_sim.shape

In [None]:
# find the highest popularity score
df_for_model_encoded_sim['popularity'].describe()

### Assign weighting

Before we proceed, we need to assign weights to different columns so that:
1. For numeric features, standardize them to the scale of 1 by min_max or standardscaler.
2. For binary variable, since each movie can only take on one label, no need for further scaling.
3. For multi-labeled features (genres) where each movie can have rating in different dummy variables of the same features would sum up to 1 <br>
(ex.if movie 1 has 2 in genre 1, 3 in genre 2, and 5 in genre 3, <br>
then each value will be converted to 2/(2+3+5)=0.2 in genre 1, 3/(2+3+5)=0.3 in genre 2, and 5/(2+3+5)=0.5 in genre 3)
4. Keywords columns are already transformed, leave for now.

In [None]:
df_for_model_encoded_sim.head()

In [None]:
# standardize numerical features
scaler = StandardScaler()
df_for_model_encoded_sim[['vote_average', 'popularity']] = scaler.fit_transform(df_for_model_encoded_sim[[ 'vote_average', 'popularity']])
df_for_model_encoded_sim.head()

In [None]:
era_columns = [col for col in df_for_model_encoded_sim.columns if 'era_' in col]
runtime_type_columns = [col for col in df_for_model_encoded_sim.columns if 'runtime_type_' in col]
language_columns = [col for col in df_for_model_encoded_sim.columns if 'language_' in col]
production_company_columns = [col for col in df_for_model_encoded_sim.columns if 'production_company_' in col]
# check the columns
print(era_columns)
print(runtime_type_columns)
print(language_columns)
print(production_company_columns)

For binary columns, make sure all values are converted to numeric.

In [None]:
# convert all binary columns to numeric
# df_for_model_encoded_sim[era_columns] = df_for_model_encoded_sim[era_columns].apply(pd.to_numeric)
# df_for_model_encoded_sim[runtime_type_columns] = df_for_model_encoded_sim[runtime_type_columns].apply(pd.to_numeric)
# df_for_model_encoded_sim[language_columns] = df_for_model_encoded_sim[language_columns].apply(pd.to_numeric)
# df_for_model_encoded_sim[production_company_columns] = df_for_model_encoded_sim[production_company_columns].apply(pd.to_numeric)
# check the columns
df_for_model_encoded_sim.info()

In [None]:
# relabel genre columns so that rating in different genres would sum up to 1 
genre_columns = ['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western', 'unknown']
df_for_model_encoded_sim[genre_columns] = df_for_model_encoded_sim[genre_columns].div(df_for_model_encoded_sim[genre_columns].sum(axis=1), axis=0)
df_for_model_encoded_sim.head()

In [None]:
# # relabel keywords so that rating in different labels would sum up to 1 
# # (ex.if movie 1 has 2 in label 1, 3 in label 2, and 5 in label 3, then each value will be converted to 2/(2+3+5)=0.2 in label 1, 3/(2+3+5)=0.3 in label 2, and 5/(2+3+5)=0.5 in label 3)
# keyword_columns = [f'label_{i}' for i in range(10)]
# df_for_model_encoded_final[label_columns] = df_for_model_encoded_final[label_columns].div(df_for_model_encoded_final[label_columns].sum(axis=1), axis=0)
# df_for_model_encoded_final.head()

Determining the appropriate weights for different features in a movie dataset involves careful consideration of how much each feature contributes to the overall similarity measure. <br><br>
Here are the proposed weighting:

1. **genres & languages: 4** -- Genres are often the primary factor in movie selection for viewers. They define the storytelling style, themes, and overall expectations of the film, making this a crucial component in determining film similarity and preferences. Viewers also tend to watch movies in their native languages (unless foreign films are specified), so it would make more sense to prioritize langues as well.
2. **rating & popularity: 3** -- This feature could be highly influential in determining quality of the movies. Most viewers are looking for recommendation that would .
3. **runtime & era: 2** -- Runtime can influence a viewer’s choice (e.g., a preference for short films for casual viewing versus feature films for dedicated watching). However, it could be secondary to the content and thematic similarity as a feature in most analyses. Era can help contextualize a film's style, themes, and production values. Movies from the same era might share stylistic features, making this a relevant feature. However, like runtime type, it shouldn’t be as heavily weighted as genres.
4. **keywords: 1** -- Keywords can capture detailed thematic elements and narrative aspects that go beyond simple classifications. They, however, are highly subjective in evaluating movies' similarity, especially when genres are present, as a result, it is secondary in weighting.
5. **production company: 0.8** -- This usually works for viewers who have strong preferences for certain production (ex.Disney), so they are less important.


In [None]:
df_for_model_encoded_sim[genre_columns] = df_for_model_encoded_sim[genre_columns]*4
df_for_model_encoded_sim[language_columns] = df_for_model_encoded_sim[language_columns]*4
df_for_model_encoded_sim['vote_average'] = df_for_model_encoded_sim['vote_average']*3
df_for_model_encoded_sim['popularity'] = df_for_model_encoded_sim['popularity']*3
df_for_model_encoded_sim[era_columns] = df_for_model_encoded_sim[era_columns]*2
df_for_model_encoded_sim[runtime_type_columns] = df_for_model_encoded_sim[runtime_type_columns]*2
df_for_model_encoded_sim[production_company_columns] = df_for_model_encoded_sim[production_company_columns]*0.8
df_for_model_encoded_sim.head()


In [None]:
# era_columns = ['era_Digital Era', 'era_Blockbuster Era', 'era_Golden Age', 'era_Post-War Era', 'era_The Silent Era']
# df_for_model_encoded_final[era_columns] = df_for_model_encoded_final[era_columns]*2
# df_for_model_encoded_final.head()

In [None]:
# runtime_columns = ['runtime_type_Epic Length Film', 'runtime_type_Extended Feature Film', 'runtime_type_Feature Film', 'runtime_type_Featurette', 'runtime_type_Short Film']
# df_for_model_encoded_final[runtime_columns] = df_for_model_encoded_final[runtime_columns]*2
# df_for_model_encoded_final.head()

In [None]:
# df_for_model_encoded_final[genre_columns] = df_for_model_encoded_final[genre_columns]*4
# df_for_model_encoded_final.head()

In [None]:
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()

In [None]:
# df_for_model_encoded_norm = scaler.fit_transform(df_for_model_encoded_final.drop('title',axis=1))
# df_norm_df = pd.DataFrame(df_for_model_encoded_norm, columns=[x for x in df_for_model_encoded_final.columns if x not in 'title'])
# df_norm_df.head()

In [None]:
# df_final = pd.concat([df_for_model_encoded_final['title'], df_norm_df], axis=1)
# df_final.head()

In [None]:
# df_final.head()

In [None]:
# set movie title as index
# df_final.set_index('title', inplace=True)

In [None]:
# output the final dataframe to a csv file for a later use
# df_final.to_csv('movie_rec_databse.csv')

In [None]:
# output the final dataframe to a csv file for a later use
df_for_model_encoded_sim.to_csv('movie_rec_databse_2.csv')

## Model Deployment

### Cosine Similarity

In content-based filtering, the most common similarity comparison methods used to determine how similar two items are based on their features are cosine similarity, Euclidean distance, Jaccard similarity. <br><br>
Here we will choose **cosine similarity** because of its ability to handle sparse data and high-dimensional feature spaces effectively. 

In [None]:
# return the top 10 most similar movies from the original dataframe, given their cosine similarity
def get_recommendation():

    movie_name = input("Enter the movie name you are looking for: ").strip().lower().replace(' ', '')
    
    # make sure n is a valid integer between 1 and 20
    while True:
        n_input = input("How many movies are you looking for to recommend? (default is 10): ").strip()
        if n_input == "":
            n = 10
            break
        try:
            n = int(n_input)
            if n < 1 or n > 20:
                print("Please choose a number between 1 and 20.")
            else:
                break
        except ValueError:
            print("Please enter a valid integer.")

    # load the encoded dataframe for cosine similarity calculation
    df_final = pd.read_csv('/kaggle/input/movie-recommendation-title-for-similarity-csv/movie_rec_databse_2.csv')
    df_final.set_index('title', inplace=True)
    # Standardize movie names in the final dataframe
    df_final.index = df_final.index.str.strip().str.lower().str.replace(' ', '')
    
    # load the original dataset
    df = pd.read_csv('/kaggle/input/tmdb-movies-dataset-2023-930k-movies/TMDB_movie_dataset_v11.csv')
    # drop duplicate values
    df.drop_duplicates(inplace=True)
    # drop null values
    df.dropna(inplace=True)
    # standardize movie names in the original dataset
    df['title'] = df['title'].str.strip().str.lower().str.replace(' ', '')
    
    if movie_name not in df_final.index:
        print(f"No match is available yet. Here are the top {n} trending movies for inspiration:")
        trending_movies = df_final.head(n)
        return trending_movies

    new_df = df_final.loc[[movie_name]]
    # Remove rows with NaN values
    df_other = df_final.loc[df_final.index != movie_name, :].dropna()
    # Get the titles of the other movies
    df_titles = df_other.index
    cosine_sim_matrix = cosine_similarity(new_df, df_other)
    cosine_sim_df = pd.DataFrame(cosine_sim_matrix, index=[movie_name], columns=df_titles)
    # Get the top n most similar movies
    top_n_similar = cosine_sim_df.T.sort_values(by=movie_name, ascending=False).head(n)
    # Slice out movie‘s information from the original dataset by title
    top_n_similar_df = df.loc[df['title'].isin(top_n_similar.index)]
    return top_n_similar_df

# Example usage:
get_recommendation()

