# Project: Movie Recommendation System

Dataset used: [Full TMDB Movies Dataset 2024 (1M Movies)](https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies)

Attributes
- id: Unique identifier assigned to each movie in the TMDB database.
- title: Title of the movie.
- release_date: Date on which the movie was released.
- status: The status of the movie (e.g., Released, Rumored, Post Production, etc.)
- genres: List of genres associated with the movie.
- original_language: Language in which the movie was originally produced.
- vote_average: Average vote or rating given by viewers. 
- vote_count: Total count of votes received for the movie.
- popularity: Popularity score assigned to the movie by TMDB based on user engagement.
- overview: Brief description or summary of the movie.
- budget: Estimated budget for producing the movie in USD.
- production_companies: List of production companies involved in making the movie.
- production_countries: List of countries involved in the movie production.
- revenue: Total revenue generated by the movie in USD.
- runtime: Total runtime of the movie in minutes.
- tagline: Short, memorable phrase associated with the movie, often used in promotional material.
- adult: Indicates if the movie is suitable only for adult audiences. 
- backdrop_path: URL of the backdrop image for the movie
- budget: Budget allocated for the movie.
- homepage: Official homepage URL of the movie.
- imdb_id: IMDb ID of the movie.
- original_title: Original title of the movie.
- poster_path: URL of the movie poster image. 
- spoken_languages: List of languages spoken in the movie.
- keywords: Keywords associated with the movie. 

## Loading data

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

In [2]:
movie_data = pd.read_csv('../data/csv_files/tmdb_movies.csv')
movie_data

Unnamed: 0,title,vote_average,vote_count,release_date,runtime,original_language,overview,popularity,poster_path,genres
0,Inception (2010),8.364,34495,2010-07-15,148,English,"Cobb, a skilled thief who commits corporate es...",83.952,https://image.tmdb.org/t/p/w300/oYuLEt3zVCKq57...,"Action, Science Fiction, Adventure"
1,Interstellar (2014),8.417,32571,2014-11-05,169,English,The adventures of a group of explorers who mak...,140.241,https://image.tmdb.org/t/p/w300/gEU2QniE6E77NI...,"Adventure, Drama, Science Fiction"
2,The Dark Knight (2008),8.512,30619,2008-07-16,152,English,Batman raises the stakes in his war on crime. ...,130.643,https://image.tmdb.org/t/p/w300/qJ2tW6WMUDux91...,"Drama, Action, Crime, Thriller"
3,Avatar (2009),7.573,29815,2009-12-15,162,English,"In the 22nd century, a paraplegic Marine is di...",79.932,https://image.tmdb.org/t/p/w300/kyeqWdyUXW608q...,"Action, Adventure, Fantasy, Science Fiction"
4,The Avengers (2012),7.710,29166,2012-04-25,143,English,When an unexpected enemy emerges and threatens...,98.082,https://image.tmdb.org/t/p/w300/RYMX2wcKCBAr24...,"Science Fiction, Action, Adventure"
...,...,...,...,...,...,...,...,...,...,...
258910,Ninja 8: Warriors of Fire (1987),1.000,1,1987-01-01,90,English,"The Black Ninja Empire want a ""confidential bl...",0.600,https://image.tmdb.org/t/p/w300/5u8ovwI0Ys9DzE...,Action
258911,Journey To Paradise (2010),1.000,1,2010-12-01,134,English,When a mysterious but attractive stranger appl...,0.709,https://image.tmdb.org/t/p/w300/uQhLo7voNY2HT1...,Romance
258912,Efficiency (2014),8.000,1,2014-06-05,87,English,A pair of irresponsible twin brothers struggle...,0.624,https://image.tmdb.org/t/p/w300/8LO1xNOXfADi4w...,Unknown
258913,L'ultimo amante (1955),6.000,1,1955-11-03,93,Italian,"Maria, a prostitute, meets Cesare in a police ...",1.325,https://image.tmdb.org/t/p/w300/kmQv0ekhDyiYj3...,Drama


In [3]:
movie_data.isnull().sum()

title                0
vote_average         0
vote_count           0
release_date         0
runtime              0
original_language    0
overview             0
popularity           0
poster_path          0
genres               0
dtype: int64

In [4]:
movie_data.dtypes

title                 object
vote_average         float64
vote_count             int64
release_date          object
runtime                int64
original_language     object
overview              object
popularity           float64
poster_path           object
genres                object
dtype: object

In [5]:
# change release_date to datetime type
movie_data['release_date'] = pd.to_datetime(movie_data['release_date'])
movie_data.dtypes

title                        object
vote_average                float64
vote_count                    int64
release_date         datetime64[ns]
runtime                       int64
original_language            object
overview                     object
popularity                  float64
poster_path                  object
genres                       object
dtype: object

In [6]:
movie_data.describe()

Unnamed: 0,vote_average,vote_count,release_date,runtime,popularity
count,258915.0,258915.0,258915,258915.0,258915.0
mean,5.967049,82.045162,1997-06-22 09:23:51.532356096,77.137265,3.159576
min,0.0,1.0,1865-01-01 00:00:00,0.0,0.0
25%,5.0,1.0,1984-05-14 12:00:00,53.0,0.669
50%,6.0,4.0,2008-01-01 00:00:00,87.0,1.304
75%,7.0,13.0,2017-04-01 00:00:00,100.0,2.641
max,10.0,34495.0,2024-07-19 00:00:00,14400.0,2994.357
std,1.843305,663.460065,,55.595337,14.599024


In [7]:
# remove movies with runtime of 0 mins
movie_data = movie_data[movie_data['runtime']>0]
movie_data.describe()

Unnamed: 0,vote_average,vote_count,release_date,runtime,popularity
count,248045.0,248045.0,248045,248045.0,248045.0
mean,5.962869,85.493632,1997-04-01 17:24:08.142071168,80.517628,3.241357
min,0.0,1.0,1865-01-01 00:00:00,1.0,0.0
25%,5.0,1.0,1983-12-07 00:00:00,60.0,0.68
50%,6.0,4.0,2007-11-08 00:00:00,88.0,1.342
75%,7.0,13.0,2017-03-23 00:00:00,100.0,2.728
max,10.0,34495.0,2024-07-19 00:00:00,14400.0,2994.357
std,1.821124,677.63029,,54.351735,14.875884


In [8]:
# find movies with popularity of 0
movie_data[movie_data['popularity']==0]

Unnamed: 0,title,vote_average,vote_count,release_date,runtime,original_language,overview,popularity,poster_path,genres
137384,Peter Griffin Seeks Fitness Advice from Meowsc...,4.0,3,2023-12-03,1,English,Peter Griffin seeks fitness advice and finds h...,0.0,https://image.tmdb.org/t/p/w300/qDeFbp67axAAW5...,"Animation, Comedy"
140301,A Love Story (2022),10.0,3,2022-04-04,3,English,He loves her more than anything in the world.....,0.0,https://image.tmdb.org/t/p/w300/dYd6F7zvtOoyNi...,"Horror, Romance, Drama, Thriller"
159128,Love Behind the Barrier (2022),9.5,2,2022-10-13,8,English,A teenage girl faces her obstacle in her love ...,0.0,https://image.tmdb.org/t/p/w300/8Q35KAdWkHl6PJ...,Drama
159682,AL7O6H (2023),10.0,2,2023-12-27,5,Arabic,"The film talks about two friends, STEAL THE BA...",0.0,https://image.tmdb.org/t/p/w300/newad8N9kKG00f...,"Comedy, Crime"
160446,Doru: Adventure Island (2023),5.5,2,2023-08-30,62,Turkish,Doru and his friends play games to determine w...,0.0,https://image.tmdb.org/t/p/w300/hgHPO3zwa1fbxO...,"Animation, Family"
...,...,...,...,...,...,...,...,...,...,...
258306,Trap (2024),7.0,1,2024-03-16,5,English,1 of 10 people suffer various types of brain d...,0.0,https://image.tmdb.org/t/p/w300/ge1XUksa4VOZcO...,Mystery
258346,Live At Cragg Vale (2014),10.0,1,2014-06-28,63,English,"""The show happened during Midsummer with the b...",0.0,https://image.tmdb.org/t/p/w300/fkBcBHLKJp9YOF...,Music
258484,Nemophila 4th Anniversary -Rizing Nemo- (2023),10.0,1,2023-11-08,123,Japanese,Nemophila's performance at Tokyo Garden Theate...,0.0,https://image.tmdb.org/t/p/w300/omybIBH1NcPixC...,Music
258572,Dead Serious (2024),8.0,1,2024-02-25,120,English,A romantic comedy starring Nollywood veteran N...,0.0,https://image.tmdb.org/t/p/w300/wiQ9QND4rWnpms...,"Comedy, Romance"


In [9]:
# remove movies that are popular
movie_data = movie_data[movie_data['popularity']>0].reset_index(drop=True)

In [10]:
len(movie_data)

247451

## Preprocessing movie title and overview

In [12]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [13]:
def clean_title(title):
    title = re.sub("[^a-zA-Z0-9 ]","", title)
    title = title.lower()
    return title

In [14]:
# Function to preprocess and lemmatize text
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()

    # Remove non-alphanumeric characters
    text = re.sub("[^a-zA-Z0-9 ]", "", text)
    # Convert to lowercase
    text = text.lower()
    # Tokenize text
    words = word_tokenize(text)
    # Remove stopwords and lemmatize words
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    # Join words back into a single string
    text = ' '.join(words)
    return text

In [15]:
movie_data['title_processed'] = movie_data['title'].apply(clean_title)
movie_data.head()

Unnamed: 0,title,vote_average,vote_count,release_date,runtime,original_language,overview,popularity,poster_path,genres,title_processed
0,Inception (2010),8.364,34495,2010-07-15,148,English,"Cobb, a skilled thief who commits corporate es...",83.952,https://image.tmdb.org/t/p/w300/oYuLEt3zVCKq57...,"Action, Science Fiction, Adventure",inception 2010
1,Interstellar (2014),8.417,32571,2014-11-05,169,English,The adventures of a group of explorers who mak...,140.241,https://image.tmdb.org/t/p/w300/gEU2QniE6E77NI...,"Adventure, Drama, Science Fiction",interstellar 2014
2,The Dark Knight (2008),8.512,30619,2008-07-16,152,English,Batman raises the stakes in his war on crime. ...,130.643,https://image.tmdb.org/t/p/w300/qJ2tW6WMUDux91...,"Drama, Action, Crime, Thriller",the dark knight 2008
3,Avatar (2009),7.573,29815,2009-12-15,162,English,"In the 22nd century, a paraplegic Marine is di...",79.932,https://image.tmdb.org/t/p/w300/kyeqWdyUXW608q...,"Action, Adventure, Fantasy, Science Fiction",avatar 2009
4,The Avengers (2012),7.71,29166,2012-04-25,143,English,When an unexpected enemy emerges and threatens...,98.082,https://image.tmdb.org/t/p/w300/RYMX2wcKCBAr24...,"Science Fiction, Action, Adventure",the avengers 2012


In [16]:
movie_data['overview_processed'] = movie_data['overview'].apply(preprocess_text)
movie_data.head()

Unnamed: 0,title,vote_average,vote_count,release_date,runtime,original_language,overview,popularity,poster_path,genres,title_processed,overview_processed
0,Inception (2010),8.364,34495,2010-07-15,148,English,"Cobb, a skilled thief who commits corporate es...",83.952,https://image.tmdb.org/t/p/w300/oYuLEt3zVCKq57...,"Action, Science Fiction, Adventure",inception 2010,cobb skilled thief commits corporate espionage...
1,Interstellar (2014),8.417,32571,2014-11-05,169,English,The adventures of a group of explorers who mak...,140.241,https://image.tmdb.org/t/p/w300/gEU2QniE6E77NI...,"Adventure, Drama, Science Fiction",interstellar 2014,adventure group explorer make use newly discov...
2,The Dark Knight (2008),8.512,30619,2008-07-16,152,English,Batman raises the stakes in his war on crime. ...,130.643,https://image.tmdb.org/t/p/w300/qJ2tW6WMUDux91...,"Drama, Action, Crime, Thriller",the dark knight 2008,batman raise stake war crime help lt jim gordo...
3,Avatar (2009),7.573,29815,2009-12-15,162,English,"In the 22nd century, a paraplegic Marine is di...",79.932,https://image.tmdb.org/t/p/w300/kyeqWdyUXW608q...,"Action, Adventure, Fantasy, Science Fiction",avatar 2009,22nd century paraplegic marine dispatched moon...
4,The Avengers (2012),7.71,29166,2012-04-25,143,English,When an unexpected enemy emerges and threatens...,98.082,https://image.tmdb.org/t/p/w300/RYMX2wcKCBAr24...,"Science Fiction, Action, Adventure",the avengers 2012,unexpected enemy emerges threatens global safe...


In [17]:
import pickle
import gzip

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Vectorize movie titles and overviews using TF-IDF
vectorizer_title = TfidfVectorizer(ngram_range=(1, 2))
tfidf_matrix_title = vectorizer_title.fit_transform(movie_data['title_processed'])

vectorizer_overview = TfidfVectorizer(ngram_range=(1, 1), max_features=500)
tfidf_matrix_overview = vectorizer_overview.fit_transform(movie_data['overview_processed'])

In [19]:
# Save the TF-IDF vectorizer
with gzip.open('../data/pkl_files/tfidf_vectorizer_title.pkl.gz', 'wb') as f:
    pickle.dump(vectorizer_title, f)

with gzip.open('../data/pkl_files/tfidf_matrix_title.pkl.gz', 'wb') as f:
    pickle.dump(tfidf_matrix_title, f)

with gzip.open('../data/pkl_files/tfidf_vectorizer_overview.pkl.gz', 'wb') as f:
    pickle.dump(vectorizer_overview, f)

with gzip.open('../data/pkl_files/tfidf_matrix_overview.pkl.gz', 'wb') as f:
    pickle.dump(tfidf_matrix_overview, f)

In [20]:
from sklearn.metrics.pairwise import cosine_similarity
from fuzzywuzzy import fuzz

In [22]:
# Load the vectorizers and reduced TF-IDF matrices
with gzip.open('../data/pkl_files/tfidf_vectorizer_title.pkl.gz', 'rb') as f:
    vectorizer_title = pickle.load(f)

with gzip.open('../data/pkl_files/tfidf_matrix_title.pkl.gz', 'rb') as f:
    tfidf_matrix_title = pickle.load(f)

with gzip.open('../data/pkl_files/tfidf_vectorizer_overview.pkl.gz', 'rb') as f:
    vectorizer_overview = pickle.load(f)

with gzip.open('../data/pkl_files/tfidf_matrix_overview.pkl.gz', 'rb') as f:
    tfidf_matrix_overview= pickle.load(f)

In [23]:
def search_movie_title(df, search_title, k=10):
    
    # Clean the input title
    cleaned_search_title = clean_title(search_title)
    
    # Vectorize the input title
    query_vect_title = vectorizer_title.transform([cleaned_search_title])
    
    # Calculate cosine similarity between the input title and all movie titles
    similarity_title = cosine_similarity(query_vect_title, tfidf_matrix_title).flatten()
    
    # Fuzzy matching to account for minor title variations
    fuzzy_scores = [fuzz.ratio(cleaned_search_title, t) for t in df['title_processed']]
    combined_similarity = similarity_title * (0.8 + 0.2 * (pd.Series(fuzzy_scores) / 100))
    
    # Get the indices of the top k similar movies
    top_indices = combined_similarity.argsort()[-k:][::-1]
    
    # Retrieve the top k similar movie titles
    search_movie_results = df.iloc[top_indices]
    search_movie_results = search_movie_results.sort_values(by='release_date', ascending=False)
    return search_movie_results[['title', 'overview', 'runtime', 'vote_average', 'release_date', 'poster_path']]


In [24]:
title_search_results_df = pd.DataFrame(search_movie_title(movie_data, "Barbie"))
title_search_results_df

Unnamed: 0,title,overview,runtime,vote_average,release_date,poster_path
211027,Barbie and Me (2023),"An abstract exploration of self-image, societa...",1,10.0,2023-07-27,https://image.tmdb.org/t/p/w300/vnigThZe10z6Wf...
825,Barbie (2023),Barbie and Ken are having the time of their li...,114,7.279,2023-07-19,https://image.tmdb.org/t/p/w300/iuFNMS8U5cb6xf...
181249,Barbie & Bob (2020),A young couple spends the night in a motel roo...,23,8.5,2020-08-23,https://image.tmdb.org/t/p/w300/5l47mSRnDHf2Dl...
128338,Black Barbie (2016),​Black Barbie is a spoken/poetry animation tha...,4,5.5,2016-10-01,https://image.tmdb.org/t/p/w300/uk16r3AlxVA4ty...
53892,Barbie Dreamtopia (2016),"Join Barbie, Chelsea, and her puppy Honey as t...",44,6.559,2016-06-26,https://image.tmdb.org/t/p/w300/ewsAXj1IbpnAn4...
152532,Barbie Boy (2014),"Bobby is a bright, imaginative seven-year-old ...",13,7.3,2014-11-14,https://image.tmdb.org/t/p/w300/bq4Ntj38TfMWDc...
36470,Barbie (2011),Soon-young lives with her mentally handicapped...,97,7.3,2011-10-07,https://image.tmdb.org/t/p/w300/qoFs8y8B7ysgCE...
6618,Barbie Mariposa (2008),"Elina, heroine of the Fairytopia films tells h...",75,6.8,2008-02-26,https://image.tmdb.org/t/p/w300/qsb1OQCNVMAX3K...
5173,Barbie: Fairytopia (2005),Elina is a flower fairy who discovers that her...,70,6.7,2005-03-08,https://image.tmdb.org/t/p/w300/a0VPQHpLNCWWmi...
45839,Barbie (1977),Barbie comes home from shopping. She takes her...,10,6.7,1977-01-01,https://image.tmdb.org/t/p/w300/A1NvddoqyBjaIf...


In [25]:
# function to find top 10 most similar movies

def search_movie_overview(df, movie_title, k=10):
    similar_movie = search_movie_title(df, movie_title, k=1)
    idx = similar_movie.index[0]  # Get the index of the most similar movie

    query_vect_overview = tfidf_matrix_overview[idx]
    similarity_overview = cosine_similarity(query_vect_overview.reshape(1, -1), tfidf_matrix_overview).flatten()
    top_indices = similarity_overview.argsort()[-k:][::-1]
    top_indices = top_indices[top_indices != idx]  # Exclude the original movie
    search_movie_results = df.iloc[top_indices]
    search_movie_results = search_movie_results.sort_values(by='release_date', ascending=False)
    
    return search_movie_results[['title', 'overview', 'runtime', 'vote_average', 'release_date', 'poster_path']]

In [26]:
overview_search_results_df = pd.DataFrame(search_movie_overview(movie_data, "Oppenheimer"))
overview_search_results_df

Unnamed: 0,title,overview,runtime,vote_average,release_date,poster_path
189566,WW1 - War on Two Wheels (2021),A fascinating insight into the role of the bic...,51,7.0,2021-05-07,https://image.tmdb.org/t/p/w300/238pk7JX2dEw4C...
108175,Unbanned: The Legend of AJ1 (2018),Unbanned explores the dynamic life of AJ1 from...,90,6.8,2018-04-22,https://image.tmdb.org/t/p/w300/u0gL9o1b3kgxPV...
165582,Stalin's James Bond (2017),An account of the troubled life of Richard Sor...,53,6.0,2017-12-30,https://image.tmdb.org/t/p/w300/j523NJfe7KRUfB...
66204,Kanche (2015),A love story played out against the backdrop o...,119,4.8,2015-10-23,https://image.tmdb.org/t/p/w300/xKYefD66zwrx2D...
84875,Homely Meals (2014),"""Homely Meals"" is an entertainer starring Atle...",142,5.6,2014-10-03,https://image.tmdb.org/t/p/w300/g7KMc8VZ54uR2m...
5988,The Counterfeiters (2007),The story of Jewish counterfeiter Salomon Soro...,98,7.377,2007-03-22,https://image.tmdb.org/t/p/w300/Af6I9RZF0SIPeN...
65385,The Seven Cervi Brothers (1968),The story of the Cervi family. Rural farmers b...,105,7.1,1968-02-16,https://image.tmdb.org/t/p/w300/jZ07F5Zw5nYrjv...
24481,The Two Marshals (1961),September 1943: in the general confusion a thi...,90,7.1,1961-12-22,https://image.tmdb.org/t/p/w300/321MOyql8LwOYZ...
164891,Pasteur (1935),Guitry reprises his role as Pasteur which he p...,75,6.0,1935-09-20,https://image.tmdb.org/t/p/w300/9Ukm0obA1Exvju...


## recommend movies based on genre

In [27]:
movie_data[["genres"]]

Unnamed: 0,genres
0,"Action, Science Fiction, Adventure"
1,"Adventure, Drama, Science Fiction"
2,"Drama, Action, Crime, Thriller"
3,"Action, Adventure, Fantasy, Science Fiction"
4,"Science Fiction, Action, Adventure"
...,...
247446,Action
247447,Romance
247448,Unknown
247449,Drama


Users can multiselect the genres.

In [28]:
def recommend_movies_by_genres(df, user_genres, k=10):
    # Create a filter mask for all specified genres
    genre_mask = df['genres'].apply(lambda genres: any(genre.lower() in genres.lower() for genre in user_genres))
    
    # Filter movies by genres
    filtered_movies = df[genre_mask]
    
    # Sort by vote_count then vote_average in descending order
    sorted_movies = filtered_movies.sort_values(by=['vote_count', 'vote_average'], ascending=[False, False])
    
    # Get top k movies
    top_movies = sorted_movies.head(k)
    
    # sort by release date
    top_movies = top_movies.sort_values(by='release_date', ascending=False)

    return top_movies[['title', 'overview', 'runtime', 'vote_average', 'release_date', 'poster_path']]

In [29]:
top_genre_movies = pd.DataFrame(recommend_movies_by_genres(movie_data, ["Animation"]))
top_genre_movies

Unnamed: 0,title,overview,runtime,vote_average,release_date,poster_path
66,Coco (2017),Despite his family’s baffling generations-old ...,105,8.222,2017-10-27,https://image.tmdb.org/t/p/w300/gGEsBPAijhVUFo...
43,Inside Out (2015),"Growing up can be a bumpy road, and it's no ex...",95,7.922,2015-06-09,https://image.tmdb.org/t/p/w300/2H1TmgdfNtsKlU...
51,Up (2009),Carl Fredricksen spent his entire life dreamin...,96,7.949,2009-05-28,https://image.tmdb.org/t/p/w300/vpbaStTMt8qqXa...
69,WALL·E (2008),What if mankind had to leave Earth and somebod...,98,8.078,2008-06-22,https://image.tmdb.org/t/p/w300/hbhFnRzzg6ZDmm...
79,The Incredibles (2004),Bob Parr has given up his superhero days to lo...,115,7.704,2004-10-27,https://image.tmdb.org/t/p/w300/2LqaLgk4Z226Kk...
61,Finding Nemo (2003),"Nemo, an adventurous young clownfish, is unexp...",100,7.824,2003-05-30,https://image.tmdb.org/t/p/w300/ggQ6o8X5984OCh...
72,"Monsters, Inc. (2001)",Lovable Sulley and his wisecracking sidekick M...,92,7.835,2001-11-01,https://image.tmdb.org/t/p/w300/sgheSKxZkttIe8...
97,Shrek (2001),It ain't easy bein' green -- especially if you...,90,7.73,2001-05-18,https://image.tmdb.org/t/p/w300/dyhaB19AICF7TO...
73,Toy Story (1995),"Led by Woody, Andy's toys live happily in his ...",81,7.971,1995-10-30,https://image.tmdb.org/t/p/w300/uXDfjJbdP4ijW5...
76,The Lion King (1994),A young lion prince is cast out of his pride b...,89,8.256,1994-06-24,https://image.tmdb.org/t/p/w300/sKCr78MXSLixwm...


In [30]:
def get_unique_genres(df, genre_column='genres'):
    # Split the genres into a list of lists
    all_genres = df[genre_column].str.split(', ').tolist()
    
    # Flatten the list of lists into a single list of genres
    flat_genres = [genre for sublist in all_genres for genre in sublist]
    
    # Get unique genres using a set
    unique_genres = sorted(set(flat_genres))
    
    # Exclude 'Unknown' genre
    unique_genres = [genre for genre in unique_genres if genre.lower() != 'unknown']
    
    return unique_genres

# Example Usage
unique_genres = get_unique_genres(movie_data)
print("Unique Genres:")
print(unique_genres)

Unique Genres:
['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western']


## recommend movies based on language

Recommend top 10 movies ranked by average votes, for the specified language. We will use a drop-down options feature.

In [31]:
def get_unique_languages(df, language_column='original_language'):
    # Get unique languages using a set
    unique_languages = sorted(set(df[language_column]))
    
    # Exclude 'Unknown' language if it exists
    unique_languages = [lang for lang in unique_languages if not lang.lower().startswith('unknown language')]
    
    return unique_languages

unique_lang_list = get_unique_languages(movie_data)
print("Unique Languages:")
print(unique_lang_list)

Unique Languages:
['Abkhazian', 'Afrikaans', 'Akan', 'Albanian', 'Amharic', 'Arabic', 'Armenian', 'Assamese', 'Aymara', 'Azerbaijani', 'Bambara', 'Bangla', 'Basque', 'Belarusian', 'Bislama', 'Bosnian', 'Bulgarian', 'Burmese', 'Catalan', 'Chinese', 'Cornish', 'Cree', 'Croatian', 'Czech', 'Danish', 'Divehi', 'Dutch', 'Dzongkha', 'English', 'Esperanto', 'Estonian', 'Faroese', 'Filipino', 'Finnish', 'French', 'Fula', 'Galician', 'Georgian', 'German', 'Greek', 'Guarani', 'Gujarati', 'Haitian Creole', 'Hausa', 'Hebrew', 'Hindi', 'Hungarian', 'Icelandic', 'Igbo', 'Indonesian', 'Interlingue', 'Inuktitut', 'Inupiaq', 'Irish', 'Italian', 'Japanese', 'Javanese', 'Kalaallisut', 'Kannada', 'Kashmiri', 'Kazakh', 'Khmer', 'Kinyarwanda', 'Korean', 'Kurdish', 'Kyrgyz', 'Lao', 'Latin', 'Latvian', 'Limburgish', 'Lingala', 'Lithuanian', 'Luxembourgish', 'Macedonian', 'Malagasy', 'Malay', 'Malayalam', 'Maltese', 'Marathi', 'Mongolian', 'Māori', 'Navajo', 'Nepali', 'North Ndebele', 'Northern Sami', 'Norwegi

Languages would be selected by the user via dropdown.

In [32]:
def recommend_movies_by_language(df, language):
    # Get the movies that belong to the specified genre
    lang_movies = df[df['original_language'].str.contains(language)]

    # Sort the movies by votes in descending order
    sorted_lang_movies = lang_movies.sort_values(by=['vote_count', 'vote_average'], ascending=[False, False])

    # Select the top 10 movies
    top_10_lang_movies = sorted_lang_movies.head(10)

    # sort by release date
    top_10_lang_movies = top_10_lang_movies.sort_values(by='release_date', ascending=False)

    return top_10_lang_movies[['title', 'overview', 'runtime', 'vote_average', 'release_date', 'poster_path']]

In [33]:
recommend_movies_by_language(movie_data, "Korean")

Unnamed: 0,title,overview,runtime,vote_average,release_date,poster_path
2009,Peninsula (2020),A soldier and his team battle hordes of post-a...,115,6.8,2020-07-15,https://image.tmdb.org/t/p/w300/eeqvAzCccAZOhU...
2687,#Alive (2020),"As a grisly virus rampages a city, a lone man ...",98,7.283,2020-06-24,https://image.tmdb.org/t/p/w300/zqf711LsnQ5CcW...
81,Parasite (2019),"All unemployed, Ki-taek's family takes peculia...",133,8.515,2019-05-30,https://image.tmdb.org/t/p/w300/7IiTTgloJzvGI1...
536,Train to Busan (2016),When a zombie virus pushes Korea into a state ...,118,7.8,2016-07-20,https://image.tmdb.org/t/p/w300/vNVFt6dtcqnI7h...
1351,The Handmaiden (2016),"In 1930s Korea, a swindler and a young woman p...",145,8.246,2016-06-01,https://image.tmdb.org/t/p/w300/8MnMGO3oALkaia...
346,Snowpiercer (2013),In a future where a failed global-warming expe...,127,6.902,2013-08-01,https://image.tmdb.org/t/p/w300/9JPx09Rr0Txq2e...
1940,I Saw the Devil (2010),Kyung-Chul is a dangerous psychopath who kills...,144,7.802,2010-08-12,https://image.tmdb.org/t/p/w300/zp5NrmYp80axIG...
1778,The Host (2006),A teenage girl is captured by a giant mutated ...,119,6.982,2006-07-27,https://image.tmdb.org/t/p/w300/dEDLY3KeghKFzk...
415,Oldboy (2003),"With no clue how he came to be imprisoned, dru...",120,8.274,2003-11-21,https://image.tmdb.org/t/p/w300/pWDtjs568ZfOTM...
1325,Memories of Murder (2003),"During the late 1980s, two detectives in a Sou...",131,8.066,2003-05-02,https://image.tmdb.org/t/p/w300/lp3Qzzq1zzy6To...


## saving into pickle files

In [35]:
# save dataframe relevant to recommending movies by title/ similar overview -> gzip compressed pickle file
with gzip.open('../data/pkl_files/movie_overall_data.pkl.gz', 'wb') as f:
    pickle.dump(movie_data, f)

# save dataframe with columns relevant to recommending movies by selected genres -> gzip compressed pickle file
genres_recommendation_columns = ['title', 'overview', 'runtime', 'vote_average', 'release_date', 'poster_path', 'genres', 'vote_count']
genres_recommendation_data = movie_data[genres_recommendation_columns]

with gzip.open('../data/pkl_files/movie_genres_rec_data.pkl.gz', 'wb') as f:
    pickle.dump(genres_recommendation_data, f)

# save dataframe with columns relevant to recommending movies by selected language -> gzip compressed pickle file
language_recommendation_columns = ['title', 'overview', 'runtime', 'vote_average', 'release_date', 'poster_path', 'original_language', 'vote_count']
language_recommendation_data = movie_data[language_recommendation_columns]

with gzip.open('../data/pkl_files/movie_language_rec_data.pkl.gz', 'wb') as f:
    pickle.dump(language_recommendation_data, f)