### CSCN8010 - Fall 2023 - Foundations of Machine Learning Frameworks Assignment 2 - Final Project
#### Project Name : Popcorn Pilot
 - Authors
 - Sam Hussain Hajanajumudeen(8901770)
 - Rohit Khadka (8899399)

<b>1. Clear Problem Statement (Objective, Motivation, Method):</b>

Objective: The project, named "Popcorn Pilot," aims to develop a movie recommendation system. Given a movie name as input, the system will suggest other movies based on various factors such as genre, story, and plot.

Motivation: The motivation behind this project is to enhance the movie-watching experience by providing personalized recommendations, making it easier for users to discover films that align with their preferences.

Method: The recommendation system will be built using machine learning techniques. The model will be trained on a dataset containing information about movies from IMDB, considering features like genre, story details, and plot summaries.

<b>2. Interesting Choice of Problem</b>

The problem of movie recommendation is interesting as it involves applying machine learning to provide personalized suggestions, making it engaging for users.

<b>3. Reasonable (Doable) Choice of Problem</b>

The choice of building a movie recommendation system based on IMDB data is reasonable and feasible, considering the availability of a rich dataset and established methods for building such systems.

<b>4. Review of the Data Source</b>

The dataset for this project will be sourced from IMDB, a reputable and comprehensive database of movies. It's crucial to highlight the characteristics and size of the dataset, ensuring it contains the necessary information for effective model training.

<b>5. Indication of Reference Code</b>

I mention that it is to refer to relevant machine learning and recommendation system code examples, libraries, or frameworks. This could include using Python with libraries like scikit-learn

<b>Proposal :</b>

I am proposing a project titled "Popcorn Pilot" which involves the development of a movie recommendation system. The objective is to provide users with personalized movie suggestions based on genre, story, and plot details. The motivation behind this project is to enhance the overall movie-watching experience by offering tailored recommendations.

The chosen method for building the recommendation system involves utilizing machine learning techniques. The model will be trained on a dataset sourced from IMDB, a renowned database of movies. This dataset will contain crucial information such as genre, story details, and plot summaries.

The problem of movie recommendation is interesting, as it not only involves applying machine learning concepts but also contributes to user satisfaction by simplifying the process of discovering movies aligned with their preferences. The choice of using IMDB data is reasonable and feasible, considering the availability of a comprehensive dataset and established methods for building recommendation systems. 

To ensure the successful implementation of the project, I plan to review relevant machine learning and recommendation system code examples, leveraging Python and possibly libraries like scikit-learn or TensorFlow.

This project holds the potential to provide a valuable and enjoyable service to movie enthusiasts, simplifying their movie selection process and introducing them to films they might have otherwise missed.


#### 1. Downloading the Dataset from Kaggle

In [3]:
import kaggle

In [4]:
!kaggle datasets download -d rounakbanik/the-movies-dataset

Downloading the-movies-dataset.zip to d:\Fall_2023\course\AI_ML_8010\Final Project\project\popcorn-pilot




  0%|          | 0.00/228M [00:00<?, ?B/s]
  0%|          | 1.00M/228M [00:00<00:32, 7.36MB/s]
  2%|▏         | 4.00M/228M [00:00<00:12, 19.1MB/s]
  4%|▎         | 8.00M/228M [00:00<00:09, 25.1MB/s]
  5%|▌         | 12.0M/228M [00:00<00:08, 27.9MB/s]
  7%|▋         | 16.0M/228M [00:00<00:07, 29.3MB/s]
  9%|▉         | 20.0M/228M [00:00<00:07, 30.4MB/s]
 11%|█         | 24.0M/228M [00:00<00:06, 30.8MB/s]
 12%|█▏        | 28.0M/228M [00:01<00:06, 31.3MB/s]
 14%|█▍        | 32.0M/228M [00:01<00:06, 31.5MB/s]
 16%|█▌        | 36.0M/228M [00:01<00:06, 31.8MB/s]
 18%|█▊        | 40.0M/228M [00:01<00:06, 31.9MB/s]
 19%|█▉        | 44.0M/228M [00:01<00:06, 31.9MB/s]
 21%|██        | 48.0M/228M [00:01<00:05, 32.0MB/s]
 23%|██▎       | 52.0M/228M [00:01<00:05, 32.0MB/s]
 25%|██▍       | 56.0M/228M [00:01<00:05, 32.0MB/s]
 26%|██▋       | 60.0M/228M [00:02<00:05, 32.0MB/s]
 28%|██▊       | 64.0M/228M [00:02<00:05, 32.1MB/s]
 30%|██▉       | 68.0M/228M [00:02<00:05, 32.1MB/s]
 32%|███▏      | 72.

### 2. Extacting Dataset

In [7]:
!powershell Expand-Archive -Path 'the-movies-dataset.zip' -DestinationPath './content/data'


#### Installing required packages

In [8]:
!pip install --quiet fastparquet
!pip install --quiet pyarrow


[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


##### Importing Rerquired Packages

In [9]:
%matplotlib inline
import pandas as pd
import numpy as np

from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem.snowball import SnowballStemmer

import pyarrow as pa
import pyarrow.parquet as pq

import warnings
warnings.simplefilter('ignore')

### 3: Data Cleaning & Engineering

>     Information on preparing data for trainable features

#### Utility Functions For Data Cleaning & Engineering

In [10]:
def get_director(x):
    """
    Extract the Name of the Director for a movie if it is present inside the job
    """
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

#### Reading dataset and merging them to form master dataset

In [11]:
movies_dataset = pd.read_csv('./content/data/movies_metadata.csv')
credits = pd.read_csv('./content/data/credits.csv')
keywords = pd.read_csv('./content/data/keywords.csv')
links = pd.read_csv('./content/data/links.csv')

In [12]:
# Dropping these 3 rows because Date Column value for them is string date instead of Int with ID.
movies_dataset = movies_dataset.drop([19730, 29503, 35587])

In [13]:
# Extracting Genres of movies from the genres dictionary. If not present, append empty list
movies_dataset['genres'] = movies_dataset['genres'].fillna('[]').apply(
    literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [14]:
# Convert to common data type for primary key in our dataset
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
movies_dataset['id'] = movies_dataset['id'].astype('int')

In [15]:
# Merging movies dataset with credits & keywords to form master dataset
movies_dataset = movies_dataset.merge(credits, on='id')
master_dataset = movies_dataset.merge(keywords, on='id')

In [16]:
master_dataset.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."


In [17]:
print(master_dataset.columns)

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'cast', 'crew', 'keywords'],
      dtype='object')


In [18]:
links = links[links['tmdbId'].notnull()]['tmdbId'].astype('int')
master_dataset = master_dataset[master_dataset['id'].isin(links)]
print(master_dataset.shape)

(46628, 27)


### 4: Data cleaning and Engineering

In [19]:
# Updating cast, crew and keyword columns by parsing them as their loaded data type is string but need to be converted to list
master_dataset['cast'] = master_dataset['cast'].apply(literal_eval)
master_dataset['crew'] = master_dataset['crew'].apply(literal_eval)
master_dataset['keywords'] = master_dataset['keywords'].apply(literal_eval)

In [20]:
# Updating cast to maintain proportion between different lengths (keeping top 3 cast members)
master_dataset['cast'] = master_dataset['cast'].apply(
    lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
master_dataset['cast'] = master_dataset['cast'].apply(
    lambda x: x[:3] if len(x) >= 3 else x)

# Setting keywords to empty list if does not exists, otherwise taking into account for each word as keyword
master_dataset['keywords'] = master_dataset['keywords'].apply(
    lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

# Extracting directory names from the crew
master_dataset['director'] = master_dataset['crew'].apply(get_director)

In [21]:
# for uniqueness, removing all the spaces in between the names
master_dataset['cast'] = master_dataset['cast'].apply(
    lambda x: [str.lower(i.replace(" ", "")) for i in x])

# Maintaining the original director name as main director
master_dataset['main_director'] = master_dataset['director']

# Maintaining the number of director to maintain proportion (similar to cast column above)
master_dataset['director'] = master_dataset['director'].astype(
    'str').apply(lambda x: str.lower(x.replace(" ", "")))
master_dataset['director']      = master_dataset['director'].apply(lambda x: [x, x,x])

In [22]:
# Stacking the keywords and keeping the movies which containers X number of keywords as minimum
s = master_dataset.apply(lambda x: pd.Series(
    x['keywords']), axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'
s = s.value_counts()
print(s[:5])

keyword
woman director      3128
independent film    1942
murder              1314
based on novel       841
musical              734
Name: count, dtype: int64


In [23]:
# Will try to map where more than 1 keyword is present for the movie
s = s[s > 1]

In [24]:
# creating an object for ENGLISH Stemmer - Snowball to trim down keywords to their stem words
stemmer = SnowballStemmer('english')

# Trim down keywords to their stem words and then remove the space between keywords which are having more than 1 length for uniqueness
master_dataset['keywords'] = master_dataset['keywords'].apply(
    lambda x: [stemmer.stem(i) for i in x])
master_dataset['keywords'] = master_dataset['keywords'].apply(
    lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [25]:
master_dataset['keywords'].head(3)

0    [jealousi, toy, boy, friendship, friend, rival...
1    [boardgam, disappear, basedonchildren'sbook, n...
2       [fish, bestfriend, duringcreditssting, oldmen]
Name: keywords, dtype: object

In [26]:
# Creating a soup feature - combination of (keywords, cast, director, genres)
master_dataset['soup'] = master_dataset['keywords'] + \
    master_dataset['cast'] + master_dataset['director'] + \
    master_dataset['genres']

# Modifying by placing single space between all the soup words
master_dataset['soup'] = master_dataset['soup'].apply(lambda x: ' '.join(x))

In [27]:
master_dataset['soup'].head(3)

0    jealousi toy boy friendship friend rivalri boy...
1    boardgam disappear basedonchildren'sbook newho...
2    fish bestfriend duringcreditssting oldmen walt...
Name: soup, dtype: object

In [28]:
print(master_dataset.columns)

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'cast', 'crew', 'keywords', 'director',
       'main_director', 'soup'],
      dtype='object')


In [29]:
# Removing unwanted columns from the dataset - these features can be used if you wish to add more features to your recommender system.
# We are not going to use them, so we are removing them.
master_dataset.drop(['adult', 'belongs_to_collection', 'budget', 'homepage', 'original_language', 'production_companies',
                    'production_countries', 'revenue', 'runtime', 'spoken_languages', 'status', 'video'], axis=1, inplace=True)
master_dataset.drop(['overview', 'tagline', 'vote_average', 'vote_count',
                    'cast', 'crew', 'keywords', 'director'], axis=1, inplace=True)
master_dataset.drop(['id', 'imdb_id','original_title','poster_path','genres'],axis=1,inplace=True)

In [30]:
# Checking popularity column for being non-float data type and removing them
master_dataset['popularity'] = master_dataset.apply(
    lambda r: r['popularity'] if type(r['popularity']) == float else np.nan, axis=1)
master_dataset.dropna(inplace=True)

# Checking director column for being non-string data type and removing them
master_dataset['main_director'] = master_dataset.apply(
    lambda r: r['main_director'] if len(r['main_director']) > 1 else np.nan, axis=1)
master_dataset.dropna(inplace=True)

In [31]:
# Sorting the whole dataset based on popularity. This will help us to take top X number of movies based on popularity.
master_dataset.sort_values(by=['popularity'], ascending=False, inplace=True)

# Dropping popularity column after sorting based on popularity
master_dataset.drop(['popularity'], axis=1, inplace=True)
master_dataset.dropna(inplace=True)

In [32]:
# Reset index because after sorting, the index values have changed.
master_dataset.reset_index(inplace=True, drop=True)

In [34]:
# For Demo, we will take top 2500 movies, which is hosted online already.
master_dataset = master_dataset[:2500]

# For Tiny-Model, we will take top 1000 movies
# master_dataset = master_dataset[:1000]

# For Extra-Small-Model, we will take top 5000 movies
# master_dataset = master_dataset[:5000]

# For Small-Model, we will take top 10000 movies
# master_dataset = master_dataset[:10000]

# For Medium-Model, we will take top 20000 movies
# master_dataset = master_dataset[:20000]

# For Large-Model, we will take top 30000 movies
# master_dataset = master_dataset[:30000]

# LEAVE ALL THE LINES COMMENTED IF YOU WISH TO TRAIN FULL MOVIES DATASET.

In [35]:
# This is our final dataset which we will be using for training our word and cosine similarity matrix
master_dataset.head()

Unnamed: 0,release_date,title,main_director,soup
0,2015-06-17,Minions,Kyle Balda,assist aftercreditssting duringcreditssting ev...
1,2014-10-24,Big Hero 6,Chris Williams,brotherbrotherrelationship hero talent reveng ...
2,2016-02-09,Deadpool,Tim Miller,antihero mercenari marvelcom superhero basedon...
3,2017-04-19,Guardians of the Galaxy Vol. 2,James Gunn,sequel superhero basedoncom misfit space outer...
4,2009-12-10,Avatar,James Cameron,cultureclash futur spacewar spacecoloni societ...


In [36]:
print(master_dataset.shape)

(2500, 4)


### 5: Recommendation Matrix

>     Building the matrix which contains similarity scores between movies based on the features

In [37]:
# 1: Training Word based count vectorizer model

In [38]:
# Creating a Count Vectorizer object which will be based on word analyzer, with ngram 1-2 and minimum number of occurances of words as 2
count = CountVectorizer(analyzer='word', ngram_range=(
    1, 2), min_df=2, stop_words='english')

# Adjusting the count vectorizer object with respect to our dataset
count_matrix = count.fit_transform(master_dataset['soup'])

In [39]:
print(count_matrix.shape)

(2500, 7277)


#### Building Cosine Similarity Matrix

In [40]:
# We build it as an pyarrow dataframe because it is the most efficient
table = pa.Table.from_pandas(pd.DataFrame(
    cosine_similarity(count_matrix, count_matrix)))

### 6. Model & Data Export

>     Exporting the trained model & dataset efficiently

We export the model into parquet format. We have 3 awesome reasons (even recommend for you in your new project)

1. Uses Less Storage
2. Best Compression Ratio
3. Fast & Optimized for efficient Read/Write

In [41]:
# save the Master Dataset
master_dataset.to_parquet('/content/movie_database.parquet', engine='fastparquet',index=False)

In [42]:
# Writing the Matrix table
pq.write_table(table, '/content/model.parquet')

### 7. Inference

>     Loading the trained model to execute Inference

In [43]:
import pandas as pd
import pyarrow as pa

In [44]:
master_dataset = pd.read_parquet('/content/movie_database.parquet')

In [45]:
master_dataset.head(3)

Unnamed: 0,release_date,title,main_director,soup
0,2015-06-17,Minions,Kyle Balda,assist aftercreditssting duringcreditssting ev...
1,2014-10-24,Big Hero 6,Chris Williams,brotherbrotherrelationship hero talent reveng ...
2,2016-02-09,Deadpool,Tim Miller,antihero mercenari marvelcom superhero basedon...


In [46]:
table = pa.parquet.read_table('/content/model.parquet').to_pandas()

In [47]:
master_dataset = master_dataset.reset_index()
titles = master_dataset['title']
indices = pd.Series(master_dataset.index, index=master_dataset['title'])

In [48]:
def get_recommendations(movie_id_from_db, movie_db):
    try:
        sim_scores = list(enumerate(movie_db[movie_id_from_db]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        sim_scores = sim_scores[1:15]  # get top 15 Recommendations

        movie_indices = [i[0] for i in sim_scores]
        output = master_dataset.iloc[movie_indices]
        output.reset_index(inplace=True, drop=True)

        response = []
        for i in range(len(output)):
            response.append({
                'movie_title': output['title'].iloc[i],
                'movie_release_date': output['release_date'].iloc[i],
                'movie_director': output['main_director'].iloc[i],
                'google_link': "https://www.google.com/search?q=" + '+'.join(output['title'].iloc[i].strip().split())
            })
        return response
    except Exception as e:
        print("error: ", e)
        return []

In [55]:
movie_name = input('Enter a movie Name: ')
movie_name

'The Shawshank Redemption'

In [56]:
movie_index = titles.to_list().index(movie_name)
recommendations = get_recommendations(movie_index, table)

In [57]:
print(f"{'Movie Title':<40} | {'Director':<20} | {'Release Date':<15}")
print(f"-"*80)
for recommendation in recommendations:
    print(
        f"{recommendation['movie_title']:<40} | {recommendation['movie_director']:<20} | {recommendation['movie_release_date']:<15}")

Movie Title                              | Director             | Release Date   
--------------------------------------------------------------------------------
The Green Mile                           | Frank Darabont       | 1999-12-10     
Siberian Education                       | Gabriele Salvatores  | 2013-02-27     
The Broken Circle Breakdown              | Felix Van Groeningen | 2012-10-09     
Winter Sleep                             | Nuri Bilge Ceylan    | 2014-06-13     
I as in Icarus                           | Henri Verneuil       | 1979-12-12     
Death Warrant                            | Deran Sarafian       | 1990-09-14     
Escape from Alcatraz                     | Don Siegel           | 1979-06-22     
London Boulevard                         | William Monahan      | 2010-11-10     
American Psycho                          | Mary Harron          | 2000-04-13     
Rise of the Footsoldier                  | Julian Gilbey        | 2007-09-07     
Bad Boy Bubby    