## Clustering and Recommendation Systems

#### Madelin Rioux 300227635 &  Kendyl Snow 300240323

### Introduction

## The Movies Dataset

### Dataset Information

- **Dataset Name:** Full MovieLens Dataset with TMDB Metadata  
- **Author:** Assembled by a Data Science Career Track student at Springboard (Collaborator Rounak Banik) 
- **Purpose:**   
  This dataset was created to perform Exploratory Data Analysis (EDA) on movie data to tell the story of cinema. It also serves to build various types of recommender systems, including content-based and collaborative filtering models.

### Dataset Shape

### `movies_metadata.csv`
- **Rows:** 45,466 movies  
- **Columns:** 24 features  

### `ratings_small.csv`
- **Rows:** 100,004 ratings  
- **Columns:** 4 features  

### Feature List and Descriptions 
#### `movies_metadata.csv`

| Feature Name           | Description                                                               | Type                        |
|------------------------|---------------------------------------------------------------------------|-----------------------------|
| `adult`                | Whether the movie is adult-themed or not                                  | Categorical                 |
| `belongs_to_collection`| Data on collection the movie belongs to (if any)                          | Categorical (JSON/string)   |
| `budget`               | Budget of the movie in USD                                                | Numerical                   |
| `genres`               | List of genres for the movie                                              | Categorical (JSON/string)   |
| `homepage`             | URL of the movie's homepage (if available)                                | Categorical                 |
| `id`                   | TMDB ID of the movie                                                      | Categorical (ID)            |
| `imdb_id`              | IMDB ID of the movie                                                      | Categorical (ID)            |
| `original_language`    | Language in which the movie was originally produced                       | Categorical                 |
| `original_title`       | Original title of the movie                                               | Categorical                 |
| `overview`             | Short summary/description of the movie plot                               | Categorical (Text)          |
| `popularity`           | Popularity score assigned by TMDB                                         | Numerical                   |
| `poster_path`          | URL path to the movie's poster image                                      | Categorical                 |
| `production_companies` | List of companies that produced the movie                                 | Categorical (JSON/string)   |
| `production_countries` | List of countries involved in production                                  | Categorical (JSON/string)   |
| `release_date`         | Official release date                                                     | Categorical (Date)          |
| `revenue`              | Revenue generated by the movie in USD                                     | Numerical                   |
| `runtime`              | Duration of the movie in minutes                                          | Numerical                   |
| `spoken_languages`     | Languages spoken in the movie                                             | Categorical (JSON/string)   |
| `status`               | Release status of the movie (e.g. Released, Post Production)              | Categorical                 |
| `tagline`              | Catchphrase or tagline of the movie                                       | Categorical                 |
| `title`                | Title of the movie                                                        | Categorical                 |
| `video`                | Boolean indicating if it’s a video release                                | Categorical                 |
| `vote_average`         | Average rating score on TMDB                                              | Numerical                   |
| `vote_count`           | Number of user votes received on TMDB                                     | Numerical                   |

#### `ratings_small.csv`

| Feature Name | Description                                  | Type     |
|--------------|----------------------------------------------|----------|
| `userId`     | Unique ID for each user                      | Numerical (int64) |
| `movieId`    | Unique ID for each movie                     | Numerical (int64) |
| `rating`     | Rating given by the user (1.0 to 5.0 scale)  | Numerical (float64) |
| `timestamp`  | Time when the rating was submitted (epoch)   | Numerical (int64) |

### Additional Files

- `keywords.csv`: Plot keywords per movie (as stringified JSON)  
- `credits.csv`: Cast and crew details (stringified JSON)  
- `links.csv`: TMDB and IMDB IDs for movies  
- `links_small.csv`: Smaller subset of links for 9,000 movies  

### Explanation for Choice of Dataset

This dataset provides:

- Movie metadata (budget, revenue, genres, cast, etc.)
- User rating behavior (for recommendation systems)

It is ideal for tasks such as:

- Similarity Measures
- Clustering Algorithims 
- Content based recommendation systems
- Collaborative Filtering Recommendation System

### THIS IS TEMPORARY 

(only so we dont have to add the dataset to github)

In [5]:
import kagglehub
import os

path = kagglehub.dataset_download("rounakbanik/the-movies-dataset")




In [6]:
csv_path_metadata = os.path.join(path, "movies_metadata.csv")
csv_path_ratings = os.path.join(path, "ratings_small.csv")


metadata_df_og = pd.read_csv(csv_path_metadata)
ratings_df_og = pd.read_csv(csv_path_ratings)

### Importations 

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import LocalOutlierFactor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import hamming, cityblock
from difflib import SequenceMatcher
import Levenshtein
import ast
import json


import warnings
warnings.filterwarnings('ignore')

### Reading of the datasets

We will add this back in once finished (when submitting)

In [8]:
# metadata_df_og = pd.read_csv('datasets/movies_metadata.csv')
# ratings_df = pd.read_csv('datasets/ratings_small.csv')

### EDA

**MetaData df**

In [9]:
metadata_df_og.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [10]:
metadata_df_og.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


We will compute all our changes on a copy of the original dataframe

In [11]:
metadata_df = metadata_df_og.copy()

As we can note, Collection, genres, and spoken languages are currently in JSON format. First we will convert them to useful columns
https://www.w3schools.com/python/python_json.asp#:~:text=If%20you%20have%20a%20JSON,will%20be%20a%20Python%20dictionary.

In [12]:
def extract_json(string):
    try:
        json_dict = json.loads(string.replace("'", '"'))
        return json_dict['name']
    except:
        return None


In [13]:
metadata_df['belongs_to_collection_list'] = metadata_df['belongs_to_collection'].apply(extract_json)

https://docs.python.org/3/library/ast.html

In [14]:
def extract_array_json(string):
    try:
        dicts = ast.literal_eval(string)
        return [d['name'] for d in dicts]
    except:
        return []


In [15]:
metadata_df['genres_list'] = metadata_df['genres'].apply(extract_array_json)
metadata_df['spoken_languages_list'] = metadata_df['spoken_languages'].apply(extract_array_json)
metadata_df['production_companies_list'] = metadata_df['production_companies'].apply(extract_array_json)
metadata_df['production_countries_list'] = metadata_df['production_countries'].apply(extract_array_json)

In [16]:
metadata_df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,tagline,title,video,vote_average,vote_count,belongs_to_collection_list,genres_list,spoken_languages_list,production_companies_list,production_countries_list
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,,Toy Story,False,7.7,5415.0,Toy Story Collection,"[Animation, Comedy, Family]",[English],[Pixar Animation Studios],[United States of America]
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,,"[Adventure, Fantasy, Family]","[English, Français]","[TriStar Pictures, Teitler Film, Interscope Co...",[United States of America]
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,Grumpy Old Men Collection,"[Romance, Comedy]",[English],"[Warner Bros., Lancaster Gate]",[United States of America]
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,,"[Comedy, Drama, Romance]",[English],[Twentieth Century Fox Film Corporation],[United States of America]
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,Father of the Bride Collection,[Comedy],[English],"[Sandollar Productions, Touchstone Pictures]",[United States of America]


Now we can remove the JSON columns

In [17]:
metadata_df = metadata_df.drop(['belongs_to_collection', 'genres','spoken_languages','production_companies', 'production_countries'], axis=1)

In [18]:
metadata_df.head()

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,...,tagline,title,video,vote_average,vote_count,belongs_to_collection_list,genres_list,spoken_languages_list,production_companies_list,production_countries_list
0,False,30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,...,,Toy Story,False,7.7,5415.0,Toy Story Collection,"[Animation, Comedy, Family]",[English],[Pixar Animation Studios],[United States of America]
1,False,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,...,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,,"[Adventure, Fantasy, Family]","[English, Français]","[TriStar Pictures, Teitler Film, Interscope Co...",[United States of America]
2,False,0,,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,...,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,Grumpy Old Men Collection,"[Romance, Comedy]",[English],"[Warner Bros., Lancaster Gate]",[United States of America]
3,False,16000000,,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,...,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,,"[Comedy, Drama, Romance]",[English],[Twentieth Century Fox Film Corporation],[United States of America]
4,False,0,,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,...,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,Father of the Bride Collection,[Comedy],[English],"[Sandollar Productions, Touchstone Pictures]",[United States of America]


Now we will ensure that each column is the accurate column type

In [24]:
#converting 'budget', 'popularity','id', 'imdb_id' to numeric
metadata_df['budget'] = pd.to_numeric(metadata_df['budget'], errors='coerce')
metadata_df['popularity'] = pd.to_numeric(metadata_df['popularity'], errors='coerce')
metadata_df['id'] = pd.to_numeric(metadata_df['id'], errors='coerce')
metadata_df['imdb_id'] = pd.to_numeric(metadata_df['imdb_id'], errors='coerce')

#converting 'release_date' to datetime
metadata_df['release_date'] = pd.to_datetime(metadata_df['release_date'], errors='coerce')


Now we want to split up realse date and year (so it is easier to find similairties to)

In [25]:
metadata_df['release_year'] = metadata_df['release_date'].dt.year
metadata_df['release_month'] = metadata_df['release_date'].dt.month
metadata_df = metadata_df.drop(columns=['release_date'])

In [26]:
metadata_df.head()

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,...,video,vote_average,vote_count,belongs_to_collection_list,genres_list,spoken_languages_list,production_companies_list,production_countries_list,release_year,release_month
0,False,30000000.0,http://toystory.disney.com/toy-story,862.0,,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,...,False,7.7,5415.0,Toy Story Collection,"[Animation, Comedy, Family]",[English],[Pixar Animation Studios],[United States of America],1995.0,10.0
1,False,65000000.0,,8844.0,,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,...,False,6.9,2413.0,,"[Adventure, Fantasy, Family]","[English, Français]","[TriStar Pictures, Teitler Film, Interscope Co...",[United States of America],1995.0,12.0
2,False,0.0,,15602.0,,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,...,False,6.5,92.0,Grumpy Old Men Collection,"[Romance, Comedy]",[English],"[Warner Bros., Lancaster Gate]",[United States of America],1995.0,12.0
3,False,16000000.0,,31357.0,,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,...,False,6.1,34.0,,"[Comedy, Drama, Romance]",[English],[Twentieth Century Fox Film Corporation],[United States of America],1995.0,12.0
4,False,0.0,,11862.0,,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,...,False,5.7,173.0,Father of the Bride Collection,[Comedy],[English],"[Sandollar Productions, Touchstone Pictures]",[United States of America],1995.0,2.0


In [20]:
metadata_df.isnull().sum()

adult                             0
budget                            0
homepage                      37684
id                                0
imdb_id                          17
original_language                11
original_title                    0
overview                        954
popularity                        5
poster_path                     386
release_date                     87
revenue                           6
runtime                         263
status                           87
tagline                       25054
title                             6
video                             6
vote_average                      6
vote_count                        6
belongs_to_collection_list    42298
genres_list                       0
spoken_languages_list             0
production_companies_list         0
production_countries_list         0
dtype: int64

### Ratings dataset

In [5]:
ratings_df_og.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100004 non-null  int64  
 1   movieId    100004 non-null  int64  
 2   rating     100004 non-null  float64
 3   timestamp  100004 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [171]:
ratings_df_og.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [172]:
ratings_df_og.isnull().sum()

User-ID        0
ISBN           0
Book-Rating    0
dtype: int64

### Study 1

**Similarity functions** https://ashukumar27.medium.com/similarity-functions-in-python-aa6dfe721035