#### Hi! Choosing a movie is a real struggle for many of us :) So most of the streaming platforms have inbuild recommendation systems. These systems aim to predict user's interests and recommend items that they'll probably like. Throughout this notebook, we will try to use 2 clasterisation methods to build our own movie recommender.

#### We are going to use three following data sets:
[Netflix TV Shows and Movies](https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies?datasetId=2178661&sortBy=voteCount)  
[HBO Max TV Shows and Movies](https://www.kaggle.com/datasets/victorsoeiro/hbo-max-tv-shows-and-movies?select=titles.csv)  
[Amazon Prime TV Shows and Movies](https://www.kaggle.com/datasets/victorsoeiro/amazon-prime-tv-shows-and-movies?select=titles.csv)  

# Step 1: Import Required Libraries
First, let's import the necessary libraries for the project.

In [45]:
import pandas as pd
import numpy as np

# warnings 
import warnings
warnings.filterwarnings('ignore')


# sklearn
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

# Step 3: Data imports (3 datasets already mentioned)

In [46]:
df_netflix = pd.read_csv('credits.csv')
df_amazon =  pd.read_csv('title.csv')
df_hbo =  pd.read_csv('titles.csv')

In [47]:
df = pd.concat([df_netflix, df_amazon, df_hbo], axis=0)

In [48]:
df.head()

Unnamed: 0,person_id,id,name,character,role,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,3748.0,tm84618,Robert De Niro,Travis Bickle,ACTOR,,,,,,,,,,,,,,
1,14658.0,tm84618,Jodie Foster,Iris Steensma,ACTOR,,,,,,,,,,,,,,
2,7064.0,tm84618,Albert Brooks,Tom,ACTOR,,,,,,,,,,,,,,
3,3739.0,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR,,,,,,,,,,,,,,
4,48933.0,tm84618,Cybill Shepherd,Betsy,ACTOR,,,,,,,,,,,,,,


# Step 3: Data Cleaning and Preprocessing

In [49]:
df_movies = df.drop_duplicates()

In [50]:
# Drop unnecessary columns
df_movies.drop(['description', 'age_certification'], axis=1, inplace=True)

In [51]:
df.head()

Unnamed: 0,person_id,id,name,character,role,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,3748.0,tm84618,Robert De Niro,Travis Bickle,ACTOR,,,,,,,,,,,,,,
1,14658.0,tm84618,Jodie Foster,Iris Steensma,ACTOR,,,,,,,,,,,,,,
2,7064.0,tm84618,Albert Brooks,Tom,ACTOR,,,,,,,,,,,,,,
3,3739.0,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR,,,,,,,,,,,,,,
4,48933.0,tm84618,Cybill Shepherd,Betsy,ACTOR,,,,,,,,,,,,,,


##### working with production_countries column

In [52]:
df['production_countries']

0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
         ...  
3289    ['PR']
3290    ['PA']
3291        []
3292        []
3293    ['US']
Name: production_countries, Length: 90966, dtype: object

In [53]:

# 1. Remove unwanted characters from the 'production_countries' column
# The .str.replace() method is used to remove '[' and ']' characters, and any single quotes
# The 'regex=True' flag allows the .str.replace() method to interpret the patterns as regular expressions.
# Note: Square brackets [ ] are special characters in regex, so they are not part of character set and needs escaping.
df_movies['production_countries'] = df_movies['production_countries'].str.replace(r"\[", '', regex=True).str.replace(r"'", '', regex=True).str.replace(r"\]", '', regex=True)

# 2. Extract the first country from the cleaned 'production_countries' column
# The .str.split(',') splits the string into a list using commas as the delimiter, then .str[0] selects the first element.
# This creates a new column 'lead_prod_country' that represents the primary production country of each movie
df_movies['lead_prod_country'] = df_movies['production_countries'].str.split(',').str[0]

# 3. Calculate the number of countries involved in the production of each movie
# The .str.split(',') splits the 'production_countries' string by commas, and .str.len() counts the number of elements in the resulting list.
# This new column 'prod_countries_cnt' stores the count of production countries for each movie, providing additional data insights
df_movies['prod_countries_cnt'] = df_movies['production_countries'].str.split(',').str.len()

# 4. Replace any empty values in the 'lead_prod_country' column with NaN (Not a Number)
# This step uses the .replace() method to convert any empty strings ('') to np.nan (missing values)
# Handling missing data with NaN is important for accurate data analysis and prevents errors in downstream processing
df_movies['lead_prod_country'] = df_movies['lead_prod_country'].replace('', np.nan)


In [54]:
df_movies['lead_prod_country']

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
       ... 
3289     PR
3290     PA
3291    NaN
3292    NaN
3293     US
Name: lead_prod_country, Length: 90933, dtype: object

##### Working with genres

In [55]:
df_movies['genres']

0                        NaN
1                        NaN
2                        NaN
3                        NaN
4                        NaN
                ...         
3289    ['romance', 'music']
3290              ['comedy']
3291              ['comedy']
3292              ['comedy']
3293       ['documentation']
Name: genres, Length: 90933, dtype: object

In [56]:
# 1. Remove unwanted characters from the 'genres' column
# The .str.replace() method is used to remove '[' and ']' characters, and any single quotes from the 'genres' column
# This cleans the 'genres' data by removing extraneous characters, making it easier to analyze and manipulate
# Note: Square brackets [ ] are special characters in regex, so they need escaping with a backslash (\).
df_movies['genres'] = df_movies['genres'].str.replace(r"\[", '', regex=True).str.replace(r"'", '', regex=True).str.replace(r"\]", '', regex=True)

# 2. Extract the first genre from the cleaned 'genres' column
# The .str.split(',') splits the 'genres' string by commas, and .str[0] selects the first element of the resulting list
# This creates a new column 'main_genre' that represents the primary genre of each movie
df_movies['main_genre'] = df_movies['genres'].str.split(',').str[0]

# . Replace any empty values in the 'main_genre' column with NaN (Not a Number)
# This step uses the .replace() method to convert any empty strings ('') to np.nan, indicating missing data
# Handling missing data with NaN is important for accurate data analysis and prevents errors in downstream processing
df_movies['main_genre'] = df_movies['main_genre'].replace('', np.nan)

In [57]:
df_movies['main_genre']

0                 NaN
1                 NaN
2                 NaN
3                 NaN
4                 NaN
            ...      
3289          romance
3290           comedy
3291           comedy
3292           comedy
3293    documentation
Name: main_genre, Length: 90933, dtype: object

In [58]:
#  Drop unnecessary columns 'genres' and 'production_countries' from the DataFrame
# The .drop() method with 'axis=1' removes specified columns, as they are no longer needed after extracting the main genre and production country count
df_movies.drop(['genres', 'production_countries'], axis=1, inplace=True)

### drop missing values

In [59]:
df_movies.shape

(90933, 18)

In [60]:
df_movies.isnull().sum()

person_id             13132
id                        0
name                  13132
character             22904
role                  13132
title                 77801
type                  77801
release_year          77801
runtime               77801
seasons               88831
imdb_id               78794
imdb_score            79194
imdb_votes            79215
tmdb_popularity       78380
tmdb_score            80146
lead_prod_country     78732
prod_countries_cnt    77801
main_genre            78063
dtype: int64

In [61]:
# Drop rows with any missing values to clean the dataset
df_movies.dropna(inplace=True)

# Set the 'title' column as the DataFrame index
df_movies.set_index('title', inplace=True)

# Drop the 'id' and 'imdb_id' columns as they are not needed for further analysis
df_movies.drop(['id', 'imdb_id'], axis=1, inplace=True)


In [62]:
df_movies.shape

(0, 15)

# Encoding Categorical Features:

In [63]:
# Create dummy variables for categorical columns ('type', 'lead_prod_country', 'main_genre')
dummies = pd.get_dummies(df_movies[['type', 'lead_prod_country', 'main_genre']], drop_first=True)

# Concatenate the dummy variables with the original DataFrame
df_movies_dum = pd.concat([df_movies, dummies], axis=1)

# 14. Drop the original categorical columns after creating dummy variables
df_movies_dum.drop(['type', 'lead_prod_country', 'main_genre'], axis=1, inplace=True)

# Scaling (MinmaxScaler):

In [64]:
if not df_movies_dum.empty:
    scaler = MinMaxScaler()
    df_scaled = scaler.fit_transform(df_movies_dum)
    df_scaled = pd.DataFrame(df_scaled, columns=df_movies_dum.columns)
else:
    print("DataFrame is empty. Cannot apply MinMaxScaler.")
    # You can either return, skip scaling, or handle it based on your needs

DataFrame is empty. Cannot apply MinMaxScaler.


In [65]:
# # Apply MinMaxScaler to scale the data for model training
# scaler = MinMaxScaler()
# df_scaled = scaler.fit_transform(df_movies_dum)
# df_scaled = pd.DataFrame(df_scaled, columns=df_movies_dum.columns)

# # Display the scaled DataFrame

# df_scaled

<a id="5"></a> <br>
# step 4: DBSCAN 

###### run a loop to get best epsilon value and minpnts

In [66]:
# Define the range of epsilon (eps) and minimum samples (min_samples) parameters for DBSCAN
eps_array = [0.2, 0.5, 1]  # List of different epsilon values (the maximum distance between two samples for one to be considered as in the neighborhood of the other)
min_samples_array = [5, 10, 30]  # List of different min_samples values (the number of samples in a neighborhood for a point to be considered as a core point)

# Iterate over each combination of eps and min_samples
for eps in eps_array:
    for min_samples in min_samples_array:
        # Initialize and fit the DBSCAN model with the current parameters
        clusterer = DBSCAN(eps=eps, min_samples=min_samples).fit(df_scaled)
        
        # Retrieve the cluster labels from the fitted model
        cluster_labels = clusterer.labels_
        
        # Check if the algorithm found only one cluster or marked all points as noise (-1 label for noise)
        if len(set(cluster_labels)) == 1:
            continue  # Skip this combination as it does not provide meaningful clusters
        
        # Calculate the silhouette score to evaluate the quality of the clustering
        silhouette_avg = silhouette_score(df_scaled, cluster_labels)
        
        # Print the current parameters, number of clusters, and the silhouette score
        print("For eps =", eps,
              "For min_samples =", min_samples,
              "Count clusters =", len(set(cluster_labels)),
              "The average silhouette_score is :", silhouette_avg)


NameError: name 'df_scaled' is not defined

# DBSCAN With Best Hypterparameters (eps=1, minpnts=5)

In [67]:
dbscan_model = DBSCAN(eps=1, min_samples=5).fit(df_scaled)
print("For eps =", 1,
      "For min_samples =", 5,
      "Count clusters =", len(set(dbscan_model.labels_)),
      "The average silhouette_score is :", silhouette_score(df_scaled, dbscan_model.labels_))

NameError: name 'df_scaled' is not defined

##### save clusters for recommendations 

In [68]:
df_movies['dbscan_clusters'] = dbscan_model.labels_

NameError: name 'dbscan_model' is not defined

In [69]:
df_movies['dbscan_clusters'].value_counts()

KeyError: 'dbscan_clusters'

<a id="6"></a> <br>
# Step 5: Movie Recommendation Function

#### Our data is ready to use the clustering results to try and recommend a movie by the name of the one you like

In [70]:
import random

def recommend_movie(movie_name: str):
    # Convert the input movie name to lowercase for case-insensitive matching
    movie_name = movie_name.lower()

    # Create a new column 'name' with lowercase movie names for comparison
    df_movies['name'] = df_movies.index.str.lower()

    # Find the movie that matches the input name
    movie = df_movies[df_movies['name'].str.contains(movie_name, na=False)]

    if not movie.empty:
        # Get the cluster label of the input movie
        cluster = movie['dbscan_clusters'].values[0]

        # Get all movies in the same cluster
        cluster_movies = df_movies[df_movies['dbscan_clusters'] == cluster]

        # If there are more than 5 movies in the cluster, randomly select 5
        if len(cluster_movies) >= 5:
            recommended_movies = random.sample(list(cluster_movies.index), 5)
        else:
            # If fewer than 5, return all the movies in the cluster
            recommended_movies = list(cluster_movies.index)

        # Print the recommended movies
        print('--- We can recommend you these movies ---')
        for m in recommended_movies:
            print(m)
    else:
        print('Movie not found in the database.')


### 🎉 Now we can input a random movie name and get 5 movies that our model recommends

In [71]:
s = input('Input movie name: ')

print("\n\n")
recommend_movie(s)




Movie not found in the database.


In [72]:
s = input('Input movie name: ')

print("\n\n")
recommend_movie(s)




Movie not found in the database.


In [73]:
s = input('Input movie name: ')

print("\n\n")
recommend_movie(s)




Movie not found in the database.


# Streamlit App (so save df_movies dataset)

In [74]:
df_movies.to_csv("clustered_movies.csv", index=False)