<a href="https://colab.research.google.com/github/okweipeng/building-a-movie-recommender-system/blob/main/Building_an_Movie_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Project Task:**

Build a recommender system using content-based filtering; Implement a recommender system that suggests documents or items (e.g., books, movies, or articles) based on their content similarity to the user’s preferences or past interactions

In this case; Movies - based on their content similarity to the user's preferences/past interaction

**Files to be used via Github;**

Source: https://github.com/reisanar/datasets/blob/master/HollywoodMovies.csv

Includes an dataset of HollywoodMovies.csv from Reisanar (github)

#### **Imports to be used for this Project**

In [None]:
#Import necessary libraries
import pandas as pd
import numpy as np
#Converting text basd features into numeric values using TF-IDF technique
from sklearn.feature_extraction.text import TfidfVectorizer
#To compute the cosine similairty between movies to movies based on their TF-IDF vectors to find similarity
from sklearn.metrics.pairwise import cosine_similarity
#Normalizing numeric features; ensure consistency (optional - done for ranging values equally)
from sklearn.preprocessing import StandardScaler

# Optional; not task related imports (used for further exploration via visuals when possible; future revisits)
import matplotlib.pyplot as plt
import seaborn as sns

Here we are loading in the neccessary imports

####**Read in the dataset from github using the url**

In [None]:
#Load the dataset from GitHub via URL with pandas
url = "https://raw.githubusercontent.com/reisanar/datasets/master/HollywoodMovies.csv"

#Read in the file; specification of url (previously)
df = pd.read_csv(url)

To load in the dataset, Hollywood movies dataset from github provided by the Reisanar (github user), reading in the dataset via URL.

#### **Understanding the dataset; the process**

In [None]:
#Print out the top 5 rows of the dataset for a viewing
df.head()

Unnamed: 0,Movie,LeadStudio,RottenTomatoes,AudienceScore,Story,Genre,TheatersOpenWeek,OpeningWeekend,BOAvgOpenWeekend,DomesticGross,ForeignGross,WorldGross,Budget,Profitability,OpenProfit,Year
0,Spider-Man 3,Sony,61.0,54.0,Metamorphosis,Action,4252.0,151.1,35540.0,336.53,554.34,890.87,258.0,345.3,58.57,2007
1,Shrek the Third,Paramount,42.0,57.0,Quest,Animation,4122.0,121.6,29507.0,322.72,476.24,798.96,160.0,499.35,76.0,2007
2,Transformers,Paramount,57.0,89.0,Monster Force,Action,4011.0,70.5,17577.0,319.25,390.46,709.71,150.0,473.14,47.0,2007
3,Pirates of the Caribbean: At World's End,Disney,45.0,74.0,Rescue,Action,4362.0,114.7,26302.0,309.42,654.0,963.42,300.0,321.14,38.23,2007
4,Harry Potter and the Order of the Phoenix,Warner Bros,78.0,82.0,Quest,Adventure,4285.0,77.1,17998.0,292.0,647.88,939.89,150.0,626.59,51.4,2007


**Check the rows + columns in existence**

In [None]:
#Check the rows + columns (before preprocessing)
df.shape

(970, 16)

#### **Copy the dataset before removing unnecesary rows/preprocessing (may or may not need for further data exploration); hidden cell**

In [None]:
# Copying a new df; else if want to explore the df further (before column removals)
new_df = df.copy()

#### **Dropping columns to not be utilized**

In [None]:
#Dropping non-needed columns (not be to utilized/not neccessary)
df = df.drop(columns=['TheatersOpenWeek', 'OpeningWeekend', 'BOAvgOpenWeekend', 'DomesticGross', 'ForeignGross', 'WorldGross', 'Profitability', 'OpenProfit', 'Year'])

#### **Dataset Preprocessing**

**Check for null/missing values within our dataset that is being used**

In [None]:
#Check for missing values/null in relevant columns
print("\nNull Values:")
print(df.isnull().sum())


Null Values:
Movie               0
LeadStudio          9
RottenTomatoes     57
AudienceScore      63
Story             329
Genre             279
Budget             73
dtype: int64


In [None]:
"""
Fills in any missing data so we don't leave blanks in the dataset.
Uses 'Unknown' for missing text and the average value for missing numbers (e.g. ratings or budget)
"""

#Deals with missing/null values instead of leaving the dataset as it is (blank)
df = df.assign(
    LeadStudio=df['LeadStudio'].fillna('Unknown'),
    RottenTomatoes=df['RottenTomatoes'].fillna(df['RottenTomatoes'].mean()),
    AudienceScore=df['AudienceScore'].fillna(df['AudienceScore'].mean()),
    Story=df['Story'].fillna('Unknown'),
    Genre=df['Genre'].fillna('Unknown'),
    Budget=df['Budget'].fillna(df['Budget'].mean())
)

**Filling/substituing missing values with unknown for various column fields**

*Explaination:* Dealing with missing/null values; I have filled missing/null values,  "unknown", instead of leaving them blank/using an empty string. This helps keep an eye on where data is missing, rather than treating it as if within our dataset, or considering them as 'empty data' and utilizing it. Substituting it as 'unknown" allows recognition of missing values of this dataset.

With this approach it makes it easier to identify and handle these missing/empty values later in the process, contributing to a more consistent and manageable dataset.

**Filling/substituing missing values with their mean for various features/columns**

*Explaination:* Since some fields consist of numeric values; the approach to handling missing values, to filling missing values with the mean for numerical columns; RottenTomatoes, AudienceScore, and Budget helps maintain the distribution of the data.

The mean is often used because it represents the average value, making it a reasonable estimate for missing data in a dataset. Filling missing values with the mean aid to prevent errors from occurring due to incomplete data while keeping the overall data we are using intact.

**Check for dataset duplicates (if any to be shown)**

In [None]:
# Print the duplicates to check; if theres any duplicates
duplicates = df[df.duplicated]
# Show duplicate rows (to be removed within the later/further process)
duplicates

Unnamed: 0,Movie,LeadStudio,RottenTomatoes,AudienceScore,Story,Genre,Budget
966,The Call,TriStar,43.0,66.0,Unknown,Unknown,13.0


**Drop the duplicated row**

In [None]:
# Dropping the duplicate
df = df.drop_duplicates()

**Note:** Here as we see two rows of the same; therefore we need to remove the duplicated row

In [None]:
# Dropping the duplicated row as shown in the earlier stage (now should not have any duplicates)
duplicate_rows = df[df.duplicated()]
print(f"Number of duplicate rows now (after dropped): {len(duplicate_rows)}")

Number of duplicate rows now (after dropped): 0


**Review the new dataset after removing columns not utilized**

In [None]:
# The new remaining rows to work with (utilized)
df.head()

Unnamed: 0,Movie,LeadStudio,RottenTomatoes,AudienceScore,Story,Genre,Budget
0,Spider-Man 3,Sony,61.0,54.0,Metamorphosis,Action,258.0
1,Shrek the Third,Paramount,42.0,57.0,Quest,Animation,160.0
2,Transformers,Paramount,57.0,89.0,Monster Force,Action,150.0
3,Pirates of the Caribbean: At World's End,Disney,45.0,74.0,Rescue,Action,300.0
4,Harry Potter and the Order of the Phoenix,Warner Bros,78.0,82.0,Quest,Adventure,150.0


In [None]:
# Recheck the shape (after dropping unutilized columns and duplicates)
df.shape

(969, 7)

**Note:** The new rows and columns shown above has been updated

### **Scaling numeric values**

In [None]:
"""
Normalize the scaling of number-based columns so they're on the same level (no higher/lower or extreme)
This helps models treat all values fairly, no matter their original range.
"""

#Normalize numerical features
scaler = StandardScaler()

#Specified columns to normalize (in case needed; ranging values - equal impact)
numerical_features = ['RottenTomatoes','Budget']
df[numerical_features] = scaler.fit_transform(df[numerical_features])

Normalizing numerical features helps ensure that different values, like RottenTomatoes and Budget, are on the same scale. (just so that these fields do not exceed a range, upscaling different values)

By scaling these features to a similar range, we make sure each one has an equal impact on the model we are implementing (movie recommendation system), leading to more accurate recommendations.

### **Combining the utilized features**

In [None]:
# Ensure proper combination of features for recommendation
df['combined_features'] = df['Genre'] + ' ' + df['Story'] + ' ' + df['LeadStudio'] + ' ' + df['RottenTomatoes'].astype(str)

**Serveral factors play in role (determine movie recommendation to user)**

**Genre:**

Captures the type of movie genre. Helps match movies of similar genres.

**Story:**

Allows the model to identify the type of movie

**LeadStudio:**

Provids information on production studio

**RottenTomatoes (scaled):**

Allows the system to recommend movies with similar audience or rating/critics.

**Note:**

The year is not included for this factor of determining whether the movie to relevant to the user history/recently watched movie.

Therefore it gives a wider span of movies to are relevant to the user's preference for recommendation.

#### **TF-IDF Extraction**

In [None]:
"""
Creates a TF-IDF vectorizer to convert text data into numerical form, focusing on frequent words and word pairs
"""

#Create a TF-IDF Vectorizer with specified parameters for better extraction
tfidf_vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), min_df=5)

#Fit and transform the combined features into a TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(df['combined_features'])
print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")

TF-IDF Matrix Shape: (969, 271)


Turning movie details, like the genre, story, etc. into numbers (extract meaningful features through text). Using TF-IDF (Term Frequency-Inverse Document Frequency) to identify important words and phrases, thus ignoring stopwords. By looking at both single words and word pairs (ngram_range=(1,2)), it creates a numerical representation for each movie based on its description, (min_df=5) to ensure terms that appear in at least 5 movies are considered.

This helps/acknowledge which movies are similar to each other, making defined recommendations.

#### **Computing the Cosine Similarity**

In [None]:
"""
Calculates the similarity between all movies based on their combined features using cosine similarity.
Later converts the similarity values into a DataFrame for easy viewing and access (viewable matrix (optional))
"""

#Calculate cosine similarity between all movies based on their combined features
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

#Convert the cosine similarity matrix into a df for easy access
cosine_sim_df = pd.DataFrame(cosine_sim, index=df['Movie'], columns=df['Movie'])

In regards to the cosine similarity, it calculates how similar each movie is to every other movie using the combined features like genre and story. It does this by comparing the movies based on their numerical representations, with cosine similarity measuring how closely related the two movies are.

A higher similarity score means the movies are more alike. This makes it easy to see which movies are most similar, helping to recommend movies that are alike based on their descriptions.

#### **Defining Recommendation Function**

In [None]:
"""
Defined a function to recommend similar movies based on cosine similarity scores.
It checks if the movie is in the dataset, then finds and returns the top N most similar movies.
"""

def recommend_movies(movie_title, cosine_sim_df, top_n=5):

    if movie_title not in cosine_sim_df.index:
        raise ValueError(f"Movie title '{movie_title}' not found in the dataset.")
    similar_scores = cosine_sim_df[movie_title]
    similar_movies = similar_scores.sort_values(ascending=False)
    similar_movies = similar_movies.drop(movie_title)

    #Returns the top_n most similar movies
    return similar_movies.head(top_n)

The defined function; recommends movies that are similar to a prompt/user-based movie are based on combined features (genre & story).

First, it checks if the movie title is in the dataset to be specified. Then, checks the similarity scores of that movie compared to all others, sorting them from most similar to least similar.

The function excludes the original movie from the list and returns the top-recommended movies based on their similarity scores.

This helps users find movies that are closely related to the one they like or are interested in.

#### **Testing the recommender system using content-based filtering**

**Lets say for this testing its based on content similarity to the user’s past interactions to the input of the movie_example column**

In [None]:
# Example movie name to test; recommendation system
movie_title = "Harry Potter and the Order of the Phoenix"
# To provide the top 5 movies via the cosine similarity score
recommended_movies = recommend_movies(movie_title, cosine_sim_df)
# Provide top 5 movie recommendation
recommended_movies

Unnamed: 0_level_0,Harry Potter and the Order of the Phoenix
Movie,Unnamed: 1_level_1
Harry Potter and the Half-Blood Prince,0.739794
Harry Potter and the Deathly Hallows Part 1,0.735445
Fool's Gold,0.730149
Inkheart,0.730149
The Informant!,0.579787


In [None]:
# Another movie to test
movie_title = "Fool's Gold"
# To provide the top 5 movies via the cosine similarity score
recommended_movies = recommend_movies(movie_title, cosine_sim_df, top_n=5)
# Provide top 5 movie recommendation list
recommended_movies

Unnamed: 0_level_0,Fool's Gold
Movie,Unnamed: 1_level_1
Harry Potter and the Half-Blood Prince,0.803709
Harry Potter and the Deathly Hallows Part 1,0.798985
Inkheart,0.793231
Harry Potter and the Order of the Phoenix,0.730149
I Am Legend,0.532426


**Summary of movie recommendation testing:**

As a result of testing of movie title recommendations; to be determined by its top 5 most similar movies, the output provides a similarity score specifying how each movie matches the output movie (with its score alongside). The similarity score reflects the resemblance based on features such as genre, story, lead studio, Rotten Tomatoes score, and year of release.

The recommendation system ranks movies from highest to lowest similarity. As the list of the movies provided through similarity, the scores decrease, signifying that the recommended movies share fewer attributes than the initial movie. This behavior highlights the system's ability to prioritize movies most relevant to the user's past selections while still identifying potential similarities even as the scores decline.

**Revisit the list of movies; therefore to proceeed to the next following steps**

In [None]:
# A pivot table based on movie and AudienceScore
movie_lists = pd.pivot_table(df,values=['AudienceScore'],index='Movie')

# Ordered/Sorted based on audiencescore given for each movie
sorted_movie_lists = movie_lists.sort_values(by='AudienceScore', ascending=False)

# By converting the output to an interactive table (top right); a pivot table will be shown of the list of movies (sorted)
sorted_movie_lists

Unnamed: 0_level_0,AudienceScore
Movie,Unnamed: 1_level_1
The Dark Knight,96.0
Warrior,93.0
The King's Speech,93.0
Inception,93.0
50/50,93.0
...,...
Jonah Hex,24.0
The Devil Inside,22.0
The Haunting of Molly Hartley,22.0
Stone,20.0


**User Testing (Movie from the dataset; to be recommended else)**

**Note:**

Based on the previous list of movies; you can specify the movie yourself & see the recommendation based off similarity & its similarity score.

To ensure that it gives the best result (case sensitive before), any format of writing in the movie name; recommendation will still be provided.

In [None]:
# User input and normalization
movie_name = input("User Input; Provide an movie for further recommendations!: ").strip().lower()

# Normalize dataset (therefore any inputs will be readable)
df['normalized_movie'] = df['Movie'].str.strip().str.lower()

# Check if movie exists within our choosen dataset used (best to note)
if movie_name not in df['normalized_movie'].values:
    print(f"Sorry, the movie '{movie_name}' is not included in the dataset.")
else:
    # Find original title and recommend movies
    movie_example_original = df[df['normalized_movie'] == movie_name].iloc[0]['Movie']
    recommended_movies = recommend_movies(movie_example_original, cosine_sim_df)
    print(f"\nRecommended movies similar to '{movie_example_original}':\n{recommended_movies}")

User Input; Provide an movie for further recommendations!: inkheart

Recommended movies similar to 'Inkheart':
Movie
Harry Potter and the Half-Blood Prince         0.803709
Harry Potter and the Deathly Hallows Part 1    0.798985
Fool's Gold                                    0.793231
Harry Potter and the Order of the Phoenix      0.730149
I Am Legend                                    0.532426
Name: Inkheart, dtype: float64


**Similarity of movie recommendation are based off the similarity scores of each movie; (to be said the movies are sorted)**

With the example provided, I have inputted Harry Potter and the Half-Blood Prince, with the recommendations within the dataset, it shows similar movies to be recommended alongside the cosine similarity score.

**Please See:**

If the movie is not in the dataset that is provided/used, the block of code above will not give a recommendation therefore an error will occur.

For the result attribute please use any given movie from the dataset for a valid result in providing the closest recommendation results via a cosine similarity score.

**Thus, as for future reference/revisits to explore further onto this dataset.**