<a href="https://colab.research.google.com/github/johnjoel2001/lumaa-spring-2025-ai-ml/blob/main/Content_Based_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Importing Repository**

In [None]:
# Connects to any needed files from GitHub and Google Drive
import os

# Remove Colab default sample_data
!rm -r ./sample_data

# Clone GitHub files to colab workspace
repo_name = "lumaa-spring-2025-ai-ml"
git_path = 'https://github.com/johnjoel2001/lumaa-spring-2025-ai-ml.git'
!git clone "{git_path}"

# **Importing Dependecies**

In [8]:
# Importing Dependencies
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize

# **Loading The Dataset**

In [63]:
# Loading the dataset

csv_path = "/content/lumaa-spring-2025-ai-ml/american_movies_500.csv"
df = pd.read_csv(csv_path)
pd.set_option('display.max_colwidth', 50)
df.head() # Printing the first five rows of our dataset




Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,2004,Highwaymen,American,Robert Harmon,"Jim Caviezel, Rhona Mitra, Colm Feore",action,https://en.wikipedia.org/wiki/Highwaymen_(film),A man known only as Rennie (Jim Caviezel) is m...
1,2008,High School Musical 3: Senior Year,American,Kenny Ortega,"Zac Efron, Vanessa Hudgens, Ashley Tisdale, Co...","musical, family",https://en.wikipedia.org/wiki/High_School_Musi...,The film opens with Troy Bolton and the rest o...
2,2004,Million Dollar Baby,American,Clint Eastwood,"Clint Eastwood, Hilary Swank, Morgan Freeman, ...",drama,https://en.wikipedia.org/wiki/Million_Dollar_Baby,"Margaret ""Maggie"" Fitzgerald, a waitress from ..."
3,2016,13 Hours: The Secret Soldiers of Benghazi,American,Michael Bay,James Badge Dale\r\nJohn Krasinski\r\nMax Martini,action,https://en.wikipedia.org/wiki/13_Hours:_The_Se...,"In 2012, Benghazi, Libya is named one of the m..."
4,2001,Black Knight,American,Gil Junger,Martin Lawrence,comedy,https://en.wikipedia.org/wiki/Black_Knight_(film),Jamal Walker (Martin Lawrence) is an everyday ...


# **Checking Column's & Null Values**

In [61]:
df.isnull().sum()

Unnamed: 0,0
Release Year,0
Title,0
Origin/Ethnicity,0
Director,0
Cast,2
Genre,0
Wiki Page,0
Plot,0


In [39]:
df.columns, df.dtypes

(Index(['Release Year', 'Title', 'Origin/Ethnicity', 'Director', 'Cast',
        'Genre', 'Wiki Page', 'Plot'],
       dtype='object'),
 Release Year         int64
 Title               object
 Origin/Ethnicity    object
 Director            object
 Cast                object
 Genre               object
 Wiki Page           object
 Plot                object
 dtype: object)

**Observations**
* There are no null values in our dataset for key columns, particularly `Title` and `Plot`.
* Both the `Title` and `Plot` columns are stored as object data types.


# **Text Vectorization**

In [40]:
def vectorize_text(data):
    """
    Converting text data to TF-IDF vectors with improved parameters:
    - Removes common words that don’t add value
    - Ignores frequent and rare terms
    - Captures single words, bigrams, and trigrams
    - Limits vocabulary size for efficiency
    """
    vectorizer = TfidfVectorizer(stop_words='english', max_df=0.85, min_df=2, ngram_range=(1,3), max_features=10000)
    tfidf_matrix = vectorizer.fit_transform(data)

    return vectorizer, tfidf_matrix

# **Reducing Dimensionality**

In [41]:
def reduce_dimensionality(tfidf_matrix):
    """
    Applying Singular Value Decomposition (SVD) for dimensionality reduction:
    - Reduces feature space while at the same time preserving variance.
    - Captures latent relationships between words and documents.

    """
    svd = TruncatedSVD(n_components=100)
    tfidf_reduced = svd.fit_transform(tfidf_matrix)
    return svd, normalize(tfidf_reduced, norm='l2', axis=1)

# **Processing The Query**

In [42]:
def process_query(user_query, vectorizer, svd):
    """
    Processing the user query:
    - Converting query into TF-IDF representation.
    - Applying SVD transformation to align with reduced feature space.
    - Normalizing query vector for accurate similarity comparison.
    """
    query_vec = vectorizer.transform([user_query])
    query_vec = svd.transform(query_vec)
    return normalize(query_vec, norm='l2', axis=1)

# **Computing Similarity & Recommendation Retrieva**l

In [51]:
def get_recommendations(query_vec, tfidf_reduced, df, top_n=5):
    """
    Computing similarity scores and retrieving top recommendations:
    - Uses cosine similarity to find the closest matches
    - Returns top N most similar movies

    """
    similarity_scores = cosine_similarity(query_vec, tfidf_reduced).flatten()
    top_indices = similarity_scores.argsort()[-top_n:][::-1]
    return pd.DataFrame({
        "Title": df.iloc[top_indices]['Title'].values,
        "Similarity Score": similarity_scores[top_indices]
    })

# **Running the Pipeline**

In [54]:
vectorizer, tfidf_matrix = vectorize_text(df['Plot'])
svd, tfidf_reduced = reduce_dimensionality(tfidf_matrix)
user_query = "I love thrilling action movies set in space, with a comedic twist"
query_vec = process_query(user_query, vectorizer, svd)
top_movies = get_recommendations(query_vec, tfidf_reduced, df)

# **Displaying Results**

In [57]:
# Display results
print("\nTop 5 recommendations:")
display(top_movies)


Top 5 recommendations:


Unnamed: 0,Title,Similarity Score
0,What Love Is,0.438784
1,Get Over It,0.417528
2,Bedtime Stories,0.411893
3,Aloha,0.385455
4,Superman: Unbound,0.385011
