<a href="https://colab.research.google.com/github/mairahazura/mairahazura/blob/main/CB_Recommender_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from math import sqrt

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

#1.Load the dataset (5 marks)

Load the dataset. Display the first 5 rows, check for missing values in the column [‘plot_synopsis’], and report basic statistics.

In [None]:
movies_w_plot = pd.read_csv('movies_w_plot.csv', encoding='latin-1')
ratings = pd.read_csv('ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,12882,1,4.0,1147195252
1,12882,32,3.5,1147195307
2,12882,47,5.0,1147195343
3,12882,50,5.0,1147185499
4,12882,110,4.5,1147195239


In [None]:
movies_w_plot.head()

Unnamed: 0,movieId,title,genres,plot_synopsis,tags
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,A boy called Andy Davis (voice: John Morris) u...,"comedy, fantasy, cult, cute, violence, clever,..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,The film begins in 1869 in the town of Brantfo...,"psychedelic, fantasy"
2,3,Grumpier Old Men (1995),Comedy|Romance,The feud between Max (Walter Matthau) and John...,"revenge, comedy, prank"
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"""Friends are the People who let you be yoursel...",revenge
4,5,Father of the Bride Part II (1995),Comedy,The film begins five years after the events of...,"romantic, comedy, fantasy, sentimental"


In [None]:
movies_w_plot.isnull().sum()

Unnamed: 0,0
movieId,0
title,0
genres,0
plot_synopsis,0
tags,0
processed_plot_synopsis,0
cleaned,0
cleaned_synopsis,0


In [None]:
#Summary statistics
print("\n" + "-"*80)
print("DATASET SUMMARY")
print("-"*80)

total_ratings = len(ratings)
num_users = ratings['userId'].nunique()
num_movies = ratings['movieId'].nunique()
min_rating = ratings['rating'].min()
max_rating = ratings['rating'].max()
print(f"Total ratings: {total_ratings:,}")
print(f"Number of unique users: {num_users:,}")
print(f"Number of unique movies: {num_movies:,}")
print(f"Rating scale: {min_rating} to {max_rating}")


# Rating distribution

print("\nRating distribution:")

print(ratings['rating'].value_counts().sort_index())


--------------------------------------------------------------------------------
DATASET SUMMARY
--------------------------------------------------------------------------------
Total ratings: 264,505
Number of unique users: 862
Number of unique movies: 2,500
Rating scale: 0.5 to 5.0

Rating distribution:
rating
0.5     3595
1.0     5543
1.5     5716
2.0    15978
2.5    19017
3.0    44574
3.5    47305
4.0    66481
4.5    30529
5.0    25767
Name: count, dtype: int64


#2. Data Preprocessing (15 marks)

Implement a text preprocessing pipeline for the plot_synopsis column. Choose which best to use (e.g text to lowercase, remove stopwords, …)

Create a new column cleaned_synopsis with the processed text. Show examples of 3 original vs. cleaned plot synopses.  

In [None]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def process_text(text):
    #lowercasing...
    text = text.lower()

    #removing punctuation & special characters
    text = re.sub('[^a-zA-Z]', ' ', text)

    #tokenization
    text = word_tokenize(text)

    #removing stop words
    text = [word for word in text if word not in stop_words]

    return ' '.join(text)

In [None]:
movies_w_plot['cleaned_synopsis'] = movies_w_plot['plot_synopsis'].apply(process_text)
for i in range(3):
  print(f"Example {i+1}")
  print(f"Original Plot:")
  print(movies_w_plot.loc[i,'plot_synopsis'])
  print(f"Cleaned:")
  print(movies_w_plot.loc[i, 'cleaned_synopsis'])

Example 1
Original Plot:
A boy called Andy Davis (voice: John Morris) uses his toys to act out a bank robbery. The bank is a cardboard box, the robber is Mr. Potato Head (voice: Don Rickles) assisted by Slinky Dog (voice: Jim Varney), and the bystanders include Bo Peep (voice: Annie Potts) and her sheep. The day is saved by cowboy doll Woody (voice: Tom Hanks) playing the sheriff, with help from Rex the dinosaur (voice: Wallace Shawn). Woody is the only toy who gets to say his own lines because he has a pull-string that makes him say things like "Reach for the sky!" and "You're my favorite deputy!"During the opening credits (soundtrack: Randy Newman's "You've Got a Friend in Me"), Andy takes Woody downstairs to find his mother (voice: Laurie Metcalf) decorating the dining room for his birthday party. He asks if they can leave the decorations up until they move, and his mom agrees. She says the guests will arrive soon and sends him back upstairs to get his baby sister Molly (voice: Hann

#3. Feature Extraction (TF-IDF implementation) (10 marks)

Apply TF-IDF vectorization on the cleaned_synopsis column of the training set. Configure the vectorizer with:

Maximum of 500 features

Minimum document frequency of 2 (min_df=2)

Computes tfidf_matrix and calculate cosine similarity between all movies using the matrix.

In [None]:
#================================================================================
#STEP 3: TF-IDF VECTORIZATION
#================================================================================

print("Step 3: TF-IDF vectorization...")

tfidf = TfidfVectorizer(
    max_features = 500,
    min_df = 2,
    stop_words = 'english'
    )

tfidf_matrix = tfidf.fit_transform(movies_w_plot['cleaned_synopsis'])

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"Number of genre features: {len(tfidf.get_feature_names_out())}")

Step 3: TF-IDF vectorization...
TF-IDF matrix shape: (2215, 500)
Number of genre features: 500


In [None]:
cosine_sim=cosine_similarity(tfidf_matrix)
cosine_sim_data=pd.DataFrame(cosine_sim, index=movies_w_plot['title'], columns=movies_w_plot['title'])
cosine_sim_data.head()

title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Sudden Death (1995),GoldenEye (1995),"American President, The (1995)",...,"Hunger Games: Catching Fire, The (2013)","Hobbit: The Desolation of Smaug, The (2013)","Wolf of Wall Street, The (2013)",Her (2013),"Grand Budapest Hotel, The (2014)",Interstellar (2014),X-Men: Days of Future Past (2014),Edge of Tomorrow (2014),Gone Girl (2014),Guardians of the Galaxy (2014)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),1.0,0.096229,0.034031,0.050797,0.071822,0.204726,0.035252,0.068023,0.031585,0.065038,...,0.120322,0.117981,0.225527,0.083298,0.118056,0.154476,0.183953,0.10339,0.0637,0.100581
Jumanji (1995),0.096229,1.0,0.033937,0.066074,0.079621,0.182674,0.028981,0.311808,0.02898,0.050945,...,0.235012,0.07769,0.17232,0.058395,0.109464,0.11257,0.104533,0.064019,0.044659,0.355564
Grumpier Old Men (1995),0.034031,0.033937,1.0,0.042866,0.028404,0.049203,0.018637,0.047859,0.01434,0.025594,...,0.041895,0.082751,0.158751,0.021171,0.032958,0.075263,0.062093,0.061575,0.020096,0.02464
Waiting to Exhale (1995),0.050797,0.066074,0.042866,1.0,0.08375,0.086809,0.115479,0.127719,0.018002,0.065799,...,0.082272,0.116258,0.189948,0.267659,0.131468,0.145938,0.100121,0.053217,0.048107,0.036742
Father of the Bride Part II (1995),0.071822,0.079621,0.028404,0.08375,1.0,0.07272,0.038149,0.065363,0.014096,0.074086,...,0.063384,0.05337,0.133928,0.058631,0.09093,0.102923,0.06255,0.053737,0.036579,0.041389


#4. Building Recommender System (15 marks)

Implement a function get_similar_movies, min_rating=3.5:

Takes a movie title as input (Grumpier Old Men (1995)). Recommend the similar movies.

Build user profile for userId 359.

Using content-based recommender, recommend item to user. Returns the top n most similar movies (excluding the input movie itself)

The function should return movie titles, similarity scores, rating and genres.  

In [None]:

 #================================================================================
#STEP 5:ITEM-TO-ITEM RECOMMENDATIONS
#================================================================================

def get_similar_movies(title, n=10, min_rating=3.5):
  if title not in cosine_sim_data.columns:
    print(f"Movie '{title}' not found!")
    return None

  sim_scores = cosine_sim_data[title].sort_values(ascending=False)
  return sim_scores[1:n+1] #Exclude the movie itself

print("Step 5: Testing item-to-item recommendations...")
print("Movies similar to 'Grumpier Old Men (1995)':")
similar = get_similar_movies('Grumpier Old Men (1995)', 20)
if similar is not None:
  for title, score in similar.items():
    genre = movies_w_plot[movies_w_plot['title'] == title]['genres'].values[0]
    print(f" {score:.3f} - {title} ({genre})")

Step 5: Testing item-to-item recommendations...
Movies similar to 'Grumpier Old Men (1995)':
 0.907 - Grumpy Old Men (1993) (Comedy)
 0.785 - Rushmore (1998) (Comedy|Drama)
 0.778 - Kazaam (1996) (Children|Comedy|Fantasy)
 0.772 - Collateral (2004) (Action|Crime|Drama|Thriller)
 0.766 - Mad Max Beyond Thunderdome (1985) (Action|Adventure|Sci-Fi)
 0.744 - Road Warrior, The (Mad Max 2) (1981) (Action|Adventure|Sci-Fi)
 0.738 - Pi (1998) (Drama|Sci-Fi|Thriller)
 0.733 - Liar Liar (1997) (Comedy)
 0.729 - Vampire in Brooklyn (1995) (Comedy|Horror|Romance)
 0.699 - Get Smart (2008) (Action|Comedy)
 0.693 - Hocus Pocus (1993) (Children|Comedy|Fantasy|Horror)
 0.649 - Jackie Brown (1997) (Crime|Drama|Thriller)
 0.640 - Once Upon a Time in America (1984) (Crime|Drama)
 0.640 - Cabaret (1972) (Drama|Musical)
 0.629 - Cape Fear (1962) (Crime|Drama|Thriller)
 0.624 - Cape Fear (1991) (Thriller)
 0.597 - Strange Days (1995) (Action|Crime|Drama|Mystery|Sci-Fi|Thriller)
 0.579 - Across the Universe 

In [None]:
#================================================================================
#STEP 6: BUILD USER PROFILE
#================================================================================

def build_user_profile(user_id, min_rating=3.5):

  #Get user's high ratings
  user_ratings = ratings[
    (ratings['userId'] == user_id) &
    (ratings['rating'] >= min_rating)
  ]

  if len(user_ratings) == 0:
    print(f"User {user_id} has no ratings >= {min_rating}")
    return None

  #Get movie indices
  liked_movie_ids = user_ratings['movieId'].tolist()
  liked_indices = movies_w_plot[movies_w_plot['movieId'].isin(liked_movie_ids)].index.tolist()

  if len(liked_indices) == 0:
    return None

  #Average TF-IDF vectors (sparse matrix handling)
  user_profile = tfidf_matrix[liked_indices].mean(axis=0)

  #Convert to dense array and flattern
  user_profile = np.asarray(user_profile).flatten()

  return user_profile

print("Step 6: Building user profile...")
user_profile = build_user_profile(user_id = 359)

if user_profile is not None:
  print(f"User 1 profile created (vector length: {len(user_profile)})")

  #Show which genres user likes (top TF-IDF scores)
  feature_names = tfidf.get_feature_names_out()
  genres_scores = dict(zip(feature_names, user_profile))
  top_genres = sorted(genres_scores.items(), key=lambda x: x[1], reverse = True)[:10]
  print("User's top genre preferences (TF-IDF scores):")
  for genre, score in top_genres:
    print(f"  {genre}: {score:.3f}")

Step 6: Building user profile...
User 1 profile created (vector length: 500)
User's top genre preferences (TF-IDF scores):
  tells: 0.052
  man: 0.042
  men: 0.041
  father: 0.036
  home: 0.036
  car: 0.034
  time: 0.033
  finds: 0.032
  says: 0.032
  police: 0.032


In [None]:
#================================================================================
#STEP 7: USER RECOMMENDATIONS
#================================================================================

def recommend_for_user(user_id, n=10, min_rating=3.5):

  #Build user profile
  user_profile = build_user_profile(user_id, min_rating)

  if user_profile is None:
    return None

  #Calculate similarity between user profile and all movies
  #Need to handle sparse matrix properly
  similarities = cosine_similarity(
      user_profile.reshape(1, -1),
      tfidf_matrix.toarray() #Convert to dense for calculation
  ).flatten()

  #Get movies user hasn't rated
  rated_movie_ids = ratings[ratings['userId'] == user_id]['movieId'].tolist()
  unrated_mask = ~movies_w_plot['movieId'].isin(rated_movie_ids)

  #Get top N unrated movies
  movie_scores = pd.Series(similarities, index=movies_w_plot.index)
  recommendations = movie_scores[unrated_mask].nlargest(n)

  #Get movie details
  result = movies_w_plot.loc[recommendations.index, ['title', 'genres']].copy()
  result['similarity_score'] = recommendations.values

  #Convert similarity to 5-star rating
  result['predicted_rating'] = (recommendations.values * 4 + 1).clip(1,5)

  return result

print("Step 7: Generating user recommendations...")
recommendations = recommend_for_user(user_id=359, n=10)
if recommendations is not None:
  print("\nTop 10 recommendations for User 359:")
  print("="*80)
  for idx, row in recommendations.iterrows():
    print(f"{row['similarity_score']:.3f} | {row['predicted_rating']:.2f}")
    print(f"{row['title']}")
    print(f"Genres: {row['genres']}")
    print()

Step 7: Generating user recommendations...

Top 10 recommendations for User 359:
0.638 | 3.55
Gone with the Wind (1939)
Genres: Drama|Romance|War

0.616 | 3.46
Wolf of Wall Street, The (2013)
Genres: Comedy|Crime|Drama

0.607 | 3.43
Adjustment Bureau, The (2011)
Genres: Romance|Sci-Fi|Thriller

0.606 | 3.43
North by Northwest (1959)
Genres: Action|Adventure|Mystery|Romance|Thriller

0.602 | 3.41
Prince of Egypt, The (1998)
Genres: Animation|Musical

0.600 | 3.40
Casino (1995)
Genres: Crime|Drama

0.595 | 3.38
Sherlock Holmes (2009)
Genres: Action|Crime|Mystery|Thriller

0.582 | 3.33
Pan's Labyrinth (Laberinto del fauno, El) (2006)
Genres: Drama|Fantasy|Thriller

0.580 | 3.32
Hunchback of Notre Dame, The (1996)
Genres: Animation|Children|Drama|Musical|Romance

0.575 | 3.30
Fear and Loathing in Las Vegas (1998)
Genres: Adventure|Comedy|Drama



#5) ANALYZE AND DISCUSS

- By cleaning the code using lowercasing, removing punctuations, tokenization,and stopword removal, it affect the quality of recommendations in a content-based recommender system. For example, removing stopwords ensures that common, uninformative words like the, and, is do not dominate the TF-IDF vectors. This allows the system to focus on meaningful keywords in the plot synopses.

- Compared to the previous CB recommender system using only genres, the current system based on plot synopses can recommend more novel movies. This is because genres are broad and limited, often resulting in recommendations that belong to the same genre but may not match the user’s specific interests. Using plot synopses with TF-IDF gives more detailed content features, allowing the system to suggest movies with similar themes or story elements, even if they belong to different genres.