Jake Derby
DSC630-T302
07/31/25
Assignment 9.2

In [1]:
import os
import pandas as pd
import numpy as np

In [2]:
movies = pd.read_csv('ml-latest-small/movies.csv')
ratings = pd.read_csv('ml-latest-small/ratings.csv')
tags = pd.read_csv('ml-latest-small/tags.csv')

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [6]:
# I feel like the tags df would complicate my recommendation model so I will delete it. 
# The movies and ratings dfs have a common column in movieId so I'll go ahead and merge them together for simplicity. 

del tags

movies = movies.merge(ratings, on='movieId', how='left')
movies.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,964982700.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,847435000.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,1106636000.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,1510578000.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,1305696000.0


In [7]:
# Now let's get the average rating for each movie in the merged dataset

average_ratings = pd.DataFrame(movies.groupby('title')['rating'].mean()) # creating a new df called average_ratings to leave the parent df unaltered
average_ratings.head()

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
'71 (2014),4.0
'Hellboy': The Seeds of Creation (2004),4.0
'Round Midnight (1986),3.5
'Salem's Lot (2004),5.0
'Til There Was You (1997),4.0


In [8]:
# Let's also add the count of reviews to put those average ratings into context

average_ratings['num_ratings'] = pd.DataFrame(movies.groupby('title')['rating'].count()) # adding the total number of ratings per movie as a new column in this df
average_ratings.head()

Unnamed: 0_level_0,rating,num_ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
'71 (2014),4.0,1
'Hellboy': The Seeds of Creation (2004),4.0,1
'Round Midnight (1986),3.5,2
'Salem's Lot (2004),5.0,1
'Til There Was You (1997),4.0,2


In [9]:
# To build my recommender, I'll need to pivot the data such that each userId is a row and all the movies in the dataset are the columns
# It will be mostly NaN-filled but that's alright

movies_by_user = movies.pivot_table(index='userId', columns='title', values='rating')
movies_by_user.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,,,,,,,,,,,...,,,,,,,,,4.0,
2.0,,,,,,,,,,,...,,,,,,,,,,
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,,...,,,,,,,,,,
5.0,,,,,,,,,,,...,,,,,,,,,,


In [17]:
# I will now move on to getting the user's input from which we will then do a pairwise correlation with all other titles to get the most similar movies to recommend.
# Instead of making a master df that has all pairwise correlations for each movie with each other movie, we will just to do this as they provide their movie
# to save on memory and processing power. Using python's difflib library to eventual fuzzy matching when we need it

import difflib
from IPython.display import clear_output

# List all movie titles in the df
all_titles = movies_by_user.columns.tolist()

# This function will prompt the user for a movie title, handle typos or partial matches and then return a valid title that is actually in our dataset
def prompt_movie_title(titles):
    while True:
        clear_output(wait=True) # this will clear the previous iteration's output before each prompt (fixes a UI issue I was having from running a loop in a Jupyter notebook)
        user_input = input("Enter a movie you like: ").strip()
        
        # First, check if they made it easy for us and see if there is an exact match
        if user_input in titles:
            print(f"Using exact match: {user_input}")
            return user_input
        
        # Maybe the movie exists in our dataset but we need to run a case-insensitive match
        lower_map = {t.lower(): t for t in titles}
        if user_input.lower() in lower_map:
            corrected = lower_map[user_input.lower()]
            print(f"Matched by case: {corrected}")
            return corrected
        
        # If those don't work, let's fuzzy match some suggestions
        suggestions = difflib.get_close_matches(user_input, titles, n=5, cutoff=0.4) # give the user 5 suggested titles with a similarity cutoff of 0.6
        if suggestions:
            print("\nMovie not found. Did you mean: ")
            for idx, title in enumerate(suggestions, 1):
                print(f"  {idx}. {title}")
            
            choice = input("Enter the number of the correct movie (or press Enter to try again): ").strip()
            if choice.isdigit(): # If they select one of the suggestions, 
                i = int(choice)  # i = their choice selection
                if 1 <= i <= len(suggestions):
                    picked = suggestions[i-1]
                    print(f"Using suggestion: {picked}")
                    return picked
            
            print("– Okay, let's try again.\n") # If they don't select a suggestion, send them back to the beginning of the loop to try it all over again
            continue
        
        # If there are no matches at all
        print(f"No close matches for '{user_input}'. Please try again.\n")

In [25]:
# Getting their movie selection 

selected_movie = prompt_movie_title(all_titles)
print(f"\nYou chose: {selected_movie}")


Movie not found. Did you mean: 
  1. Matrix, The (1999)
  2. Paterson
  3. Martin (1977)
  4. Marnie (1964)
  5. Masterminds (2016)
Using suggestion: Matrix, The (1999)

You chose: Matrix, The (1999)


In [26]:
# Now let's use pandas' corrwith function to get a pairwise correlation between this movie selection and all other movies in the dataset. We will use this correlation df
# to sort the top picks to choose from and provide them to the user

import warnings
warnings.filterwarnings('ignore') # Silencing warning messages since there will likely be many NaN values

# mask of users who rated the seed movie
mask = movies_by_user[selected_movie].notnull()

# co_counts[m] = number of users who rated both selected_movie and m
co_counts = movies_by_user[mask] \
                .notnull() \
                .sum(axis=0)

correlations = movies_by_user.corrwith(movies_by_user[selected_movie])
correlations.head()

title
'71 (2014)                                NaN
'Hellboy': The Seeds of Creation (2004)   NaN
'Round Midnight (1986)                    NaN
'Salem's Lot (2004)                       NaN
'Til There Was You (1997)                 NaN
dtype: float64

In [27]:
# Let's now drop the selected movie from the dataset so we don't get it in our correlations

if selected_movie in correlations.index:
    correlations = correlations.drop(index=selected_movie)

In [28]:
# Now let's remove those NaNs and merge the total number of rating into the correlation table (will need this to make sure we recommend only movies in the list
# that have a substantial number of reviews)

recommendation = pd.DataFrame({
    'Correlation': correlations,                        # adding the correlations
    'CoRatings'   : co_counts,                          # adding the co-counts which we will filter by
    'TotalRatings': average_ratings['num_ratings']      # lastly adding the total number of ratings, so they can see how popular the movie is in general
})

In [29]:
# Let's clean it up a little bit

recommendation = (recommendation.drop(index=selected_movie).dropna()) # never recommends itself and drop movies with no overlap

In [30]:
# Let's finalize our recommendation list which we will return to the user. We'll make sure the recommended movies are based on movies that have at least 25 reviews

recommendation = recommendation[(recommendation['CoRatings'] >= 20)] # only keep if ≥ 20 users co-rated

In [31]:
# As a final touch, let's go ahead and merge in the rest of the movies dataset that has the genres so we can estimate how well the recommender is working

top_recs = recommendation \
    .sort_values('Correlation', ascending=False) \
    .head(10) \
    .reset_index() \
    .rename(columns={'index':'title'}) # making our top_recs df based on recommendation, sorted to get only the top 10

genre_dict = dict(zip(movies['title'], movies['genres']))       # couldn't merge the movies df properly, so I'm just going to make a title:genre dictionary to map from
top_recs['genres'] = top_recs['title'].map(genre_dict)          # mapping the genres to the top_recs df in a new column

top_recs.head(10) # displaying the 10 movie recommendations, along with the actual correlations, movie titles, genres

Unnamed: 0,title,Correlation,CoRatings,TotalRatings,genres
0,"Cabin in the Woods, The (2012)",0.795494,20.0,22,Comedy|Horror|Sci-Fi|Thriller
1,Zootopia (2016),0.76061,23.0,32,Action|Adventure|Animation|Children|Comedy
2,Life of Pi (2012),0.710196,26.0,31,Adventure|Drama|IMAX
3,1408 (2007),0.702785,20.0,25,Drama|Horror|Thriller
4,X-Men: Days of Future Past (2014),0.695681,26.0,30,Action|Adventure|Sci-Fi
5,Tarzan (1999),0.680988,20.0,24,Adventure|Animation|Children|Drama
6,Scent of a Woman (1992),0.679637,21.0,27,Drama
7,Tommy Boy (1995),0.674887,25.0,50,Comedy
8,Iron Man 3 (2013),0.657786,27.0,32,Action|Sci-Fi|Thriller|IMAX
9,Gravity (2013),0.648136,26.0,32,Action|Sci-Fi|IMAX


I built a lightweight and robust movie recommendation system by combining collaborative filtering with sensible safeguards against data sparsity. First, I ingested and merged the MovieLens ratings and movie metadata files, then computed each film’s average score and total rating count to capture its overall popularity. Next, I pivoted the merged dataset into a user–item matrix—users as rows, movie titles as columns, and ratings as values so that every film could be directly compared to every other by their common audience.

Since I had some issues with the Jupyter interface with the user input code block, I had to implement a custom prompt function that accepts free‐form text, applies exact and case-insensitive matching, and uses the Python difflib.get_close_matches to recover from typos. I also had to add clear_output() calls so that each retry cleanly redraws the suggestions in Jupyter.

When a user selects a movie, the system computes Pearson correlations between that movie’s rating vector and all others, then calculates how many users rated each pair in common. By filtering out any pair with too few shared raters and shrinking correlation scores for low counts we avoid spurious, perfect matches that arise from only two or three overlapping votes (this issue would likely be resovled by using a larger dataset). Finally, the top ten correlated titles are sorted, merged with genre information and presented to the user.

References:

Nair, A. (2019, September 25). How to build your first recommender system using Python & MovieLens dataset. Analytics India Magazine. https://analyticsindiamag.com/deep-tech/how-to-build-your-first-recommender-system-using-python-movielens-dataset/

GroupLens Research Group. (n.d.). MovieLens latest small dataset README. Retrieved July 31, 2025, from https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html 