# 9.2 Exercise: Recommender System

## DSC 630: Predictive Analytics

## August 4th, 2024

## Kenn Wade

# Introduction

## Overview

The objective of this assignment is to create a recommender system using the small MovieLens dataset. The recommender system will allow users to input a movie they like (in the dataset) and receive recommendations for ten other movies to watch. This analysis will involve loading and preprocessing the data, building the recommender system, and providing recommendations.

## Dataset Description

The small MovieLens dataset includes:
1. `movies.csv`: Contains movie IDs and titles.
2. `ratings.csv`: Contains user IDs, movie IDs, ratings, and timestamps.

## Assignment Instructions

For this assignment, I will:
1. **Data Loading**: Load the MovieLens dataset (movies.csv and ratings.csv).
2. **Data Preprocessing**: Merge the datasets and create a user-item matrix.
3. **Similarity Calculation**: Compute the cosine similarity matrix for the movies.
4. **Recommender System**: Create a function to recommend ten movies based on a given movie title.
5. **Documentation**: Document all steps, processes, and analyses thoroughly.

## Questions to Explore

I will explore the following questions:
1. How can I preprocess the MovieLens data to create a user-item matrix?
2. How can I use cosine similarity to find similar movies?
3. What are the ten most similar movies to a given movie in the dataset?

This analysis will provide insights into the creation and functionality of recommender systems using collaborative filtering.


In [5]:
# Step 1: Data Loading

# Import necessary libraries
import pandas as pd

# Load the MovieLens dataset
movies = pd.read_csv('/Users/kennwade/Downloads/ml-latest-small/movies.csv')
ratings = pd.read_csv('/Users/kennwade/Downloads/ml-latest-small/ratings.csv')

# Display the first few rows of each dataset
print("Movies Data:")
print(movies.head())

print("\nRatings Data:")
print(ratings.head())


Movies Data:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  

Ratings Data:
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


In [6]:
# Step 2: Data Preprocessing

# Merge the movies and ratings dataframes
movie_ratings = pd.merge(ratings, movies, on='movieId')

# Create a user-item matrix
user_movie_matrix = movie_ratings.pivot_table(index='userId', columns='title', values='rating')

# Fill NaN values with 0
user_movie_matrix.fillna(0, inplace=True)

# Display the user-item matrix
print("User-Item Matrix:")
print(user_movie_matrix.head())


User-Item Matrix:
title   '71 (2014)  'Hellboy': The Seeds of Creation (2004)  \
userId                                                        
1              0.0                                      0.0   
2              0.0                                      0.0   
3              0.0                                      0.0   
4              0.0                                      0.0   
5              0.0                                      0.0   

title   'Round Midnight (1986)  'Salem's Lot (2004)  \
userId                                                
1                          0.0                  0.0   
2                          0.0                  0.0   
3                          0.0                  0.0   
4                          0.0                  0.0   
5                          0.0                  0.0   

title   'Til There Was You (1997)  'Tis the Season for Love (2015)  \
userId                                                               
1             

In [9]:
# Step 3: Similarity Calculation

from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity matrix
movie_similarity = cosine_similarity(user_movie_matrix.T)

# Convert the similarity matrix into a DataFrame
movie_similarity_df = pd.DataFrame(movie_similarity, index=user_movie_matrix.columns, columns=user_movie_matrix.columns)

# Display a subset of the similarity matrix to avoid scrolling
print("Movie Similarity Matrix (subset):")
print(movie_similarity_df.iloc[:5, :5])


Movie Similarity Matrix (subset):
title                                    '71 (2014)  \
title                                                 
'71 (2014)                                      1.0   
'Hellboy': The Seeds of Creation (2004)         0.0   
'Round Midnight (1986)                          0.0   
'Salem's Lot (2004)                             0.0   
'Til There Was You (1997)                       0.0   

title                                    'Hellboy': The Seeds of Creation (2004)  \
title                                                                              
'71 (2014)                                                              0.000000   
'Hellboy': The Seeds of Creation (2004)                                 1.000000   
'Round Midnight (1986)                                                  0.707107   
'Salem's Lot (2004)                                                     0.000000   
'Til There Was You (1997)                                               0.00

In [8]:
# Step 4: Recommender System

def recommend_movies(movie_title, similarity_matrix, n_recommendations=10):
    """
    Recommend movies similar to the given movie_title.
    
    Parameters:
    movie_title (str): The title of the movie.
    similarity_matrix (DataFrame): The movie similarity matrix.
    n_recommendations (int): The number of movie recommendations to return.
    
    Returns:
    DataFrame: The recommended movies with similarity scores.
    """
    # Get the similarity scores for the input movie
    similarity_scores = similarity_matrix[movie_title]
    
    # Sort the movies by similarity score
    similar_movies = similarity_scores.sort_values(ascending=False)
    
    # Exclude the input movie and get the top n recommendations
    recommended_movies = similar_movies.iloc[1:n_recommendations+1]
    
    return recommended_movies

# Test the function
movie_title = 'Toy Story (1995)'  # Example movie
recommendations = recommend_movies(movie_title, movie_similarity_df)

print(f"Movies similar to '{movie_title}':")
print(recommendations)


Movies similar to 'Toy Story (1995)':
title
Toy Story 2 (1999)                                   0.572601
Jurassic Park (1993)                                 0.565637
Independence Day (a.k.a. ID4) (1996)                 0.564262
Star Wars: Episode IV - A New Hope (1977)            0.557388
Forrest Gump (1994)                                  0.547096
Lion King, The (1994)                                0.541145
Star Wars: Episode VI - Return of the Jedi (1983)    0.541089
Mission: Impossible (1996)                           0.538913
Groundhog Day (1993)                                 0.534169
Back to the Future (1985)                            0.530381
Name: Toy Story (1995), dtype: float64


## Conclusion and Insights

### Recommender System Process

1. **Data Loading:**
   - I loaded the `movies.csv` and `ratings.csv` files from the small MovieLens dataset using pandas.
   - Merged the datasets to create a comprehensive dataframe containing user ratings and movie titles.

2. **Data Preprocessing:**
   - Created a user-item matrix by pivoting the merged dataframe, with users as rows and movies as columns, and filling missing values with 0.
   - This user-item matrix is essential for calculating similarities between movies based on user ratings.

3. **Similarity Calculation:**
   - Computed the cosine similarity matrix for the movies using the `cosine_similarity` function from the `sklearn.metrics.pairwise` module.
   - Converted the similarity matrix into a DataFrame for easier manipulation and interpretation.
   - Displayed a subset of the similarity matrix for clarity.

4. **Recommender System Function:**
   - Developed a function `recommend_movies` that takes a movie title and the similarity matrix as inputs.
   - The function sorts the movies by similarity score and returns the top ten recommendations, excluding the input movie itself.

### Results and Analysis

#### Example Output:
- For the movie "Toy Story (1995)," the recommender system provided the following ten recommendations:
  1. Toy Story 2 (1999)
  2. Jurassic Park (1993)
  3. Independence Day (a.k.a. ID4) (1996)
  4. Star Wars: Episode IV - A New Hope (1977)
  5. Forrest Gump (1994)
  6. Lion King, The (1994)
  7. Star Wars: Episode VI - Return of the Jedi (1983)
  8. Mission: Impossible (1996)
  9. Groundhog Day (1993)
  10. Back to the Future (1985)

### Questions Explored

1. **How can I preprocess the MovieLens data to create a user-item matrix?**
   - The data was preprocessed by merging the `movies` and `ratings` datasets and creating a user-item matrix using the `pivot_table` function in pandas. Missing values were filled with 0 to ensure the matrix was suitable for similarity calculations.

2. **How can I use cosine similarity to find similar movies?**
   - Cosine similarity was calculated using the `cosine_similarity` function from the `sklearn.metrics.pairwise` module. This similarity measure helps in identifying movies that are similar based on user ratings.

3. **What are the ten most similar movies to a given movie in the dataset?**
   - The `recommend_movies` function sorts the similarity scores and returns the top ten movies that are most similar to the given movie, excluding the movie itself.

### Key Insights

- **Data Preprocessing:** Proper data preprocessing, including merging datasets and creating a user-item matrix, is crucial for building an effective recommender system.
- **Cosine Similarity:** Cosine similarity is an effective measure for identifying similar movies based on user ratings.
- **Functionality:** The recommender system function accurately provides relevant movie recommendations, demonstrating the potential of collaborative filtering techniques in building recommender systems.

### Next Steps

1. **Further Improvement:** Explore additional methods to enhance recommendation accuracy, such as using more advanced collaborative filtering techniques or incorporating content-based filtering.
2. **Deployment:** Consider deploying the recommender system as a web application for easier user interaction and accessibility.
3. **User Feedback:** Collect feedback from users to improve the recommendations and refine the system accordingly.

This analysis provided valuable insights into the creation and functionality of recommender systems using collaborative filtering. The recommender system successfully delivered relevant movie recommendations, showcasing its practical application in personalized content delivery.

### References

- MovieLens dataset: [GroupLens](https://grouplens.org/datasets/movielens/)
- Scikit-learn documentation: [Scikit-learn](https://scikit-learn.org/stable/)
