# Project 4: Movie Recommender System

*Fall 2023 | STAT-542 / CS-598*

**Team Members**
| Net ID| Name | Program |
| --- | --- | --- |
| wesleye2 | Wesley Ecoiffier | MCS |
| robbiel2 | Robbie Li | MCS-DS |
| baolong3 | Baolong Truong | MCS-DS |


**Note**: All data files (e.g., .csv, .tsv, etc) are stored in the `data` folder.

In [1]:
# Define imports and set options
import pandas as pd
import numpy as np
import requests
import warnings
from sklearn.metrics.pairwise import cosine_similarity

pd.set_option("display.float_format", "{:.7f}".format)
warnings.filterwarnings("ignore")

## Preprocessing
1. Fetch required data
2. Clean and format data movie data
3. Merge movie and ratings data

In [2]:
# Get movies list and create movies DataFrame

ratings = pd.read_csv('data/ratings.csv')
movies_list_url = "https://liangfgithub.github.io/MovieData/movies.dat?raw=true"

# Fetch the data from the URL
movies_list = requests.get(movies_list_url)

# Split the data into lines and then split each line using "::"
movie_lines = movies_list.text.split('\n')
movie_data = [line.split("::") for line in movie_lines if line]

# Create a DataFrame from the movie data
movies = pd.DataFrame(movie_data, columns=['movie_id', 'title', 'genres'])
movies['movie_id'] = movies['movie_id'].astype(int)

In [3]:
# Get ratings data and merge with movies list
ratings_count = ratings.count(axis=0)
ratings_count = pd.DataFrame({'count': ratings_count})
ratings_count['movie_id'] = ratings_count.index
ratings_count['movie_id'] = ratings_count['movie_id'].apply(lambda x: int(x[1:]))
ratings_count = ratings_count.reset_index()
merged = pd.merge(movies, ratings_count, on='movie_id', how='inner')
merged = merged.drop(columns='index')

# Output the merged dataset to a CSV file, excluding index column
merged.to_csv('data/movies_with_ratings_count.csv', index=False)

In [4]:
# Create a list of all the genre options

# Filter out None values
genres = merged["genres"].str.split("|", expand=True)

# Only keep uniques
genres = genres.stack().unique()

# Sort the list alphabetically
genres = np.sort(genres)

# Output to csv, excluding index column and column headers
genres_df = pd.DataFrame(genres, columns=["Genre"])
genres_df.to_csv("data/genres.csv", index=False, header=False)

## System I: Recommendatings by Genre

Get movie recommendations based on a specified genre, ranked by number of ratings

In [5]:
def top_movies_in_genre(df, genre, n = 10):
    # Filter DataFrame for rows with the specified genre
    genre_df = df[df['genres'].str.contains(genre)]

    # Sort DataFrame by count in descending order
    sorted_genre_df = genre_df.sort_values(by='count', ascending=False)

    # Take the top n rows
    top_movies = sorted_genre_df.head(n)

    return top_movies

In [6]:
# Example usage
top_movies_in_genre(merged, "Comedy", 20)

Unnamed: 0,movie_id,title,genres,count
2651,2858,American Beauty (1999),Comedy|Drama,3428
1178,1270,Back to the Future (1985),Comedy|Sci-Fi,2583
1449,1580,Men in Black (1997),Action|Adventure|Comedy|Sci-Fi,2538
2203,2396,Shakespeare in Love (1998),Comedy|Romance,2369
1107,1197,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance,2318
1173,1265,Groundhog Day (1993),Comedy|Romance,2278
2785,2997,Being John Malkovich (1999),Comedy,2241
346,356,Forrest Gump (1994),Comedy|Romance|War,2194
2511,2716,Ghostbusters (1984),Comedy|Horror,2181
0,1,Toy Story (1995),Animation|Children's|Comedy,2077


# System 2: Item-Based Collaborative Filtering (IBCF)

In [7]:
# Normalize the ratings data by centering it around the mean rating for each movie
row_means = np.nanmean(ratings, axis=1, keepdims=True)

# Create a matrix where each row mean is repeated along the columns
row_means_matrix = np.tile(row_means, (1, ratings.shape[1]))

# Subtract the row means matrix from the original matrix
R = ratings - row_means_matrix

# Ouput normalized ratings to CSV
R.to_csv('data/ratings_norm.csv')

## !!! IMPORTANT !!!

R is much faster than python at creating the similarity matrix using cosine similarity. Therefore, we will use R to create the similarity matrix, then resume the rest of the system 2 data processing in python.

The `similarity.Rmd` script expects that `data/ratings_norm.csv` exists (which is written in the block above). It will output the `similarity.csv` file, which is used in subsequent steps.

**This is R code from "similarity.Rmd", it will not run correctly inside of a Python jupyter notebook**

*similarity.Rmd*

```
---
title: "p4"
output: html_document
date: "2023-12-08"
---
```


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r}
install.packages("coop")

R = read.csv("data/ratings_norm.csv")
```


```{r}
R <- R[, -1]

S <- coop::cosine(as.matrix(R), use = "pairwise.complete.obs")

rounded_S <- round(1/2 + 1/2*(S), digits = 7)
```


```{r}
write.csv(rounded_S, file = "data/similarity.csv")
```

In [25]:

# Create a mask
# Note: this only needs to be run once on ratings.csv.

mask = None
try:
    # Check if data/mask.csv exists. If it does, load it
    mask = pd.read_csv("data/mask.csv")
except:
    # If it doesn't, create it
    not_na = ratings.notna().astype(int)
    ratings_mask = not_na.dot(not_na.T)
    ratings_mask.to_csv("data/mask.csv")
    mask = pd.read_csv("data/mask.csv")

# Check similarity matrix

In [26]:
S = pd.read_csv('data/similarity.csv')
S.set_index('Unnamed: 0', inplace= True)

mask.set_index("Unnamed: 0", inplace= True)
np.fill_diagonal(mask.values, 0)

# For all movies with less than 3 ratings, set the similarity to nan
S[mask < 3] = np.nan

In [28]:
# Test similarity output of selected movies, rounded to 7 decimal places
selected_indices = ['m1', 'm10', 'm100', 'm1510', 'm260', 'm3212']

sample_similarity = S.loc[selected_indices, selected_indices]

print(f'Sample similarity matrix for movies {selected_indices}')
sample_similarity

Sample similarity matrix for movies ['m1', 'm10', 'm100', 'm1510', 'm260', 'm3212']


Unnamed: 0_level_0,m1,m10,m100,m1510,m260,m3212
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
m1,,0.5121055,0.3919999,,0.7411482,
m10,0.5121055,,0.5474583,,0.5343338,
m100,0.3919999,0.5474583,,,0.3296943,
m1510,,,,,,
m260,0.7411482,0.5343338,0.3296943,,,
m3212,,,,,,


In [29]:
# Create a function to get the top n (default 30) similar movies for each movie
def top_n(row, n = 30):
    top_n_indices = row.sort_values(ascending=False).index[:n]
    row.loc[~row.index.isin(top_n_indices)] = np.nan
    return row

S = S.apply(top_n, axis=1)
S.to_csv('data/similarity_top_30.csv')

In [30]:
S_top30 = pd.read_csv('data/similarity_top_30.csv')
S_top30.set_index("Unnamed: 0", inplace= True)

In [31]:
def myIBCF(similarity_matrix, newuser, num_recommendations=10):
    not_rated_indices = []
    for index in newuser.index:
        if np.isnan(newuser[index]):
            not_rated_indices.append(index)
    df_not_rated = pd.DataFrame(index=not_rated_indices, columns=["Value"])

    for l in df_not_rated.index:
        Sl = S_top30.loc[l].dropna()

        movie_score_num = 0
        movie_score_denom = 0
        w = newuser

        for i in Sl.index:
            w_i = 0 if np.isnan(w[i]) else w[i]
            movie_score_num += w_i * Sl[i]
            if w_i != 0:
                movie_score_denom += Sl[i]
        if movie_score_denom != 0:
            df_not_rated.loc[l] = movie_score_num / movie_score_denom

    return df_not_rated.sort_values(by="Value", ascending=False).head(num_recommendations)

In [32]:
# Test myIBCF function with user u1181
print("Top Recommendations for User u1181")
newuser = ratings.loc['u1181']
myIBCF(S_top30, newuser)

Top Recommendations for User u1181


Unnamed: 0,Value
m3732,5.0
m749,4.5265592
m3899,4.5260659
m3752,4.0
m504,4.0
m1235,4.0
m2793,4.0
m2082,4.0
m3789,4.0
m1914,4.0
