<a href="https://colab.research.google.com/github/mobius29er/AIML_Class/blob/main/MovieRatingAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Required Discussion 19:1: Building a Recommender System with SURPRISE

This discussion focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to [grouplens](https://grouplens.org/datasets/movielens/) and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.


In [6]:
!pip install "numpy<2.0"
!pip install surprise

Collecting surprise
  Using cached surprise-0.1-py2.py3-none-any.whl.metadata (327 bytes)
Using cached surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Installing collected packages: surprise
Successfully installed surprise-0.1


In [11]:
import pandas as pd
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering, NormalPredictor;
from surprise.model_selection import cross_validate, RandomizedSearchCV, train_test_split
from surprise import accuracy
import time
from scipy.stats import randint, uniform
import matplotlib.pyplot as plt

In [2]:
!gdown https://drive.google.com/uc?id=1BRpXZorDwjvZuX3b1wF1_6pHREMkWMwN -O links.csv
!gdown https://drive.google.com/uc?id=1scyn6EGZfYsJtRZCaUIkl9rvgEq25BEM -O movies.csv
!gdown https://drive.google.com/uc?id=1RrmA2bCbz3S8NG5hTHsIfTdbJaSxVRR_ -O ratings.csv
!gdown https://drive.google.com/uc?id=10fC8xmQ5a0hpeXcQXPOYXUV2XsbMMWtv -O tags.csv

Downloading...
From: https://drive.google.com/uc?id=1BRpXZorDwjvZuX3b1wF1_6pHREMkWMwN
To: /content/links.csv
100% 198k/198k [00:00<00:00, 82.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1scyn6EGZfYsJtRZCaUIkl9rvgEq25BEM
To: /content/movies.csv
100% 494k/494k [00:00<00:00, 161MB/s]
Downloading...
From: https://drive.google.com/uc?id=1RrmA2bCbz3S8NG5hTHsIfTdbJaSxVRR_
To: /content/ratings.csv
100% 2.48M/2.48M [00:00<00:00, 129MB/s]
Downloading...
From: https://drive.google.com/uc?id=10fC8xmQ5a0hpeXcQXPOYXUV2XsbMMWtv
To: /content/tags.csv
100% 119k/119k [00:00<00:00, 115MB/s]


In [5]:
# Load ratings
ratings = pd.read_csv("ratings.csv")

# Load movies for mapping IDs to titles
movies = pd.read_csv("movies.csv")

# Build Surprise dataset
reader = Reader(line_format="user item rating timestamp", sep=",", rating_scale=(0.5,5.0))
data = Dataset.load_from_df(ratings[["userId","movieId","rating"]], Reader(rating_scale=(0.5,5)))

In [12]:
# Train/test split
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Run multiple algorithms
algos = {
    "SVD": SVD(random_state=42),
    "NMF": NMF(random_state=42),
    "SlopeOne": SlopeOne(),
    "KNNBasic": KNNBasic(),
    "CoClustering": CoClustering(random_state=42)
}

results = {}
predictions = {}

for name, algo in algos.items():
    algo.fit(trainset)
    preds = algo.test(testset)
    rmse = accuracy.rmse(preds, verbose=False)
    results[name] = rmse
    predictions[name] = preds

print("Cross-validated RMSEs:", results)

Computing the msd similarity matrix...
Done computing similarity matrix.
Cross-validated RMSEs: {'SVD': 0.8807462819979623, 'NMF': 0.9287510909340335, 'SlopeOne': 0.9114276258338462, 'KNNBasic': 0.9578061143849369, 'CoClustering': 0.9543517643640059}


In [13]:
def preds_to_df(preds, label):
    return pd.DataFrame([(p.uid, p.iid, p.est) for p in preds],
                        columns=["userId","movieId",label])

# Convert
svd_df = preds_to_df(predictions["SVD"], "svd_rating")
s1_df  = preds_to_df(predictions["SlopeOne"], "slope_one_rating")

# Merge and build hybrid
hybrid_df = svd_df.merge(s1_df, on=["userId","movieId"])
hybrid_df["hybrid_rating"] = hybrid_df[["svd_rating","slope_one_rating"]].mean(axis=1)

# Add movie titles
hybrid_df = hybrid_df.merge(movies[["movieId","title"]], on="movieId", how="left")

# Format columns like assignment asks
hybrid_df = hybrid_df[["title","userId","hybrid_rating","svd_rating","slope_one_rating"]]

print(hybrid_df.head(10))


                                               title  userId  hybrid_rating  \
0                        Under the Tuscan Sun (2003)     140       3.184288   
1                          Once Were Warriors (1994)     603       4.515343   
2                 Dragon: The Bruce Lee Story (1993)     438       2.781713   
3                                     Arrival (2016)     433       3.424891   
4                  Bad and the Beautiful, The (1952)     474       3.409121   
5                         Sound of Music, The (1965)     304       4.394399   
6                      Not Another Teen Movie (2001)     298       2.130361   
7  Léon: The Professional (a.k.a. The Professiona...     131       3.611973   
8                                  Ghost Ship (2002)     288       2.248211   
9                          Hotel Transylvania (2012)     448       2.779960   

   svd_rating  slope_one_rating  
0    3.412888          2.955687  
1    4.455247          4.575440  
2    3.125288          2.438

In [15]:
results_df = pd.DataFrame.from_dict(results, orient='index', columns=['RMSE']).sort_values(by="RMSE")
print(results_df)

                  RMSE
SVD           0.880746
SlopeOne      0.911428
NMF           0.928751
CoClustering  0.954352
KNNBasic      0.957806
