### Required Discussion 19:1: Building a Recommender System with SURPRISE

This discussion focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to [grouplens](https://grouplens.org/datasets/movielens/) and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.


In [3]:
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering
from surprise.model_selection import cross_validate

import pandas as pd

### Installation Notes for the Surprise Library on openSUSE with Python 3.11

The initial installation of `scikit-surprise` failed to import in Jupyter due to a compatibility issue between the library’s compiled C extensions and the version of NumPy installed on the system. Surprise was compiled for NumPy 1.x, but Python 3.11 on openSUSE was using NumPy 2.x, which caused `ImportError: numpy.core.multiarray failed to import`. To resolve this, NumPy was fully uninstalled and replaced with a compatible version (`numpy<2`, specifically 1.26.4). The `scikit-surprise` package was then reinstalled so it could rebuild against the correct NumPy headers. After confirming that Jupyter was using the Python 3.11 environment where these packages were installed, the Surprise library imported successfully.


In [6]:
ratings_path = "ratings.csv"
movies_path  = "movies.csv"
tags_path    = "tags.csv"
links_path   = "links.csv"

ratings = pd.read_csv(ratings_path)
movies  = pd.read_csv(movies_path)
tags    = pd.read_csv(tags_path)
links   = pd.read_csv(links_path)

ratings.head()


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
# The ratings DataFrame hase userID, movieID, rating, and timestamps. 
#I told Surprise what the rating scale is
reader = Reader(rating_scale=(0.5, 5.0))

# Then I built the Surprise dataset from the pandas DataFrame
data = Dataset.load_from_df(
    ratings[['userId', 'movieId', 'rating']],
    reader
)


In [8]:
len(ratings), ratings['userId'].nunique(), ratings['movieId'].nunique()


(100836, 610, 9724)

### Algorithm Comparison

Now that I have confirmed that the output matches the official MovieLens dataset description, it is time to proceed to the **Algorithm Comparison**.

In this section, I will compare the performance of the following algorithms:

- **KNNBasic**
- **SVD**
- **NMF**
- **SlopeOne**
- **CoClustering**

These models will be evaluated using **5-fold cross-validation**, measuring:

- **RMSE** (Root Mean Squared Error)  lower is better  
- **MAE** (Mean Absolute Error)  lower is better  
- **Fit time**  how long the model takes to train  
- **Test time**  how long the model takes to generate predictions  


In [12]:
from surprise import KNNBasic
from surprise.model_selection import cross_validate

# Define the algorithm
knn = KNNBasic()

# Evaluate with 5-fold cross-validation
knn_results = cross_validate(
    knn,
    data,
    measures=["RMSE", "MAE"],
    cv=5,
    verbose=True
)

knn_results


Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9493  0.9465  0.9528  0.9515  0.9415  0.9483  0.0040  
MAE (testset)     0.7263  0.7248  0.7316  0.7298  0.7218  0.7268  0.0035  
Fit time          0.10    0.12    0.13    0.12    0.11    0.12    0.01    
Test time         1.15    0.99    1.07    1.04    1.04    1.06    0.05    


{'test_rmse': array([0.94928465, 0.94651172, 0.95281797, 0.95145631, 0.94153796]),
 'test_mae': array([0.72627704, 0.72479138, 0.73155286, 0.72978468, 0.72178794]),
 'fit_time': (0.10239982604980469,
  0.11645722389221191,
  0.12970399856567383,
  0.12432336807250977,
  0.11074042320251465),
 'test_time': (1.1450910568237305,
  0.9895656108856201,
  1.0708889961242676,
  1.0380923748016357,
  1.040161371231079)}

### Results: KNNBasic

The KNNBasic algorithm was evaluated using **5-fold cross-validation**.  
This method calculates user–user similarities using Mean Squared Difference (MSD) and predicts ratings based on nearest neighbors.

**Performance Summary**

| Metric | Mean | Interpretation |
|--------|------|----------------|
| **RMSE** | **0.9483** | On average, predictions are off by ~0.95 stars |
| **MAE** | **0.7268** | Average absolute error is ~0.73 |
| **Fit Time** | **0.12 seconds** | Very fast to train |
| **Test Time** | **1.06 seconds** | Slow to predict due to similarity computations |

**Analysis**

KNNBasic performs reasonably well, but it is not the strongest algorithm for MovieLens-style data.  
This provides a baseline, and the next step is to evaluate more advanced algorithms like **SVD**.


In [13]:
from surprise import SVD
from surprise.model_selection import cross_validate

# Define the algorithm
svd = SVD()

# Evaluate using 5-fold cross-validation
svd_results = cross_validate(
    svd,
    data,
    measures=["RMSE", "MAE"],
    cv=5,
    verbose=True
)

svd_results


Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8781  0.8680  0.8712  0.8748  0.8768  0.8738  0.0037  
MAE (testset)     0.6757  0.6674  0.6690  0.6703  0.6747  0.6714  0.0032  
Fit time          1.21    1.15    1.13    1.12    1.15    1.15    0.03    
Test time         0.09    0.18    0.09    0.08    0.18    0.12    0.05    


{'test_rmse': array([0.87811798, 0.86798325, 0.87122453, 0.87480276, 0.87681915]),
 'test_mae': array([0.67571455, 0.66735411, 0.66899445, 0.67030148, 0.67468266]),
 'fit_time': (1.2057907581329346,
  1.1508398056030273,
  1.130406379699707,
  1.1210880279541016,
  1.151198387145996),
 'test_time': (0.08806252479553223,
  0.1761789321899414,
  0.08850860595703125,
  0.08173274993896484,
  0.18261027336120605)}

### Results: SVD (Singular Value Decomposition)

SVD is a matrix factorization algorithm that learns latent features for both
users and items. It typically performs very well on MovieLens datasets because
it captures deeper patterns in how users rate movies.
Compared to KNNBasic (RMSE ≈ 0.9483), SVD performs significantly better.
It not only produces more accurate predictions but also generates predictions
more quickly once trained. This makes SVD a strong baseline and one of the
most reliable algorithms for collaborative filtering on MovieLens data.

In [14]:
from surprise import NMF
from surprise.model_selection import cross_validate

# Define the NMF algorithm
nmf = NMF()

# Evaluate using 5-fold cross-validation
nmf_results = cross_validate(
    nmf,
    data,
    measures=["RMSE", "MAE"],
    cv=5,
    verbose=True
)

nmf_results


Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9119  0.9203  0.9342  0.9277  0.9126  0.9213  0.0086  
MAE (testset)     0.6994  0.7049  0.7118  0.7124  0.7008  0.7059  0.0054  
Fit time          2.23    2.14    2.19    2.29    2.23    2.22    0.05    
Test time         0.08    0.18    0.08    0.18    0.07    0.12    0.05    


{'test_rmse': array([0.91188182, 0.92031965, 0.93419075, 0.92773398, 0.91256745]),
 'test_mae': array([0.69938756, 0.70492446, 0.71183981, 0.71237996, 0.70076836]),
 'fit_time': (2.225700616836548,
  2.1405556201934814,
  2.1902778148651123,
  2.2915737628936768,
  2.228145122528076),
 'test_time': (0.08072376251220703,
  0.18477559089660645,
  0.07628488540649414,
  0.17822957038879395,
  0.07028484344482422)}

### Results: NMF (Non-Negative Matrix Factorization)

NMF is a matrix factorization model similar to SVD, but it imposes a
**non-negativity constraint** on all latent factors. This can make the factors
more interpretable, but it also limits the model’s flexibility and typically
reduces accuracy. NMF performs **better than KNNBasic** but **worse than SVD**, which remains
the most accurate model so far.

In [15]:
from surprise import SlopeOne
from surprise.model_selection import cross_validate

# Define the SlopeOne algorithm
slope = SlopeOne()

# Evaluate using 5-fold cross-validation
slope_results = cross_validate(
    slope,
    data,
    measures=["RMSE", "MAE"],
    cv=5,
    verbose=True
)

slope_results


Evaluating RMSE, MAE of algorithm SlopeOne on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9000  0.9004  0.9013  0.9011  0.9008  0.9007  0.0005  
MAE (testset)     0.6872  0.6866  0.6855  0.6901  0.6913  0.6882  0.0022  
Fit time          3.43    3.47    3.32    3.24    3.52    3.40    0.10    
Test time         5.36    5.08    5.45    4.89    4.97    5.15    0.22    


{'test_rmse': array([0.90004766, 0.90042484, 0.90132997, 0.90105948, 0.90084973]),
 'test_mae': array([0.68720191, 0.68663578, 0.68549187, 0.69010506, 0.69131766]),
 'fit_time': (3.4333155155181885,
  3.4692580699920654,
  3.324516534805298,
  3.23602557182312,
  3.5228271484375),
 'test_time': (5.357722282409668,
  5.0758867263793945,
  5.446192979812622,
  4.886468887329102,
  4.969773530960083)}

### Results: SlopeOne

SlopeOne is a simple and older collaborative filtering algorithm based on item–item
differences. It predicts a user’s rating by averaging known rating differences
between items. SlopeOne is the **slowest algorithm** evaluated so far.

In [17]:
from surprise import CoClustering
from surprise.model_selection import cross_validate

# Define the CoClustering algorithm
cocluster = CoClustering()

# Evaluate using 5-fold cross-validation
cocluster_results = cross_validate(
    cocluster,
    data,
    measures=["RMSE", "MAE"],
    cv=5,
    verbose=True
)

cocluster_results


Evaluating RMSE, MAE of algorithm CoClustering on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9446  0.9517  0.9426  0.9542  0.9428  0.9472  0.0048  
MAE (testset)     0.7312  0.7343  0.7316  0.7364  0.7297  0.7326  0.0024  
Fit time          2.23    2.02    2.01    2.00    2.04    2.06    0.09    
Test time         0.08    0.20    0.08    0.07    0.07    0.10    0.05    


{'test_rmse': array([0.94456183, 0.9516643 , 0.942574  , 0.95423142, 0.94283111]),
 'test_mae': array([0.73119772, 0.73429018, 0.73157007, 0.73638349, 0.72971717]),
 'fit_time': (2.2291295528411865,
  2.0185444355010986,
  2.0088515281677246,
  2.0031089782714844,
  2.0404787063598633),
 'test_time': (0.08199572563171387,
  0.2029726505279541,
  0.07540416717529297,
  0.06799936294555664,
  0.06867170333862305)}

### Results: CoClustering

CoClustering groups users and items into co-clusters and predicts ratings based
on these group-level interactions.

**Analysis**

CoClustering performs poorly on the MovieLens dataset, with RMSE and MAE values
similar to KNNBasic—the weakest models in this comparison. Although prediction
time is fast, the overall accuracy does not match more advanced techniques such
as **SVD** or even **NMF**.