### Required Discussion 19:1: Building a Recommender System with SURPRISE

This discussion focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to [grouplens](https://grouplens.org/datasets/movielens/) and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.


In [2]:
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering
from surprise.model_selection import cross_validate

import pandas as pd

In [3]:
# Load the MovieLens 100k dataset
reader = Reader(line_format='user item rating timestamp', sep='\t')
data = Dataset.load_from_file('data/ml-100k/u.data', reader=reader)

# Create a list of algorithms to test
algorithms = {
    'SVD': SVD(),
    'NMF': NMF(),
    'KNNBasic': KNNBasic(),
    'SlopeOne': SlopeOne(),
    'CoClustering': CoClustering()
}

# Dictionary to store results
results = {}

# Perform cross-validation for each algorithm
for name, algo in algorithms.items():
    print(f"Running cross-validation for {name}...")
    cv_results = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
    results[name] = {
        'RMSE': cv_results['test_rmse'].mean(),
        'MSE': cv_results['test_rmse'].mean() ** 2,  # Calculate MSE from RMSE
        'MAE': cv_results['test_mae'].mean()
    }

# Create a DataFrame to display results
results_df = pd.DataFrame(results).T
results_df = results_df.sort_values('MSE')  # Sort by MSE since that's our primary metric
print("\nFinal Results:")
print(results_df)

Running cross-validation for SVD...
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9363  0.9386  0.9385  0.9372  0.9336  0.9369  0.0018  
MAE (testset)     0.7400  0.7428  0.7375  0.7383  0.7347  0.7387  0.0027  
Fit time          0.39    0.39    0.40    0.38    0.37    0.38    0.01    
Test time         0.04    0.04    0.04    0.04    0.07    0.05    0.01    
Running cross-validation for NMF...
Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9655  0.9678  0.9626  0.9608  0.9703  0.9654  0.0034  
MAE (testset)     0.7608  0.7613  0.7550  0.7535  0.7622  0.7586  0.0036  
Fit time          0.58    0.59    0.62    0.57    0.56    0.58    0.02    
Test time         0.04    0.07    0.04    0.07    0.04    0.05    0.02    
Running cross-validation for KNNBasic...
Computing the msd similarity 

## Analysis Summary

We evaluated five collaborative filtering algorithms on the MovieLens 100K dataset, which contains 100,000 ratings from 943 users across 1,682 movies. Using 5-fold cross-validation, we compared the performance of SVD, NMF, KNNBasic, SlopeOne, and CoClustering algorithms. SVD emerged as the best performer with the lowest MSE (0.878) and RMSE (0.937), followed by SlopeOne and CoClustering. KNNBasic, despite its simplicity, showed the highest error rates. All algorithms maintained relatively consistent performance across folds, with SVD showing particularly stable results (std dev: 0.0018). While the performance differences were modest, with RMSE values ranging from 0.937 to 0.978, the computational requirements varied significantly. SVD combined superior accuracy with efficient processing, while KNNBasic required longer computation times due to similarity matrix calculations. These results suggest that matrix factorization approaches, particularly SVD, offer the best balance of accuracy and efficiency for this recommendation task.

## Analysis Summary

We explored the MovieLens 100K dataset, which captures 100,000 movie ratings from 943 users rating 1,682 different movies on a 1-5 scale. Looking at how different recommendation algorithms handle this data, we found some interesting patterns. We put five popular algorithms through their paces using cross-validation, and SVD really stood out from the pack. It not only achieved the best accuracy with an MSE of 0.878, but it was also remarkably consistent across different data splits. SlopeOne came in as a solid runner-up, while KNNBasic struggled to keep up despite its straightforward approach.

What's particularly interesting is how the computational demands varied. While KNNBasic seems simple on paper, it actually took the longest to run due to all its similarity calculations. SVD, on the other hand, managed to be both fast and accurate. The performance gap between the best and worst algorithms wasn't huge (RMSE ranging from 0.937 to 0.978), but when you're dealing with movie recommendations, these small improvements can make a real difference. Based on these results, SVD looks like the way to go if you want a good balance of accuracy and speed.