### Building a Recommender system with Surprise

This try-it focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to grouplens and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.



In [2]:
from surprise import Dataset, Reader, accuracy, KNNBasic, SVD, NMF, SlopeOne, CoClustering
from surprise.model_selection import train_test_split, cross_validate

import pandas as pd
import numpy as np

## Building a Book Recommendation System: A Collaborative Filtering Approach

### Understanding the Problem:

Given a dataset of user IDs, item IDs (books), and ratings, the goal is to predict which books a user might like based on their past preferences and the preferences of similar users.

**Collaborative Filtering:**
This is a popular technique for recommendation systems. It assumes that users who have similar tastes in the past will likely have similar tastes in the future.

### Data Understanding

**Book Genome Dataset**

- The data is derived from the grouplens org. 
- Dataset URL: [Book Genome Dataset](https://grouplens.org/datasets/book-genome/)
- Raw data includes information, such as titles, item id, user id and user ratings.

### Data Preparation
**Load the dataset**

In [4]:
book_ratings = pd.read_json('../data/ratings.json', lines=True)

In [5]:
book_ratings.shape

(5152656, 3)

In [6]:
book_ratings.head()

Unnamed: 0,item_id,user_id,rating
0,41335427,0,5
1,41335427,1,3
2,41335427,2,5
3,41335427,3,5
4,41335427,4,5


In [8]:
book_title = pd.read_json('../data/metadata.json', lines=True)

In [9]:
book_title.shape

(9374, 8)

In [10]:
book_title.head()

Unnamed: 0,item_id,url,title,authors,lang,img,year,description
0,16827462,https://www.goodreads.com/book/show/11870085-t...,The Fault in Our Stars,John Green,eng,https://images.gr-assets.com/books/1360206420m...,2012,"There is an alternate cover edition .\n""I fel..."
1,2792775,https://www.goodreads.com/book/show/2767052-th...,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,eng,https://images.gr-assets.com/books/1447303603m...,2008,Winning will make you famous.\nLosing means ce...
2,8812783,https://www.goodreads.com/book/show/7260188-mo...,"Mockingjay (The Hunger Games, #3)",Suzanne Collins,eng,https://images.gr-assets.com/books/1358275419m...,2010,My name is Katniss Everdeen.\nWhy am I not dea...
3,41107568,https://www.goodreads.com/book/show/22557272-t...,The Girl on the Train,Paula Hawkins,eng,https://images.gr-assets.com/books/1490903702m...,2015,Every day the same\nRachel takes the same comm...
4,6171458,https://www.goodreads.com/book/show/6148028-ca...,"Catching Fire (The Hunger Games, #2)",Suzanne Collins,eng,https://images.gr-assets.com/books/1358273780m...,2009,Sparks are igniting.\nFlames are spreading.\nA...


In [11]:
# Merge the DataFrames on the 'item_id' column
merged_df = pd.merge(book_ratings, book_title[['item_id', 'title']], on='item_id', how='left')

In [12]:
merged_df.head()

Unnamed: 0,item_id,user_id,rating,title
0,41335427,0,5,Harry Potter and the Half-Blood Prince (Harry ...
1,41335427,1,3,Harry Potter and the Half-Blood Prince (Harry ...
2,41335427,2,5,Harry Potter and the Half-Blood Prince (Harry ...
3,41335427,3,5,Harry Potter and the Half-Blood Prince (Harry ...
4,41335427,4,5,Harry Potter and the Half-Blood Prince (Harry ...


In [13]:
merged_df.rating.unique()

array([5, 3, 4, 2, 1])

In [14]:
merged_df.shape

(5152656, 4)

#### Take a small sample of 500K records for ease of processing

In [15]:
book_500k = merged_df.sample(n=500000, random_state=42)

In [16]:
book_500k.to_csv('../data/ratings_500k.csv', index=False)

#### Load the dataset into sf

In [18]:
# Define the Reader and load the dataset
a = book_500k[['user_id', 'item_id', 'rating']]
reader = Reader(rating_scale=(1, 5))
sf = Dataset.load_from_df(a, reader)

### Modeling

#### Comparing Recommendation Algorithms: KNNBasic, SVD, NMF, SlopeOne, and CoClustering

**Understanding the Algorithms:**

- **KNNBasic:** A simple nearest neighbor algorithm that recommends items based on the ratings of similar users.   
- **SVD:** Singular Value Decomposition is a matrix factorization technique that decomposes the user-item rating matrix into latent factors.   
- **NMF:** Non-negative Matrix Factorization is another matrix factorization technique that decomposes the rating matrix into non-negative factors.
- **SlopeOne:** A simple algorithm that estimates the rating of an item by a user based on the average difference between ratings of that item and other items rated by the user.
- **CoClustering:** A clustering-based approach that simultaneously clusters users and items, assuming that users in the same cluster tend to prefer items in the same cluster.

**Cross-Validation:** Due to compute constraints, couldn't perform cross-validation for KNNBasic.

In [19]:
# List of algorithms to evaluate

# Configure KNNBasic with fewer neighbors
# sim_options = {
#    'name': 'cosine',
#    'user_based': True
#}

models = {
   # 'KNNBasic': KNNBasic(k=5, sim_options=sim_options),
    'SVD': SVD(),
    'NMF': NMF(),
    'SlopeOne': SlopeOne(),
    'CoClustering': CoClustering()
}

In [20]:
# Dictionary to store results
results = {}

# Evaluate each algorithm
for name, model in models.items():
    print(f"Cross-validating {name}...")
    cv_results = cross_validate(model, sf, measures=['RMSE', 'MAE'], cv=5, verbose=True)
    results[name] = {
        'RMSE Mean': cv_results['test_rmse'].mean(),
        'RMSE Std': cv_results['test_rmse'].std(),
        'MAE Mean': cv_results['test_mae'].mean(),
        'MAE Std': cv_results['test_mae'].std()
    }

Cross-validating SVD...
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9780  0.9778  0.9801  0.9769  0.9761  0.9778  0.0014  
MAE (testset)     0.7747  0.7751  0.7759  0.7745  0.7732  0.7746  0.0009  
Fit time          6.83    7.28    7.17    7.46    6.92    7.13    0.23    
Test time         0.89    0.81    0.83    0.88    0.93    0.87    0.04    
Cross-validating NMF...
Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.1333  1.1336  1.1376  1.1434  1.1335  1.1363  0.0039  
MAE (testset)     0.8893  0.8908  0.8940  0.8988  0.8902  0.8926  0.0035  
Fit time          18.91   20.36   20.63   20.40   18.56   19.77   0.86    
Test time         0.76    0.53    0.52    0.74    0.52    0.61    0.11    
Cross-validating SlopeOne...
Evaluating RMSE, MAE of algorithm SlopeOne on 5 split(s).

      

### Evaluation:

**Metrics:** Used metrics like root mean squared error (RMSE) and mean absolute error (MAE) to evaluate the accuracy of the recommendations.

In [21]:
import pandas as pd
results_df = pd.DataFrame(results).T
print(results_df)

              RMSE Mean  RMSE Std  MAE Mean   MAE Std
SVD            0.977785  0.001358  0.774649  0.000882
NMF            1.136273  0.003924  0.892613  0.003471
SlopeOne       1.180892  0.000856  0.893620  0.000816
CoClustering   1.066204  0.002902  0.808641  0.002551


### Conclusion

Based on the evaluation metrics (RMSE and MAE), SVD appears to be the most effective algorithm for the given book recommendation task. It consistently demonstrates lower error rates compared to the other algorithms (NMF, SlopeOne, and CoClustering).

#### Key Findings:

- **SVD**: Best accuracy with the lowest RMSE (0.9778) and MAE (0.7746), and balanced fit and test times.
- **NMF**: Higher error rates (RMSE: 1.1363, MAE: 0.8926) with the longest fit times but faster test times.
- **SlopeOne**: Highest error rates (RMSE: 1.1809, MAE: 0.8936) with the shortest fit times.
- **CoClustering**: Better accuracy than NMF and SlopeOne (RMSE: 1.0662, MAE: 0.8086) but with longer fit times.
  
#### Further Considerations:

- **Scalability**: Assessing how well each model scales with increasing data size. SVD and CoClustering may handle larger datasets better due to their matrix factorization approaches.
- **Computational Resources**: Considering the available computational resources. If resources are limited, SlopeOne’s low computational cost might be beneficial despite its lower accuracy.
- **Real-Time Performance**: Evaluating the test time and whether the model can meet real-time requirements if applicable. SlopeOne and CoClustering offer different trade-offs between accuracy and speed.
- **Interpretability**: Some models might provide better insights into the data. For instance, SVD and CoClustering can offer more interpretable results compared to NMF.
- **Robustness**: Examining how each model performs under various conditions or with noisy data. Consistency in error rates across folds is an indicator of robustness.
- **Future Updates**: Considering how easily each model can be updated or retrained as new data becomes available. 
- **Business Needs**: Aligning the choice of model with the specific needs and constraints of the business or application, including accuracy requirements, computational constraints, and real-time processing needs.