### Building a Recommender system with Surprise

This try-it focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to grouplens and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.



In [5]:
from surprise import Dataset, Reader, accuracy, KNNBasic, SVD, NMF, SlopeOne, CoClustering
from surprise.model_selection import train_test_split, cross_validate

import pandas as pd
import numpy as np

### The Data: Book Genome Dataset

The data is derived from the MovieLens data [here](https://grouplens.org/datasets/movielens/)-> [Book Genome Dataset](https://grouplens.org/datasets/book-genome/). Raw data includes information, such as titles, authors, user ratings and book-tag ratings.

In [8]:
book_ratings = pd.read_json('../data/book_dataset/raw/ratings.json', lines=True)

In [9]:
book_ratings.shape

(5152656, 3)

In [10]:
book_ratings.head()

Unnamed: 0,item_id,user_id,rating
0,41335427,0,5
1,41335427,1,3
2,41335427,2,5
3,41335427,3,5
4,41335427,4,5


In [11]:
book_ratings.to_csv('../data/book_dataset/raw/ratings.csv', index=False)

In [12]:
book_title = pd.read_json('../data/book_dataset/raw/metadata.json', lines=True)

In [13]:
book_title.shape

(9374, 8)

In [14]:
book_title.head()

Unnamed: 0,item_id,url,title,authors,lang,img,year,description
0,16827462,https://www.goodreads.com/book/show/11870085-t...,The Fault in Our Stars,John Green,eng,https://images.gr-assets.com/books/1360206420m...,2012,"There is an alternate cover edition .\n""I fel..."
1,2792775,https://www.goodreads.com/book/show/2767052-th...,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,eng,https://images.gr-assets.com/books/1447303603m...,2008,Winning will make you famous.\nLosing means ce...
2,8812783,https://www.goodreads.com/book/show/7260188-mo...,"Mockingjay (The Hunger Games, #3)",Suzanne Collins,eng,https://images.gr-assets.com/books/1358275419m...,2010,My name is Katniss Everdeen.\nWhy am I not dea...
3,41107568,https://www.goodreads.com/book/show/22557272-t...,The Girl on the Train,Paula Hawkins,eng,https://images.gr-assets.com/books/1490903702m...,2015,Every day the same\nRachel takes the same comm...
4,6171458,https://www.goodreads.com/book/show/6148028-ca...,"Catching Fire (The Hunger Games, #2)",Suzanne Collins,eng,https://images.gr-assets.com/books/1358273780m...,2009,Sparks are igniting.\nFlames are spreading.\nA...


In [15]:
# Merge the DataFrames on the 'item_id' column
merged_df = pd.merge(book_ratings, book_title[['item_id', 'title']], on='item_id', how='left')

In [16]:
merged_df.head()

Unnamed: 0,item_id,user_id,rating,title
0,41335427,0,5,Harry Potter and the Half-Blood Prince (Harry ...
1,41335427,1,3,Harry Potter and the Half-Blood Prince (Harry ...
2,41335427,2,5,Harry Potter and the Half-Blood Prince (Harry ...
3,41335427,3,5,Harry Potter and the Half-Blood Prince (Harry ...
4,41335427,4,5,Harry Potter and the Half-Blood Prince (Harry ...


In [17]:
merged_df.rating.unique()

array([5, 3, 4, 2, 1])

In [18]:
merged_df.shape

(5152656, 4)

In [19]:
merged_df.to_csv('../data/book_ratings.csv', index=False)

### Load the dataset

In [21]:
# Define the Reader and load the dataset
a = merged_df[['user_id', 'title', 'rating']]
reader = Reader(rating_scale=(1, 5))
sf = Dataset.load_from_df(a, reader)

In [22]:
# List of algorithms to evaluate
models = {
    'SVD': SVD(),
    'NMF': NMF(),
    'SlopeOne': SlopeOne(),
    'CoClustering': CoClustering(),
    'KNNBasic': KNNBasic()
}

In [None]:
# Dictionary to store results
results = {}

# Evaluate each algorithm
for name, model in models.items():
    print(f"Cross-validating {name}...")
    cv_results = cross_validate(model, sf, measures=['RMSE', 'MAE'], cv=5, verbose=True)
    results[name] = {
        'RMSE Mean': cv_results['test_rmse'].mean(),
        'RMSE Std': cv_results['test_rmse'].std(),
        'MAE Mean': cv_results['test_mae'].mean(),
        'MAE Std': cv_results['test_mae'].std()
    }

Cross-validating SVD...


In [None]:
import pandas as pd
results_df = pd.DataFrame(results).T
print(results_df)