### Recommendation System

1. Downsampled to about 32k rows using about 5k unique books(this was the limit for processing by `Surprise` algos)
2. Trained several baseline models and evaluated using the MAE and RMSE metrics
3. Performed GridSearch to find the best hyperparameters for the top 2 baseline models
4. Chose best optimized model of the 2
5. Used best model to get top-n recommendations for user 
6. Defined helper functions to make my life easier
7. Imported metadata to get info on top-5 recommended books for a random user

Final output was a dataframe containing the title, price, description, book ID and predicted rating for a random user.
Happy with results.


### Future Next Steps

* Convert into an API endpoint for a possible Flask application to provide a web interface to try out recommendation system.





<hr>

Import libraries

In [1]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, IntegerType, FloatType, StringType, LongType
from pyspark.sql.functions import col

from surprise import Reader, Dataset, SVD, NMF, accuracy
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV
from surprise import KNNBasic, KNNBaseline, KNNWithMeans
from surprise.prediction_algorithms.slope_one import SlopeOne

import sys
sys.path.append("..")
from utils import *

Create SparkSession

In [2]:
spark = SparkSession.builder.master("local[6]").appName('Book Ratings')\
                            .config('spark.executor.memory', '8g')\
                            .config('spark.driver.memory', '4g')\
                            .getOrCreate()
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/11 09:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Import Dataset into Spark DF

In [3]:
# Get underlying Spark Context
sc = spark.sparkContext

In [4]:
# define new schema
schema = StructType()\
        .add('bookID', IntegerType(), nullable=False)\
        .add('userID', StringType(), nullable=False)\
        .add('rating', FloatType(), nullable=False)\
        .add('timestamp', LongType(), nullable=False)

In [5]:
# Import data into PySpark DF
books = spark.read.format('csv').schema(schema).load('../data/Books.csv')
books.head()

Row(bookID=1713353, userID='A1C6M8LCIX4M6M', rating=5.0, timestamp=1123804800)

In [6]:
# Verify data types
books.printSchema()

root
 |-- bookID: integer (nullable = true)
 |-- userID: string (nullable = true)
 |-- rating: float (nullable = true)
 |-- timestamp: long (nullable = true)



The data is returned in the format `(int(bookID), str(userID), float(rating), long(timestamp))`.

To parse into a `PySpark` `Rating` object, it is expected to be in the format `(int(userID), int(bookID), float(rating), long(timestamp))`.

<hr>

### Check for `NoneType` Values

In [7]:
# Check for None values in the DF
none_count = books.filter(
    (col("bookID").isNull()) |
    (col("userID").isNull()) |
    (col("rating").isNull()) |
    (col("timestamp").isNull())
).count()
has_none_values = none_count > 0

# Print the result
if has_none_values:
    print(f"The DF has at least {none_count} rows with a None value.")
else:
    print("The DF does not have any rows with None values.")



The DF has at least 6671993 rows with a None value.


                                                                                

In [8]:
# Verify what rows with None values look like
none_df = books.filter(
    (col('bookID').isNull()) |
    (col('userID').isNull()) |
    (col('rating').isNull()) |
    (col('timestamp').isNull())
)
none_df.take(10)

[Row(bookID=None, userID='ASS457AQPDIFZ', rating=5.0, timestamp=1409443200),
 Row(bookID=None, userID='A3NMH1KTLG7CWX', rating=5.0, timestamp=1398816000),
 Row(bookID=None, userID='A2LI5026JCXQBA', rating=4.0, timestamp=1398729600),
 Row(bookID=None, userID='AHNMXYVRDN1R9', rating=5.0, timestamp=1394323200),
 Row(bookID=None, userID='A2CAVTNQA2Y3IJ', rating=5.0, timestamp=1384560000),
 Row(bookID=None, userID='A2685NTFXLJJ1T', rating=5.0, timestamp=1377475200),
 Row(bookID=None, userID='A17TBLPM7H401J', rating=4.0, timestamp=1374364800),
 Row(bookID=None, userID='A1840OJGNFSBSN', rating=5.0, timestamp=900460800),
 Row(bookID=None, userID='A3ONKN7GMHG6K2', rating=5.0, timestamp=1400630400),
 Row(bookID=None, userID='A4LSI6PTX23BE', rating=5.0, timestamp=1400284800)]

The number of rows with missing values represent about 13% of the entire dataset.   
Since there isn't a straightforward way of handling the missing values without affecting the results of the recommendations, the rows will simply be dropped.

In [9]:
# Drop missing values
books = books.dropna()

# Verify count
# books.count()

After dropping rows with missing values, we still have around ~44.6M rows of data left to work with.

<hr>

### Subset Dataset based on Books Titles
Due to the large size of the dataset and processing issues, the dataset will be downsampled to about 100,000 rows based on books.

**Rationale**:
Since each book on average received about 17 distinct ratings, downsampling the dataset to about 100,000 rows will require about 5000 unique books.  
This is still a large dataset and should be able to return decent results for our recommendation engine.

Convert DF to RDD for Mapping UserIDs

In [10]:
# Convert DF to RDD 
books_rdd = books.rdd.map(list)
books_rdd.first()

                                                                                

[1713353, 'A1C6M8LCIX4M6M', 5.0, 1123804800]

In [11]:
# Get all book IDs
book_ids = books_rdd.map(lambda x: x[0])

# Verify result
book_ids.first()



1713353

Get all unique bookIDs for mapping.

In [12]:
# Get all unique book IDs
book_ids = book_ids.distinct()

# Count total book titles 
total_books = book_ids.count()

print(f"There are {total_books} total books in the dataset.")



CodeCache: size=131072Kb used=20201Kb max_used=20217Kb free=110870Kb
 bounds [0x000000010a9e0000, 0x000000010bdc0000, 0x00000001129e0000]
 total_blobs=8522 nmethods=7574 adapters=861
 compilation: disabled (not enough contiguous free space left)




There are 2266596 total books in the dataset.


                                                                                

In [13]:
# Get a Random Sample of 5k books from list of unique books IDs
book_ids_5k = book_ids.sample(withReplacement=False, fraction=0.0008, seed=42)

In [1]:
book_ids_5k.count()

NameError: name 'book_ids_5k' is not defined

In [14]:
# Broadcast values to nodes and perform collect transformation to be used in filter
broadcasted_book_ids = sc.broadcast(set(book_ids_5k.collect()))

In [15]:
books_5k = books_rdd.filter(lambda x: x[0] in broadcasted_book_ids.value)

In [16]:
books_5k.count()

                                                                                

32515

After successfully filtering, the dataset has been reduced to just over 32k rows, and was downsampled based on the book IDs as opposed to the user IDs. 

In [17]:
# Convert first to Spark DF
books_5k_df = books_5k.toDF()

# Then convert to Pandas DF from use with the Surprise library
books_pandas_df = books_5k_df.toPandas()

                                                                                

In [18]:
books_pandas_df.head()

Unnamed: 0,_1,_2,_3,_4
0,1472933,A3V9X42L3AI67I,5.0,1360627200
1,1472933,AG7E2YAPMFJVL,3.0,1358640000
2,1472933,ARFK2WYVYT1QQ,5.0,1347321600
3,1472933,AO93WR5UAFA6F,5.0,1247788800
4,1472933,A1O5WPLJSI7Y1B,5.0,1233532800


### Build Recommender System with Python's Surprise Library

The data is returned in the format `(bookID, userID, rating, timestamp)`. 
To parse into a `Surprise` dataframe, it is expected to be in the format `(userID, bookID, rating, timestamp)`.

In [19]:
# Provide column headings to dataframe
books_5k = books_pandas_df.set_axis(['bookID', 'userID', 'rating', 'timestamp'], axis=1)

In [20]:
# Subset just user ID, book ID and ratings (out of 5 stars)
books_5k = books_5k[['userID', 'bookID', 'rating', 'timestamp']]

In [21]:
type(books_5k)

pandas.core.frame.DataFrame

Now the pandas dataframe is in an acceptable format to be used by Suprise's algorithms.

In [22]:
# Import data into a Surprise DF
reader = Reader(rating_scale=(1.0, 5.0))
data_5k = Dataset.load_from_df(books_5k[['userID', 'bookID', 'rating']], reader=reader)

### Baseline Models

In [23]:
# Perform train test split
trainset_5k, testset_5k = train_test_split(data_5k, test_size=0.2)

In [24]:
# Configure sim options for KNN based models
sim_options = {
    'name': 'msd',
    'min_support': 15,
    'user_based': False
}

In [25]:
def predict_model(evaluator, *args, trainset=trainset_5k, testset=testset_5k):
    '''Create model, fit, predict and return RMSE for provided models'''
    # Get name of model
    name = evaluator.__name__

    # Instantiate model
    if "KNN" in name:
        model = evaluator(sim_options=args[0])
    else:
        model = evaluator()

    # Fit trainset
    model.fit(trainset)

    # Predict on testset
    preds = model.test(testset)

    # Get RMSe
    rmse = accuracy.rmse(preds)

    # Get MAEs
    mae = accuracy.mae(preds)

    return {'Model name': name, "RMSE": rmse, "MAE": mae }

In [26]:
models = [SVD, NMF, SlopeOne, KNNBasic, KNNWithMeans, KNNBaseline]

results = []
for model in models:
    if 'KNN' in str(model):
        res = predict_model(model, sim_options)
    else:
        res = predict_model(model)

    results.append(res)

RMSE: 0.9671
MAE:  0.7124
RMSE: 1.0217
MAE:  0.7863
RMSE: 1.0195
MAE:  0.7800
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0173
MAE:  0.7812
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0157
MAE:  0.7794
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9667
MAE:  0.7134


In [27]:
# Get Results
sorted(results, key=lambda x: x['MAE'])

[{'Model name': 'SVD', 'RMSE': 0.9670625998876182, 'MAE': 0.7123506015288347},
 {'Model name': 'KNNBaseline',
  'RMSE': 0.9667313439494886,
  'MAE': 0.7133589810455935},
 {'Model name': 'KNNWithMeans',
  'RMSE': 1.0156516846043766,
  'MAE': 0.7794415679580358},
 {'Model name': 'SlopeOne',
  'RMSE': 1.019549124127825,
  'MAE': 0.7800115359663153},
 {'Model name': 'KNNBasic',
  'RMSE': 1.0172666608296024,
  'MAE': 0.7812384158730227},
 {'Model name': 'NMF', 'RMSE': 1.0217235718072273, 'MAE': 0.7862632223565881}]

From the Baseline Models tested, the model with the smallest MAE is the SVD followed by the KNNBaseline Model with an MAE score of 0.712 and 0.713, respectively. Both models will now be tuned for optimized results.

### Perform GridSearch CV to find Best Params

#### Tune Parameters for SVD Model

In [28]:
# # Define parameter grid
# param_grid = {
#     "n_epochs": [50, 150, 200, 250],
#     "lr_all": [0.002, 0.005, 0.01, 0.1],
#     "reg_all": [0.1, 0.4, 0.6]}
# gs = GridSearchCV(SVD, param_grid, measures=["mae","rmse"], cv=5)

# # fit train data
# gs.fit(data_5k)

# # best RMSE score
# print(gs.best_score["mae"])
# print(gs.best_score["rmse"])

# # combination of parameters that gave the best MAE score
# print(gs.best_params["mae"])

In [29]:
# Run best model and perform cross-validation
tuned_model = SVD(n_epochs=200, lr_all=0.002, reg_all=0.1)

In [30]:
# Fit, Predict and Evaluate Best SVD Model
tuned_model.fit(trainset_5k)

svd_preds = tuned_model.test(testset_5k)

accuracy.mae(svd_preds)
accuracy.rmse(svd_preds)


MAE:  0.7102
RMSE: 0.9653


0.9653439229707829

In [31]:
# Run 5-fold cross-validation and print results
cross_validate(tuned_model, data_5k, measures=["MAE", "RMSE"], cv=5, verbose=True)

Evaluating MAE, RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
MAE (testset)     0.7092  0.7122  0.7151  0.7086  0.7062  0.7103  0.0031  
RMSE (testset)    0.9679  0.9692  0.9702  0.9657  0.9546  0.9655  0.0057  
Fit time          1.61    1.63    1.71    1.74    1.69    1.67    0.05    
Test time         0.01    0.02    0.01    0.01    0.02    0.01    0.00    


{'test_mae': array([0.70919994, 0.71219496, 0.71511503, 0.70861717, 0.7061788 ]),
 'test_rmse': array([0.96793465, 0.96922687, 0.97018171, 0.96573804, 0.95462441]),
 'fit_time': (1.6094129085540771,
  1.6321840286254883,
  1.7069380283355713,
  1.737342119216919,
  1.687748908996582),
 'test_time': (0.014159202575683594,
  0.015156984329223633,
  0.014237165451049805,
  0.013879060745239258,
  0.01507711410522461)}

#### Tune Parameters for KNN Baseline Model

In [32]:
param_grid = {
    'bsl_options': {
        'method': ['als', 'sgd'],
        'reg': [1, 2],
    },
    'k': [3, 5, 10, 30],
    'sim_options': {
        'name': ['msd', 'cosine'],
        'min_support': [1, 5, 15],
        'user_based': [False],
    },
}

In [33]:
# gs = GridSearchCV(KNNBaseline, param_grid, measures=["mae", "rmse"], cv=5)

# gs.fit(data_5k)

# # best metric scores
# print(gs.best_score["mae"])
# print(gs.best_score["rmse"])

# # combination of parameters that gave the best RMSE score
# print(gs.best_params["mae"])

In [34]:
# Train best KNN model
bsl_options = {'method': 'als', 'reg': 1}
sim_options = {'name': 'msd', 'min_support': 5, 'user_based': False}

tuned_knn = KNNBaseline(k=5, bsl_options=bsl_options, sim_options=sim_options)

tuned_knn.fit(trainset=trainset_5k)

# Predict
knn_preds = tuned_knn.test(testset=testset_5k)

# Evaluate
accuracy.mae(knn_preds)
accuracy.rmse(knn_preds)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
MAE:  0.7133
RMSE: 0.9668


0.966761509388261

In [35]:
# Run 5-fold cross-validation and print results
cross_validate(tuned_knn, data_5k, measures=["MAE", "RMSE"], cv=5, verbose=True)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating MAE, RMSE of algorithm KNNBaseline on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
MAE (testset)     0.7261  0.7020  0.7163  0.7131  0.7124  0.7140  0.0077  
RMSE (testset)    0.9903  0.9545  0.9714  0.9622  0.9574  0.9671  0.0129  
Fit time          0.07    0.06    0.07    0.06    0.07    0.07    0.00    
Test time         0.01    0.02    0.01    0.01    0.01    0.01    0.00    


{'test_mae': array([0.7261041 , 0.70201646, 0.71627736, 0.7130898 , 0.71239385]),
 'test_rmse': array([0.9903402 , 0.95448092, 0.97138321, 0.9621818 , 0.95736333]),
 'fit_time': (0.06772494316101074,
  0.05996298789978027,
  0.07012081146240234,
  0.06389975547790527,
  0.0705568790435791),
 'test_time': (0.01400899887084961,
  0.015027046203613281,
  0.01474905014038086,
  0.014976024627685547,
  0.013681888580322266)}

### Choose Best Model
Based on the return metrics, the SVD model outperforms the KNNBaseline model, though only slightly, after GridSearch and confirmation with Cross-validation. 

The tuned SVD model had an average MAE score of .7305 and RMSE score of 0.9968 while the tuned KNNBaseline model had an average MAE of 0.7348 and and average RMSE score of 0.9974.

## Get Top-N Recommendations for Users

In [37]:
# Train on tuned SVD model
train = data_5k.build_full_trainset()
tuned_model.fit(train)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x13f9ee870>

In [38]:
testset = train.build_anti_testset()

In [39]:
# predict ratings for all pairs (u, i) that are NOT in the training set.
predictions = tuned_model.test(testset)

In [40]:
# Get top 5 recommendations for all users
top_n = get_top_n(predictions=predictions, n=5)

In [42]:
# TEST: Get Top 5 Recommendations from a random user by looking it up in top_n
top_n[get_random_user(trainset=books_5k)]

[(143111639, 5.0),
 (451214471, 5.0),
 (547905580, 5.0),
 (735291551, 5.0),
 (736927964, 5.0)]

### Import Metadata to Get Recommended Book Titles

In [44]:
spark_df = spark.read.json('../data/books_meta.gz')

23/12/11 09:11:39 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [51]:
def get_recs_and_ratings(all_user_recs, trainset) -> dict:
    '''Get top-n predicted recommendations for a random user and returns a dictionary of book IDs and predicted ratings.'''

    # Get Random User and their Recommendations and Ratings
    rand_user = get_random_user(trainset)
    recs = all_user_recs[rand_user]

    # Wrangle their recommendations and ratings into a dict
    bookID = [id[0] for id in recs]
    rating = [rec[1] for rec in recs]

    pairs = {
        'books': bookID,
        'predicted_ratings': rating
    }

    return pairs

In [53]:
# Returns the recommended book titles and ratings for a random user
recs_and_ratings = get_recs_and_ratings(all_user_recs=top_n, trainset=books_5k)
recs_and_ratings

{'books': [1514491877, 1849633436, 1420928244, 1543461468, 807013153],
 'predicted_ratings': [5.0, 5.0, 5.0, 5.0, 4.991933454076072]}

In [54]:
# Get metadata for user recs
recommendations_metadata = get_metadata(user_recs=recs_and_ratings, spark_session=spark)

                                                                                

In [55]:
recommendations_metadata

Unnamed: 0,asin,title,price,description
0,1420928244,Army Field Manual FM 22-100 (The U.S. Army Lea...,Price Not Available.,No Description Available.
1,1514491877,The Essences,$23.99,Writing this book was a very fun process for m...
2,1543461468,Winning the Staffing Sales Game: The Definitiv...,$19.99,No Description Available.
3,1849633436,Wolfbaene,$7.81,My name is Michelle Dennis. I live in Ellenbro...


In [56]:
# Convert recs and ratings to DF for Merge
recs_and_ratings_df = pd.DataFrame(recs_and_ratings)

In [57]:
recs_and_ratings_df

Unnamed: 0,books,predicted_ratings
0,1514491877,5.0
1,1849633436,5.0
2,1420928244,5.0
3,1543461468,5.0
4,807013153,4.991933


In [58]:
recommendations_metadata

Unnamed: 0,asin,title,price,description
0,1420928244,Army Field Manual FM 22-100 (The U.S. Army Lea...,Price Not Available.,No Description Available.
1,1514491877,The Essences,$23.99,Writing this book was a very fun process for m...
2,1543461468,Winning the Staffing Sales Game: The Definitiv...,$19.99,No Description Available.
3,1849633436,Wolfbaene,$7.81,My name is Michelle Dennis. I live in Ellenbro...


In [61]:
top_5 = recommendations_metadata.merge(recs_and_ratings_df, left_on='asin',right_on='books', how='left')
top_5 = top_5.drop('asin', axis=1)

In [62]:
top_5

Unnamed: 0,title,price,description,books,predicted_ratings
0,Army Field Manual FM 22-100 (The U.S. Army Lea...,Price Not Available.,No Description Available.,1420928244,5.0
1,The Essences,$23.99,Writing this book was a very fun process for m...,1514491877,5.0
2,Winning the Staffing Sales Game: The Definitiv...,$19.99,No Description Available.,1543461468,5.0
3,Wolfbaene,$7.81,My name is Michelle Dennis. I live in Ellenbro...,1849633436,5.0
