### Challenges This Week

* With `Surprise` library, processing time too long on the large dataset
* Tried `Pyspark`, to take advantage of it's distributed processing:
    - mapping `str` type userIDs to `int` type userIDs was necessary
    - processing time for training/evaluating extremely long due to ALS algo --> Tried using a smaller subset of data, still too long
    - Terrible RMSE after evaluating baseline model (RMSE: 4.719, peviously with `Surprise`, it was 0.99-1.1)
    - pyspark session abruptly stopping due to running out of memory (?) during cross-val, even with smaller subset of data

### Next Steps

* Revert back to `Surprise` library for modeling since no need to map `str` type userIDs to `int` type, and more options for algorithms
* Most likely will have to only use a fraction of the overall dataset instead of entire dataset due to processing time


**Note**: Dataset contained 51M+ rows, and 46M+ rows after dropping rows containing `None` type values.



<hr>

Import libraries

In [2]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, IntegerType, FloatType, StringType, LongType
from pyspark.sql.functions import col

from surprise import Reader, Dataset, SVD, NMF, accuracy
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV
from surprise import KNNBasic, KNNBaseline, KNNWithMeans
from surprise.prediction_algorithms.slope_one import SlopeOne

Create SparkSession

In [3]:
spark = SparkSession.builder.master("local[6]").appName('Book Ratings')\
                            .config('spark.executor.memory', '8g')\
                            .config('spark.driver.memory', '4g')\
                            .getOrCreate()
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/10 17:14:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Import Dataset into Spark DF

In [4]:
# Get underlying Spark Context
sc = spark.sparkContext

In [5]:
# define new schema
schema = StructType()\
        .add('bookID', IntegerType(), nullable=False)\
        .add('userID', StringType(), nullable=False)\
        .add('rating', FloatType(), nullable=False)\
        .add('timestamp', LongType(), nullable=False)

In [6]:
# Import data into PySpark DF
books = spark.read.format('csv').schema(schema).load('../data/Books.csv')
books.head()

Row(bookID=1713353, userID='A1C6M8LCIX4M6M', rating=5.0, timestamp=1123804800)

In [7]:
# Verify data types
books.printSchema()

root
 |-- bookID: integer (nullable = true)
 |-- userID: string (nullable = true)
 |-- rating: float (nullable = true)
 |-- timestamp: long (nullable = true)



The data is returned in the format `(int(bookID), str(userID), float(rating), long(timestamp))`.

To parse into a `PySpark` `Rating` object, it is expected to be in the format `(int(userID), int(bookID), float(rating), long(timestamp))`.

<hr>

### Check for `NoneType` Values

In [8]:
# Check for None values in the DF
none_count = books.filter(
    (col("bookID").isNull()) |
    (col("userID").isNull()) |
    (col("rating").isNull()) |
    (col("timestamp").isNull())
).count()
has_none_values = none_count > 0

# Print the result
if has_none_values:
    print(f"The DF has at least {none_count} rows with a None value.")
else:
    print("The DF does not have any rows with None values.")



The DF has at least 6671993 rows with a None value.


                                                                                

In [9]:
# Verify what rows with None values look like
none_df = books.filter(
    (col('bookID').isNull()) |
    (col('userID').isNull()) |
    (col('rating').isNull()) |
    (col('timestamp').isNull())
)
none_df.take(10)

[Row(bookID=None, userID='ASS457AQPDIFZ', rating=5.0, timestamp=1409443200),
 Row(bookID=None, userID='A3NMH1KTLG7CWX', rating=5.0, timestamp=1398816000),
 Row(bookID=None, userID='A2LI5026JCXQBA', rating=4.0, timestamp=1398729600),
 Row(bookID=None, userID='AHNMXYVRDN1R9', rating=5.0, timestamp=1394323200),
 Row(bookID=None, userID='A2CAVTNQA2Y3IJ', rating=5.0, timestamp=1384560000),
 Row(bookID=None, userID='A2685NTFXLJJ1T', rating=5.0, timestamp=1377475200),
 Row(bookID=None, userID='A17TBLPM7H401J', rating=4.0, timestamp=1374364800),
 Row(bookID=None, userID='A1840OJGNFSBSN', rating=5.0, timestamp=900460800),
 Row(bookID=None, userID='A3ONKN7GMHG6K2', rating=5.0, timestamp=1400630400),
 Row(bookID=None, userID='A4LSI6PTX23BE', rating=5.0, timestamp=1400284800)]

The number of rows with missing values represent about 13% of the entire dataset.   
Since there isn't a straightforward way of handling the missing values without affecting the results of the recommendations, the rows will simply be dropped.

In [10]:
# Drop missing values
books = books.dropna()

# Verify count
# books.count()

After dropping rows with missing values, we still have around ~44.6M rows of data left to work with.

<hr>

### Subset Dataset based on Books Titles
Due to the large size of the dataset and processing issues, the dataset will be downsampled to about 1,000,000 rows based on books.

**Rationale**:
Since each book on average received about 17 distinct ratings, downsampling the dataset to about 1M rows will require about 58,000 unique books.  
This is still a very large dataset and should be able to return decent results for our recommendation engine.

Convert DF to RDD for Mapping UserIDs

In [11]:
# Convert DF to RDD 
books_rdd = books.rdd.map(list)
books_rdd.first()

                                                                                

[1713353, 'A1C6M8LCIX4M6M', 5.0, 1123804800]

In [12]:
# Get all book IDs
book_ids = books_rdd.map(lambda x: x[0])

# Verify result
book_ids.first()



1713353

Get all unique bookIDs for mapping.

In [13]:
# Get all unique book IDs
book_ids = book_ids.distinct()

# Count total book titles 
total_books = book_ids.count()

print(f"There are {total_books} total books in the dataset.")



CodeCache: size=131072Kb used=20344Kb max_used=20357Kb free=110727Kb
 bounds [0x000000010a9e0000, 0x000000010bdf0000, 0x00000001129e0000]
 total_blobs=8576 nmethods=7628 adapters=861
 compilation: disabled (not enough contiguous free space left)




There are 2266596 total books in the dataset.


                                                                                

In [14]:
# Get a Random Sample of 58,000 books from list of unique books IDs
book_ids_58k = book_ids.sample(withReplacement=False, fraction=0.026, seed=42)

In [15]:
book_ids_5k = book_ids.sample(withReplacement=False, fraction=0.0025)

In [16]:
# book_ids_58k.count()

In [17]:
# Broadcast values to nodes and perform collect transformation to be used in filter
broadcasted_book_ids = sc.broadcast(set(book_ids_58k.collect()))

                                                                                

In [18]:
bb_5k = sc.broadcast(set(book_ids_5k.collect()))

In [19]:
books_5k = books_rdd.filter(lambda x: x[0] in bb_5k.value)

In [20]:
# Get user ids and ratings for subset of 58K books
books_58k = books_rdd.filter(lambda x: x[0] in broadcasted_book_ids.value)

In [21]:
# books_58k.count()

In [22]:
books_5k.count()

ERROR:root:KeyboardInterrupt while sending command.                (0 + 6) / 16]
Traceback (most recent call last):
  File "/Users/kylerodriguez/miniconda3/envs/book-recommender/lib/python3.12/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylerodriguez/miniconda3/envs/book-recommender/lib/python3.12/site-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylerodriguez/miniconda3/envs/book-recommender/lib/python3.12/socket.py", line 707, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt


KeyboardInterrupt: 

After successfully filtering, the dataset has been reduced to just over 1M rows, and was downsampled based on the book IDs as opposed to the user IDs. 

In [None]:
# Convert first to Spark DF
books_58k_df = books_58k.toDF()

# Then convert to Pandas DF from use with the Surprise library
books_58k_df = books_58k_df.toPandas()

                                                                                

In [None]:
books_5k_df = books_5k.toDF()

books_5k = books_5k_df.toPandas()

                                                                                

In [None]:
books_5k.head()

Unnamed: 0,_1,_2,_3,_4
0,6173624,AIRJKGFFI1POJ,5.0,1519171200
1,6173624,AOS0X376MV1TN,2.0,1519084800
2,6173624,AXJQ4MBM9TY03,3.0,1518652800
3,6173624,A2XI8CU2845UXM,5.0,1518048000
4,6173624,A19FZP5PZXAOD3,5.0,1517616000


In [None]:
# Confirm data structure
books_58k_df.iloc[:2]

Unnamed: 0,_1,_2,_3,_4
0,2008343,A1RB3KF8ZQ43DZ,4.0,1203811200
1,2008343,A20QI7NG8SFN05,3.0,1199750400


### Build Recommender System with Python's Surprise Library

The data is returned in the format `(bookID, userID, rating, timestamp)`. 
To parse into a `Surprise` dataframe, it is expected to be in the format `(userID, bookID, rating, timestamp)`.

In [None]:
# Provide column headings to dataframe
books = books_58k_df.set_axis(['bookID', 'userID', 'rating', 'timestamp'], axis=1)
books_5k = books_5k.set_axis(['bookID', 'userID', 'rating', 'timestamp'], axis=1)

In [None]:
# Subset just user ID, book ID and ratings (out of 5 stars)
books = books[['userID', 'bookID', 'rating', 'timestamp']]
books_5k = books_5k[['userID', 'bookID', 'rating', 'timestamp']]

books_5k.head(5)

Unnamed: 0,userID,bookID,rating,timestamp
0,AIRJKGFFI1POJ,6173624,5.0,1519171200
1,AOS0X376MV1TN,6173624,2.0,1519084800
2,AXJQ4MBM9TY03,6173624,3.0,1518652800
3,A2XI8CU2845UXM,6173624,5.0,1518048000
4,A19FZP5PZXAOD3,6173624,5.0,1517616000


Now the pandas dataframe is in an acceptable format to be used by Suprise's algorithms.

In [None]:
# Import data into a Surprise DF
reader = Reader(rating_scale=(1.0, 5.0))
data = Dataset.load_from_df(books[['userID', 'bookID', 'rating']], reader=reader)

In [None]:
# Import data into a Surprise DF
reader2 = Reader(rating_scale=(1.0, 5.0))
data2 = Dataset.load_from_df(books_5k[['userID', 'bookID', 'rating']], reader=reader)

### Baseline SVD model

In [None]:
# Perform train test split
trainset, testset = train_test_split(data2, test_size=0.2)

NameError: name 'train_test_split' is not defined

In [31]:
# Define baseline SVD 
model = SVD()

# Train and predict
model.fit(trainset)
preds = model.test(testset)

# Evaluate model
accuracy.rmse(preds)

RMSE: 0.9859


0.9859147129145115

Results from the baseline SVD model give an RMSE value of 0.989, or roughly an error of 1 full star-rating on average, on a 1-5 star rating scale. While this is not a terrible score, it can be improved upon for more accurate results.

### Perform Cross-validation to Randomize Splits

In [32]:
# Run 5-fold cross-validation and print results
cross_validate(model, data, measures=["RMSE", "MAE"], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9880  0.9869  0.9911  0.9878  0.9887  0.9885  0.0014  
MAE (testset)     0.7244  0.7232  0.7255  0.7239  0.7236  0.7241  0.0008  
Fit time          10.12   10.36   10.75   10.54   13.68   11.09   1.31    
Test time         1.21    1.06    1.17    1.10    0.72    1.05    0.17    


{'test_rmse': array([0.98803048, 0.98694421, 0.99110749, 0.98777696, 0.98870302]),
 'test_mae': array([0.72436212, 0.72322946, 0.72554698, 0.72386566, 0.72362888]),
 'fit_time': (10.117323875427246,
  10.36492109298706,
  10.753936052322388,
  10.540676832199097,
  13.675766944885254),
 'test_time': (1.2062747478485107,
  1.0624489784240723,
  1.166262149810791,
  1.1043531894683838,
  0.7231988906860352)}

Even with a simple cross-validation to rule out randomness in train/test split, the average RMSE score was still ~0.99.

### Perform GridSearch CV to find Best Params

In [26]:
# Define parameter grid
param_grid = {
    "n_epochs": [5, 10, 15, 20], 
    "lr_all": [0.002, 0.005, 0.01, 0.1], 
    "reg_all": [0.1, 0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=5)

# fit train data
gs.fit(data)

# best RMSE score
print(gs.best_score["rmse"])

# combination of parameters that gave the best RMSE score
print(gs.best_params["rmse"])

0.9837273688883142
{'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.1}


In [35]:
# Define parameter grid
param_grid = {
    "n_epochs": [20, 50, 150],
    "lr_all": [0.002, 0.005, 0.01, 0.1],
    "reg_all": [0.1, 0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=5)

# fit train data
gs.fit(data2)

# best RMSE score
print(gs.best_score["rmse"])

# combination of parameters that gave the best RMSE score
print(gs.best_params["rmse"])

0.9915724647475507
{'n_epochs': 150, 'lr_all': 0.002, 'reg_all': 0.1}


In [27]:
# Run best model and perform cross-validation
tuned_model = SVD(n_epochs=20, lr_all=0.01, reg_all=0.1)

# Run 5-fold cross-validation and print results
cross_validate(tuned_model, data, measures=["RMSE", "MAE"], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9837  0.9837  0.9848  0.9821  0.9846  0.9838  0.0010  
MAE (testset)     0.7194  0.7205  0.7199  0.7200  0.7211  0.7202  0.0006  
Fit time          13.37   11.99   11.00   10.57   12.30   11.85   0.99    
Test time         1.29    0.75    0.76    0.76    0.77    0.87    0.21    


{'test_rmse': array([0.98368286, 0.98374204, 0.98476931, 0.98206963, 0.98459273]),
 'test_mae': array([0.71938808, 0.72050783, 0.71986622, 0.72001344, 0.72113289]),
 'fit_time': (13.3734769821167,
  11.986736059188843,
  11.001637935638428,
  10.574107885360718,
  12.299041032791138),
 'test_time': (1.2938790321350098,
  0.7538900375366211,
  0.7587981224060059,
  0.7577779293060303,
  0.7703983783721924)}

After hyperparameter tuning using GridSearch, the score improved only slightly to ~0.98 on average, after cross-validation.

### Try Neighbors-based Algorithms

In [41]:
# Configure sim options
sim_options = {
    'name': 'msd',
    'min_support': 10,
    'user_based': False
}

In [43]:
knn_basic = KNNBasic(sim_options=sim_options)

knn_basic.fit(trainset=trainset)

knn_basic_preds = knn_basic.test(testset)

accuracy.rmse(knn_basic_preds)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0360


1.0359539018955202

In [42]:
# Baseline KNNBasic model
knn = KNNBaseline(sim_options=sim_options)

# Fit and predict
knn.fit(trainset=trainset)

knn_preds = knn.test(testset=testset)

# Evaluate model 
accuracy.rmse(knn_preds)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9858


0.9857962734509415

In [44]:
knn_means = KNNWithMeans(sim_options=sim_options)

knn_means.fit(trainset=trainset)

means_preds = knn_means.test(testset)

accuracy.rmse(means_preds)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0350


1.035044683405012

Non-Negative Matrix Factorization Model

In [39]:
nmf = NMF()

nmf.fit(trainset)

nmf_preds = nmf.test(testset)

accuracy.rmse(nmf_preds)

RMSE: 1.0565


1.0565001929351474

SlopeOne Model

In [40]:
slope_one = SlopeOne()

slope_one.fit(trainset)

slope_preds = slope_one.test(testset)

accuracy.rmse(slope_preds)

RMSE: 1.0487


1.0486803955595436

In [32]:
def predict_model(evaluator, trainset=trainset, testset=testset):

    # Get name of evaluator
    name = evaluator.__name__

    # Instantiate model
    model = evaluator()

    # Fit model to trainset
    model.fit(trainset)

    # Predict on test set
    preds = model.test(testset)

    # Return RMSE metric
    rmse = accuracy.rmse(preds)

    return {"model": name, "RMSE": rmse}
    

In [33]:
predict_model(SVD)


RMSE: 1.0126


{'model': 'SVD', 'RMSE': 1.0126084945329576}

In [34]:
predict_model(KNNBaseline)

Estimating biases using als...


: 