# Spark Recommender System

Spark offers a built in ALS recommendation system that we will use to compare against our LightFM model. Unfortunately it does not handle the cold start problem we aimed at solving for, but it does a good job of using collaborative techniques to make a prediction. 

The algorithm uses alternating least squares as a metric for optimizing predictions of a users preference for an item. How it does it is by taking the original matrix of users and product ratings R, which in our case is the number of times a user bought an item for a given product, and factorizes it into two matrices U and P. When U and P are multiplied back together the empty ratings will be replaced with an estimation. The alternating part of ALS comes from the way the algorithm minimizes least squares error. It will alternate between matrices U and P by fixing one matrix and optimizing for the other and then repeating this process a designated number of times to minimize least squared error. The resulting matrix will have ratings filled in for each product which we can sample from to do a comparison with our LightFM model. We will tune our model by estimating our parameters and iterating until an optimal root mean squared error is achieved. 

In [303]:
# import necessary modules
import os
import shutil
import pyspark as ps
from pyspark.ml import Pipeline, Transformer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.sql import Row
from pyspark.sql.types import DoubleType

In [304]:
# create spark context
spark = (ps.sql.SparkSession.builder
        .appName("sandbox")
        .getOrCreate()
        )
sc = spark.sparkContext
print(spark.version)

3.0.0


## Read in Data

We will use user clusters as part of user features for our model here. 

In [305]:
# source data from prior step
data_dir = os.path.join("modified_data", "")
file = os.path.join(data_dir, "item_features_clustered.csv")

# options are specified to read in data without error
df_user = spark.read.format("csv")\
               .option("multiline", "true")\
               .option("quote", '"')\
               .option("header", "true")\
               .option("escape", "\\")\
               .option("escape", '"')\
               .load(file)

In [306]:
# source data from prior step
data_dir = os.path.join("modified_data", "")
file = os.path.join(data_dir, "item_features.csv")

# options are specified to read in data without error
df_item = spark.read.format("csv")\
               .option("multiline", "true")\
               .option("quote", '"')\
               .option("header", "true")\
               .option("escape", "\\")\
               .option("escape", '"')\
               .load(file)

## Create use and item feature matrices

In [307]:
user_features = df_user.select(df_user['customer_unique_id'], 
                               df_user['product_id'], 
                               df_user['product_category_count'], 
                               df_user['cluster_id'])

In [308]:
user_features.show(4)

+--------------------+--------------------+----------------------+----------+
|  customer_unique_id|          product_id|product_category_count|cluster_id|
+--------------------+--------------------+----------------------+----------+
|7c396fd4830fd0422...|87285b34884572647...|                     1|        12|
|7c396fd4830fd0422...|9abb00920aae319ef...|                     1|        12|
|e781fdcc107d13d86...|87285b34884572647...|                     1|         3|
|3a51803cc0d012c3b...|87285b34884572647...|                     1|         3|
+--------------------+--------------------+----------------------+----------+
only showing top 4 rows



In [309]:
user_features = user_features.sort("customer_unique_id")

In [310]:
item_features = df_item.select(df_item['product_id'], 
                               df_item['product_category_name'], 
                               df_item['avg_price_binned'])

In [311]:
item_features.show(4)

+--------------------+---------------------+----------------+
|          product_id|product_category_name|avg_price_binned|
+--------------------+---------------------+----------------+
|372645c7439f9661f...|       bed_bath_table|   (74.9, 135.0]|
|5099f7000472b634f...|        health_beauty|   (0.849, 39.9]|
|64b488de448a5324c...|           stationery|    (39.9, 74.9]|
|2345a354a6f203360...|            telephony|   (0.849, 39.9]|
+--------------------+---------------------+----------------+
only showing top 4 rows



## Index user and product ids

In [312]:
from pyspark.ml.feature import StringIndexer

# create object of StringIndexer class and specify input and output column
SI_customer = StringIndexer(inputCol='customer_unique_id',outputCol='customer_index')
SI_product = StringIndexer(inputCol='product_id',outputCol='product_index')

# transform the data
user_features = SI_customer.fit(user_features).transform(user_features)
user_features = SI_product.fit(user_features).transform(user_features)
item_features = SI_product.fit(item_features).transform(item_features)

# view the transformed data
user_features.select('customer_unique_id', 'customer_index', 'product_id', 'product_index').show(10)
item_features.select('product_id', 'product_index').show(10)

+--------------------+--------------+--------------------+-------------+
|  customer_unique_id|customer_index|          product_id|product_index|
+--------------------+--------------+--------------------+-------------+
|0000366f3b9a7992b...|       11614.0|372645c7439f9661f...|        380.0|
|0000b849f77a49e4a...|       11615.0|5099f7000472b634f...|       2737.0|
|0000f46a3911fa3c0...|       11616.0|64b488de448a5324c...|       4156.0|
|0000f6ccb0745a6a4...|       11617.0|2345a354a6f203360...|       6611.0|
|0004aac84e0df4da2...|       11618.0|c72e18b3fe2739b8d...|      28356.0|
|0004bd2a26a76fe21...|       11619.0|25cf184645f3fae66...|       6625.0|
|00050ab1314c0e55a...|       11620.0|8cefe1c6f2304e7e6...|       2384.0|
|00053a61a98854899...|        2805.0|62984ea1bba7fcea1...|       5331.0|
|00053a61a98854899...|        2805.0|58727e154e8e85d84...|       1357.0|
|0005e1862207bf6cc...|       11621.0|e24f73b7631ee3fbb...|        801.0|
+--------------------+--------------+--------------

In [313]:
from pyspark.sql.types import IntegerType
# convert columns to integer types
user_features = user_features.withColumn("product_category_count",
                                        user_features["product_category_count"].cast(IntegerType()))

## Model Training

In [314]:
# split 80-20
(training, test) = user_features.randomSplit([0.8, 0.2])

In [324]:
# train the recommender with als
als_alg = ALS(maxIter=5, 
              regParam=0.01, 
              userCol='customer_index', 
              itemCol="product_index", 
              ratingCol='product_category_count',
              coldStartStrategy='drop', 
              seed = 3)

model=als_alg.fit(training)

# evaluate with the holdout set
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName='rmse',
                                labelCol='product_category_count',
                                predictionCol='prediction')
rmse = evaluator.evaluate(predictions)

print("Root-mean-squared-error = " + str(round(rmse, 3)))

Root-mean-squared-error = 0.26


## Generate user and product recommendations

These can be sampled from to output predictions for specific users.

In [325]:
# generate top 5 product recommendations for user
user_recs = model.recommendForAllUsers(5)
user_recs.show(4)

+--------------+--------------------+
|customer_index|     recommendations|
+--------------+--------------------+
|           148|[[508, 7.674963],...|
|           463|[[586, 4.9995255]...|
|           471|[[1687, 9.257822]...|
|           496|[[352, 9.7099085]...|
+--------------+--------------------+
only showing top 4 rows



In [326]:
product_recs = model.recommendForAllItems(10)
product_recs.show()

+-------------+--------------------+
|product_index|     recommendations|
+-------------+--------------------+
|          148|[[560, 3.2099476]...|
|          463|[[301, 10.625377]...|
|          471|[[245, 8.60173], ...|
|          496|[[560, 9.778228],...|
|          833|[[594, 32.10067],...|
|         1088|[[447, 9.255263],...|
|         1238|[[368, 8.5092745]...|
|         1342|[[1007, 14.607211...|
|         1580|[[2087, 7.3240857...|
|         1591|[[823, 8.992891],...|
|         1645|[[665, 6.242186],...|
|         1829|[[512, 6.9178505]...|
|         1959|[[378, 16.812271]...|
|         2122|[[2025, 9.799909]...|
|         2142|[[348, 15.56076],...|
|         2366|[[378, 10.060214]...|
|         2659|[[348, 10.471405]...|
|         2866|[[55, 7.7986393],...|
|         3175|[[55, 8.606258], ...|
|         3749|[[93, 9.776457], ...|
+-------------+--------------------+
only showing top 20 rows



In [327]:
users = user_features.select(als_alg.getUserCol()).distinct().limit(3)
user_subset_recs = model.recommendForUserSubset(users, 10)
user_subset_recs.show(n=4)

+--------------+--------------------+
|customer_index|     recommendations|
+--------------+--------------------+
|         11757|[[1651, 2.7976084...|
|           558|[[7183, 18.795618...|
|          2815|[[352, 3.1491728]...|
+--------------+--------------------+



In [333]:
recs.head()

Unnamed: 0,customer_index,recommendations
0,148,"[(1651, 12.715227127075195), (23114, 12.715227..."
1,463,"[(586, 5.005967140197754), (956, 4.98965263366..."
2,471,"[(692, 11.821859359741211), (901, 11.619313240..."
3,496,"[(596, 7.7499518394470215), (508, 6.2985320091..."
4,833,"[(676, 13.120061874389648), (3776, 12.81623840..."


### Parameter Tuning

In [334]:
# Import the required functions
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [336]:
# train the recommender with als
als = ALS(maxIter=5, 
              regParam=0.01, 
              userCol='customer_index', 
              itemCol="product_index", 
              ratingCol='product_category_count',
              coldStartStrategy='drop')

In [338]:
# Import the requisite packages
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

# Add hyperparameters and their respective values to param_grid
param_grid = ParamGridBuilder() \
            .addGrid(als.rank, [10, 50, 100, 150]) \
            .addGrid(als.regParam, [.01, .05, .1, .15]) \
            .build()

In [339]:
evaluator = RegressionEvaluator(
           metricName="rmse", 
           labelCol="product_category_count", 
           predictionCol="prediction") 
print ("Num models to be tested: ", len(param_grid))

Num models to be tested:  16


In [340]:
# Build cross validation using CrossValidator
cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

In [None]:
#Fit cross validator to the 'train' dataset
model = cv.fit(training)
#Extract best model from the cv model above
best_model = model.bestModel
# View the predictions
test_predictions = best_model.transform(test)
RMSE = evaluator.evaluate(test_predictions)
print(RMSE)

In [None]:
print("**Best Model**")
# Print "Rank"
print("  Rank:", best_model._java_obj.parent().getRank())
# Print "MaxIter"
print("  MaxIter:", best_model._java_obj.parent().getMaxIter())
# Print "RegParam"
print("  RegParam:", best_model._java_obj.parent().getRegParam())

In [None]:
# Generate n Recommendations for all users
recommendations = best_model.recommendForAllUsers(5)
recommendations.show()

### Re-run model for all users

In [None]:
# train the recommender with als
als_alg = ALS(rank=best_model._java_obj.parent().getRank(),
              maxIter=best_model._java_obj.parent().getMaxIter(), 
              regParam=best_model._java_obj.parent().getRegParam(), 
              userCol='customer_index', 
              itemCol="product_index", 
              ratingCol='product_category_count',
              coldStartStrategy='drop')

final_model=als_alg.fit(user_features)

In [None]:
# generate top_n product recommendations for user
nrecommend = 5
user_recs = final_model.recommendForAllUsers(nrecommend)
user_recs.show(4)

In [None]:
recs = user_recs.toPandas()

## Recommender Function

In [None]:
# Generate pandas df for accessing products in recommender function
products = item_features.toPandas()

In [329]:
def user_recommendations(user_id, top_n = 3):
    
    if top_n > nrecommend:
        print("Please select up to {} items to recommend".format(nrecommend))
        return; 
    
    
    print("User: {}\n".format(user_id))
    print("Known positives: ")
    known_like_product = user_features_df[user_features_df['customer_unique_id'] == user_id]\
                                                            ['product_id'].unique()[0]
    known_like_category = products[products['product_id'] == known_like_product]\
                                                            ['product_category_name'].unique()[0]
    
    print("\t", known_like_product)
    print("\t", known_like_category, "\n")
    
    customer_index = user_features_df[user_features_df['customer_unique_id'] == user_id]\
                                                            ['customer_index'].unique()[0]
    
    print("Top {} Recommendations: \n".format(top_n))
    rec_products = []
    
    for n in range(top_n):
        
        rec_products.append(list(recs[recs['customer_index'] == customer_index]['recommendations'])[0][n][0])
        
        print("{}.\n".format(n+1), products[products['product_index'] == rec_products[n]]\
                                                  [['product_id', 'product_category_name']].iloc[0][0])
        
        print(products[products['product_index'] == rec_products[n]]\
                                                  [['product_id', 'product_category_name']].iloc[0][1])

__Test for customer_id = 'c8ed31310fc440a3f8031b177f9842c3'__

In [295]:
user_recommendations('c8ed31310fc440a3f8031b177f9842c3', top_n=5)

User: c8ed31310fc440a3f8031b177f9842c3

Known positives: 
	 1065e0ebef073787a7bf691924c60eeb
	 construction_tools_construction 

Top 5 Recommendations: 

1.
 89b190a046022486c635022524a974a8
furniture_decor
2.
 ff26009ac6b838dc6cffa2d589cdbefb
furniture_decor
3.
 79d62ec5dd0de230da5f185b478a5ade
auto
4.
 05b515fdc76e888aada3c6d66c201dff
health_beauty
5.
 270516a3f41dc035aa87d220228f844c
health_beauty


__Test for customer_id = '698e1cf81d01a3d389d96145f7fa6df8'__

In [294]:
user_recommendations('698e1cf81d01a3d389d96145f7fa6df8', top_n=5)

User: 698e1cf81d01a3d389d96145f7fa6df8

Known positives: 
	 9571759451b1d780ee7c15012ea109d4
	 auto 

Top 5 Recommendations: 

1.
 9571759451b1d780ee7c15012ea109d4
auto
2.
 837b5c6df9ceb8a9c604e78fde0e60a2
computers_accessories
3.
 f3720bc68555b1bff49b9ffd41b017ac
computers_accessories
4.
 42189544021ccb7369862e7ee218d828
health_beauty
5.
 70c1bce00b24bfd21332f7f8ebe2217f
housewares


__Test for customer_id = '89be58cbdd6ef318e3ed93fdb22be178'__

In [296]:
user_recommendations('89be58cbdd6ef318e3ed93fdb22be178', top_n=5)

User: 89be58cbdd6ef318e3ed93fdb22be178

Known positives: 
	 3fdb534dccf5bc9ab0406944b913787d
	 diapers_and_hygiene 

Top 5 Recommendations: 

1.
 3fdb534dccf5bc9ab0406944b913787d
diapers_and_hygiene
2.
 9571759451b1d780ee7c15012ea109d4
auto
3.
 270516a3f41dc035aa87d220228f844c
health_beauty
4.
 05b515fdc76e888aada3c6d66c201dff
health_beauty
5.
 79d62ec5dd0de230da5f185b478a5ade
auto
