# Part 7: Model-based Collaborative Filtering with Alternating Least Squares (ALS) & Evaluation on sample userid 2043
---


## Foreword on Alternating Least Squares (ALS)
---

- Alternating Least Squares (ALS) algorithm is a matrix factorization technique that decomposes user-item interaction matrix (such as user-item ratings matrix for datasets with explicit feedback) into user and item latent factors where their dot product will predict user ratings. It alternates between fixing user or item latent factors to solve for the other via gradient descent at each iteration in the process of minimizing loss. In the course of doing so, it learns these latent user and item factors that predict the user ratings almost just like how a regressor learns coefficients that predict the target, based on input features.

<img src="yelp_data/matfact_explain.png"/>

- ALS can be imported from ```implicit``` library for datasets with implicit feedback (Eg. clickthroughs and page views) or ```pyspark.ml.recommendations``` (which can be used for either implicit or explicit data). Since my data involves user ratings, I will be relying on the pyspark machine learning library.
---

- Please follow the steps below to download and install all the relevant libraries and dependencies BEFORE running this notebook to avoid encountering any errors.

- First up, navigate to your home directory and create a new directory called ```server```:
    ```cd ~```
    ```mkdir server```
  Make sure that the stuff below will be downloaded into this ```server``` folder.
  
- In order to run spark and pyspark on your local machine, kindly ensure that you already have Java installed with the following command in your Terminal or Windows-equivalent in command prompt: ```java -version``` If nothing comes out of this, navigate to this [link](https://java.com/en/download/help/download_options.xml) to download java for mac or windows. You may need to restart your system after installation for java to take effect.

- Next, navigate to this [link](https://www.oracle.com/java/technologies/javase-jdk8-downloads.html) to download the java development kit and then install it.

- Check if scala is installed by executing this command in your Terminal or Windows-equivalent command prompt: ```scala -version```. If nothing comes out, navigate to this [link](https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.tgz) to download and install scala as well as this [link](https://github.com/sbt/sbt/releases/download/v0.13.17/sbt-0.13.17.tgz) to download and install sbt-0.13.17.tgz.

- Navigate to this [link](https://spark.apache.org/downloads.html) to download Apache Spark. Select the options like the screenshot below and click on "spark-2.4.5-bin-hadoop2.7.tgz" under point 3 to download spark and install it.
<img src="yelp_data/spark_dl.png"/>

- The following should be the directory paths of the software you have downloaded and installed above, where ```HOMEDIRECTORY``` is your home directory's name:
    JDK: ```/Library/Java/JavaVirtualMachines/jdk1.8.0_251.jdk```
    Sbt: ```/Users/HOMEDIRECTORY/server/sbt```
    Scala: ```/Users/HOMEDIRECTORY/server/scala-2.11.12```
    Spark: ```/Users/HOMEDIRECTORY/server/spark-2.4.5-bin-hadoop2.7```

- After all of the above have been installed, set up a ```.bash_profile``` file in your home directory. For Mac users, if you do not already have a ```.bash_profile``` file, navigate to your home directory and create one by executing the following commands:
    ```cd ~```
    ```touch .bash_profile```
    After which, open it with a text editor of your choice and add the following lines of code at the top of the ```.bash_profile``` file, replacing ```HOMEDIRECTORY``` with the name of your home directory:

<img src="yelp_data/spark_bash_profile.png"/>
    
       
- Save and close the ```.bash_profile``` file and execute ```source ~/.bash_profile``` in your Terminal or Windows-equivalent command prompt.

- Completely quit your Terminal and command prompt.

- Now you may proceed to run the rest of the following code.


- ***KINDLY NOTE THAT IF YOU HAVE ENCOUNTERED A CONNECTION REFUSED ERROR OR A JAVA ERROR WHERE IT IS TRYING TO CONNECT TO YOUR IP ADDRESS BUT FAILED WHEN RUNNING ANY PYSPARK-RELATED CELL, KINDLY JUST COPY ALL THE CELLS IN THE NOTEBOOK (HIGHLIGHT THE TOP CELL AND CMD(FOR MAC)/CTRL(FOR WINDOWS) + SHIFT + HIGHLIGHT THE LAST CELL), COPY AND PASTE INTO A FRESH NOTEBOOK AND RUN THEM THERE INSTEAD***

In [None]:
#findspark allows pyspark to be run in jupyter notebook
!pip install pyspark
!pip install findspark

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import StringIndexer, IndexToString
from pyspark.ml import Pipeline as PL
from pyspark.sql.functions import col
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql.functions import max
import pyspark.sql.functions as func
from pyspark.ml import PipelineModel
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, multilabel_confusion_matrix, roc_auc_score
import numpy as np
import pandas as pd
import joblib

In [2]:
#starting the pyspark session
spark = SparkSession.builder.appName('recommendation_system').getOrCreate()

In [3]:
#reading in curated dataset for model-based collaborative filtering (mbcf)
mbcf = spark.read.csv('yelp_data/mbcf.csv', header=True, inferSchema=True)

In [4]:
#looking at how the mbcf looks like for a start
mbcf.show(3,truncate=True)

+-------------------+-------+-------+
|              shops|ratings|userids|
+-------------------+-------+-------+
|hustle-co-singapore|    5.0|    532|
|hustle-co-singapore|    5.0|   1397|
|hustle-co-singapore|    5.0|     80|
+-------------------+-------+-------+
only showing top 3 rows



In [5]:
#remember we want to test the ALS rating predictions on userid 2043, just as we did for content-based filtering by the various models earlier on
mbcf.filter("userids = 2043").show(5)

+--------------------+-------+-------+
|               shops|ratings|userids|
+--------------------+-------+-------+
| hustle-co-singapore|    5.0|   2043|
|benjamin-barker-c...|    4.0|   2043|
|the-coffee-roaste...|    3.0|   2043|
|  koi-cafe-singapore|    5.0|   2043|
|starbucks-singapo...|    4.0|   2043|
+--------------------+-------+-------+
only showing top 5 rows



In [6]:
#this confirmed earlier analyses (in part 1 sub-notebook) that show that userid 2043 rated 980 outlets.
mbcf.filter("userids = 2043").count()

980

In [7]:
#for ALS recommender algorithm to work, the feature columns must at least be in the numerical format - although some of the feature columns are already in the int format, let's just play safe by converting all except for the ratings' column into the double format.
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in list(set(mbcf.columns)-set(['ratings']))]
pipeline = PL(stages=indexer)
transformed = pipeline.fit(mbcf).transform(mbcf)
transformed.show(5)

+-------------------+-------+-------+-----------+-------------+
|              shops|ratings|userids|shops_index|userids_index|
+-------------------+-------+-------+-----------+-------------+
|hustle-co-singapore|    5.0|    532|      323.0|         50.0|
|hustle-co-singapore|    5.0|   1397|      323.0|         56.0|
|hustle-co-singapore|    5.0|     80|      323.0|       1677.0|
|hustle-co-singapore|    5.0|   2073|      323.0|         90.0|
|hustle-co-singapore|    5.0|   2043|      323.0|          0.0|
+-------------------+-------+-------+-----------+-------------+
only showing top 5 rows



In [8]:
#the relevant dtypes are of the correct form now!
transformed.dtypes

[('shops', 'string'),
 ('ratings', 'double'),
 ('userids', 'int'),
 ('shops_index', 'double'),
 ('userids_index', 'double')]

In [9]:
#splitting the dataset into training and test sets where test set will be used for evaluation
(training,test)=transformed.randomSplit([0.8, 0.2], 42)

In [10]:
#looking at the rating classes (target) in the training dataset to determine baseline accuracy...
training.groupBy("ratings").count().show()

+-------+-----+
|ratings|count|
+-------+-----+
|    1.0|    9|
|    4.0| 2681|
|    3.0|  331|
|    2.0|   24|
|    5.0| 2658|
+-------+-----+



<ul>
    
- Baseline accuracy is $\frac{2681\ (majority\ class)}{(9+2681+331+24+2658)} = 0.47$ (rounded off to 2 decimal places)

In [11]:
#instantiating the ALS model
#als = ALS(maxIter=20,userCol='userids_index',itemCol='shops_index', ratingCol='ratings',coldStartStrategy='drop')

In [12]:
#instantiating the pipeline for tuning of ALS hyperparameters
#pipeline = PL(stages=[als])

In [13]:
#setting up the paramGrid
#paramGrid = ParamGridBuilder() \
#    .addGrid(als.rank, [50, 100, 200, 300]) \
#    .addGrid(als.regParam, [0.001, 0.01, 0.1, 0.5, 0.9]) \
#    .build()

In [14]:
#setting up the CrossValidator object for hyperparameter tuning
#crossval = CrossValidator(estimator=pipeline,
                        #  estimatorParamMaps=paramGrid,
                        #  evaluator=RegressionEvaluator(predictionCol='prediction',labelCol='ratings',metricName='r2'),
                        #  numFolds=10)

In [15]:
#fitting/training on the train dataset. The following few cells or code have been commented out since this cell took a long time to tune...
#this tuned model yielded best rank of 200 and best regParam of 0.1 as mentioned below...
#cvModel = crossval.fit(training)

In [16]:
#making predictions on the test set
#alspredictions = cvModel.transform(test)

In [17]:
#saving alspredictions to csv for retrieval to test prediction performance later on...
#alspredictions.toPandas().to_csv('yelp_data/als_predictions.csv',index=False)

In [18]:
#something went wrong here while saving cvModel.bestModel for retrieval later on. Should have just saved cvModel without the .bestModel because then a PipelineModel which is only the estimator and it is impossible to extract the best params for rank and regParam from a PipelineModel's estimator alone...
#cvModel.bestModel.save("cvModel.model")

In [19]:
#loading a prefit ALS model containing the best rank = 200 and best regParam = 0.1 derived from the crossval fitted cvModel 4 cells above. Unable to implement the code here due to saving it in the wrong format-and I don't wish to go through the above crossvalidator process again as it took close to a full day just to run it...I could not afford that time as there just wasn't enough time...
cvModel = ALS.load("yelp_data/als_rec_prefit.model")

In [20]:
#reading in alspredictions to test performance of ALS on test set.
alspredictions = spark.read.csv('yelp_data/als_predictions.csv', header=True, inferSchema=True)

In [21]:
#checking out the first few rows of the prediction df
alspredictions.show(3)

+--------------------+-------+-------+-----------+-------------+----------+
|               shops|ratings|userids|shops_index|userids_index|prediction|
+--------------------+-------+-------+-----------+-------------+----------+
|the-bao-makers-si...|    4.0|   2043|      148.0|          0.0| 3.9331024|
|the-coffee-shot-s...|    4.0|   2326|      471.0|        124.0| 3.9336052|
|little-farms-cafe...|    4.0|   2043|      496.0|          0.0| 3.8955767|
+--------------------+-------+-------+-----------+-------------+----------+
only showing top 3 rows



In [22]:
#rounding the predictions to a whole number (just like the original or actual ratings) to prepare it for evaluation by MulticlassClassificationEvaluator using f1 score metric
df2 = alspredictions.withColumn("prediction_rounded", func.round(alspredictions["prediction"],0))

In [23]:
#taking a look at the first few rows of the prediction df with the additional rounded prediction column...
df2.show(3)

+--------------------+-------+-------+-----------+-------------+----------+------------------+
|               shops|ratings|userids|shops_index|userids_index|prediction|prediction_rounded|
+--------------------+-------+-------+-----------+-------------+----------+------------------+
|the-bao-makers-si...|    4.0|   2043|      148.0|          0.0| 3.9331024|               4.0|
|the-coffee-shot-s...|    4.0|   2326|      471.0|        124.0| 3.9336052|               4.0|
|little-farms-cafe...|    4.0|   2043|      496.0|          0.0| 3.8955767|               4.0|
+--------------------+-------+-------+-----------+-------------+----------+------------------+
only showing top 3 rows



In [24]:
#checking out the data types of the various columns
df2.dtypes

[('shops', 'string'),
 ('ratings', 'double'),
 ('userids', 'int'),
 ('shops_index', 'double'),
 ('userids_index', 'double'),
 ('prediction', 'double'),
 ('prediction_rounded', 'double')]

In [25]:
#as MulticlassClassificationEvaluator works on double typed values, let's convert rounded predictions to double type.
indexer_1 = StringIndexer(inputCol="prediction_rounded", outputCol='prediction_rounded_dbl')

In [26]:
#implementing the conversion of rounded predictions to double type by fitting and transforming the test set
df2 = indexer_1.fit(df2).transform(df2)

In [27]:
#taking a look at the first few rows of the df with the newly added column
df2.select("shops","ratings","prediction_rounded","prediction_rounded_dbl").show(5,False)

+---------------------------------------+-------+------------------+----------------------+
|shops                                  |ratings|prediction_rounded|prediction_rounded_dbl|
+---------------------------------------+-------+------------------+----------------------+
|the-bao-makers-singapore               |4.0    |4.0               |0.0                   |
|the-coffee-shot-singapore              |4.0    |4.0               |0.0                   |
|little-farms-cafe-singapore            |4.0    |4.0               |0.0                   |
|lam-yeo-coffee-powder-fty-singapore    |5.0    |5.0               |1.0                   |
|nassim-hill-bakery-bistro-bar-singapore|4.0    |4.0               |0.0                   |
+---------------------------------------+-------+------------------+----------------------+
only showing top 5 rows



In [28]:
#in order for the double typed rounded predictions to match up with the actual ratings, 
#convert the actual ratings into double type as well for a fairer comparison
indexer_2 = StringIndexer(inputCol="ratings", outputCol='ratings_dbl')

In [29]:
#implementing the conversion
df2 = indexer_2.fit(df2).transform(df2)

In [30]:
#checking out the first few rows of the prediction df again
df2.select("shops","ratings_dbl","prediction_rounded","prediction_rounded_dbl").show(5,False)

+---------------------------------------+-----------+------------------+----------------------+
|shops                                  |ratings_dbl|prediction_rounded|prediction_rounded_dbl|
+---------------------------------------+-----------+------------------+----------------------+
|the-bao-makers-singapore               |0.0        |4.0               |0.0                   |
|the-coffee-shot-singapore              |0.0        |4.0               |0.0                   |
|little-farms-cafe-singapore            |0.0        |4.0               |0.0                   |
|lam-yeo-coffee-powder-fty-singapore    |1.0        |5.0               |1.0                   |
|nassim-hill-bakery-bistro-bar-singapore|0.0        |4.0               |0.0                   |
+---------------------------------------+-----------+------------------+----------------------+
only showing top 5 rows



In [80]:
#instantiating the MulticlassClassificationEvaluator with accuracy score as the metric to compare against baseline accuracy first. 
evaluator_b = MulticlassClassificationEvaluator(predictionCol='prediction_rounded_dbl',
                                                labelCol='ratings_dbl',metricName='accuracy')

In [31]:
#instantiating the MulticlassClassificationEvaluator with f1 score as the metric of choice. 
#F1 score was chosen as both precision and recall are important here-we wouldn't want to miss out 
#on good recommendations and neither do we want to recommend something that shouldn't have been recommended.
evaluator_c = MulticlassClassificationEvaluator(predictionCol='prediction_rounded_dbl',
                                                labelCol='ratings_dbl',metricName='f1')

In [81]:
#accuracy is 0.97!
print("Accuracy is ", evaluator_b.evaluate(df2))

Accuracy is  0.9685916919959473


In [32]:
print("F1 score for rating predictions by tuned ALS model-based collaborative filtering: ",evaluator_c.evaluate(df2))

F1 score for rating predictions by tuned ALS model-based collaborative filtering:  0.9820064562401125


<ul>
    
- The tuned ALS model performed really well with a high $F_1$ score of 0.98!

In [33]:
#saving df2 with selected columns for reading in again for other evaluation below
df2.select("shops","ratings","prediction_rounded", "prediction","userids").toPandas().to_csv("yelp_data/mbcf_ALS.csv",index=False)

In [34]:
#reading in mbcf_ALS for other evaluation
mbcf_als = pd.read_csv("yelp_data/mbcf_ALS.csv")
mbcf_als.head(3)

Unnamed: 0,shops,ratings,prediction_rounded,prediction,userids
0,the-bao-makers-singapore,4.0,4.0,3.933102,2043
1,the-coffee-shot-singapore,4.0,4.0,3.933605,2326
2,little-farms-cafe-singapore,4.0,4.0,3.895577,2043


In [35]:
#checking out the value_counts of rounded predictions; seems like class 0.0 and -1.0 shouldnt have existed. Let's map them to rating 1.
mbcf_als['prediction_rounded'].value_counts()

 4.0    496
 5.0    405
 3.0     55
 0.0     21
-1.0      7
 2.0      2
 1.0      1
Name: prediction_rounded, dtype: int64

In [36]:
#checking out the value_counts of actual ratings
mbcf_als['ratings'].value_counts()

4.0    504
5.0    419
3.0     60
2.0      2
1.0      2
Name: ratings, dtype: int64

In [37]:
#getting the indices of the wrongly encoded rows
wrong_encoding_lst = mbcf_als[(mbcf_als['prediction_rounded']==0.0) | (mbcf_als['prediction_rounded']==-1.0)]['prediction_rounded'].index

In [38]:
#encoding all 0.0 and -1.0 ratings to 1.0 instead
mbcf_als.loc[wrong_encoding_lst,'prediction_rounded'] = 1.0

In [39]:
#confirming that the encoding has been done
mbcf_als['prediction_rounded'].value_counts()

4.0    496
5.0    405
3.0     55
1.0     29
2.0      2
Name: prediction_rounded, dtype: int64

In [40]:
#saving as csv and then re-reading it to use spark evaluator to recalculate the F1 score...
mbcf_als.to_csv('yelp_data/mbcf_ALS_mod.csv',index=False)

In [41]:
#reading in the modified "mbcf_ALS" csv for evaluation of F1 using multiclassclassificationevaluator
df2_mod = spark.read.csv('yelp_data/mbcf_ALS_mod.csv', header=True, inferSchema=True)

In [42]:
#in order for the modified prediction_rounded and actual ratings to be recognized by the multiclassclassification evaluator, 
#convert the both columns into double type again
indexer_3 = StringIndexer(inputCol="ratings", outputCol='ratings_double_pre')

In [43]:
#implementing the conversion
df3_mod = indexer_3.fit(df2_mod).transform(df2_mod)

In [44]:
#instantiating the indexer 4 to convert prediction_rounded column to double precision
indexer_4 = StringIndexer(inputCol="prediction_rounded", outputCol='prediction_rounded_double_pre')

In [45]:
#implementing the conversion
df4_mod = indexer_4.fit(df3_mod).transform(df3_mod)

In [82]:
#instantiating the MulticlassClassificationEvaluator with accuracy score to compare against baseline accuracy first- see if correcting some of the incorrect rating classes have changed the accuracy score
evaluator_c_1 = MulticlassClassificationEvaluator(predictionCol='prediction_rounded_double_pre',
                                                labelCol='ratings_double_pre',metricName='accuracy')

In [46]:
#instantiating the MulticlassClassificationEvaluator with f1 score as the metric of choice. 
#F1 score was chosen as both precision and recall are important here-we wouldn't want to miss out 
#on good recommendations and neither do we want to recommend something that shouldn't have been recommended.
evaluator_d = MulticlassClassificationEvaluator(predictionCol='prediction_rounded_double_pre',
                                                labelCol='ratings_double_pre',metricName='f1')

In [83]:
#accuracy after amendment of rating class encoding is still 0.97!
print("Accuracy is ", evaluator_c_1.evaluate(df4_mod))

Accuracy is  0.9685916919959473


In [47]:
#F1 score remained unchanged!
print("F1 score for slightly-modified rating predictions by tuned ALS model-based collaborative filtering: ",evaluator_d.evaluate(df4_mod))

F1 score for slightly-modified rating predictions by tuned ALS model-based collaborative filtering:  0.9820064562401125


In [48]:
#saving to pandas df for re-reading in to compute other evaluation scores as pandas and sklearn is easier for me to handle at the moment...
df4_mod.toPandas().to_csv('yelp_data/mbcf_ALS_final_corrected.csv',index=False)

In [49]:
#reading in final corrected mbcf_ALS_final_corrected.csv for evaluation of confusion matrix, micro-avg precision and recall
mbcf_als_final_corrected = pd.read_csv('yelp_data/mbcf_ALS_final_corrected.csv')

In [50]:
mbcf_als_final_corrected.prediction_rounded.value_counts()

4.0    496
5.0    405
3.0     55
1.0     29
2.0      2
Name: prediction_rounded, dtype: int64

In [51]:
mbcf_als_final_corrected.ratings.value_counts()

4.0    504
5.0    419
3.0     60
2.0      2
1.0      2
Name: ratings, dtype: int64

In [52]:
#having a rough look at the confusion matrix for model-based collaborative filtering via ALS...
multilabel_confusion_matrix(mbcf_als_final_corrected.ratings,mbcf_als_final_corrected.prediction_rounded)

array([[[958,  27],
        [  0,   2]],

       [[985,   0],
        [  0,   2]],

       [[927,   0],
        [  5,  55]],

       [[483,   0],
        [  8, 496]],

       [[568,   0],
        [ 14, 405]]])

<ul>
    
- Looks like majority of the false positives for rating class 1 is due to what was done above in correcting ratings 0.0 and -1.0 to rating 1.0 but that is still acceptable since those values were already mis-classified in the first place and correcting them did not change the F1 score calculated with spark's evaluator_d earlier, but this should be noted as a model limitation/assumption...

In [53]:
#apart from rating class 1.0, the ALS model seemed to be able to predict the rating classes well!
print(classification_report(mbcf_als_final_corrected.ratings,mbcf_als_final_corrected.prediction_rounded))

              precision    recall  f1-score   support

         1.0       0.07      1.00      0.13         2
         2.0       1.00      1.00      1.00         2
         3.0       1.00      0.92      0.96        60
         4.0       1.00      0.98      0.99       504
         5.0       1.00      0.97      0.98       419

    accuracy                           0.97       987
   macro avg       0.81      0.97      0.81       987
weighted avg       1.00      0.97      0.98       987



## Defining functions for evaluation of model (confusion matrix, micro-average precision, recall)
---

In [54]:
#defining function for obtaining tn, fp, fn, tp for each rating class for feeding into micro-avg precision and recall functions defined below
def cm_spec(y_true,y_pred,rating,state):
    if state=='tn':
        return multilabel_confusion_matrix(y_true,y_pred)[rating-1][0][0]
    elif state=='fp':
        return multilabel_confusion_matrix(y_true,y_pred)[rating-1][0][1]
    elif state=='fn':
        return multilabel_confusion_matrix(y_true,y_pred)[rating-1][1][0]
    else:
        return multilabel_confusion_matrix(y_true,y_pred)[rating-1][1][1]
    

In [55]:
#defining function for obtaining micro-avg precision
def micro_avg_precision(y_true,y_pred):
    return ((cm_spec(y_true,y_pred,1,'tp')+
                                                 cm_spec(y_true,y_pred,2,'tp')+
                                                 cm_spec(y_true,y_pred,3,'tp')+
                                                 cm_spec(y_true,y_pred,4,'tp')+
                                                 cm_spec(y_true,y_pred,5,'tp'))/(
                                                cm_spec(y_true,y_pred,1,'tp')+
                                                 cm_spec(y_true,y_pred,2,'tp')+
                                                 cm_spec(y_true,y_pred,3,'tp')+
                                                 cm_spec(y_true,y_pred,4,'tp')+
                                                 cm_spec(y_true,y_pred,5,'tp')+
                                                cm_spec(y_true,y_pred,1,'fp')+
                                                 cm_spec(y_true,y_pred,2,'fp')+
                                                 cm_spec(y_true,y_pred,3,'fp')+
                                                 cm_spec(y_true,y_pred,4,'fp')+
                                                 cm_spec(y_true,y_pred,5,'fp')))

In [56]:
#defining function for obtaining micro-avg recall
def micro_avg_recall(y_true,y_pred):
    return ((cm_spec(y_true,y_pred,1,'tp')+
                                                 cm_spec(y_true,y_pred,2,'tp')+
                                                 cm_spec(y_true,y_pred,3,'tp')+
                                                 cm_spec(y_true,y_pred,4,'tp')+
                                                 cm_spec(y_true,y_pred,5,'tp'))/(
                                                cm_spec(y_true,y_pred,1,'tp')+
                                                 cm_spec(y_true,y_pred,2,'tp')+
                                                 cm_spec(y_true,y_pred,3,'tp')+
                                                 cm_spec(y_true,y_pred,4,'tp')+
                                                 cm_spec(y_true,y_pred,5,'tp')+
                                                cm_spec(y_true,y_pred,1,'fn')+
                                                 cm_spec(y_true,y_pred,2,'fn')+
                                                 cm_spec(y_true,y_pred,3,'fn')+
                                                 cm_spec(y_true,y_pred,4,'fn')+
                                                 cm_spec(y_true,y_pred,5,'fn')))

In [57]:
#defining function for obtaining micro_avg_f1
def micro_avg_f1(y_true,y_pred):
    return 2 * ((micro_avg_precision(y_true,y_pred) * micro_avg_recall(y_true,y_pred))/(micro_avg_precision(y_true,y_pred) + micro_avg_recall(y_true,y_pred)))

In [58]:
#function to print out confusion matrix breakdown for each rating class
def confusion_breakdown(y_true,y_pred,rating):
    print("True negatives for rating {}: {}".format(
        rating,multilabel_confusion_matrix(y_true,y_pred)[rating-1][0][0]))
    print("False positives for rating {}: {}".format(
        rating,multilabel_confusion_matrix(y_true,y_pred)[rating-1][0][1]))
    print("False negatives for rating {}: {}".format(
        rating,multilabel_confusion_matrix(y_true,y_pred)[rating-1][1][0]))
    print("True positives for rating {}: {}".format(
        rating,multilabel_confusion_matrix(y_true,y_pred)[rating-1][1][1]))
    return "******************************************"

In [59]:
print(confusion_breakdown(mbcf_als_final_corrected.ratings,mbcf_als_final_corrected.prediction_rounded,1))
print(confusion_breakdown(mbcf_als_final_corrected.ratings,mbcf_als_final_corrected.prediction_rounded,2))
print(confusion_breakdown(mbcf_als_final_corrected.ratings,mbcf_als_final_corrected.prediction_rounded,3))
print(confusion_breakdown(mbcf_als_final_corrected.ratings,mbcf_als_final_corrected.prediction_rounded,4))
print(confusion_breakdown(mbcf_als_final_corrected.ratings,mbcf_als_final_corrected.prediction_rounded,5))

True negatives for rating 1: 958
False positives for rating 1: 27
False negatives for rating 1: 0
True positives for rating 1: 2
******************************************
True negatives for rating 2: 985
False positives for rating 2: 0
False negatives for rating 2: 0
True positives for rating 2: 2
******************************************
True negatives for rating 3: 927
False positives for rating 3: 0
False negatives for rating 3: 5
True positives for rating 3: 55
******************************************
True negatives for rating 4: 483
False positives for rating 4: 0
False negatives for rating 4: 8
True positives for rating 4: 496
******************************************
True negatives for rating 5: 568
False positives for rating 5: 0
False negatives for rating 5: 14
True positives for rating 5: 405
******************************************


In [60]:
#can tell it is able to predict correctly most of the rating classes except for the false positives in rating class 1.0 due to the manual class correction earlier.
mbcf_als_final_corrected.ratings.value_counts()

4.0    504
5.0    419
3.0     60
2.0      2
1.0      2
Name: ratings, dtype: int64

In [61]:
print("Tuned ALS yielded micro_avg_precision of ", micro_avg_precision(mbcf_als_final_corrected.ratings,mbcf_als_final_corrected.prediction_rounded))

Tuned ALS yielded micro_avg_precision of  0.9726443768996961


In [62]:
print("Tuned ALS yielded micro_avg_recall of ", micro_avg_recall(mbcf_als_final_corrected.ratings,mbcf_als_final_corrected.prediction_rounded))


Tuned ALS yielded micro_avg_recall of  0.9726443768996961


In [63]:
print("Tuned ALS yielded micro_avg_f1 of ", micro_avg_f1(mbcf_als_final_corrected.ratings,mbcf_als_final_corrected.prediction_rounded))

Tuned ALS yielded micro_avg_f1 of  0.9726443768996961


<ul>
    
- Seems like the $F_1$ score computed outside of spark for ALS is also around 0.98! 

In [64]:
#filtering out only userid2043 for comparison with content-based filtering later on
userid2043_mbcf_test_rev = mbcf_als_final_corrected[mbcf_als_final_corrected['userids']==2043][['shops','ratings','prediction_rounded','prediction']]
userid2043_mbcf_test_rev.head()

Unnamed: 0,shops,ratings,prediction_rounded,prediction
0,the-bao-makers-singapore,4.0,4.0,3.933102
2,little-farms-cafe-singapore,4.0,4.0,3.895577
3,lam-yeo-coffee-powder-fty-singapore,5.0,5.0,4.869991
17,stateland-cafe-singapore,5.0,5.0,4.882576
19,coffee-club-singapore-5,5.0,5.0,4.850962


In [65]:
userid2043_mbcf_test_rev.shape

(185, 4)

In [66]:
#saving this df to csv for easier retrieval later on for comparison and fusion with content-based filtering's rating predictions
userid2043_mbcf_test_rev.to_csv('yelp_data/userid2043_mbcf_pred_actual.csv', index=False)

In [67]:
#old cells, kept for ref only
#df2.show(5,False)

In [68]:
#old cells, kept for ref only..
#having a look at how some of the predictions compare with the actual ratings
#df2.filter("userids = 2043").select("shops","ratings","prediction_rounded","prediction").sort("prediction", ascending=False).show(5, False)

In [69]:
#old cells, kept for ref only...
#filtering out only userid2043 for comparison with content-based filtering later on, converting to pandas df for easier manipulation and merging with content-based rating predictions
#userid2043_mbcf_test = df2.filter("userids = 2043").select("shops","ratings","prediction_rounded","prediction").toPandas()

In [70]:
#old cells, kept for ref only...
#taking a look at the first few rows of the prediction vs actual ratings ("ratings") for ALS tuned with CrossValidator
#userid2043_mbcf_test.head()

In [71]:
#old cells, kept for ref only...
#userid2043_mbcf_test.shape

In [72]:
#saving this df to csv for easier retrieval later on for comparison and fusion with content-based filtering's rating predictions
#userid2043_mbcf_test.to_csv('yelp_data/userid2043_mbcf_pred_actual.csv', index=False)

In [73]:
#old method that would have worked had I saved the cvModel up above correctly...
#extracting the index of the best params
#best_model_params_idx = np.argmax(cvModel.avgMetrics)

In [74]:
#old method that would have worked, see above cell
#extracting the tuned hyperparameters of the best model
#best_rank = list(cvModel.getEstimatorParamMaps()[best_model_params_idx].values())[0]
#best_regParam = list(cvModel.getEstimatorParamMaps()[best_model_params_idx].values())[1]

In [75]:
best_rank = int(cvModel.explainParam('rank')[-4:-1])
best_regParam = float(cvModel.explainParam('regParam')[-4:-1])

In [76]:
best_rank

200

In [77]:
best_regParam

0.1

<ul>
    
- The best rank is 200 while best regParam is 0.1. Fed these into the ALS instantiation below and saved it for future use.

In [78]:
#instantiating ALS with the tuned hyperparameters of the best model
#als_rec = ALS(rank=best_rank,regParam=best_regParam,
#              maxIter=20,userCol='userids_index',
#              itemCol='shops_index', ratingCol='ratings',
#              coldStartStrategy='drop')

In [79]:
#dumping the instantiated model above for later use in another jupyter notebook
#als_rec.save("yelp_data/als_rec_prefitted.model")

## Model Summary and Result Interpretation
---

<ul>
    
- Except for rating 1 which contains a number of false positives, the tuned model-based ALS was able to predict the rating classes well!
- Accuracy of 0.97, Micro-Averaged precision of 0.97, Micro-Averaged recall of 0.97, Micro-Averaged $F_1$ of 0.97

## Source(s)
---

- https://java.com/en/download/help/download_options.xml
- https://www.oracle.com/java/technologies/javase-jdk8-downloads.html
- https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.tgz
- https://github.com/sbt/sbt/releases/download/v0.13.17/sbt-0.13.17.tgz
- https://spark.apache.org/downloads.html
- https://medium.com/luckspark/installing-spark-2-3-0-on-macos-high-sierra-276a127b8b85