# Part 8: Hybrid Recommender Evaluation_in test set for userid 2043
---

- Importing the relevant libraries first...


In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, multilabel_confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import joblib

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Importing X_legit and y which contain the shops that userid 2043 rated and ratings respectively
---

In [2]:
X_legit = pd.read_csv('yelp_data/xlegit.csv')
X_legit.shape

(980, 19307)

In [3]:
y = pd.read_csv('yelp_data/y.csv', squeeze=True)
y.head()

0    4.0
1    4.0
2    5.0
3    5.0
4    4.0
Name: user_rating, dtype: float64

In [4]:
#split the dataset into train and test sets first
X_train, X_test, y_train, y_test = train_test_split(X_legit, y, test_size=0.2, random_state=42, stratify=y)

In [5]:
y_train.value_counts(normalize=True)

4.0    0.479592
5.0    0.383929
3.0    0.118622
2.0    0.012755
1.0    0.005102
Name: user_rating, dtype: float64

<ul>
    
- The baseline accuracy will be 0.48 since that is the highest proportion among the training dataset's target classes

In [6]:
#instantiate scaler since not all of the features are of the same scale, eg. review_count and avg_store_rating
ss = StandardScaler()

In [7]:
#fitting the train and transforming both the train and test sets
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [8]:
#reading in reconstructed_X_test for userid 2043 from content-based filtering for comparison later on...
userid2043_cb_pred_actual_X_test = pd.read_csv('yelp_data/userid2043_cb_pred_actual.csv')

In [9]:
#checking out the dimensions of the read-in content-based filtering dataset....
userid2043_cb_pred_actual_X_test.shape

(196, 3)

In [10]:
#checking out the first few rows of the content-based filtering dataset....
userid2043_cb_pred_actual_X_test.head(3)

Unnamed: 0,shops,predicted_ratings,actual_ratings
0,shops_got-luck-cafe-singapore,4.0,4.0
1,shops_symmetry-singapore,4.0,4.0
2,shops_the-bread-project-singapore,4.0,4.0


In [11]:
#cleaning the shops column to remove the "shops_" prefix for easier merging later on...
userid2043_cb_pred_actual_X_test['shops'] = userid2043_cb_pred_actual_X_test['shops'].apply(lambda x: x[6:])

In [12]:
#confirming that the change has been made
userid2043_cb_pred_actual_X_test.head(3)

Unnamed: 0,shops,predicted_ratings,actual_ratings
0,got-luck-cafe-singapore,4.0,4.0
1,symmetry-singapore,4.0,4.0
2,the-bread-project-singapore,4.0,4.0


In [13]:
#reading in the collaborative filtering dataset for userid2043...
userid2043_mbcf_pred_actual_test = pd.read_csv('yelp_data/userid2043_mbcf_pred_actual.csv')

In [14]:
#checking out the dimensions of the model-based collaborative filtering dataset...
userid2043_mbcf_pred_actual_test.shape

(185, 4)

In [15]:
#checking out the first few rows of the model-based collaborative filtering dataset...
userid2043_mbcf_pred_actual_test.head(3)

Unnamed: 0,shops,ratings,prediction_rounded,prediction
0,the-bao-makers-singapore,4.0,4.0,3.933102
1,little-farms-cafe-singapore,4.0,4.0,3.895577
2,lam-yeo-coffee-powder-fty-singapore,5.0,5.0,4.869991


In [17]:
#let's merge both content-based filtering and model-based collaborative filtering predictions for userid 2043 together!
con_collab_2043_tst = pd.merge(userid2043_cb_pred_actual_X_test,userid2043_mbcf_pred_actual_test,how="left",on='shops')

In [18]:
#checking out the dimensions of the merged dataset...
con_collab_2043_tst.shape

(196, 6)

In [19]:
#keeping only common shops present in both content-based and collaborative filtering...
con_collab_2043_tst.dropna(inplace=True)

In [20]:
#looks like we got 36 outlets in common between content-based and model-based collaborative filtering to work with for userid 2043...
con_collab_2043_tst.shape

(36, 6)

In [21]:
#checking out the first few rows of the merged dataset that has been trimmed of NaNs...
con_collab_2043_tst.head(3)

Unnamed: 0,shops,predicted_ratings,actual_ratings,ratings,prediction_rounded,prediction
1,symmetry-singapore,4.0,4.0,4.0,4.0,3.922104
3,dean-and-deluca-singapore-4,4.0,4.0,4.0,4.0,3.934483
15,krispy-kreme-singapore-2,4.0,3.0,3.0,3.0,2.93391


In [22]:
#loading decisiontreeclassifier model
loaded_model = joblib.load('yelp_data/dtc_gs_model.sav')

In [23]:
#decisiontreeclassifier for content-based filtering had a test accuracy score of 0.85
loaded_model.best_estimator_.score(X_test_sc,y_test)

0.8469387755102041

<ul>
    
- Suggest baseline score for hybrid recommender is the average of the content-based and collaborative filtering baseline accuracies, i.e. 
    
    $Hybrid\ recommender\ baseline\ accuracy = \frac{0.48 + 0.47}{2} = 0.48$

<img src="yelp_data/micro-avg_for_dtc.png"/>

<ul>
    
- The above shows equal output for micro-averaged precision and recall on the test set using the DecisionTreeClassifier (user-centered content-based filtering) and the corresponding micro-averaged F1 score, or harmonic mean of micro-averaged precision and recall is:
    
   DecisionTreeClassifier's micro-averaged $F_1 score = 2 \times \frac{micro-averaged\ precision\ \times\ micro-averaged\ recall}{micro-averaged\ precision\ +\ micro-averaged\ recall}$ = $2 \times \frac{0.85 \times 0.85}{0.85 + 0.85}$ = $0.85$


<img src="yelp_data/ALS_F1_score.png"/>

<ul>
    
- The above image shows the $F_1$ score of the model-based collaborative filtering (ALS) on the test set: $0.98$


In [24]:
#since we are not tuning the models further, let's use the respective models' F1 scores to weight each model's rating predictions!
con_wt = 0.85 / (0.85 + 0.98)
collab_wt = 0.98 / (0.85 + 0.98)

In [25]:
#creating a new column containing the weighted sum of rating predictions from content-based and collaborative filtering
con_collab_2043_tst['final_rating_predictions'] = (con_collab_2043_tst['predicted_ratings']*con_wt) + (con_collab_2043_tst['prediction']*collab_wt)

In [26]:
#checking out the new df with added column...
con_collab_2043_tst.head(3)

Unnamed: 0,shops,predicted_ratings,actual_ratings,ratings,prediction_rounded,prediction,final_rating_predictions
1,symmetry-singapore,4.0,4.0,4.0,4.0,3.922104,3.958285
3,dean-and-deluca-singapore-4,4.0,4.0,4.0,4.0,3.934483,3.964914
15,krispy-kreme-singapore-2,4.0,3.0,3.0,3.0,2.93391,3.429088


In [27]:
#rounding the computed final rating predictions to 0 decimal place so that it can be compared to the actual ratings (which are also discrete whole numbers) via the f1 score...
con_collab_2043_tst['final_rating_predictions_rd'] = round(con_collab_2043_tst['final_rating_predictions'],0)

In [28]:
#checking out the first few rows of the df containing the rounded prediction column...
con_collab_2043_tst.head(3)

Unnamed: 0,shops,predicted_ratings,actual_ratings,ratings,prediction_rounded,prediction,final_rating_predictions,final_rating_predictions_rd
1,symmetry-singapore,4.0,4.0,4.0,4.0,3.922104,3.958285,4.0
3,dean-and-deluca-singapore-4,4.0,4.0,4.0,4.0,3.934483,3.964914,4.0
15,krispy-kreme-singapore-2,4.0,3.0,3.0,3.0,2.93391,3.429088,3.0


In [29]:
#getting a sense of the top 5 recommendations from this hybrid system; seems like the hybrid system's predictions are identical to the actual ratings for the top 5 recommendations!
con_collab_2043_tst[['shops','actual_ratings','final_rating_predictions_rd','final_rating_predictions']].sort_values('final_rating_predictions',ascending=False).head()

Unnamed: 0,shops,actual_ratings,final_rating_predictions_rd,final_rating_predictions
170,d-good-cafe-singapore-3,5.0,5.0,4.953647
127,mr-teh-tarik-eating-house-singapore-5,5.0,5.0,4.944909
114,jewel-coffee-singapore-13,5.0,5.0,4.941013
98,meidi-ya-singapore-2,5.0,5.0,4.93996
192,isle-eating-house-singapore,5.0,5.0,4.939903


In [30]:
#however, the hybrid system is rather weak in predicting rating 2...but at least it performed well for the other 3 rating classes (3,4,5)
print(classification_report(con_collab_2043_tst['actual_ratings'],con_collab_2043_tst['final_rating_predictions_rd']))

              precision    recall  f1-score   support

         2.0       0.33      1.00      0.50         1
         3.0       1.00      1.00      1.00         4
         4.0       1.00      0.94      0.97        16
         5.0       1.00      0.93      0.97        15

    accuracy                           0.94        36
   macro avg       0.83      0.97      0.86        36
weighted avg       0.98      0.94      0.96        36



In [31]:
#appears there were no rating 1 in the 36 rows of common outlets between content-based filtering and collaborative filtering..
con_collab_2043_tst[con_collab_2043_tst['actual_ratings']==1.0]

Unnamed: 0,shops,predicted_ratings,actual_ratings,ratings,prediction_rounded,prediction,final_rating_predictions,final_rating_predictions_rd


## Defining functions for evaluation of model
---

In [32]:
#defining function for obtaining tn, fp, fn, tp for each rating class for feeding into micro-avg precision and recall functions defined below
def cm_spec(y_true,y_pred,rating,state):
    if state=='tn':
        return multilabel_confusion_matrix(y_true,y_pred)[rating-2][0][0]
    elif state=='fp':
        return multilabel_confusion_matrix(y_true,y_pred)[rating-2][0][1]
    elif state=='fn':
        return multilabel_confusion_matrix(y_true,y_pred)[rating-2][1][0]
    else:
        return multilabel_confusion_matrix(y_true,y_pred)[rating-2][1][1]
    

In [33]:
#defining function for obtaining micro-avg precision
def micro_avg_precision(y_true,y_pred):
    return ((cm_spec(y_true,y_pred,1,'tp')+
                                                 cm_spec(y_true,y_pred,2,'tp')+
                                                 cm_spec(y_true,y_pred,3,'tp')+
                                                 cm_spec(y_true,y_pred,4,'tp')+
                                                 cm_spec(y_true,y_pred,5,'tp'))/(
                                                cm_spec(y_true,y_pred,1,'tp')+
                                                 cm_spec(y_true,y_pred,2,'tp')+
                                                 cm_spec(y_true,y_pred,3,'tp')+
                                                 cm_spec(y_true,y_pred,4,'tp')+
                                                 cm_spec(y_true,y_pred,5,'tp')+
                                                cm_spec(y_true,y_pred,1,'fp')+
                                                 cm_spec(y_true,y_pred,2,'fp')+
                                                 cm_spec(y_true,y_pred,3,'fp')+
                                                 cm_spec(y_true,y_pred,4,'fp')+
                                                 cm_spec(y_true,y_pred,5,'fp')))

In [34]:
#defining function for obtaining micro-avg recall
def micro_avg_recall(y_true,y_pred):
    return ((cm_spec(y_true,y_pred,1,'tp')+
                                                 cm_spec(y_true,y_pred,2,'tp')+
                                                 cm_spec(y_true,y_pred,3,'tp')+
                                                 cm_spec(y_true,y_pred,4,'tp')+
                                                 cm_spec(y_true,y_pred,5,'tp'))/(
                                                cm_spec(y_true,y_pred,1,'tp')+
                                                 cm_spec(y_true,y_pred,2,'tp')+
                                                 cm_spec(y_true,y_pred,3,'tp')+
                                                 cm_spec(y_true,y_pred,4,'tp')+
                                                 cm_spec(y_true,y_pred,5,'tp')+
                                                cm_spec(y_true,y_pred,1,'fn')+
                                                 cm_spec(y_true,y_pred,2,'fn')+
                                                 cm_spec(y_true,y_pred,3,'fn')+
                                                 cm_spec(y_true,y_pred,4,'fn')+
                                                 cm_spec(y_true,y_pred,5,'fn')))

In [35]:
#defining function for obtaining micro_avg_f1
def micro_avg_f1(y_true,y_pred):
    return 2 * ((micro_avg_precision(y_true,y_pred) * micro_avg_recall(y_true,y_pred))/(micro_avg_precision(y_true,y_pred) + micro_avg_recall(y_true,y_pred)))

In [36]:
#function to print out confusion matrix breakdown for each rating class
def confusion_breakdown(y_true,y_pred,rating):
    print("True negatives for rating {}: {}".format(
        rating,multilabel_confusion_matrix(y_true,y_pred)[rating-2][0][0]))
    print("False positives for rating {}: {}".format(
        rating,multilabel_confusion_matrix(y_true,y_pred)[rating-2][0][1]))
    print("False negatives for rating {}: {}".format(
        rating,multilabel_confusion_matrix(y_true,y_pred)[rating-2][1][0]))
    print("True positives for rating {}: {}".format(
        rating,multilabel_confusion_matrix(y_true,y_pred)[rating-2][1][1]))
    return "******************************************"

In [37]:
print(confusion_breakdown(con_collab_2043_tst['actual_ratings'],con_collab_2043_tst['final_rating_predictions_rd'],2))
print(confusion_breakdown(con_collab_2043_tst['actual_ratings'],con_collab_2043_tst['final_rating_predictions_rd'],3))
print(confusion_breakdown(con_collab_2043_tst['actual_ratings'],con_collab_2043_tst['final_rating_predictions_rd'],4))
print(confusion_breakdown(con_collab_2043_tst['actual_ratings'],con_collab_2043_tst['final_rating_predictions_rd'],5))

True negatives for rating 2: 33
False positives for rating 2: 2
False negatives for rating 2: 0
True positives for rating 2: 1
******************************************
True negatives for rating 3: 32
False positives for rating 3: 0
False negatives for rating 3: 0
True positives for rating 3: 4
******************************************
True negatives for rating 4: 20
False positives for rating 4: 0
False negatives for rating 4: 1
True positives for rating 4: 15
******************************************
True negatives for rating 5: 21
False positives for rating 5: 0
False negatives for rating 5: 1
True positives for rating 5: 14
******************************************


In [42]:
print("Hybrid recommender yielded accuracy: ", (33+32+20+21+1+4+15+14)/(33+2+1+32+4+20+1+15+21+1+14))

Hybrid recommender yielded accuracy:  0.9722222222222222


In [38]:
print("Hybrid recommender yielded micro-averaged precision: ", micro_avg_precision(con_collab_2043_tst['actual_ratings'],con_collab_2043_tst['final_rating_predictions_rd']))


Hybrid recommender yielded micro-averaged precision:  0.96


In [39]:
print("Hybrid recommender yielded micro-averaged recall: ", micro_avg_recall(con_collab_2043_tst['actual_ratings'],con_collab_2043_tst['final_rating_predictions_rd']))

Hybrid recommender yielded micro-averaged recall:  0.9411764705882353


In [40]:
print("Hybrid recommender yielded micro_avg_f1 of ", micro_avg_f1(con_collab_2043_tst['actual_ratings'],con_collab_2043_tst['final_rating_predictions_rd']))

Hybrid recommender yielded micro_avg_f1 of  0.9504950495049505


In [41]:
#precision, recall, f1 of all rating classes show good performance except for rating 2, which shows quite a strong false positive count...
print(classification_report(con_collab_2043_tst['actual_ratings'],con_collab_2043_tst['final_rating_predictions_rd']))

              precision    recall  f1-score   support

         2.0       0.33      1.00      0.50         1
         3.0       1.00      1.00      1.00         4
         4.0       1.00      0.94      0.97        16
         5.0       1.00      0.93      0.97        15

    accuracy                           0.94        36
   macro avg       0.83      0.97      0.86        36
weighted avg       0.98      0.94      0.96        36



## Hybrid Model Evaluation Result Interpretation
---

<ul>
    
- The hybrid recommender performed pretty well in terms of predicting the ratings of userid 2043 with an $F_1$ score of 0.95, although the classification report suggests a rather strong false positive count for rating 2... and because there is no rating 1 instance (we could only evaluate on 36 common outlets between content-based and collaborative filtering and the 1 outlet userid 2043 happened to gave a rating of 1.0 to was not included among the 36), the hybrid model could not predict rating 1. This could potentially mean that this hybrid model may wrongly include outlets, that would have been poorly rated by a user, among the top recommendations (false positives for ratings 1 and 2)...

</ul>

<ul>
    
- Since we have improvised here by manually converting a continuous predictive output of ALS into a discrete class predictive output so that $F_1$ score could be obtained, it is not possible to use ```.predict_proba()``` or any ```decision_function()``` to provide scorings for the target rating classes (have checked in ALS Spark documentation and there is no mention of such scoring methods in the algorithm so it seems like it is a pure regressor) to eventually obtain the micro-averaged ROC AUC and so we will not use the ROC AUC metric here.
    
</ul>

<ul>
    
- Accuracy of 0.97, Micro-Averaged precision of 0.96, Micro-Averaged recall of 0.94, Micro-Averaged $F_1$ of 0.95
    
</ul>

<ul>
    
- Let's now move on to producing some actual recommendations (in the next sub-notebook)!
    
</ul>