# Part 9: Hybrid Recommender Recommendation Trial_userid2552
---

- Please follow the steps below to download and install all the relevant libraries and dependencies BEFORE running this notebook to avoid encountering any errors.

- First up, navigate to your home directory and create a new directory called ```server```:
    ```cd ~```
    ```mkdir server```
  Make sure that the stuff below will be downloaded into this ```server``` folder.
  
- In order to run spark and pyspark on your local machine, kindly ensure that you already have Java installed with the following command in your Terminal or Windows-equivalent in command prompt: ```java -version``` If nothing comes out of this, navigate to this [link](https://java.com/en/download/help/download_options.xml) to download java for mac or windows. You may need to restart your system after installation for java to take effect.

- Next, navigate to this [link](https://www.oracle.com/java/technologies/javase-jdk8-downloads.html) to download the java development kit and then install it.

- Check if scala is installed by executing this command in your Terminal or Windows-equivalent command prompt: ```scala -version```. If nothing comes out, navigate to this [link](https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.tgz) to download and install scala as well as this [link](https://github.com/sbt/sbt/releases/download/v0.13.17/sbt-0.13.17.tgz) to download and install sbt-0.13.17.tgz.

- Navigate to this [link](https://spark.apache.org/downloads.html) to download Apache Spark. Select the options like the screenshot below and click on "spark-2.4.5-bin-hadoop2.7.tgz" under point 3 to download spark and install it.
<img src="yelp_data/spark_dl.png"/>

- The following should be the directory paths of the software you have downloaded and installed above, where ```HOMEDIRECTORY``` is your home directory's name:
    JDK: ```/Library/Java/JavaVirtualMachines/jdk1.8.0_251.jdk```
    Sbt: ```/Users/HOMEDIRECTORY/server/sbt```
    Scala: ```/Users/HOMEDIRECTORY/server/scala-2.11.12```
    Spark: ```/Users/HOMEDIRECTORY/server/spark-2.4.5-bin-hadoop2.7```

- After all of the above have been installed, set up a ```.bash_profile``` file in your home directory. For Mac users, if you do not already have a ```.bash_profile``` file, navigate to your home directory and create one by executing the following commands:
    ```cd ~```
    ```touch .bash_profile```
    After which, open it with a text editor of your choice and add the following lines of code at the top of the ```.bash_profile``` file, replacing ```HOMEDIRECTORY``` with the name of your home directory:

<img src="yelp_data/spark_bash_profile.png"/>
    
       
- Save and close the ```.bash_profile``` file and execute ```source ~/.bash_profile``` in your Terminal or Windows-equivalent command prompt.

- Completely quit your Terminal and command prompt.

- Now you may proceed to run the rest of the following code.


- ***KINDLY NOTE THAT IF YOU HAVE ENCOUNTERED A CONNECTION REFUSED ERROR OR A JAVA ERROR WHERE IT IS TRYING TO CONNECT TO YOUR IP ADDRESS BUT FAILED WHEN RUNNING ANY PYSPARK-RELATED CELL, KINDLY JUST COPY ALL THE CELLS IN THE NOTEBOOK (HIGHLIGHT THE TOP CELL AND CMD(FOR MAC)/CTRL(FOR WINDOWS) + SHIFT + HIGHLIGHT THE LAST CELL), COPY AND PASTE INTO A FRESH NOTEBOOK AND RUN THEM THERE INSTEAD***

In [1]:
#this is required to allow pyspark to run in a jupyter notebook
#!pip install findspark

In [2]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import StringIndexer, IndexToString
from pyspark.ml import Pipeline as PL
from pyspark.sql.functions import col

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

from sklearn.preprocessing import StandardScaler

from xgboost import XGBClassifier
import numpy as np
import pandas as pd

from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
import re



In [3]:
#creating a new spark session
newspark = SparkSession.builder.appName('hybrid_rec').getOrCreate()

In [4]:
#reading in prepped dataset for model-based collaborative filtering recommendation
mbcf = newspark.read.csv('yelp_data/mbcf.csv', header=True, inferSchema=True)

In [5]:
#checking out the first few rows of the mbcf df
mbcf.show(3,truncate=True)

+-------------------+-------+-------+
|              shops|ratings|userids|
+-------------------+-------+-------+
|hustle-co-singapore|    5.0|    532|
|hustle-co-singapore|    5.0|   1397|
|hustle-co-singapore|    5.0|     80|
+-------------------+-------+-------+
only showing top 3 rows



In [6]:
#finding out the max userid so that one knows what new userid should be assigned to the new user of interest. I.e. the new user's userid should be max(userid) + 1
#mbcf.select(max("userids")).show()

In [7]:
#making a copy of the spark mbcf df for experimentation/trial
mbcf_try = mbcf

In [8]:
#checking out how many rows the spark mbcf df has.
mbcf.count()

7076

In [9]:
#checking out the columns...
mbcf_try.columns

['shops', 'ratings', 'userids']

In [10]:
#new user (userid 2552; me)'s arbitrary ratings given to 10 outlets in the list of 987
vals = [('tai-chong-coffee-stall-singapore',1.0,2552),
        ('the-stamford-brasserie-singapore',1.0,2552),
        ('nanyang-old-coffee-singapore-3',2.0,2552),
        ('hanis-singapore-2',2.0,2552),
        ('dr-cafe-coffee-singapore',3.0,2552),
        ('muzium-cafe-singapore',3.0,2552),
        ('heavenly-wang-singapore-9',4.0,2552),
        ('paris-baguette-singapore-5',4.0,2552),
        ('food-for-thought-singapore',5.0,2552),
        ('starbucks-singapore-158',5.0,2552),]

<ul>
    
- The above cell represent new user (userid 2552, AKA me)'s arbitrary ratings given to 10 outlets in the list of 987 that I have visited (I gave ratings based on my impression of the name of the outlet alone, as I seldom visit coffee-drinking places - fun fact: I don't drink coffee!)(By right, users should give ratings only for those outlets they have actually dined in/patronized; as such, the recommendations generated may not be logical-imagine rubbish in, rubbish out...)

In [11]:
#pyspark's convention to adding new rows to the end of an existing spark dataframe-1
newRows = newspark.createDataFrame(vals,mbcf_try.columns)

In [12]:
#pyspark's convention to adding new rows to the end of an existing spark dataframe-2
mbcf_try = mbcf_try.union(newRows)

In [13]:
#checking out the number of rows of the df upon adding those 10 new rating rows for userid 2552
mbcf_try.count()

7086

In [14]:
#indeed, the above step of assigning the original mbcf to a new variable makes a copy of that-just want to make sure it works with this novel language-pyspark
#original mbcf df has 7076 rows...
mbcf.count()

7076

In [15]:
#converting df to pandas df for easier manipulation later on...
mbcf_try_pd = mbcf_try.toPandas()

In [16]:
#getting a look again at the outlets and ratings provided by userid2552 so we know which outlets to exclude in recommending outlets to userid2552 later on...
user_item_2552 = mbcf_try_pd[mbcf_try_pd['userids']==2552]
user_item_2552

Unnamed: 0,shops,ratings,userids
7076,tai-chong-coffee-stall-singapore,1.0,2552
7077,the-stamford-brasserie-singapore,1.0,2552
7078,nanyang-old-coffee-singapore-3,2.0,2552
7079,hanis-singapore-2,2.0,2552
7080,dr-cafe-coffee-singapore,3.0,2552
7081,muzium-cafe-singapore,3.0,2552
7082,heavenly-wang-singapore-9,4.0,2552
7083,paris-baguette-singapore-5,4.0,2552
7084,food-for-thought-singapore,5.0,2552
7085,starbucks-singapore-158,5.0,2552


In [17]:
#as part of ALS requirements for the feature columns to be in numerical format, am converting both shops and userids to the double precision format just in case (even though userids is already in a float format)
indexer_try = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in list(set(mbcf_try.columns)-set(['ratings']))]
pipeline_try = PL(stages=indexer_try)
transformed_try = pipeline_try.fit(mbcf_try).transform(mbcf_try)
transformed_try.show(3)

+-------------------+-------+-------+-------------+-----------+
|              shops|ratings|userids|userids_index|shops_index|
+-------------------+-------+-------+-------------+-----------+
|hustle-co-singapore|    5.0|    532|         50.0|      324.0|
|hustle-co-singapore|    5.0|   1397|         56.0|      324.0|
|hustle-co-singapore|    5.0|     80|       1678.0|      324.0|
+-------------------+-------+-------+-------------+-----------+
only showing top 3 rows



In [18]:
transformed_try.filter("userids = 2552").show()

+--------------------+-------+-------+-------------+-----------+
|               shops|ratings|userids|userids_index|shops_index|
+--------------------+-------+-------+-------------+-----------+
|tai-chong-coffee-...|    1.0|   2552|        100.0|      525.0|
|the-stamford-bras...|    1.0|   2552|        100.0|      348.0|
|nanyang-old-coffe...|    2.0|   2552|        100.0|      369.0|
|   hanis-singapore-2|    2.0|   2552|        100.0|      434.0|
|dr-cafe-coffee-si...|    3.0|   2552|        100.0|      408.0|
|muzium-cafe-singa...|    3.0|   2552|        100.0|      508.0|
|heavenly-wang-sin...|    4.0|   2552|        100.0|      305.0|
|paris-baguette-si...|    4.0|   2552|        100.0|      521.0|
|food-for-thought-...|    5.0|   2552|        100.0|       26.0|
|starbucks-singapo...|    5.0|   2552|        100.0|      479.0|
+--------------------+-------+-------+-------------+-----------+



In [19]:
transformed_2552 = transformed_try.filter("userids = 2552")

In [20]:
#there are 981 unique shops.
transformed_try.select("shops").distinct().count()

981

In [21]:
#rank=300 and regParam=0.1 was a pair of tuned best params while retuning als with train test split stratified for userids...
als = ALS(rank=300, regParam=0.1, maxIter=20, seed=42, userCol='userids_index',itemCol='shops_index', ratingCol='ratings',coldStartStrategy='drop')

In [22]:
#loading a previously saved and tuned prefitted ALS model to generate predictions here.
#alsmod = ALS.load("yelp_data/als_rec_prefitted.model")

In [23]:
#training the dataset containing the new user's ratings...
als_model_rec = als.fit(transformed_try)

In [24]:
#making recommendations for model-based collaborative filtering alone first, passing in all 981 outlets so as to ensure as much overlap between collaborative filtering and content-based filtering in the outlets that they generate rating predictions for
recs=als_model_rec.recommendForAllUsers(981).toPandas()
nrecs=recs.recommendations.apply(pd.Series) \
            .merge(recs, right_index = True, left_index = True) \
            .drop(["recommendations"], axis = 1) \
            .melt(id_vars = ['userids_index'], value_name = "recommendation") \
            .drop("variable", axis = 1) \
            .dropna() 
nrecs=nrecs.sort_values('userids_index')
nrecs=pd.concat([nrecs['recommendation'].apply(pd.Series), nrecs['userids_index']], axis = 1)
nrecs.columns = [
        
        'Shop_index',
        'Rating',
        'UserID_index'
       
     ]
md=transformed_try.select(transformed_try['userids'],transformed_try['userids_index'],transformed_try['shops'],transformed_try['shops_index'])
md=md.toPandas()
dict1=dict(zip(md['userids_index'],md['userids']))
dict2=dict(zip(md['shops_index'],md['shops']))
nrecs['UserID']=nrecs['UserID_index'].map(dict1)
nrecs['shops']=nrecs['Shop_index'].map(dict2)
nrecs=nrecs.sort_values('UserID')
nrecs.reset_index(drop=True, inplace=True)
new=nrecs[['UserID','shops','Rating']]
new['recommendations'] = list(zip(new.shops, new.Rating))
res=new[['UserID','recommendations']]  
res_new=res['recommendations'].groupby([res.UserID]).apply(list).reset_index()
res_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,UserID,recommendations
0,0,"[(apartment-coffee-singapore, 4.94189643859863..."
1,1,"[(drips-singapore, 3.950673818588257), (butter..."
2,2,"[(ya-kun-kaya-toast-singapore-22, 3.9299848079..."
3,3,"[(40-hands-singapore, 3.96014666557312), (star..."
4,4,"[(the-stamford-brasserie-singapore, 3.73784589..."


<ul>
    
- The above cell may take some time to load, which may result in slow response in future when this is deployed for A/B testing...

In [25]:
#creating a new df for userid2552's collaborative filtering-derived recommendations
collab_rec_2552 = pd.DataFrame(dict(res_new[res_new["UserID"]==2552]['recommendations'].tolist()[0]),index=[0]).T.sort_values(0,ascending=False)
collab_rec_2552.head(20)

Unnamed: 0,0
starbucks-singapore-158,4.76087
food-for-thought-singapore,3.974101
paris-baguette-singapore-5,3.927828
heavenly-wang-singapore-9,3.872818
singapore-swimming-club-singapore,3.699315
starbucks-singapore-160,3.688314
ogopogo-singapore,3.678975
mahota-commune-singapore,3.676981
collective-brewers-singapore,3.671374
gram-singapore-2,3.669931


In [26]:
#creating a list of outlets userid2552 has rated earlier on
rated_2552 = mbcf_try_pd[mbcf_try_pd['userids']==2552]['shops'].tolist()

In [27]:
#filtering out those 10 outlets userid2552 has rated initially from the collaborative filtering recommendation list...
collab_rankedrecs_2552 = collab_rec_2552.loc[[shop for shop in collab_rec_2552.index if shop not in rated_2552],0]
collab_rankedrecs_2552.head()

singapore-swimming-club-singapore    3.699315
starbucks-singapore-160              3.688314
ogopogo-singapore                    3.678975
mahota-commune-singapore             3.676981
collective-brewers-singapore         3.671374
Name: 0, dtype: float64

In [28]:
#organizing the above series column into a df of recommendations and collaborative filtering rating predictions
collab_2552_df = pd.DataFrame({'recommendations':collab_rankedrecs_2552.index,
              'collab_filter_predicted_ratings':collab_rankedrecs_2552})
collab_2552_df.head(3)

Unnamed: 0,recommendations,collab_filter_predicted_ratings
singapore-swimming-club-singapore,singapore-swimming-club-singapore,3.699315
starbucks-singapore-160,starbucks-singapore-160,3.688314
ogopogo-singapore,ogopogo-singapore,3.678975


<ul>
    
- Now, for content-based filtering aspect...

In [29]:
#reading in the previously prepped df meant for content-based filtering here for content-based filtering recommendations..
content_f = pd.read_csv('yelp_data/content_based_df_nouser.csv')

In [30]:
#checking out how the first few rows of the df look like.
content_f.head()

Unnamed: 0,shops,reviews,category_alias,review_count,rating
0,183-rojak-singapore,Opening a rojak stall in Toa Payoh isn't that ...,food,2,3.5
1,1983-a-taste-of-nanyang-singapore-2,"Located in the first basement level at MBS, cl...",singaporean coffee foodstands,4,4.0
2,2am-dessert-bar-singapore,Creative desserts with several layers of flavo...,bars desserts coffee,38,3.5
3,2nd-mini-steamboat-delight-singapore,Fantastic Authentic Place!!!What a treat and a...,cafes,2,4.5
4,365-fruit-juice-and-smoothie-singapore,"Real juice, with real fruits and vegetables, a...",juicebars,1,4.0


In [31]:
#checking out the dimensions of the df...
content_f.shape

(981, 5)

In [32]:
#merging userid2552's info with the df meant for content-based filtering so that rcontent-based filtering can make recommendations via rating predictions for userid 2552 later on...
content_2552 = pd.merge(content_f,user_item_2552,how='left',on='shops')

In [33]:
#checking out the first few rows of which...
content_2552.head(3)

Unnamed: 0,shops,reviews,category_alias,review_count,rating,ratings,userids
0,183-rojak-singapore,Opening a rojak stall in Toa Payoh isn't that ...,food,2,3.5,,
1,1983-a-taste-of-nanyang-singapore-2,"Located in the first basement level at MBS, cl...",singaporean coffee foodstands,4,4.0,,
2,2am-dessert-bar-singapore,Creative desserts with several layers of flavo...,bars desserts coffee,38,3.5,,


In [34]:
#getting dummies for categorical columns...
content_2552_wdummies = pd.get_dummies(content_2552, columns=['shops','category_alias'], drop_first=False)

In [35]:
#setting feature and target
X = content_2552_wdummies.drop(['ratings'], axis=1)
y = content_2552_wdummies['ratings']

In [36]:
#collating dummified columns
shops_cats_list = [col for col in content_2552_wdummies.columns if (col.startswith('shops')) or (col.startswith('category'))]

In [37]:
#extending with review_count and rating
shops_cats_list.extend(['review_count','rating','userids'])

In [38]:
#as tfidf can only work on one column of texts at a time, am separating features as below...
X1 = X['reviews']
X2 = X[shops_cats_list]

In [39]:
#checking out the first few rows of reviews prior to preprocessing...
X1.head(3)

0    Opening a rojak stall in Toa Payoh isn't that ...
1    Located in the first basement level at MBS, cl...
2    Creative desserts with several layers of flavo...
Name: reviews, dtype: object

In [40]:
#Assigning a new variable name to X1 for processing.
rev = X1

In [41]:
#creating customized stop words' list
cust_stop_words = [word for word in stop_words.ENGLISH_STOP_WORDS]

In [42]:
#adding on to the above list based on preliminary word cloud EDA
cust_stop_words.extend(["wa","ha","just","ve","did","got","quite"]) #adding "wa" into the list of stop words

In [43]:
#preprocessing text in reviews by defining a function to do so
lemm = WordNetLemmatizer()

def text_processer(raw_text):
    # Function to convert a raw string of text to a string of words
    # The input is a single string (a raw unprocessed text), and 
    # the output is a single string (a preprocessed text)
    
    # 1. Remove http urls.
    review_text = re.sub("\(http.+\)", " ", raw_text)
    
    # 2. Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", review_text)
    
    # 3. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 4. Lemmatize words.
    lemmed_words = [lemm.lemmatize(i) for i in words]
    
    # 5. Remove stop words.
    
    meaningful_words = [w for w in lemmed_words if not w in cust_stop_words]
    
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

In [44]:
#showing how the processed reviews look like
rev_processed = pd.Series([text_processer(text) for text in rev])
rev_processed[:3]

0    opening rojak stall toa payoh isn t easy tough...
1    located basement level mb close sw entrance es...
2    creative dessert layer flavor enjoyed looking ...
dtype: object

In [45]:
#using tfidf vectorizer to convert the reviews into term frequency columns...
tvec_naive = TfidfVectorizer(stop_words = cust_stop_words)  #instantiating TfidfVectorizer with customized stop words

X1_tvec_naive = tvec_naive.fit_transform(rev_processed).todense()   #fitting tvec and transforming the processed reviews
X1_tvec_naive_df = pd.DataFrame(X1_tvec_naive, columns = tvec_naive.get_feature_names())  #converting it into a dataframe for easy lookup.
X1_tvec_naive_df.shape #checking how array's dimension has changed with the vectorization.

(981, 18216)

In [46]:
#combining tvec-df with the rest of the features for rating prediction for userid 2552 later on...
X_legit = pd.concat([X1_tvec_naive_df,X2], axis=1)
X_legit.shape

(981, 19498)

In [47]:
#adding back the column of ratings so that it can be dropped below-sorry sometimes my train of thought may sound illogical
X_legit['ratings'] = y

In [48]:
#creating X_train manually for userid 2552
X_train_2552 = X_legit[X_legit['userids']==2552].drop(['ratings','userids'],axis=1)

In [49]:
#creating y_train manually for userid 2552
y_train_2552 = X_legit[X_legit['userids']==2552]['ratings']

In [50]:
#creating X_test manually for userid 2552 which contains all outlets that have not been rated by userid 2552
X_test_2552 = X_legit[X_legit['userids']!=2552].drop(['ratings','userids'],axis=1)

In [51]:
#instantiate scaler since not all of the features are of the same scale, eg. review_count and rating
ss= StandardScaler()

In [52]:
#fitting the train and transforming both the train and test sets
X_train_2552_sc = ss.fit_transform(X_train_2552)
X_test_2552_sc = ss.transform(X_test_2552)

In [53]:
#this is to try if tuning the model every time a user feeds ratings into the recommender would produce a better outcome
#params = {
#    'max_depth':[5,7,10],
#    'min_samples_split':[10, 15, 20],
#    'min_samples_leaf':[3, 4, 5],
#    'class_weight':['balanced']
#    } 


#dtc_gs = GridSearchCV(DecisionTreeClassifier(), params, cv = 2, verbose = 1, n_jobs = -1)
#dtc_gs.fit(X_train_2552_sc, y_train_2552)

In [54]:
#print('Gridsearch best score: ', dtc_gs.best_score_)
#print('Gridsearch best estimator: ', dtc_gs.best_estimator_)
#print('Gridsearch best params: ',dtc_gs.best_params_)

<ul>
    
- Terrible gridsearchcv results of 0.2 accuracy score when user only provides 2 ratings for each rating bracket (total 10 ratings) as that would have forced the number of cross-validation folds to be maximally 2-meaning 50/50 split and testing twice only for each of the hyperparameter combination to be searched upon... This would mean that the user would need to input at least 10 ratings in each rating bracket, which would total to 50 ratings in order to provide a representative picture of all rating brackets (1 - 5), which would be unduly onerous for the user!

In [55]:
#loading a previously tuned content-based filtering model for one user (userid 2043) here to see the kind of rating predictions it can generate for a different user, userid 2552...
#loaded_model = joblib.load('xgb_model.sav')

In [56]:
#learning rate, max depth, and n_estimators' values were obtained from previously tuned xgb model - xgb model.sav
xgb = XGBClassifier(learning_rate=0.5, max_depth=9, n_estimators=200, random_state=42)

In [57]:
#training the loaded model on the dataset containing the new user, userid 2552's ratings.
xgb.fit(X_train_2552_sc, y_train_2552)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bytree=1, gamma=0, learning_rate=0.5, max_delta_step=0,
              max_depth=9, min_child_weight=1, missing=None, n_estimators=200,
              n_jobs=1, nthread=None, objective='multi:softprob',
              random_state=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=None, silent=True, subsample=1)

In [58]:
#stacking X_test_2552 as first step in regenerating the shops column for predictions
trial = X_test_2552.stack()

In [59]:
#creating loop to re-generate original X_test_2552 order of shops
index_lst = []
outlets_lst = []
for n in range(len(trial.index)):
    if trial.index[n][1].startswith('shops_') and trial[n]!=0:
        index_lst.append(str(trial.index[n][0]))
        outlets_lst.append(trial.index[n][1])
index_lst = [int(x) for x in index_lst]
reconstructed_X_test_2552 = pd.DataFrame({'shops':outlets_lst}, index=index_lst)

In [60]:
#generating content-based filtering rating predictions for userid 2552
rating_predictions = xgb.predict(X_test_2552_sc)

In [61]:
#adding new column of rating predictions into the reconstructed X_test_2552
reconstructed_X_test_2552['predicted_ratings']=rating_predictions

In [62]:
#giving the reconstructed df a more easily understood name for distinction from the collaborative filtering df dealt with above
content_2552_df = reconstructed_X_test_2552
content_2552_df.shape

(971, 2)

In [63]:
#checking out the dimensions of the collaborative filtering df
collab_2552_df.shape

(971, 2)

In [64]:
#ensuring that the content-based filtering df for userid 2552 has no null values
sum(content_2552_df.isnull().sum())

0

In [65]:
#ensuring that the collaborative filtering-based df for userid 2552 has no null values
sum(collab_2552_df.isnull().sum())

0

In [66]:
#having a look at some of the ratings predicted by content-based filtering for userid 2552. Oh no, looks like content-based filtering was only able to predict all rating 1s! Let's see if the collaborative filtering can compensate for this shortcoming...
content_2552_df.head()

Unnamed: 0,shops,predicted_ratings
0,shops_183-rojak-singapore,2.0
1,shops_1983-a-taste-of-nanyang-singapore-2,2.0
2,shops_2am-dessert-bar-singapore,1.0
3,shops_2nd-mini-steamboat-delight-singapore,1.0
4,shops_365-fruit-juice-and-smoothie-singapore,3.0


In [67]:
#trimming off the shops' prefixes so that they can eventually be merged with the collaborative filtering df
content_2552_df['shops'] = content_2552_df['shops'].apply(lambda x: x[6:])

In [68]:
#shops' prefixes have been trimmed off!
content_2552_df.head(3)

Unnamed: 0,shops,predicted_ratings
0,183-rojak-singapore,2.0
1,1983-a-taste-of-nanyang-singapore-2,2.0
2,2am-dessert-bar-singapore,1.0


In [69]:
#renaming the column of rating predictions to distinguish from collaborative filtering's prediction column later on when both dfs are merged.
content_2552_df.rename(columns={'predicted_ratings':'content_filter_predicted_ratings'},inplace=True)

In [70]:
#checking out the renamed column...
content_2552_df.head(3)

Unnamed: 0,shops,content_filter_predicted_ratings
0,183-rojak-singapore,2.0
1,1983-a-taste-of-nanyang-singapore-2,2.0
2,2am-dessert-bar-singapore,1.0


In [71]:
#checking out the first few rows of the collaborative filtering df for userid 2552
collab_2552_df.head()

Unnamed: 0,recommendations,collab_filter_predicted_ratings
singapore-swimming-club-singapore,singapore-swimming-club-singapore,3.699315
starbucks-singapore-160,starbucks-singapore-160,3.688314
ogopogo-singapore,ogopogo-singapore,3.678975
mahota-commune-singapore,mahota-commune-singapore,3.676981
collective-brewers-singapore,collective-brewers-singapore,3.671374


In [72]:
#renaming collaborative filtering df's recommendations' column so that it can be merged with the content-based filtering df.
collab_2552_df.rename(columns={'recommendations':'shops'},inplace=True)

In [73]:
#checking out the renamed column
collab_2552_df.head(3)

Unnamed: 0,shops,collab_filter_predicted_ratings
singapore-swimming-club-singapore,singapore-swimming-club-singapore,3.699315
starbucks-singapore-160,starbucks-singapore-160,3.688314
ogopogo-singapore,ogopogo-singapore,3.678975


In [74]:
#reseting the index in the collaborative filtering df so that the index is numerical again
collab_2552_df.reset_index(drop=True,inplace=True)

In [75]:
#checking out the reset index of the collaborative filtering df
collab_2552_df.head(3)

Unnamed: 0,shops,collab_filter_predicted_ratings
0,singapore-swimming-club-singapore,3.699315
1,starbucks-singapore-160,3.688314
2,ogopogo-singapore,3.678975


In [76]:
#merging both content-based filtering and collaborating filtering df to prepare to make hybrid recommendations for userid 2552
content_collab_2552_df = pd.merge(content_2552_df,collab_2552_df,how='inner',on='shops')

In [77]:
#looking at the first few rows of the combined df for userid 2552
content_collab_2552_df.head(3)

Unnamed: 0,shops,content_filter_predicted_ratings,collab_filter_predicted_ratings
0,183-rojak-singapore,2.0,2.856616
1,1983-a-taste-of-nanyang-singapore-2,2.0,2.912037
2,2am-dessert-bar-singapore,1.0,3.637525


In [78]:
#as mentioned in the previous sub-notebook on this hybrid recommender's evaluation, the following are the content-based and collaborative filtering's ratings' weights
con_wt = 0.97 / (0.97 + 1.0)
collab_wt = 1.0 / (0.97 + 1.0)

In [79]:
#feature engineering to add hybrid recommender's rating predictions into the combined df by multiplying the respective rating predictions by weights based on both models' f1 scores derived from prior evaluation and summing them up to yield hybrid predictions
content_collab_2552_df['final_weighted_rating_predictions'] = (content_collab_2552_df['content_filter_predicted_ratings']*con_wt) + (content_collab_2552_df['collab_filter_predicted_ratings']*collab_wt)

In [80]:
#checking out the first few rows of the final hybrid df
content_collab_2552_df.head(3)

Unnamed: 0,shops,content_filter_predicted_ratings,collab_filter_predicted_ratings,final_weighted_rating_predictions
0,183-rojak-singapore,2.0,2.856616,2.43483
1,1983-a-taste-of-nanyang-singapore-2,2.0,2.912037,2.462963
2,2am-dessert-bar-singapore,1.0,3.637525,2.338845


In [81]:
#top 5 coffee-drinking outlet recommendations for userid 2552 (me!) based on my ratings given rather randomly to 10 of the outlets earlier on...
content_collab_2552_df.sort_values('final_weighted_rating_predictions',ascending=False).head()

Unnamed: 0,shops,content_filter_predicted_ratings,collab_filter_predicted_ratings,final_weighted_rating_predictions
243,empire-cafe-singapore,5.0,3.656636,4.318089
255,fine-palate-cafe-singapore,5.0,3.65295,4.316218
100,brunetti-singapore,5.0,3.65009,4.314767
63,baristart-coffee-sentosa-southern-islands,5.0,3.649774,4.314606
352,jcone-jipangyi-singapore,5.0,3.646433,4.31291


In [82]:
top_5_recs = content_collab_2552_df[['shops','final_weighted_rating_predictions']].sort_values('final_weighted_rating_predictions',ascending=False).head()
top_5_recs

Unnamed: 0,shops,final_weighted_rating_predictions
243,empire-cafe-singapore,4.318089
255,fine-palate-cafe-singapore,4.316218
100,brunetti-singapore,4.314767
63,baristart-coffee-sentosa-southern-islands,4.314606
352,jcone-jipangyi-singapore,4.31291


In [83]:
top_5_recs.reset_index(drop=True,inplace=True)

In [84]:
top_5_recs

Unnamed: 0,shops,final_weighted_rating_predictions
0,empire-cafe-singapore,4.318089
1,fine-palate-cafe-singapore,4.316218
2,brunetti-singapore,4.314767
3,baristart-coffee-sentosa-southern-islands,4.314606
4,jcone-jipangyi-singapore,4.31291


In [85]:
top_5_recs.loc[4,'shops']

'jcone-jipangyi-singapore'

In [86]:
#content-filtering could only predict 1.0 for all of the other outlets that userid 2552 (me!) has not rated...
content_collab_2552_df['content_filter_predicted_ratings'].sum()

2773.0

In [87]:
content_collab_2552_df.content_filter_predicted_ratings.value_counts()

2.0    258
1.0    214
5.0    210
3.0    163
4.0    126
Name: content_filter_predicted_ratings, dtype: int64

In [88]:
content_collab_2552_df.shape

(971, 4)

## Interpretation of trial's results
---

<ul>
    
- Interestingly enough, content-based filtering failed to predict variations in ratings across all 971 outlets that I have not rated...fortunately, I have the model-based collaborative filtering as a backup to compensate where content-based filtering has failed, in this hybrid recommender...

</ul>

<ul>
    
- This hybrid recommender has shown that having the user provide an even distribution of ratings (2 each of ratings 1-5) for 10 different outlets he/she has visited from the getgo out of 980+ outlets (basically training only on about 1% of the data and testing on 99%) is not a viable solution for a recommender that solely relies on a model-inclined content-based filtering system. The model-based collaborative filtering with ALS managed to salvage a little by providing a variation of ratings at least, suggesting that it is indeed robust to missing data-the 971 outlets not rated by userid 2552.

</ul>    
    
<ul>

- As mentioned earlier, the 10 initial ratings provided by me were arbitrary and perhaps rather random (not based on actual experience having visited the outlets that I have rated, but purely based instead on my impression of the name of the outlets). As such, it may not be surprising that the user ratings predicted by the content-based filtering algorithm do not make sense. Perhaps I can deploy this proper on a platform for actual coffee lovers to try out and validate it in what is called an A/B testing as part of the future plans of this project.
    
</ul>

## Models' Summary
---

<ul>

- <font size='3'>__Content-based Filtering (baseline accuracy: 0.48)__</font>:
    
    </ul>

|<center><font size='2'>Model<center>|<center><font size='2'>Accuracy<center>|<center><font size='2'>Micro-Average<br>Precision<center>|<center><font size='2'>Micro-Average<br>Recall<center>|<center><font size='2'>Micro-Average<br>$F_1$ score<center>|<center><font size='2'>Micro-Average<br>ROC AUC<center>|<center><font size='2'>Prevalence-Weighted<br>ROC AUC<center>|
|---|---|---|---|---|---|---|
|<center><font size='1'>*Logistic Regression<br>with<br>TfidfVectorizer*<center>|<center>0.81<center>|<center>0.81<center>|<center>0.81<center>|<center>0.81<center>|<center>0.88<center>|<center>0.71<center>|
|<center><font size='1'>*Logistic Regression<br>with<br>TfidfVectorizer<br>with<br>PCA*<center>|<center>0.50<center>|<center>0.50<center>|<center>0.50<center>|<center>0.50<center>|<center>0.65<center>|<center>0.57<center>|
|<center><font size='1'>***XGB Classifier<br>with<br>TfidfVectorizer (chosen)***<center>|<center>***0.97***<center>|<center>***0.97***<center>|<center>***0.97***<center>|<center>***0.97***<center>|<center>***1.0***<center>|<center>***1.0***<center>|
|<center><font size='1'>Decision Tree Classifier<br>with<br>TfidfVectorizer<center>|<center>0.85<center>|<center>0.85<center>|<center>0.85<center>|<center>0.85<center>|<center>0.94<center>|<center>0.90<center>|
|<center><font size='1'>*Decision Tree Classifier<br>with<br>TfidfVectorizer<br>with<br>PCA*<center>|<center>0.43<center>|<center>0.43<center>|<center>0.43<center>|<center>0.43<center>|<center>0.62<center>|<center>0.47<center>|
|<center><font size='1'>*Random Forest Classifier<br>with<br>TfidfVectorizer*<center>|<center>0.61<center>|<center>0.61<center>|<center>0.61<center>|<center>0.61<center>|<center>0.92<center>|<center>0.81<center>|



<ul>
    
- <font size='3'>__Model-based Collaborative Filtering (baseline accuracy: 0.47)__</font>:
    
    </ul>

|<center><font size='2'>Model<center>|<center><font size='2'>Accuracy<center>|<center><font size='2'>Micro-Average<br>Precision<center>|<center><font size='2'>Micro-Average<br>Recall<center>|<center><font size='2'>Micro-Average<br>$F_1$ score<center>|
|---|---|---|---|---|
|<center><font size='1'>*Alternating Least Squares (ALS)*<center>|<center>1.0<center>|<center>1.0<center>|<center>1.0<center>|<center>1.0<center>|

<ul>
    
- <font size='3'>__Hybrid Recommender (baseline accuracy: 0.48)__</font>:
    
    </ul>

|<center><font size='2'>Model<center>|<center><font size='2'>Accuracy<center>|<center><font size='2'>Micro-Average<br>Precision<center>|<center><font size='2'>Micro-Average<br>Recall<center>|<center><font size='2'>Micro-Average<br>$F_1$ score<center>|
|---|---|---|---|---|
|<center><font size='1'>*Hybrid Recommender<br>(ALS and XGB Classifier)*<center>|<center>1.0<center>|<center>1.0<center>|<center>1.0<center>|<center>1.0<center>|

## Model Improvements and Current Limitations
---

<ul>
    
- The major limitation with the earlier phase of the project was that the content-based filtering was trained and tuned on only a single userid's ratings which may not be representative of the vast majority even though said userid rated a considerable number of outlets... In this extension, the content-based filtering was not only trained on more than 1 userid (110 to be precise), but time was spent tuning an XGB Classifier algorithm properly in an attempt to mitigate this under-representation issue. This time round, training data was restricted arbitrarily to those who have rated at least 10 outlets (only 110 userids out of 2552; the rest rated less than 10 outlets - quite a significant number rated only 1-2 outlets and including them will make the train_test_split and cross-validation aspect of the project problematic since it is important to stratify the splitting by userids in the evaluation stage).
<ul>
    - Indeed, the XGB Classifier did not disappoint, with a near-perfect scores of 0.97 - 1.0 in its performance.
    </ul>
    
</ul>
    
<ul>
    
- In order to evaluate ALS' rating predictions on the same grounds as the content-based filtering that I have evaluated with classification metrics (since the ratings fall into finite discrete classes), ALS' continuous rating predictions were rounded off to nearest whole numbers and those falling into incorrect rating classes such as -1.0 or 0.0 were manually coded as the nearest possible value-1.0. Even though these incorrect classes were already mis-classified and correcting them did not alter the spark evaluator-computed $F_1$ score (still 0.98 after corrections), it should be noted that the ALS predictions were "tweaked". There should be better, more reliable options to convert the continuous output into discrete rating classes but for now, I am making do with it and qualifying this as an "assumption" by which the results of my evaluation of the ALS component holds true. One way of improving this is to consider incorporating a logistic/sigmoid function ($f(x)$) into the matrix factorization loss function to automatically squeeze the range of predicted output ratings down to probabilities that range between 0 and 1 for each discrete rating class: 

</ul>

<img src="yelp_data/extn_matfac.png"/>

<ul>
    
- The above extension of matrix factorization can then be adapted and extended to more complex algorithms like neural networks, which are used for near state-of-the-art recommenders.
    
</ul>

<ul>    
    
- Another limitation is that this hybrid model only takes into account explicit users' preferences in the form of ratings and reviews. Perhaps implicit feedback could be incorporated such as clickthrough data, or page views (which are not easily obtainable, if not impossible to obtain, just from the Yelp API token and the scraping of its websites via BeautifulSoup) to supplement the hybrid recommender.
    
</ul>    
    
<ul>

- Possibly yet another limitation could be that ```TfidfVectorizer``` was not tuned and naive vectorizer was fitted instead as reviews were not the only input feature but was combined with numerical feature columns for the content-based filtering. As such, it was difficult, if not impossible, to tune the vetorizer for example, limiting the ```max_features``` hyperparameter. One possibility of improving this though, could be to use an [Auto-encoder](https://towardsdatascience.com/creating-a-hybrid-content-collaborative-movie-recommender-using-deep-learning-cc8b431618af) which leverages on a neural network to decompose the multi-dimensional tfidf matrix down to its crucial components.

</ul>        

<ul>
    
- Consider mean normalization and NaN or null ratings where instead of eliminating NaNs or null ratings (user had not rated outlet), mean normalize them such that NaNs are converted into zero first and then eventually add the average outlet rating to these zeros to yield the predicted ratings as in [here](https://www.coursera.org/lecture/machine-learning/implementational-detail-mean-normalization-Adk8G) for those previously null rating values. This could add value by generating rating predictions for users who opt not to rate any outlets at all when using this hybrid recommender system, and therefore partially address the cold-start problem common to collaborative filtering systems.
    
</ul>

<ul>
    
- The scraped data may be incomplete or its quality may not be optimal as the structuring of the html source code behind Yelp's webpage is such that it is hard to ensure that all userids are aligned to all ratings - there may be some pages with an extra rating and review with no userid attached as that was an old review by the userid who posted an updated post right above the old one (it is difficult to write a code with BeautifulSoup to distinguish that and scrape the relevant details accordingly). Fortunately, this scenario is not extremely common on Yelp. Perhaps increased data quality and more of it will go a long way in improving this hybrid recommender system.
    
    </ul>
        
<ul>
    
- Lastly, this dataset is static and any deployed app will need to be updated regularly to remain relevant...
    
</ul>

## Conclusions and Future Plans
---

<ul>
    
- There is a potential that if more time is provided to tune certain algorithms like DecisionTreeClassifier with tfidf and PCA as well as RandomForestClassifier with tfidf etc, they might be able to perform better than the chosen DecisionTreeClassifier with only tfidf since for some of these models, their tuned hyperparameters fall on either edge of the search space provided for the hyperparameters in the parameter grid - this suggests that if tuned a couple more times, providing different hyperparameter values beyond the previous search grid each time, a more optimal model could be achieved eventually for each of these algorithms that had not been chosen. Furthermore, ```PCA``` was not tuned as well - this could help reduce the dimensionality of the data and reduce model variance while maintaining all features which is crucial since all the outlets are coded as features and they should not be removed so that there is at least a predicted rating for all outlets - we do not want to miss out on possible outlets the user could be interested in from the list of recommendations (beyond the top 5 or 10). Perhaps the ```PCA``` should be fitted onto just ```X_train``` instead of ```X_train_sc``` as I had casually experimented on the former near the end of the capstone period with some multiclass ROC AUC plots and scores and it churned out better micro-average ROC AUCs and weighted-by-prevalence ROC AUCs. That said, there has been [advice online](https://stackoverflow.com/questions/39685740/calculate-sklearn-roc-auc-score-for-multi-class) that multiclass classification should better be assessed by its usual methods-confusion matrix-instead...

</ul>


<ul>
    
- Built a Flask app with this hybrid recommendation system and it was implemented successfully in a local virtual environment as shown below (screenshots). However, it takes on average 15 mins or more for the recommendations to be generated... Regarding the actual deployment on a platform like Heroku, as the pyspark component is a lot more intractable and there are very few, if not no resources online on deployment of pyspark apps on Heroku, I was not able to deploy the full hybrid recommendation system and wound up only deploying the XGB Classifier-supported content-based filtering component (further screenshots below). Will keep a lookout for updates on the various deployment platforms, stackoverflow, and pyspark deployment to see if the pyspark ALS component can be incorporated into the deployment without incurring unnecessary billing, as tapping onto the [AWS S3 Bucketeer](https://elements.heroku.com/addons/bucketeer) that complements deployment of spark applications on Heroku is not complimentary. Please click [here](http://sg-coffee-recommender.herokuapp.com/) for the link to the content-based filtering deployed on Heroku. 

- Flask app:

<img src="yelp_data/Flask_app_frontpage_1.png"/>
<img src="yelp_data/Flask_app_frontpage_2.png"/>
<img src="yelp_data/Flask_app_outcome.png"/>
    
- Heroku App:

<img src="yelp_data/heroku_app.png"/>
<img src="yelp_data/heroku_app_recs.png"/>
    
</ul>


<ul>

- All in all, this capstone project is "bootstrapp-ish" as almost every component was cobbled together from its 'raw materials': From the data-scraping phase to constructing the content-based filtering model and combining it with a more-or-less well-established collaborative filtering model, and then aggregating rating predictions from both to yield the final hybrid prototype. There is no tried-and-tested method, gold-standard or novice-friendly digestible guide for constructing a full hybrid recommendation system out there. Although this hybrid system cannot be compared to commercial ones like Netflix, it serves as a peek into the nuts and bolts that go into creating a simple hybrid recommender from scratch for beginners. 

</ul>


## Sources
---

- https://java.com/en/download/help/download_options.xml
- https://www.oracle.com/java/technologies/javase-jdk8-downloads.html
- https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.tgz
- https://github.com/sbt/sbt/releases/download/v0.13.17/sbt-0.13.17.tgz
- https://spark.apache.org/downloads.html
- https://medium.com/luckspark/installing-spark-2-3-0-on-macos-high-sierra-276a127b8b85
- https://towardsdatascience.com/creating-a-hybrid-content-collaborative-movie-recommender-using-deep-learning-cc8b431618af
- https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-1-knn-item-based-collaborative-filtering-637969614ea
- https://www.coursera.org/lecture/machine-learning/implementational-detail-mean-normalization-Adk8G
- https://stackoverflow.com/questions/39685740/calculate-sklearn-roc-auc-score-for-multi-class
- https://elements.heroku.com/addons/bucketeer
- http://sg-coffee-recommender.herokuapp.com/
