Capstone Project: Recommender for Local Coffee-Drinking Places

(Data obtained from Yelp)
(Author: Mr Jason Chia)

Background and Business Problem

It is ever more challenging nowadays to decide on a prime spot for one's favorite cuppa coffee against the backdrop of a burgeoning food and beverage industry (FnB). With the younger generation rising to the occasion in recent years, we are seeing an increasing number of artisanal and creative FnB outlets, touting all sorts of innovative menu items that render the typical coffee consumers like you utterly spoilt for choice! Beyond these new up-and-coming outfits, don't forget the idyllic, quaint but ultimately good testaments to quality coffee such as traditional hawker centres and foodcourts - there are definitely more than one too many options for coffee-drinkers who prefer to pander to old times and nostalgia too!

Overview of Recommendation Systems and Project Objective

Therefore, I am taking it upon myself to make your life that much easier by building a recommendation system that will recommend you coffee-drinking places based on reviews and ratings! So instead of poring over all outlets' reviews and checking out their individual ratings on Yelp one by one until one reaches the 1000th outlet before deciding which to settle for, users can potentially just rate a couple that they have been to before, click another button, and BAM!...Out comes the top 5 recommendations you can check out right away!

Types of recommendation systems include Collaborative Filtering - where a user's preferences are predicted from other similar users' and hence, the collaborative aspect, and ranking preference predictions in the form of rating predictions for example, to generate recommendations - and Content-based Filtering - where a user is recommended items similar in terms of item characteristics such as item category to the items once liked or rated highly by user. Collaborative Filtering allows the possibility of cross-recommendations - where items dissimilar to an item once liked or rated highly by user could be recommended to user based on high ratings by other users of similar tastes - but faces the cold-start problem - hard to predict preferences of new users who have never rating anything before or recommend new items that have not been rated before by users. Collaborative Filtering can be sub-divided further into Memory-based and Model-based. Memory-based Collaborative Filtering is purely based on determining similarities between different users' preferences based on past users' rating patterns using some form of pairwise distance metric such as Jaccard distance metric or cosine similarity, while Model-based Collaborative Filtering relies upon an algorithm to model user-item interactions and thus has slightly higher bias but will generalize better to out-of-sample data and hence have lower variance. An example of a model-based collaborative filtering algorithm is the Alternating Least Squares (ALS) algorithm which is a matrix factorization technique that decomposes user-item interaction matrix (such as user-item ratings matrix for datasets with explicit feedback) into user and item latent factors where their dot product will predict user ratings. It alternates between fixing user or item latent factors to solve for the other via gradient descent at each iteration in the process of minimizing loss:

As ALS will be imported from a pyspark machine learning library, please follow specific instructions at the top of notebook Parts 7 and 9 to download and install the necessary libraries and dependencies pertaining to java, scala, spark and pyspark, as well as follow the instructions on how to configure the software in order to avoid encountering errors when running pyspark-related code in some of the notebooks.

KINDLY NOTE THAT IF YOU HAVE ENCOUNTERED A CONNECTION REFUSED ERROR OR A JAVA ERROR WHERE IT IS TRYING TO CONNECT TO YOUR IP ADDRESS BUT FAILED WHEN RUNNING ANY PYSPARK-RELATED CELL, KINDLY JUST COPY ALL THE CELLS IN THE NOTEBOOK (HIGHLIGHT THE TOP CELL AND CMD(FOR MAC)/CTRL(FOR WINDOWS) + SHIFT + HIGHLIGHT THE LAST CELL), COPY AND PASTE INTO A FRESH NOTEBOOK AND RUN THEM THERE INSTEAD

User-centered Content-based Filtering is about building a model for each user that predicts that user's rating for all the different items based on characteristics of the various items such as item category, item review count etc., in the process, learning that user's coefficients. A potential shortfall of Content-based Filtering is that it is unlikely to provide cross-recommendations - something that can be compensated by Collaborative Filtering in a hybrid recommendation system which will be what I will be building in this project, albeit a simple version...

Ideally, there should be one model built for each user in content-based filtering but for the purpose of a simple demonstration in this capstone project of 6-7 weeks, I will only be building and tuning one model for the user-centered content-based filtering component for the hybrid system for a sample user (userid 2043) who happened to have rated a vast majority of the coffee-drinking outlets contained within the dataset to be used for this project. Now this could potentially introduce bias since the content-based model will only be representative of this particular user's taste and preferences but we will see later on that's where a hybrid recommendation system's value shines through, with the collaborative filtering component compensating for this bias since it takes into account other similar users' ratings.

Beyond the capstone presentation, more work was done - in particular, XGB Classifier was properly tuned and trained on not just one prominent userid (i.e. userid 2043) but on 110 of them (chosen as they had rated at least 10 outlets, where 10 was an arbitrary number which should avoid errors at the train_test_split stage where the split would be stratified by userids instead, and cross-validation stage). In addition, ALS was also re-tuned on datasets split with stratification for userids instead. The new outcomes generated exceeded the performance of earlier models chosen in this capstone project. More details on the work done beyond the capstone presentation can be found in the various alternative notebooks in the repo.

Approach

Data scraping: The data is a list of 987 coffee drinking places in Singapore scraped using Yelp's api token, 6,292 reviews, 7,076 ratings and userids of reviews scraped using BeautifulSoup from Yelp's website. (details inside Part 1 notebook)

Data Cleaning, Feature Engineering, Preprocessing, Exploratory Data Analysis (EDA) and generation of other datasets for manipulation in subsequent notebooks (details inside Part 1 notebook or alternate notebook)

Content-based Filtering and Evaluation (Micro-Average Precision, Recall, F1, ROC AUC, and Prevalence-Weighted ROC AUC)(details inside Parts 2 - 6 notebooks):
1. Modeling with Logistic Regression and Tfidf vectorization
2. Modeling with Logistic Regression and Tfidf vectorization and PCA
3. Modeling with Decision Tree Classifier and Tfidf vectorization
4. Modeling with XGB Classifier and Tfidf vectorization (performed best)
5. Modeling with Decision Tree Classifier and Tfidf vectorization and PCA
6. Modeling with Random Forest Classifier and Tfidf vectorization

Collaborative Filtering with ALS and Evaluation (Micro-Average Precision, Recall, F1) (details inside Part 7 notebook or alternate notebook)

Hybrid Recommendation System Evaluation (Micro-Average Precision, Recall, F1) (details inside Part 8 notebook or alternate notebook)
1. Combination of Content-based and Collaborative Filtering by taking weighted sum of ratings from both as the final rating predictions

Simple Trial of Hybrid Recommendation System : 10 arbitrary ratings will be fed into the system to see if it works to churn out discernible recommendations (details inside Part 9 notebook)

Model Improvements and Current Limitations (details inside Part 9 notebook)

Conclusions and Future Plans (details inside Part 9 notebook)

Models' Summary

Content-based Filtering (baseline accuracy: 0.48):

Model	Accuracy	Micro-Average Precision	Micro-Average Recall	Micro-Average $F_1$ score	Micro-Average ROC AUC	Prevalence-Weighted ROC AUC
Logistic Regression with TfidfVectorizer	0.81	0.81	0.81	0.81	0.88	0.71
Logistic Regression with TfidfVectorizer with PCA	0.50	0.50	0.50	0.50	0.65	0.57
Decision Tree Classifier with TfidfVectorizer	0.85	0.85	0.85	0.85	0.94	0.90
XGB Classifier with TfidfVectorizer (chosen)	0.97	0.97	0.97	0.97	1.0	1.0
Decision Tree Classifier with TfidfVectorizer with PCA	0.43	0.43	0.43	0.43	0.62	0.47
Random Forest Classifier with TfidfVectorizer	0.61	0.61	0.61	0.61	0.92	0.81

Model-based Collaborative Filtering (baseline accuracy: 0.47):

Model	Accuracy	Micro-Average Precision	Micro-Average Recall	Micro-Average $F_1$ score
Alternating Least Squares (ALS)	1.0	1.0	1.0	1.0

Hybrid Recommender (baseline accuracy: 0.48):

Model	Accuracy	Micro-Average Precision	Micro-Average Recall	Micro-Average $F_1$ score
Hybrid Recommender (ALS and XGB Classifier)	1.0	1.0	1.0	1.0

Flask Implementation and Heroku Deployment

Flask Implementation (Hybrid Recommender)

Heroku Deployment (Content-based Filtering Recommender only)

Built a Flask app with this hybrid recommendation system and it was implemented successfully in a local virtual environment as shown below (screenshots). However, it takes on average 15 mins or more for the recommendations to be generated... Regarding the actual deployment on a platform like Heroku, as the pyspark component is a lot more intractable and there are very few, if not no resources online on deployment of pyspark apps on Heroku, I was not able to deploy the full hybrid recommendation system and wound up only deploying the XGB Classifier-supported content-based filtering component (further screenshots below). Please click here for the link to the content-based filtering deployed on Heroku. This is the GitHub repo containing the Flask app files and folders for the hybrid recommender. It also contains the Procfile, requirements.txt, and runtime.txt pertinent to the content-based filtering recommender deployed on Heroku.

Model Improvements and Current Limitations

The major limitation with the earlier phase of the project was that the content-based filtering was trained and tuned on only a single userid's ratings which may not be representative of the vast majority even though said userid rated a considerable number of outlets... In this extension, the content-based filtering was not only trained on more than 1 userid (110 to be precise), but time was spent tuning an XGB Classifier algorithm properly in an attempt to mitigate this under-representation issue. This time round, training data was restricted arbitrarily to those who have rated at least 10 outlets (only 110 userids out of 2552; the rest rated less than 10 outlets - quite a significant number rated only 1-2 outlets and including them will make the train_test_split and cross-validation aspect of the project problematic since it is important to stratify the splitting by userids in the evaluation stage).

- Indeed, the XGB Classifier did not disappoint, with a near-perfect scores of 0.97 - 1.0 in its performance.

ALS is a regressor, output was “tweaked” for classification to align with Content-based Filtering --> some predicted ratings fell into novel classes - possible way to improve is incorporate logistic/sigmoid function (f(x)) to automatically convert continuous predicted ratings to discrete rating classes that line up with actual ratings' classes:

The above extension of matrix factorization can then be adapted and extended to more complex algorithms like neural networks, which are used for near state-of-the-art recommenders.

This project only uses explicit user ratings for coffee-drinking outlets in Singapore. Perhaps can consider sourcing for and including implicit data such as clickthroughs and page views to further enhance the hybrid recommender...

TfidfVectorizer and PCA not tuned: Hard to tune TfidfVectorizer as input features for Content-based Filtering are not just a single reviews' column but also combined with numerical features and so it is impossible to tune it in a pipeline together with an estimator without separating the dataset into reviews and numerical features, which complicates the process; but PCA could be tuned further for more promising results as the first iteration suggests that the grid search space could be modified further

Mean Normalization of ratings could be considered so that hybrid system can fall back on mean outlet ratings to rank recommendations for new users who do not provide ratings (in a way dealing with cold start problem of collaborative filtering)

Scraped data quality (a few ratings and userids not aligned --> affects quality of recommendations)

Lastly, this dataset is static and any deployed app will need to be updated regularly to remain relevant...

Conclusions and Future Plans

Potentially tune PCA or Tfidf to further boost content-based filtering performance

Long Term Steps:

https://elements.heroku.com/addons/bucketeer

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Capstone		Capstone
word_cloud		word_cloud
yelp_data		yelp_data
Capstone_presentation.pptx		Capstone_presentation.pptx
Future_plan_GBC_tuned_Content-based_Filtering.ipynb		Future_plan_GBC_tuned_Content-based_Filtering.ipynb
Part 1_Alternate version.ipynb		Part 1_Alternate version.ipynb
Part 1_Introduction, Data Scraping, Cleaning, Feature Engineering, Preprocessing, and Exploratory Data Analysis (EDA).ipynb		Part 1_Introduction, Data Scraping, Cleaning, Feature Engineering, Preprocessing, and Exploratory Data Analysis (EDA).ipynb
Part 2_Modeling Content-based Filtering with LogReg and Tfidf.ipynb		Part 2_Modeling Content-based Filtering with LogReg and Tfidf.ipynb
Part 3_Modeling Content-based Filtering with LogReg and Tfidf and PCA.ipynb		Part 3_Modeling Content-based Filtering with LogReg and Tfidf and PCA.ipynb
Part 4_Modeling Content-based Filtering with DecisionTreeClassifier and Tfidf.ipynb		Part 4_Modeling Content-based Filtering with DecisionTreeClassifier and Tfidf.ipynb
Part 5_Modeling Content-based Filtering with DecisionTreeClassifier and Tfidf and PCA.ipynb		Part 5_Modeling Content-based Filtering with DecisionTreeClassifier and Tfidf and PCA.ipynb
Part 6_Modeling Content-based Filtering with RandomForestClassifier and Tfidf.ipynb		Part 6_Modeling Content-based Filtering with RandomForestClassifier and Tfidf.ipynb
Part 7_Model-based Collaborative Filtering with Alternating Least Squares (ALS) & Evaluation on sample userid 2043.ipynb		Part 7_Model-based Collaborative Filtering with Alternating Least Squares (ALS) & Evaluation on sample userid 2043.ipynb
Part 7_Model-based Collaborative Filtering with Alternating Least Squares (ALS)_alt_stratified splitting.ipynb		Part 7_Model-based Collaborative Filtering with Alternating Least Squares (ALS)_alt_stratified splitting.ipynb
Part 8_Hybrid RecSys Evaluation_w_XGB_ALS_stratified_split.ipynb		Part 8_Hybrid RecSys Evaluation_w_XGB_ALS_stratified_split.ipynb
Part 8_Hybrid Recommender Evaluation_in test set for userid 2043.ipynb		Part 8_Hybrid Recommender Evaluation_in test set for userid 2043.ipynb
Part 9_Simple Hybrid Recommender Trial.ipynb		Part 9_Simple Hybrid Recommender Trial.ipynb
README.md		README.md
Simple Hybrid Recommender Trial_extn with XGB.ipynb		Simple Hybrid Recommender Trial_extn with XGB.ipynb
XGB_tuned_Content-based_Filtering.ipynb		XGB_tuned_Content-based_Filtering.ipynb

jasonchia89/GA_Capstone

Folders and files

Latest commit

History

Repository files navigation

Capstone Project: Recommender for Local Coffee-Drinking Places

Background and Business Problem

Overview of Recommendation Systems and Project Objective

Approach

Models' Summary

Flask Implementation and Heroku Deployment

Model Improvements and Current Limitations

Conclusions and Future Plans

Source(s)

About

Resources

Stars

Watchers

Forks

Languages