# <span style='text-align: center'>Project Proposal</span>

<hr>

## Executive Summary

The Restaurant Segmentation Analysis project is a collaboration between Sitewise Analytics and MDS students Chen Lin, Eric Tsai, Morris Zhao, and Xinru Lu. This project aims to use machine learning methods to determine factors that drive traffic to a particular location and  identify clusters of similar store locations. In the following proposal, we will first highlight the problem and fundamental goals of the project, then discuss the data science techniques that we will employ to tackle the problem, and conclude with a rough timeline of the project.

<hr>

## Introduction

Restaurants seeking to open new stores in a region need to know whether this new store will be successful, given that the investment of opening a new location is costly. Thus, restaurant franchise owners need to know the factors that drive traffic to a location, such as the surrounding population demographic and consumer behaviour in the region. By having a strong grasp of these factors, owners can plan future expansions and market the new location strategically based on the demographic of the region. The Restaurant Segmentation Analysis project will address this problem by using data from Smoothie King locations in the United States and Subway locations in Canada and the United States to build machine learning data pipelines for Sitewise Analytics to incorporate into their consulting service. At the end of the project, we expect to have human-interpretable machine-learning models that cluster similar store locations, which will be helpful for Sitewise Analytics clients to identify factors that lead to more customers in those similar locations. 

More specifically, we will be building three separate machine learning pipelines:
1. A supervised machine learning pipeline using data from Smoothie King locations to predict a store’s category from one of five categories:
    - Home
    - Shopping
    - Work
    - Travel
    - Other

    The prediction will be human-interpretable in that users can identify features that determine the prediction of a store location’s category.

2. An unsupervised machine learning pipeline based on data of US Subway locations that clusters locations by similar features.

3. An unsupervised machine learning pipeline based on data of Canadian Subway locations that clusters locations by similar features.

    The two unsupervised machine learning pipelines will also have human-interpretable results, including ways to identify similar features that caused different locations to be clustered together.

The final product presented to Sitewise Analytics will be a GitHub repository containing the code for the machine learning data pipelines. Moreover, documentation of the scripts that can be run to reproduce the results will be included.

<hr>

## Data Summary

We received three datasets for three popular chain restaurants: Smoothie King, Subway Canada, and Subway USA. Each dataset consists of three CSV files of demographic, point of interest, and store-specific data respectively, where each row represents a single store location and the columns represent the variables/features of that store.

To enhance the understanding of the behavior of the data, we select some significant features and create visualizations to depict their distribution across categories. 

<hr>

## Data Science Techniques

<center><img src='../img/smoothie_category_bar_plot.png' width='600px' style='padding-top:60px'></center>
<span style='padding-left:50px'><center><b>Figure 1</b></center></span>

<center><img src='../img/smoothie_store_density_bar_plot.png' width='600px' style='padding-top:60px'></center>
<span style='padding-left:50px'><center><b>Figure 2</b>: <br>Light Urban, Urban, and Super Urban have very low counts with below 10 stores.</center></span>

<center><img src='../img/smoothie_market_size_bar_plot.png' width='600px' style='padding-top:60px'></center>
<span style='padding-left:50px'><center><b>Figure 3</b>: <br> The category with the highest number of stores is Very Large Metro.</center></span>

<center><img src='../img/subway_canada_store_density_bar_plot.png' width='600px' style='padding-top:60px'></center>
<span style='padding-left:50px'><center><b>Figure 4</b>: <br> Rural has the highest count.</center></span>

<center><img src='../img/subway_canada_market_size_bar_plot.png' width='600px' style='padding-top:60px'></center>
<span style='padding-left:50px'><center><b>Figure 5</b>: <br> Very Large Metro has the highest count with 5190 stores, and other categories have similar counts.</center></span>

<center><img src='../img/subway_us_store_density_bar_plot.png' width='600px' style='padding-top:60px'></center>
<span style='padding-left:50px'><center><b>Figure 6</b>: <br> The most common market size is Very Large Metro with 577 stores.</center></span>

<center><img src='../img/subway_us_market_size_bar_plot.png' width='600px' style='padding-top:60px'></center>
<span style='padding-left:50px; padding-bottom: 80px'><center><b>Figure 7</b>: <br> Very Large Metro has the highest count with 577 stores, and the other categories have similar counts.</center></span>

Based on the data described above with the important insights highlighted from some important features, along with the problems to be resolved for each restaurant chain which were mentioned in the Introduction section, the initial approaches and the corresponding evaluation metrics to be used are as followings:

1. 

For Smoothie King, as there are over 1100 features combined for surrounding population demographic, consumer behaviour, and store location information for 796 stores in the US, the initial proposed model to be used is an LGBMClassifier to classify the stores to the desired category which includes “Home“, “Shopping“, “Work“, “Travel“ and “Other“. 

Since LGBMClassifier is capable of handling large datasets and performs well on high-dimensional data, LGBMClassifier would be a good fit for the Smoothie King dataset. An improvement to be implemented, due to the high-dimension dataset, is to use Principal Component Analysis (PCA) to reduce the dimension and then train the LGBMClassifier. Normally, LGBMClassifier also has a shorter training time and less memory usage with relatively high accuracy. 

The result will be evaluated through the model accuracy score. The accuracy would indicate how many stores are classified correctly into the designated category where higher accuracy suggests better model performance. The result can be interpreted through SHAP (SHapley Additive exPlanations) plots which should be able to tell the story of which features drive the decision of a particular category assignment for a certain store.

2. 

For Subway USA, the corresponding dataset has 890 features combined with similar information as mentioned above for Smoothie King, for approximately 14,000 stores across the US. The main goal is to cluster the stores into different clusters where stores share similar features within the same cluster. The initial proposed approach is to perform a Hierarchical clustering to achieve this along with performing PCA in advance for dimensionality reduction. 

Since both Smoothie King and Subway US are both in the US and share similar features in the dataset, they could share similar segmentations to a certain extent. To perform PCA, the most driven features from the supervised model for Smoothie King can be used as a reference to evaluate this PCA step where more matched important features could indicate a more proper PCA. 

During the clustering, the linkage criteria should be tested to evaluate how to find similarities between clusters. Finally, since ground truth labels are not known, evaluation can be done through the Silhouette Coefficient where a higher Silhouette Coefficient score suggests that a model defines better clusters. 

Additionally, to validate the clustered stores, a potential approach is to randomly select a certain portion of stores in the same cluster and open them on Google Maps to check if they actually belong to the same category.

3. 

For Subway Canada, the corresponding dataset has relatively fewer features (53 features combined with similar information as mentioned above for Smoothie King, for approximately 1800 stores in Canada) in comparison to the other 2 restaurant chains. The implementation of PCA is optional. We will take a similar approach to Subway US and evaluate the result. 

If the performance is not ideal, consider removing the PCA step and running clustering to evaluate the result again with the Silhouette Coefficient score and check Google Maps.

<hr>

## Timeline

| <b>Weekly Schedule</b>            |                                        | **Objective**                                                                                                                                                  |
|--------------------------|----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Week 1 <br>(5/1 - 5/7)**   | EDA & Proposal                         | Understand the problem, perform initial EDA on the dataset, and propose potential models and approaches to each objective.                                     |
| **Week 2 <br>(5/8 - 5/14)**  | Feature Selection                      | Explore a variety of methods to determine the most important features for each of the three datasets.                                                          |
| **Week 3 <br>(5/15 - 5/21)** | Supervised Model                       | Train and test a supervised classification model on the labeled dataset, as well as a list of the most important features as major indicators.                 |
| **Week 4 <br>(5/22 - 5/28)** | Unsupervised Model (part I)            | Train an unsupervised clustering model on one of the unlabeled datasets and apply several potential evaluation metrics.                                        |
| **Week 5 <br>(5/29 - 6/4)**  | Unsupervised Model (part II)           | Train an unsupervised clustering model on the other unlabeled datasets and apply several potential evaluation metrics.                                         |
| **Week 6 <br>(6/5 - 6/11)**  | Models Tuning & Parameter Optimization | Perform parameter tuning and optimization on the three models.                                                                                                 |
| **Week 7 <br>(6/12 - 6/18)** | Final Presentation                     | Present the final models as well as the list of important indicators from the list of features. Address potential directions and approaches for further study. |
| **Week 8 <br>(6/19 - 6/28)** | Final Product & Reflection             | Draft submission for the final product and report and iterate on feedback before final submission.                                                             |
