### Introduction and Business Problem

##### __Background:__

While traveling in the southern states of the US in general, and in the state of Georgia in specific, it is usually not easy to find the right kind of restaurants to one’s need and taste. Especially while traveling through some remote regions in Georgia, it is very handy to know what kind of restaurants one can expect in the neighborhood. Furthermore, there are different towns and cities in Georgia which are popular for authentic restaurants of specific cuisines. It will be useful to learn more about their specialities before traveling in order not to miss out local specialities. This capstone project will center around the state of Georgia in USA, analyzing the various categories of restaurants and food options available for travelers and localities in different towns and cities.

##### __Problem statements:__
We will attempt to address the following problem statements in this project, all in the geographical region of Georgia, USA
1. What categories of restaurants are popular in different parts of the state?
2. Are vegan/vegetarian options available in the state? Where are they located?
3. Where can one find a cafe or quick bite?
4. Are there speciality southern restaurants in the state? Where are they located?
5. Which towns do not have dine in restaurants, potentially not traveler friendly to stop for dine-ins?

### Data Sources

In order to find answers and get insight into the restaurants landscape in Georgia, USA, the first piece of data that we require is a list of all towns and cities in Georgia, USA. This is a public data available either in the Georgia government site or in the Wikipedia page. In the project, we will scape the towns and cities table information available in the Wikipage: https://en.wikipedia.org/wiki/List_of_municipalities_in_Georgia_(U.S._state). The next piece of information we need is to get the geolocation (latitude and longitude) of all the towns and cities we scraped in the first step. In order to get this information, we will use the Geocoder Nominatim OpenStreepMap (OSM) API to get the geographical location of each town/city. Lastly, we will utilize FourSquare crowdsource data in order to get all the restaurant and category details of all the venues of type “food”. We will restrict the number of venues by searching for venues within a radius of 5kms around each town/city.

To summarize, we will pull the following data:
  
1. **List of all towns and cities in Georgia, USA**  
   Source:  https://en.wikipedia.org/wiki/List_of_municipalities_in_Georgia_(U.S._state)  
   Pull table data from class wikitable sortable  
   _Example (Name, Type, County):_  
   Alpharetta	City	Fulton  
   Savannah	City	Chatham  
     
2. **Geolocation (latitude and longitude) of each town and city - Using Geocoder Nominatim OSM API**  
   Package: geocoder Nominatim  
   Example:   
   Name	latitude	longitude  
   0	Abbeville, GA	31.992122	-83.306824  
   1	Acworth, GA	34.065933	-84.676880  
     
3. **Restaurants and food venues - From Foursquare crowdsource data, using ‘Venues’ endpoint**  
   Endpoint: Foursquare venue explore endpoint https://api.foursquare.com/v2/venues/explore  
   Parameters: Radius = 5kms, Limit venues per call = 100, section = 'Food'  
   Example:  
      Neighborhood	Neighborhood Latitude	Neighborhood Longitude	Venue	Venue Latitude	Venue Longitude	Venue Category  
      Abbeville, GA	31.992122	-83.306824	Ophelias	31.992405	-83.306903	Diner  
      Abbeville, GA	31.992122	-83.306824	Country Kitchen	31.992640	-83.307734	Food  
      Abbeville, GA	31.992122	-83.306824	Mr B-b-q	31.993763	-83.295976	Food  
      Acworth, GA	34.065933	-84.676880	Henry's Louisiana Grill	34.066011	-84.677728	Cajun / Creole Restaurant  
      Acworth, GA	34.065933	-84.676880	Fusco's via Roma	34.065781	-84.677163	Italian Restaurant  

Combining and blending the data from the 3 data sources stated above, we can get the list of venues and its categorization for each city/town. We will use this as the input to the machine learning model K-means clustering for unsupervised data and  build clusters of similar towns/cities offering similar categories of restaurants. Once clusters are formed, we will analyze and label the clusters based on the venue categories within each cluster. Using the cluster and labels, we will attempt to answer all the problem statements.

### Data Analysis

##### Data Analysis - Exploratory, descriptive, and statistical

Using BeautifulSoup package, the wikitable data having all the municipalities (cities and towns) have been loaded into Pandas dataframe. An initial description of this dataframe shows:  

Type | Total
--- | ---
City|409
Consolidated city|2
Town|122
Unified government|6
**Grand total**|**539**


Using Geocoder Nominatim package, loop through each town/city in the list, and fetch the geolocation (latitude and longitude).

Name | latitude | longitude
--- | --- | ---
Abbeville, GA|31.992122|-83.306824
Acworth, GA|34.065933|-84.676880
Adairsville, GA|34.368702|-84.934109
Adel, GA|31.137136|-83.423494
Adrian, GA|32.530722|-82.589299

Using folium, visualize all the towns/cities in the state of Georgia

![GA-CitiesPlotted](GA-CitiesPlotted.png "Georgia Towns and Cities")


Using FourSquare (venue explore API), all the venues of Food category in the vicinity of each of these towns/cities are obtained.  
_Parameters used:_  
+ Maximum venues per town/city = 100  
+ Radius from lat/long = 5kms  
+ Section (category of venue) = Food  


![GA-Venues-Head10](GA-Venues-head10.png "Georgia Venues")

**There are a total of 13118 venues, and the total distinct venue categories is 102**

Since the FourSquare data is crowd-sourced, it is prone to have errors and mis-classifications. It is important to perform data cleansing including addressing missing values, incorrect category classification, and delete incorrect entries.  
A few examples include:
+ Peruvian, Colombian, Argentinian, Brazilian, Latin American, and Venezuelan - need to be reclassified into South American Restaurants
+ South Indian, north Indian, Chaat Place - need to be reclassified into Indian Restaurants
+ Mediterranean, Greek, Turkish, and Middle Eastern - need to be classified together

After all the cleansing is applied, we have a resulting distinct venue categories of **78**

#### Below is the top 10 venue categories

![Top10-VenueCategories](Top10-VenueCategories.png "Georgia Venue Categories - Top10")


### Feature set

The key to our feature set is applying one-hot encoding technique. We will use the mean of the frequency of occurrence of each category. This will help us arrive at a numerical score for each of the 79 categories, for each town/city.

Since the distribution of the venue category beyond the top 10 categories for each town/city will have very less correlation to what kind of restaurants are offered in the specific town/city, we will generate the top 10 most common venue category and choose these 10 fields as our feature set.

![FeatureSet](FeatureSet.png "Feature Set - Most common Venue Categories -")

At this time, we found an additional 68 town/cities for which Foursquare does not have any reported food venues. We remove these 68 towns/cities from our clustering model.

### Methodology

The dataset for these set of problem statements do not have a predefined label or the primary catogories that a given town/city belongs to. Hence, this dataset is classified as **unsupervised** data. The model we select to solve this particular problem is **K-means clustering for unsupervised dataset**  
We use the K-means clustering algorithm, and label all the towns/cities into **10** clusters.

Here are sample data from some of the clusters generated by our model:

**Cluster 1**
![Cluster1](Cluster1.png "Cluster 1")
**Cluster 5**
![Cluster5](Cluster5.png "Cluster 5")
**Cluster 6**
![Cluster6](Cluster6.png "Cluster 6")
**Cluster 8**
![Cluster8](Cluster8.png "Cluster 8")
**Cluster 10**
![Cluster10](Cluster10.png "Cluster 10")



### Results


Once we build the clusters, we visualize the clusters through   
1) **A scatter plot by converting all features and target into 2-dimensional form**

![Cluster2D](Cluster2D.png "Cluster Map in 2D form")

2) **Folium to visualize the clusters and its distribution on the map**

![ClusterMap](ClusterMap.png "Cluster Map")


Reviewing the results from each of the clusters, below table summarizes the categories and trends of each cluster of towns/cities.

![Cluster info](Clusterinfo.png "Cluster Info")



### Discussion and Conclusion

This section elaborates the answers and findings for all the problem statements defined earlier.

1. **What categories of restaurants are popular in different parts of the state?**  
In the 10 clusters we generated, we were able to find unique restaurant categories. The most popular restaurant categories include american, fast food and Pizza place. They are distributed in clusters 1,2,3, and 9.

![Top10-VenueCategories](Top10-VenueCategories.png "Georgia Venue Categories - Top10")


2. **Are vegan/vegetarian options available in the state? Where are they located?**  
There are a total of only 31 vegetarian/vegan restaurants checked in by FourSquare users in the entire state of Georgia, out of 13118 venues. This shows that Georgia is not a veg/vegan friendly state. Moreover, the few options are centered around a few cities including Atlanta, Macon, Athens, Savanah, and Columbus.

![Veg/Vegan](VegVegan.png "Vegetarian/Vegan Restaurants")


3. **Where can one find a cafe or quick bite?**  
Cafe and Sandwich/Quick bite belongs to cluster 7 and are distributed thinly around the entire state

![Cafe](Cafe.png "Cafe and Quick Bites")

4. **Are there speciality southern restaurants in the state? Where are they located?**
There are 231 restaurants classified as Southern. From the result and distribution, it is evident that Georgia is a typical southern state with many american and southern restaurants well distributed across the entire state.

![Southern](Southern.png "Southern Restaurants")

5. **Which towns do not have dine in restaurants, potentially not traveler friendly to stop for dine-ins?**
There are 68 towns/cities from where FourSquare users not not checked in and reported any restaurants. Perhaps, these are cities to avoid dine-in plans.

![No Restaurants](None.png "No Restaurants")