# Determining Ideal Locations for New Restaurants in Detroit Metropolitan Area
## Author: Neil Gurram

## Introduction

The city of Detroit, Michigan is seemingly trying to get back on the upswing and improve as a city. With this realization, restaurant companies, especially those which wouldn't be considered common such as cuisines from foreign countries, could very well be looking to find a place in Detroit Metropolitan Area (DMA) to take advantage of the high population and bolster their revenues. An increase in the number of restaurants in the DMA could not only allow for variety of restaurants that Detroiters can go to but also could help with increasing employment to provide more jobs in the DMA for those who may be unemployed. With this growth in the restaurant industry, this could supplement the growth of other industries in the DMA, such as tourism, as those go hand in hand.

## Business Problem

The goal of this project is to find locations in DMA that would be suitable to start a restaurant in to help not only bolster the restaurant industry and improve Detroit, but also to provide restaurants the chances to improve their revenue. 

The target audience for this project will be not only people who are interested in opening up restaurants in DMA but also those who would be interested in opening up other types of businesses because similar analyses could be used for other businesses.

## Data

The relevant pieces of data needed for our project are 1) list of cities and towns in DMA, 2) longitudinal and latitudinal data for cities and towns in DMA, and 3) data for venues associated with DMA.

### Cities and Towns in DMA

For the purpose of this project, I will limit the DMA to cities in Macomb, Oakland, and Wayne Counties. I had to web scrape data from websites pertaining to cities in [Macomb](https://geographic.org/streetview/usa/mi/macomb/index.html), [Oakland](https://geographic.org/streetview/usa/mi/oakland/index.html), and [Wayne](https://geographic.org/streetview/usa/mi/wayne/index.html) Counties. The information on the three websites were consistent in that all were under an unordered list (ul) tag, so I could create code that would extract the cities from each of the three counties. The modules used in this task were requests and BeautifulSoup.

### Longitudes and Latitudes for Cities and Towns in DMA

I will be using the geopy module to get the longitudes and latitudes for cities and towns in DMA. I will then display the results into a Dataframe containing city, county, latitude, and longitude.

Upon initial inspection of the data frame, there seemed to be some errors with the latitudes and longitudes computed, so we needed to focus on the correct latitudes and longitudes. This can be determined by first ensuring that the latitude is at least 41 and at most 43 and longitude is at least -84 and at most -82. Then, from a manual check, we can see that all the latitudes and longitudes seem proper. Alternatively, we could have manually replaced the coordinates that were wrong with the correct ones, but decided to go with the former method simply to ensure all data was coming from the same source.

After filtering, we get a dataframe displayed below.

![City_Coordinates.png](attachment:c196ee7d-3421-45e3-8bb6-c2b82e0ff38a.png)

### Venues in DMA

After getting latitude and longitude information for cities in DMA, I will be using the FourSquare API to get relevant venue information for the aforementioned cities. I will append the venue data, which will comprise of the Venue Name, Venue Latitude, Venue Longitude, and Venue Category to the above dataframe associated with city information. To get venues tied to each city, we will find out all venues within one mile of a given latitude,longitude pair. We will limit for each city 200 venues displayed.

![Venues_Unfiltered.png](attachment:dd418793-7917-422c-9099-9b3dd9229494.png)

After getting the dataframe, I realized there were 292 Venue Categories total, with many not pertinent to restaurants. So I then went through all the unique Venue Categories and only kept the ones that pertained to a restaurant. I made my own discretion as to what to consider to be restaurants, and resulted in 57 Restaurant Venue Categories. The filtered restaurants data is shown below.

![Venues_Filtered.png](attachment:b0831fb6-9e3d-4fa6-ad04-c1fe79a5bfb4.png)

Note that it is possible that the same restaurant could be obtained from two different API calls of neighborhoods of cities. However, when processing later we can ignore duplicates as necessary. It also is possible that there may be restaurants that indeed represent the same restaurant but aren't identical in the dataframe. Because there isn't an obvious way to determine if two restaurants are identical from the data, we will keep the data as is.

We have all the relevant data needed now to proceed forward in our project. Any additional manipulations and presentations of data will be specified in future sections, but will all originate from the three types of data presented here.

## Methodology

### Data Visualization

It is important to visualize the data that is in place. First, one can get a count of the number of cities in each country that is considered, as displayed below.

![Cities_Per_County.png](attachment:85abd9d2-c928-4549-a938-656c6e19239c.png)

Then, using Python's Folium package, we can present the cities in the DMA, colored by county.

![Map_Cities_By_County.png](attachment:27860310-da9d-480a-9d29-cc9c7b1fb6cf.png)

### Determining Features

After presenting the geographic visualization of the DMA, we then need to go back to our Venue Data and convert it into something more tenable. Namely, the Venue Category is categorical, but we will need to determine an easy way to quantify how similar two cities are by type of venue. This will necessitate using the one-hot encoding where a category column is assigned 1 if it matches the original venue category and 0. The one-hot encoding is shown below.

![One_Hot_Encoding_Unfiltered.png](attachment:596854b9-cde1-4673-aaec-ab4873668baa.png)

Furthermore, one would need to be able to classify a score of some sort for each city, representing a proportion of each city that a certain venue category seems to occupy. To do this, one can group by each city and get a cumulative total over all the venue categories and then divide by the total number of venues for each city. This proportion dataframe can be shown below. We remove the coordinates in this dataframe as it is not needed for future analysis.

![Proportion_Unfiltered.png](attachment:2cb256c8-a9a0-4e48-b741-00d6ba54aea3.png)

In addition, as taught in our Capstone Project Course, we can also get a dataframe, as shown below, of top ten venues for each city. This will then give an idea of what kind of venues are presented in the city; notice that this translates to determining what are the ten highest venue categories proportions for a given city from the previous proportion dataframe. We shall call this dataframe the top ten dataframe.

![Top_Ten_Unfiltered.png](attachment:d60d6d87-d296-4cfc-a7c6-11f1103e5dec.png)

Next, one can try to make decisions solely based on the restaurant venues. Perhaps one will decide based on restaurants as the restaurants themselves will provide a better way of scoring a city to determine what kind of city it is. This filtering could provide a different way of determining if two cities are similar. The one-hot encoding is shown below.

![One_Hot_Encoding_Filtered.png](attachment:b04b65e2-4eb2-44b0-92c5-863ce0489133.png)

Going forward in this project, we will use the word unfiltered when discussing the whole data and the filtered when discussing only venues associated with restaurants. For example, the aforementioned dataframes will be considered unfiltered proportion dataframe and the unfiltered top ten dataframe.

With that said, one can then get the filtered proportion dataframe and filtered top ten dataframe. Both these dataframes are shown below.

![Proportion_Filtered.png](attachment:0d0830f1-9390-45b5-ac42-25717d766580.png)

![Top_Ten_Filtered.png](attachment:55c16157-06a1-4940-a7ca-afea8ffd5c68.png)

### Machine Learning Algorithms Used

With the proportion dataframe, one can view each row as an example of a city's data showing the proportion of all the venues. These examples can then be processed with <em>k-means clustering</em>, an unsupervised learning algorithm that can group examples based on a metric, usually Euclidean distance. In this case, examples are closer together based on the similarity between their proportions. This means we only will consider the venue proportions (and not city name, county name, city coordinates, venue name, and venue coordinates) when clustering.

The question then becomes how many clusters would be best suitable to group the data examples. One way to do this is to plot the sum of squares error (or SSE) versus the number of clusters and determine where the graph decreases sharply and then slowly. This describes the <em>elbow method</em>. The plots of the two graph for filtered and unfiltered data is shown below. We only consider two to nine clusters in our plots, given that the number of cities is not too large.

![Elbow_Plots.png](attachment:a637ff02-fd2a-460a-9045-1d838e9bb3a0.png)

For the unfiltered data (associated with all venues), it is true that there isn't much of an elbow displayed on the graph. But we see there is ever so slightly one at 6, so the optimal number of clusters to choose is 6, and will be used for the unfiltered k-means clustering model. For the filtered data (only associated with restaurants), we observe that it is more erratic at both 6 and 8 clusters, so with our discretion we will set the optimal number of clusters to be 6, and will be used for the filtered k-means clustering model.

## Results

### Considering all Venues

With the unfiltered clustering model, one can then determine the cluster number for each city. This can then be inserted into the the unfiltered top ten dataframe as shown below.

![Top_Ten_With_Cluster_Unfiltered.png](attachment:cdaf8206-9f5d-4ccc-9de8-5a5d993225cd.png)

For each cluster, one can present the dataframe for all examples within that cluster. This is shown in the following clusters below in order from Cluster 0 to Cluster 5.

#### Cluster 0
![Top_Ten_With_Cluster_0_Unfiltered.png](attachment:a245b117-e26b-4c09-922a-272a3c5fc286.png)

#### Cluster 1
![Top_Ten_With_Cluster_1_Unfiltered.png](attachment:5a2ac181-6979-42f7-86dd-8984f9d4fe5c.png)

#### Cluster 2
![Top_Ten_With_Cluster_2_Unfiltered.png](attachment:97ab693b-25a5-4ca4-bbf0-db29ef85b9c1.png)

#### Cluster 3
![Top_Ten_With_Cluster_3_Unfiltered.png](attachment:0795c223-0a5e-4fd3-a5ab-aef2f87a8233.png)

#### Cluster 4
![Top_Ten_With_Cluster_4_Unfiltered.png](attachment:7a6ce5b3-6177-40b9-b321-6e2b1ffc7ec8.png)

#### Cluster 5
![Top_Ten_With_Cluster_5_Unfiltered.png](attachment:3c0fbf0b-6edd-46b3-b0d9-cc349eb48799.png)

In addition, from the unfiltered cluster information, one can plot the cities color-coded by the cluster, as shown below.

![Clustering_Unfiltered.png](attachment:be1e93e7-f8fa-47b6-a943-db13d62e68b3.png)

### Considering only Restaurants

With the filtered clustering model, one can obtain analogously the cluster number for each city and the cluster numbers can be inserted into the filtered top ten dataframe as shown below.

![Top_Ten_With_Cluster_Filtered.png](attachment:2b851da5-8080-4ae8-8499-d1d89e144066.png)

For each cluster, one has dataframes associated with all examples for each cluster, and these dataframes are shown in sequence below from Cluster 0 to Cluster 5.

#### Cluster 0
![Top_Ten_With_Cluster_0_Filtered.png](attachment:c7064b99-6f98-4e13-bd1f-226784576857.png)

#### Cluster 1
![Top_Ten_With_Cluster_1_Filtered.png](attachment:141447d6-caf0-4382-94a7-e0ac0335341e.png)

#### Cluster 2
![Top_Ten_With_Cluster_2_Filtered.png](attachment:6ac9e4db-756b-40cd-9b4d-5ad7039de009.png)

#### Cluster 3
![Top_Ten_With_Cluster_3_Filtered.png](attachment:85f5d753-ce9d-4661-b667-d44d396c50e9.png)

#### Cluster 4
![Top_Ten_With_Cluster_4_Filtered.png](attachment:a3f26ce5-9500-4806-947c-3e8a10642339.png)

#### Cluster 5
![Top_Ten_With_Cluster_5_Filtered.png](attachment:2ecfbac2-e5f6-45dd-b3d3-385a790504c9.png)

Then, from the filtered cluster information, one can plot the cities color-coded by the cluster, as shown below.

![Clustering_Filtered.png](attachment:fda0049a-8cd9-4bbc-8ad6-72b117f641a9.png)

## Discussion

Assume one operates on the assumption that it is good to build a restaurant in an area where other restaurants of similar types are located, as that would eventually attract people to come to the new restaurant, since they would want to try this restaurant's food. Then, based on what type of restaurant a company is looking to start in Detroit, one can determine which cluster is most similar to the restaurant, and then determine which cities may be best to build the restaurant.

The above assumption can be applied similarly to both unfiltered and filtered data. As seen in the unfiltered cluster dataframes in the Results section, one notices that clusters 1, 2, 3, and 4 have a preponderance of restaurants in the top ten. To get a further understanding within these clusters, one then can look at the filtered cluster dataframes in the Results section. It seems that within the filtered data, one quickly observes that Cluster 4 seems to favor Fast Food Restaurants, Cluster 5 seems to favor American Restaurant and Pizza Places, and Cluster 2 seems to have more ethnic diversity in the cooking. The other three clusters are too small make a generalization.

In addition, one also sees from the unfiltered clustering that clusters 0 and 5 do not seem that appropriate for building since the venues tied to that cluster are not really related to the restaurant business. As seen from the unfiltered top ten dataframe, the venues seem more along the lines of Parks for Cluster 0 and Zoos for Cluster 5. This could make some sense as their locations are either isolated as seen in Cluster 0 or in more recreational areas that may not tend to have that many restaurants like Cluster 5.

Lastly, it is interesting to note that there doesn't seem to be any geographical basis for the clustering, as observed in the maps. So this shows that the various cuisines in DMA are spread out.

## Conclusion

This project attempts to determine where a restaurant can be created in Detroit Metropolitan Area, under the assumption one is looking for similar restaurants to draw customers from. A k-means clustering algorithm was used to determine this similarity measure, and there were two sets of data used for this k-means clustering, namely the filtered dataframes that only consider restaurants and the unfiltered dataframes that consider other venues.

Our project only uses venues within a mile of a specified latitude and longitude of a city to characterize a city; perhaps one needs to consider a larger distance to characterize a city better or should use a smaller distance to avoid noise that may be tied to other cities. Also, the one-hot encoding used for the project seems to include a lot of specific descriptions, so perhaps grouping the specifics into general categories would create a more easy-to-understand model that would give a better characterization of cities within clusters. These two facets could be considered and modified for future research.

In addition, our project assumes that there is enough profit that will be generated to keep the new restaurant above water. It is possible that one may need to consider competition from similar restaurants that may make restaurants not feasible or simply operational costs of working in a city, as certain cities are more expensive than others to work in. These and other factors that businesses place in high importance may be considered in future research, and would inevitably create new models that should be used.