# Coursera Capstone Project Final Report
## The Battle of the Neigbourhoods

## Author: Michiel Schinkel

# Introduction
In today’s society, the ability to use data to make decisions is becoming increasingly important. Data driven approaches to healthcare, marketing and business have shown great results. Data science provides us with the opportunity to learn from past events with an unprecedented accuracy and detect patterns that could otherwise go unnoticed. 

Suppose we would like to open up a new sandwich place, or any other type of venue in New York City (NYC). We could start looking for a place to open our shop at random and hope for the best. We could also make a more informed decision about there to open up our new place by looking at data of other venues in NYC and see where other sandwich places are located.

This project aims to find clusters of neighbourhoods based on their venue characteristics to find the place within NYC that may offer the greatest chances of success for opening up a new sandwich place. These clusters are then mapped onto a NYC map to visualize the clusters.


## Data
Several data sources will be used for this project. First of all, a list of neighbourhoods in NYC is downloaded from the following location:

-	https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json

Secondly, the location of data from NYC is obtained through the geolocator function from the geopy library. 

Finally, we sent a GET request through the Foursquare API to obtain venues surrounding all the different NYC neighbourhoods within a radius of 500 meters. These venues will be used as characteristics of the clusters and to decide upon the best location to start a new sandwich place.


## Methods
### Variable selection
All venues within a 500 meter radius of a certain neighborhood are obtained through the Foursquare API. These venues will serve as features to find clusters of similar neighborhoods. One hot encoding will be used change these categorical variables into binary vaniables. When a 0 is given to a venue, the venue is not present in this neighborhood. When a 1 is given to a venue, this venue is present in the neighborhood.

### K-means clustering
For the main analysis, k-means clustering is used. This is an unsupervised type of machine learning which clusters unlabelled data to other closely related data points and creates k clusters. 

On a high level, k-means clustering works as follows. K random points are placed within the data matrix and all data points are assigned to the closest K point. These are the initial clusters. Then, in an iterative process, the centers of the newly formed clusters are selected to be the new K points and all points are once more assigned to the closed K point. This process is repeated until no points change clusters.

### Choosing K
The K-means algorithm needs to be instructed about the number (K) of clusters that need to be formed. An objective decision about the optimal number of K will be found through an elbow plot. The elbow plot shows the Mean Squared Error (MSE) for every number of K. Naturally, the MSE will decrease with increasing K. The place at which the decrease in MSE starts to slow down (the elbow) indicates the optimal number of K for this problem.

## Results
### New York Neighborhoods
In the image below, a geographical map of the neigborhoods of New York is shown.

![image.png](attachment:image.png)

### The optimal number of K
Before we initialize the K-means algorithm, we create an elbow plot to find the optimal number of K. See below:

![image-2.png](attachment:image-2.png)


Although there is no clear elbow in the plot, there seems to be a slowing of the decrease in MSE around K = 6. 

### Cluster visualization
The K-means algorithm is instructed to form 6 clusters of neighborhoods based on the venue data. The outcome of the clustering is presented in the visual map below.

![image-3.png](attachment:image-3.png)

### Cluster characteristics
The frequency with which the venues occur in the different clusters is also visualized with the barchart below.

![image-7.png](attachment:image-7.png)

### Without parks and pools
Since pools and parks greatly influence the barplot, but only occur in clusters 5 and 6 (which have only 2 and one neighborhoods in them), we present an additional barchart without these venues. 

![image-4.png](attachment:image-4.png)

### Sandwich places
Since the primary aim of this search is to find the optimal place to open a new sandwich shop, present data on this particular venue here.

The sandwich places are located in cluster 3, 4 and 5. Most sandwich places are located in cluster 4.

## Discussion
### Where to open a new Sandwich Place
Through this analyses we found three clusters of neighborhoods that currently have Sandwich places in them. Cluster 4 has most sandwich venues located inside. Based on this information, we would open up a sandwich place in a neighborhood which is in cluster 4, but does not have a sandwich place yet. The first example in alphabetical order would be Allerton in the Bronx.

### Limitations
This analysis is limited to the neighborhoods of New York. We only have limited data through the lite version of the Foursquare API and therefore focused on these neighborhoods only.

### Number of clusters
Another limitation of this analysis was that the ideal number of clusters was somewhat hard to decide based on the elbow plot. The current solution has a cluster with just one neighborhood. This is very unbalanced, although it does a creat job in clustering pools and parks, which do not occur in any other clusters.

## Conclusion

In this analysis we clustered New York neighborhoods based on the venues within them, in order to find a perfect neighborhood to open up a new sandwich place. We conclude that there are several neighborhoods that would potentially like a sandwich place but do not have one yet. Among them is Allerton in the Bronx.