# New York City Neighborhood Suitability for a Business Plan - Healthy Food Store

## Introduction
New York City has 306 neighborhoods in 5 boroughs. While some neighborhoods can bear similar characteristics, other neighborhoods can be unique and appropriate for a specific business purpose. For example, a high density of coffee shops or cafes can be a distinguishing factor for the neighborhoods with high density of office buildings. Here it would be beneficial to open a business restaurant (provided that the restaurant density is not extremely high). On the other hand, neighborhoods with parks, sport and leisure time facilities, groceries and schools could be a good living choice for a family with children. 

Characteristic features of these neighborhoods can be determined based on geolocation data and various statistics or machine learning techniques. In this project, [Foursquare](https://foursquare.com/) venues location data is used to explore New York City neighborhoods using K-Means clustering. Neighborhoods are divided into clusters based on their similarity, i.e. type and occurrence of venues. 

### Business problem
__Problem/question to solve: What are the best candidate neighborhoods to open a store with healthy food?__

Analysis of location data can be used to answer many questions. Here, we try to find the best option to open a shop with healthy food. We want to find the neighborhoods that would be the best candidates to open the shop that sells products for active lifestyle and healthy diet, such as bioproducts including fresh vegetables and fruits, wholegrain products, food wealthy on protein, special types of flours, cereals, grains, etc. It's a shop where people could find all they need for their nutrition needs. 

The project aims to determine the best neighborhood(s) to start a new healthy food store, which is of a great interest to contractors willing to start such a business in a new area. The study is able to assist in the decision making process where to start the store in order to:
- maximize profits
- minimize risks  

A properly selected location will help to gain a stable and potentially increasing number of target customers. It will also eliminate losses that could originate e.g. from insufficient abundance of target customers. 

Here are some assumptions and considerations:
- Let's assume that people with an active lifestyle use facilities like gyms, pools, other sport facilities or parks. Neighborhoods with these features would be proper neighborhoods for such a healthy food shop.
- There is a high chance that products of healthy lifestyle are commonly sold in supermarkets and groceries. Our candidate neighborhoods shouldn't be rich in these facilities. We don't want to add another shop if there are many nearby shops, because it could reduce the profits. 
- High abundance of restaurants of different kinds, fast food, pizza and other places might suggest that the neighborhood is not the best candidate for our business idea. Such neighborhoods might be rich in social and cultural life, and people wouldn't spend their time looking for healthy products here.
- Neighborhood clustering based on abundance of venues belonging to different categories enables decisions whether the neighborhood is a good candidate or not. 

## Data
We use data from two sources:
- New York City neighborhood data (available from here: [NYU Spatial Data Repository](https://geo.nyu.edu/catalog/nyu_2451_34572)) that contains following information about every neighborhood:
    - neighborhood name
    - borough name
    - neighborhood latitude
    - neighborhood longitude
- location data obtained from [Foursquare](https://foursquare.com/) API that include information about venues and their categories in the respective neighborhood

Both data is converted to pandas dataframes to make it available for easy manipulation and analysis. 

Location data will be used to cluster neighborhoods based on their similarities.

__Examples of data:__

Example of New York City neighborhood data:  

| Borough | Neighborhood | Latitude | Longitude |
| :------ | :----------- | :------- | :-------- |
| Bronx   | Wakefield    |40.894705 |-73.847201 |
| Bronx   | Co-op City   |40.874294 |-73.829939 |
| Bronx   | Eastchester  |40.887556 |-73.827806 |
| Bronx   | Fieldston    |40.895437 |-73.905643 |
| Bronx   | Riverdale    |40.890834 |-73.912585 |

Example of location data:

| Neighborhood | Borough | Neighborhood Latitude | Neighborhood Longitude | Venue            | Venue Latitude | Venue Longitude | Venue Category |
| :----------- | :------ | :-------------------- | :--------------------- | :--------------- | :------------- | :-------------- | :------------- |
| Wakefield    | Bronx   | 40.894705             | -73.847201             | Lollipops Gelato | 40.894123      | -73.845892      | Dessert Shop   |
| Wakefield    | Bronx   | 40.894705             | -73.847201             | Rite Aid         | 40.896649      | -73.844846      | Pharmacy       |
| Wakefield    | Bronx   | 40.894705             | -73.847201             | Carvel Ice Cream | 40.890487      | -73.848568      | Ice Cream Shop |
| Wakefield    | Bronx   | 40.894705             | -73.847201             | Walgreens        | 40.896687      | -73.844850      | Pharmacy       |
| Wakefield    | Bronx   | 40.894705             | -73.847201             | Dunkin'          | 40.890459      | -73.849089      | Donut Shop     |

Location data contain much more information but we will use only venues and their categories to cluster neighborhoods. Note: Both data come in a JSON format. 

## Methods
We will use standard K-Means Clustering to cluster neighborhoods based on their similarities measured in terms of different venue categories and their abundance in a neighborhood.
To analyze neighborhoods and study the effects of clustering, we will use following approach:
- cluster all neighborhoods in New York City, irrespective of the boroughs they belong to:
    - use K (number of clusters) 5 and 10, and compare the results
- cluster neighborhoods within each borough, i.e. take only neighborhoods belonging to one borough at a time:
    - use K 5 and 8, and compare the results

The __goal__ of this approach is to:
- find a reasonable way to cluster neighborhoods
- determine the similarity of neighborhoods within boroughs and among boroughs
- recommend proper candidate neighborhoods to start a healthy food store