# New York City Neighborhood Suitability for a Business Plan - Healthy Food Store

## Table of Contents
- [Introduction](#introduction)
    - [Business problem](#business_problem)
- [Data](#data)
- [Methodology](#methodology)
- [Preparation of data and exploratory data analysis](#prep_and_eda)
    - [1. Load required libraries](#load_libraries)
    - [2. New York City neighborhood dataset](#nyc_neighborhood_dataset)
    - [3. New York City venues dataset](#nyc_venues_dataset)
- [Analysis: Clustering of New York City neighborhoods](#clustering)
    - [1. Prepare dataset for clustering](#clustering_prepare)
    - [2. Clustering of all New York City neighborhoods with K=5](#clustering_all_k5)
    - [3. Clustering of all New York City neighborhoods with K=10](#clustering_all_k10)
    - [4. Clustering of New York City neighborhoods within boroughs](#clustering_boroughs)
        - [A. Bronx](#clustering_bronx)
        - [B. Brooklyn](#clustering_brooklyn)
        - [C. Manhattan](#clustering_manhattan)
        - [D. Queens](#clustering_queens)
        - [E. Staten Island](#clustering_staten_island)
- [Results and discussion](#results_discussion)
- [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>

New York City has 306 neighborhoods in 5 boroughs. While some neighborhoods can bear similar characteristics, other neighborhoods can be unique and appropriate for a specific business purpose. For example, a high density of coffee shops or cafes can be a distinguishing factor for the neighborhoods with high density of office buildings. Here it would be beneficial to open a business restaurant provided that the restaurant density is not extremely high or a special restaurant after careful analysis of the data. On the other hand, for example,  neighborhoods with parks, sport and leisure time facilities, groceries and schools could be a good living choice for a family with children. 

Characteristic features of these neighborhoods can be determined based on geolocation data and various statistics or machine learning techniques. In this project, [Foursquare](https://foursquare.com/) venues location data is used to explore New York City neighborhoods using K-Means clustering. Neighborhoods are divided into clusters based on their similarity, i.e. type and occurrence of venues. 

### Business problem <a name="business_problem"></a>

__Problem/question to solve: What are the best candidate neighborhoods to open a store with healthy food?__

Analysis of location data can be used to answer many questions. Here, we try to find the best neighborhood to open a shop with healthy food. We want to find the neighborhoods that would be the best candidates to open the shop that sells products for active lifestyle and healthy diet, such as bioproducts including fresh vegetables and fruits, wholegrain products, food wealthy on protein, special types of flours, cereals, grains, etc. It's a shop where people could find all they need for their nutrition needs. The project aims to determine the best neighborhood(s) to start a new healthy food store. 

__Target audience__  
The study's target audience are businessmen or contractors that would like to start a successful healthy food store in a new area. Thus, the study will help to assist in the decision making process where to start the new store in order to maximize profits and minimize risks related to opening of the new shop. 

A properly selected location will help to gain a stable and potentially increasing number of target customers. It will also eliminate losses that could originate e.g. from insufficient abundance of target customers. The study aims to predict the adequate locations for the business purpose with respect to the characteristics of the place and people that will likely visit the place. 

__Assumptions and considerations__
- Let's assume that people with an active lifestyle use facilities like gyms, pools, other sport facilities or parks. Neighborhoods with these features would be proper neighborhoods for such a healthy food shop.
- There is a high chance that products of healthy lifestyle are commonly sold in supermarkets and groceries. Our candidate neighborhoods shouldn't be rich in these facilities. We don't want to add another shop if there are many nearby shops, because it could reduce the profits. 
- High abundance of restaurants of different kinds, fast food, pizza and other places might suggest that the neighborhood is not the best candidate for our business idea. Such neighborhoods might be rich in social and cultural life, and people wouldn't spend their time looking for healthy products here.
- Neighborhood clustering based on abundance of venues belonging to different categories enables decisions whether the neighborhood is a good candidate or not. 

## Data <a name="data"></a>

We use data from two sources:
- New York City neighborhood data (available from here: [NYU Spatial Data Repository](https://geo.nyu.edu/catalog/nyu_2451_34572)) that contains following information about every neighborhood (the data fields are self-explanatory):
    - neighborhood name
    - borough name
    - neighborhood latitude
    - neighborhood longitude
- location data obtained from [Foursquare](https://foursquare.com/) API that include information about venues and their categories in the respective neighborhood

Both data is converted to pandas dataframes to make it available for easy manipulation and analysis. 

Location data will be used to cluster neighborhoods based on their similarities.

__Examples of data:__

__1. New York City neighborhood data:__  

    a. Original JSON data:

        {'type': 'FeatureCollection',
         'totalFeatures': 306,
         'features': [{'type': 'Feature',
           'id': 'nyu_2451_34572.1',
           'geometry': {'type': 'Point',
            'coordinates': [-73.84720052054902, 40.89470517661]},
           'geometry_name': 'geom',
           'properties': {'name': 'Wakefield',
            'stacked': 1,
            'annoline1': 'Wakefield',
            'annoline2': None,
            'annoline3': None,
            'annoangle': 0.0,
            'borough': 'Bronx',
            'bbox': [-73.84720052054902,
             40.89470517661,
             -73.84720052054902,
             40.89470517661]}},
        ...
        }  


    b. Pandas dataframe:
| Borough | Neighborhood | Latitude | Longitude |
| :------ | :----------- | :------- | :-------- |
| Bronx   | Wakefield    |40.894705 |-73.847201 |
| Bronx   | Co-op City   |40.874294 |-73.829939 |
| Bronx   | Eastchester  |40.887556 |-73.827806 |
| Bronx   | Fieldston    |40.895437 |-73.905643 |
| Bronx   | Riverdale    |40.890834 |-73.912585 |

__2. Location data:__

    a. Original JSON data:
        {'meta': {'code': 200, 'requestId': '5e63d0eb78a484001bad525f'},
         'response': {'suggestedFilters': {'header': 'Tap to show:',
           'filters': [{'name': 'Open now', 'key': 'openNow'}]},
          'headerLocation': 'Marble Hill',
          'headerFullLocation': 'Marble Hill, New York',
          'headerLocationGranularity': 'neighborhood',
          'totalResults': 24,
          'suggestedBounds': {'ne': {'lat': 40.88105078329964,
            'lng': -73.90471933917806},
           'sw': {'lat': 40.87205077429964, 'lng': -73.91659997808156}},
          'groups': [{'type': 'Recommended Places',
            'name': 'recommended',
            'items': [{'reasons': {'count': 0,
               'items': [{'summary': 'This spot is popular',
                 'type': 'general',
                 'reasonName': 'globalInteractionReason'}]},
              'venue': {'id': '4b4429abf964a52037f225e3',
               'name': "Arturo's",
               'location': {'address': '5198 Broadway',
                'crossStreet': 'at 225th St.',
                'lat': 40.87441177110231,
                'lng': -73.91027100981574,
                'labeledLatLngs': [{'label': 'display',
                  'lat': 40.87441177110231,
                  'lng': -73.91027100981574}],
                'distance': 240,
                'postalCode': '10463',
                'cc': 'US',
                'city': 'New York',
                'state': 'NY',
                'country': 'United States',
                'formattedAddress': ['5198 Broadway (at 225th St.)',
                 'New York, NY 10463',
                 'United States']},
               'categories': [{'id': '4bf58dd8d48988d1ca941735',
                 'name': 'Pizza Place',
                 'pluralName': 'Pizza Places',
                 'shortName': 'Pizza',
                 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/pizza_',
                  'suffix': '.png'},
                 'primary': True}],
               'delivery': {'id': '72548',
                'url': 'https://www.seamless.com/menu/arturos-pizza-5189-broadway-ave-new-york/72548?affiliate=1131&utm_source=foursquare-affiliat...,
                'provider': {'name': 'seamless',
                 'icon': {'prefix': 'https://fastly.4sqi.net/img/general/cap/',
                  'sizes': [40, 50],
                  'name': '/delivery_provider_seamless_20180129.png'}}},
               'photos': {'count': 0, 'groups': []}},
              'referralId': 'e-0-4b4429abf964a52037f225e3-0'},
              ...
              }  
    
    b. Pandas dataframe used for further processing and subsequent analysis:
| Neighborhood | Borough | Neighborhood Latitude | Neighborhood Longitude | Venue            | Venue Latitude | Venue Longitude | Venue Category |
| :----------- | :------ | :-------------------- | :--------------------- | :--------------- | :------------- | :-------------- | :------------- |
| Wakefield    | Bronx   | 40.894705             | -73.847201             | Lollipops Gelato | 40.894123      | -73.845892      | Dessert Shop   |
| Wakefield    | Bronx   | 40.894705             | -73.847201             | Rite Aid         | 40.896649      | -73.844846      | Pharmacy       |
| Wakefield    | Bronx   | 40.894705             | -73.847201             | Carvel Ice Cream | 40.890487      | -73.848568      | Ice Cream Shop |
| Wakefield    | Bronx   | 40.894705             | -73.847201             | Walgreens        | 40.896687      | -73.844850      | Pharmacy       |
| Wakefield    | Bronx   | 40.894705             | -73.847201             | Dunkin'          | 40.890459      | -73.849089      | Donut Shop     |

Location data contain much more information but we will use only venues and their categories to cluster neighborhoods.

## Methodology <a name="methodology"></a>

We will use standard K-Means Clustering to cluster neighborhoods based on their similarities measured in terms of different venue categories and their abundance in a neighborhood.
To analyze neighborhoods and study the effects of clustering, we will use following approach:
- cluster all neighborhoods in New York City, irrespective of the boroughs they belong to:
    - use K (number of clusters) 5 and 10, and compare the results
- cluster neighborhoods within each borough, i.e. take only neighborhoods belonging to one borough at a time:
    - use K 5 and 8, and compare the results

The __goal__ of this approach is to:
- find a reasonable way to cluster neighborhoods
- determine the similarity of neighborhoods within boroughs and among boroughs
- recommend proper candidate neighborhoods to start a healthy food store

## Preparation of data and exploratory data analysis <a name="prep_and_eda"></a>

### 1. Load required libraries <a name="load_libraries"></a>

Let's import all necessary modules and packages first:

In [1]:
# Import numpy, pandas and requests
import numpy as np
import pandas as pd
import requests

# Import geopy
from geopy.geocoders import Nominatim

# Import matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Import k-means from clustering stage
from sklearn.cluster import KMeans

# Import folium, map rendering library
import folium

# Import package to manipulate with json files
import json

# Tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize

### 2. New York City neighborhood dataset <a name="nyc_neighborhood_dataset"></a>

The New York City neighborhood file containing also the latitude and longitude coordinates for every neighborhood can be downloaded from [NYU Spatial Data Repository](https://geo.nyu.edu/catalog/nyu_2451_34572). The exported GeoJSON file is stored in [data](data/ny-neighborhoods.json).  

Let's read the json data in `ny_data` first:

In [2]:
# File name including path
file = 'data/ny-neighborhoods.json'

with open(file) as data:
    ny_data = json.load(data)

The relevant information is in `features`:

In [3]:
ny_data = ny_data['features']

To work with the data, create a pandas dataframe `ny_hoods`. The dataframe will contain 4 columns:
- Borough
- Neighborhood
- Latitude
- Longitude

In [4]:
# Initialize a pandas dataframe with defined column names
columns = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']
ny_hoods = pd.DataFrame(columns=columns)

# Loop over neighborhoods and add each one as a row to our dataframe ny_hoods
for data in ny_data:
    borough = data['properties']['borough'] 
    neighborhood = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    ny_hoods = ny_hoods.append({'Borough': borough,'Neighborhood': neighborhood, 'Latitude': neighborhood_lat, 'Longitude': neighborhood_lon}, ignore_index=True)

#### Explore New York City neighborhoods
It's essential to know the data - let's do some quick exploration of the neighborhood dataframe.

View the first few rows of the dataframe:

In [5]:
# View the first ten neighborhoods with their latitude and longitude
ny_hoods.head(10)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585
5,Bronx,Kingsbridge,40.881687,-73.902818
6,Manhattan,Marble Hill,40.876551,-73.91066
7,Bronx,Woodlawn,40.898273,-73.867315
8,Bronx,Norwood,40.877224,-73.879391
9,Bronx,Williamsbridge,40.881039,-73.857446


How big is the dataframe? Let's check how many neighborhoods there are:

In [6]:
# Check the size of the dataframe
print('New York City has {} neighborhoods.'.format(ny_hoods.shape[0]))

New York City has 306 neighborhoods.


How many boroughs are there?

In [7]:
# Check the number of boroughs in New York
print('New York neighborhoods belong to {} boroughs.'.format(ny_hoods['Borough'].nunique()))

New York neighborhoods belong to 5 boroughs.


How many neighborhoods belong to each borough?

In [8]:
# Check the numbers of neighborhoods in boroughs
ny_hoods[['Borough', 'Neighborhood']].groupby('Borough').count()

Unnamed: 0_level_0,Neighborhood
Borough,Unnamed: 1_level_1
Bronx,52
Brooklyn,70
Manhattan,40
Queens,81
Staten Island,63


Some neighborhoods belonging to different boroughs have the same name:

In [9]:
# Get the list of names of duplicated neighborhoods
duplicated_hoods = ny_hoods[ny_hoods.duplicated(subset='Neighborhood')]['Neighborhood']

In [10]:
# View the neighborhoods with the same names
ny_hoods[ny_hoods['Neighborhood'].isin(duplicated_hoods)]

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
115,Manhattan,Murray Hill,40.748303,-73.978332
116,Manhattan,Chelsea,40.744035,-74.003116
140,Queens,Sunnyside,40.740176,-73.926916
175,Queens,Bay Terrace,40.782843,-73.776802
180,Queens,Murray Hill,40.764126,-73.812763
220,Staten Island,Sunnyside,40.61276,-74.097126
235,Staten Island,Bay Terrace,40.553988,-74.139166
244,Staten Island,Chelsea,40.594726,-74.18956


Only the neighborhood name isn't a proper unique identifier of a neighborhood. 

To wrap this up, New York City has 5 boroughs - Bronx, Brooklyn, Manhattan, Queens and Staten Island [Boroughs of New York City](https://en.wikipedia.org/wiki/Boroughs_of_New_York_City):  

<img src="https://upload.wikimedia.org/wikipedia/commons/3/34/5_Boroughs_Labels_New_York_City_Map.svg" width="300" height="300" style="float: left"/>  

Each of the boroughs has tens of neighborhoods. Manhattan has the lowest number of neighborhoods (40) and Queens has the highest number of neigborhoods (81). 

#### Map of New York City with neighborhoods

Let's create the map of New York City and add neighborhood markers on top of the map. The neighborhood markers are colorcoded, as in the image above. That's another way how to verify where the neighborhoods belong.  

To create a map centered in New York City, the coordinates of the city are required. Then, use [folium](https://pypi.org/project/folium/) package to generate the map with markers. 

In [11]:
# Get coordinates
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [12]:
# Create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# Color palette to distinguish neighborhoods belonging to different boroughs
colors_boroughs = {'Bronx': 'red', 'Manhattan': 'green', 'Brooklyn': 'yellow', 'Queens': 'orange', 'Staten Island': 'purple'}

# Add markers to map
for lat, lng, borough, neighborhood in zip(ny_hoods['Latitude'], ny_hoods['Longitude'], ny_hoods['Borough'], ny_hoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=colors_boroughs[borough],
        fill=True,
        fill_color=colors_boroughs[borough],
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

### 3. New York City venues dataset <a name="nyc_venues_dataset"></a>

#### Get the top 100 venues for each neighborhood in radius of 500 metres

Define Foursquare credentials and version (Note: Credentials are stored in a separate file credential.json that is not version-controlled on github.)

In [13]:
with open('credentials.json') as file:
    data = json.load(file)
    CLIENT_ID = data['id']    # Foursquare ID
    CLIENT_SECRET = data['secret']    # Foursquare Secret

file.close()
    
VERSION = '20200310'    # Foursquare API version

print('Credentials loaded.')

Credentials loaded.


Define function that extracts the category of a venue:

In [14]:
# Function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Define function to get nearby venues for neighborhoods:

In [15]:
def getNearbyVenues(names, boroughs, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, borough, lat, lng in zip(names, boroughs, latitudes, longitudes):
        # Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            borough, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                             'Borough',
                             'Neighborhood Latitude', 
                             'Neighborhood Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']
    
    return(nearby_venues)

Get nearby venues for New York City neighborhoods and store it in a pandas dataframe:

In [17]:
LIMIT = 100

ny_venues = getNearbyVenues(names=ny_hoods['Neighborhood'], boroughs=ny_hoods['Borough'], latitudes=ny_hoods['Latitude'], longitudes=ny_hoods['Longitude'])

#### Explore the venues

Check the size of the venues dataframe `ny_venues`:

In [18]:
# Size of the dataframe with venues
ny_venues.shape

(10290, 8)

The dataframe containing venues for each of the neighborhoods has about 10000 rows.

Look at the first few rows in `ny_venues`:

In [19]:
# View the first ten venues
ny_venues.head(10)

Unnamed: 0,Neighborhood,Borough,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,Bronx,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,Bronx,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
2,Wakefield,Bronx,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,Wakefield,Bronx,40.894705,-73.847201,Walgreens,40.896687,-73.84485,Pharmacy
4,Wakefield,Bronx,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop
5,Wakefield,Bronx,40.894705,-73.847201,Shell,40.894187,-73.845862,Gas Station
6,Wakefield,Bronx,40.894705,-73.847201,Cooler Runnings Jamaican Restaurant Inc,40.898083,-73.850259,Caribbean Restaurant
7,Wakefield,Bronx,40.894705,-73.847201,SUBWAY,40.890468,-73.849152,Sandwich Place
8,Wakefield,Bronx,40.894705,-73.847201,Pitman Deli,40.894149,-73.845748,Food
9,Wakefield,Bronx,40.894705,-73.847201,Koss Quick Wash,40.891281,-73.849904,Laundromat


Is there any Venue Category called Neighborhood? From previous experience, some of the venues had a category Neighborhood, which doesn't seem to be a valid category. Let's drop these venues for the purposes of further analyses.

In [20]:
# Check if there are any venues with suspicous category
ny_venues[ny_venues['Venue Category'] == 'Neighborhood']

Unnamed: 0,Neighborhood,Borough,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
1375,Brighton Beach,Brooklyn,40.576825,-73.965094,Little Russia,40.57769,-73.96158,Neighborhood
1380,Brighton Beach,Brooklyn,40.576825,-73.965094,Brighton Beach,40.575518,-73.962372,Neighborhood
2547,Gerritsen Beach,Brooklyn,40.590848,-73.930102,Gerritsen Beach,40.592377,-73.925009,Neighborhood
9961,Roxbury,Queens,40.567376,-73.892138,"Roxbury, NY",40.566788,-73.891715,Neighborhood
10252,Hammels,Queens,40.587338,-73.80553,"Rockaway Beach, NY",40.585899,-73.809066,Neighborhood


In [21]:
# Create a mask to filter venues that have category Neighborhood
mask = ny_venues['Venue Category'] == 'Neighborhood'

# Remove rows with venues that have category Neighborhood from venues dataframe
ny_venues = ny_venues[~mask]

# Reset index
ny_venues.reset_index(drop=True, inplace=True)

# Check the size of the new dataframe with venues
print('Dataframe with venues has {} rows and {} columns.'.format(ny_venues.shape[0], ny_venues.shape[1]))
print()

# Check the dataframe with venues
ny_venues.head()

Dataframe with venues has 10285 rows and 8 columns.



Unnamed: 0,Neighborhood,Borough,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,Bronx,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,Bronx,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
2,Wakefield,Bronx,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,Wakefield,Bronx,40.894705,-73.847201,Walgreens,40.896687,-73.84485,Pharmacy
4,Wakefield,Bronx,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


How many venue categories are there?

In [22]:
print('There are {} unique categories.'.format(ny_venues['Venue Category'].nunique()))

There are 432 unique categories.


What are the most common venue categories? Get the top 20 most frequent venues in New York City:

In [23]:
ny_venues[['Venue', 'Venue Category']].groupby('Venue Category').count().sort_values('Venue', ascending=False).head(20).rename(columns={'Venue': 'Count'})

Unnamed: 0_level_0,Count
Venue Category,Unnamed: 1_level_1
Pizza Place,439
Italian Restaurant,308
Coffee Shop,294
Deli / Bodega,286
Bar,222
Bakery,222
Chinese Restaurant,213
Sandwich Place,188
Grocery Store,184
Mexican Restaurant,181


The most frequent venue is Pizza Place followed by Italian Restaurant and Coffee Shop. Also, corner shops, bakeries, bars, groceries and other types of restaurant are abundant as well.

How does it look like with the number of venues in neighborhoods?

First, let's create a combined column with neighborhood and borough because there are neighborhoods with the same names:

In [24]:
# Create new column that combines neighborhood and borough
ny_venues['Neighborhood Borough'] = ny_venues['Neighborhood'] + ', ' + ny_venues['Borough']

In [25]:
# Check the result
ny_venues.head()

Unnamed: 0,Neighborhood,Borough,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Neighborhood Borough
0,Wakefield,Bronx,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop,"Wakefield, Bronx"
1,Wakefield,Bronx,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy,"Wakefield, Bronx"
2,Wakefield,Bronx,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop,"Wakefield, Bronx"
3,Wakefield,Bronx,40.894705,-73.847201,Walgreens,40.896687,-73.84485,Pharmacy,"Wakefield, Bronx"
4,Wakefield,Bronx,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop,"Wakefield, Bronx"


In [26]:
# Get the number of venues for the neighborhoods with the highest number of venues
ny_venues[['Neighborhood Borough', 'Venue']].groupby('Neighborhood Borough').count().sort_values('Venue', ascending=False).head(20).rename(columns={'Venue': 'Count'})

Unnamed: 0_level_0,Count
Neighborhood Borough,Unnamed: 1_level_1
"Yorkville, Manhattan",100
"Chinatown, Manhattan",100
"Sunnyside Gardens, Queens",100
"Lenox Hill, Manhattan",100
"South Side, Brooklyn",100
"Soho, Manhattan",100
"Brooklyn Heights, Brooklyn",100
"Lincoln Square, Manhattan",100
"Little Italy, Manhattan",100
"Carnegie Hill, Manhattan",100


How many neighborhoods have more than 50 venues?

In [27]:
print('{} out of {} neighborhoods have more than 50 venues.'.format(ny_venues.groupby('Neighborhood Borough').filter(lambda x: len(x) > 50)['Neighborhood Borough'].nunique(), 
      ny_venues['Neighborhood Borough'].nunique()))

63 out of 304 neighborhoods have more than 50 venues.


Are there any neighborhoods with too few venues?

In [28]:
print('{} out of {} neighborhoods have less than 10 venues.'.format(ny_venues.groupby('Neighborhood Borough').filter(lambda x: len(x) < 10)['Neighborhood Borough'].nunique(), 
      ny_venues['Neighborhood Borough'].nunique()))

58 out of 304 neighborhoods have less than 10 venues.


In [29]:
print('{} out of {} neighborhoods have less than 5 venues.'.format(ny_venues.groupby('Neighborhood Borough').filter(lambda x: len(x) < 5)['Neighborhood Borough'].nunique(), 
      ny_venues['Neighborhood Borough'].nunique()))

21 out of 304 neighborhoods have less than 5 venues.


Let's look at some neighborhoods with only a few venues:

In [30]:
ny_venues[['Neighborhood Borough', 'Venue']].groupby('Neighborhood Borough').count().sort_values('Venue', ascending=True).head(10).rename(columns={'Venue': 'Count'})

Unnamed: 0_level_0,Count
Neighborhood Borough,Unnamed: 1_level_1
"Brookville, Queens",1
"Somerville, Queens",1
"Port Ivory, Staten Island",1
"Mill Island, Brooklyn",1
"Todt Hill, Staten Island",1
"Grymes Hill, Staten Island",1
"Bayswater, Queens",2
"Malba, Queens",3
"Fieldston, Bronx",3
"Country Club, Bronx",3


We could use either all of the neighborhoods into further analyses, or select only those that have more than _N_ venues. Let's go with the first option and use all neighborhoods.

## Analysis: Clustering of New York City neighborhoods <a name="clustering"></a>

### 1. Prepare dataset for clustering <a name="clustering_prepare"></a>

##### Create dataframe with onehot encoding for the venue categories

In [31]:
# One hot encoding
ny_onehot = pd.get_dummies(ny_venues[['Venue Category']], prefix='', prefix_sep='')

# Add neighborhood + borough column back to dataframe
ny_onehot['Neighborhood Borough'] = ny_venues['Neighborhood Borough'] 

# Move neighborhood column to the first column
fixed_columns = [ny_onehot.columns[-1]] + list(ny_onehot.columns[:-1])
ny_onehot = ny_onehot[fixed_columns]

# View the result
ny_onehot.head()

Unnamed: 0,Neighborhood Borough,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,...,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Wakefield, Bronx",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Wakefield, Bronx",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Wakefield, Bronx",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Wakefield, Bronx",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Wakefield, Bronx",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
# Check the size of the new dataframe
ny_onehot.shape

(10285, 433)

##### Calculate average frequency of venues by neighborhood and store the result in `ny_grouped`

In [33]:
ny_grouped = ny_onehot.groupby('Neighborhood Borough').mean().reset_index()
ny_grouped

Unnamed: 0,Neighborhood Borough,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,...,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Allerton, Bronx",0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,"Annadale, Staten Island",0.000000,0.0,0.000000,0.000000,0.000000,0.181818,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,"Arden Heights, Staten Island",0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,"Arlington, Staten Island",0.000000,0.0,0.000000,0.000000,0.000000,0.200000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,"Arrochar, Staten Island",0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
5,"Arverne, Queens",0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.055556,0.000000,0.000000,0.000000
6,"Astoria Heights, Queens",0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7,"Astoria, Queens",0.000000,0.0,0.000000,0.000000,0.000000,0.010000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.010000,0.000000,0.000000,0.000000
8,"Auburndale, Queens",0.000000,0.0,0.000000,0.000000,0.000000,0.055556,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9,"Bath Beach, Brooklyn",0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


Check the size of the new grouped dataframe:

In [34]:
ny_grouped.shape

(304, 433)

##### Create a pandas dataframe with the top 10 venues for each neighborhood

In [35]:
# Define function to return the top venues
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [36]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# Create columns according to number of top venues
columns = ['Neighborhood Borough']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood Borough'] = ny_grouped['Neighborhood Borough']

for ind in np.arange(ny_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ny_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Allerton, Bronx",Pizza Place,Deli / Bodega,Supermarket,Spa,Donut Shop,Chinese Restaurant,Grocery Store,Martial Arts Dojo,Bakery,Fast Food Restaurant
1,"Annadale, Staten Island",American Restaurant,Pizza Place,Bakery,Train Station,Diner,Liquor Store,Restaurant,Pharmacy,Sports Bar,Park
2,"Arden Heights, Staten Island",Playground,Pharmacy,Coffee Shop,Pizza Place,Yoga Studio,Entertainment Service,Ethiopian Restaurant,Event Service,Event Space,Exhibit
3,"Arlington, Staten Island",Bus Stop,Deli / Bodega,American Restaurant,Coffee Shop,Yoga Studio,Fish & Chips Shop,Event Space,Exhibit,Eye Doctor,Factory
4,"Arrochar, Staten Island",Deli / Bodega,Italian Restaurant,Bus Stop,Pizza Place,Mediterranean Restaurant,Food Truck,Lawyer,Taco Place,Sandwich Place,Liquor Store


#### Functions
Let's define some functions that make clustering and analysing the results easier.

##### Define function to perform clustering and return the final dataframe
The final dataframe contains:
- Borough
- Neighborhood
- Latitude
- Longitude
- helper column Neighborhood Borough
- Cluster labels
- Common Venues (1st - 10th)

In [37]:
# Function to perform clustering and create the final dataframe with all the necessary information
def cluster_neighborhoods(kclusters, data_grouped, data_neighborhoods, neighborhoods_venues_sorted):
    data_grouped_clustering = data_grouped.drop('Neighborhood Borough', 1)

    # Run k-means clustering
    kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(data_grouped_clustering)

    # Check cluster labels generated for each row in the dataframe
    print('Cluster labels for the first ten neighborhoods: {}'.format(kmeans.labels_[0:10]))

    # Check the values of cluster labels
    print('Cluster labels: {}'.format(set(kmeans.labels_)))
    
    # Add clustering labels
    if 'Cluster Labels' in neighborhoods_venues_sorted.columns:
        del neighborhoods_venues_sorted['Cluster Labels']
    neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

    # Create a pandas dataframe
    data_merged = data_neighborhoods
    # Add the combined column
    data_merged['Neighborhood Borough'] = data_merged['Neighborhood'] + ', ' + data_merged['Borough']

    # Merge dataframes to get a row containing all the information for each neighborhood
    data_merged = data_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood Borough'), on='Neighborhood Borough')

    # Check the size
    print('The size of the final dataframe: {}'.format(data_merged.shape))
    
    return data_merged

##### Define function to create a map with clustered neighborhoods superimposed on top

In [38]:
# Function to create a map with clustered neighborhoods superimposed on top
def create_map_clustered_neighborhoods(latitude, longitude, kclusters, data, zoom_level=10):

    # Create map
    map_clusters = folium.Map(location=[latitude, longitude], zoom_start=zoom_level)

    # Set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # Add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster in zip(data['Latitude'], data['Longitude'], data['Neighborhood'], data['Cluster Labels']):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)

    return map_clusters

##### Define function to evaluate clusters of neighborhoods

In [39]:
# Function to return the top 5 venue categories within the most common venue categories in a cluster
def calculate_most_common_venues(cluster_label, data_merged):
    print('Number of neighborhoods in cluster with label={}: {}\n'.format(cluster_label, data_merged[data_merged['Cluster Labels'] == cluster_label].shape[0]))
    print('Top 5 category venues for cluster with label={}:\n'.format(cluster_label))

    for common_venue in list(range(6, data_merged.shape[1]-5)):
        print(data_merged.loc[data_merged['Cluster Labels'] == cluster_label, data_merged.columns[[4] + [common_venue]]]\
              .groupby(data_merged.columns[common_venue]).count()\
              .sort_values('Neighborhood Borough', ascending=False).rename(columns={'Neighborhood Borough': 'Count'}).head(5))
        print()

### 2. Clustering of all New York City neighborhoods with K=5 <a name="clustering_all_k5"></a>

Let's start with the clustering of all New York City neighborhoods with K-Means algorithm where the number of clusters (K) is set to 5.

##### Perform the clustering

In [40]:
ny_merged = cluster_neighborhoods(kclusters=5, data_grouped=ny_grouped, data_neighborhoods=ny_hoods, neighborhoods_venues_sorted=neighborhoods_venues_sorted)

Cluster labels for the first ten neighborhoods: [2 2 2 0 2 2 2 2 2 2]
Cluster labels: {0, 1, 2, 3, 4}
The size of the final dataframe: (306, 16)


In [41]:
# Drop neighborhoods without any venues
ny_merged = ny_merged.dropna()

In [42]:
# Convert cluster labels back to integer and check the result
ny_merged['Cluster Labels'] = ny_merged['Cluster Labels'].astype('int')
ny_merged.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Neighborhood Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bronx,Wakefield,40.894705,-73.847201,"Wakefield, Bronx",2,Pharmacy,Ice Cream Shop,Dessert Shop,Laundromat,Caribbean Restaurant,Gas Station,Donut Shop,Sandwich Place,Food,Eye Doctor
1,Bronx,Co-op City,40.874294,-73.829939,"Co-op City, Bronx",2,Bus Station,Baseball Field,Restaurant,Park,Pharmacy,Bagel Shop,Grocery Store,Pizza Place,Discount Store,Fast Food Restaurant
2,Bronx,Eastchester,40.887556,-73.827806,"Eastchester, Bronx",2,Caribbean Restaurant,Deli / Bodega,Diner,Chinese Restaurant,Seafood Restaurant,Donut Shop,Bakery,Pizza Place,Platform,Bowling Alley
3,Bronx,Fieldston,40.895437,-73.905643,"Fieldston, Bronx",2,Plaza,River,Bus Station,Filipino Restaurant,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory
4,Bronx,Riverdale,40.890834,-73.912585,"Riverdale, Bronx",2,Park,Bus Station,Bank,Playground,Plaza,Gym,Home Service,Baseball Field,Food Truck,Event Space


##### Create a map of New York City with clustered neighborhoods

In [43]:
create_map_clustered_neighborhoods(latitude=latitude, longitude=longitude, kclusters=5, data=ny_merged, zoom_level=10)

##### Examine clusters

Get the number of neighborhoods in clusters:

In [44]:
# Calculate the number of neighborhoods in clusters
ny_merged[['Neighborhood Borough', 'Cluster Labels']].groupby('Cluster Labels').count()

Unnamed: 0_level_0,Neighborhood Borough
Cluster Labels,Unnamed: 1_level_1
0,28
1,1
2,271
3,3
4,1


Most of the neighborhoods (271 out of 306) form one big cluster (cluster label 2). 28 neighborhoods belong to cluster with label 0. There are 3 clusters with less than 5 neighborhoods. 

Let's look at how the clusters are distributed among boroughs: 

In [45]:
# Calculate the number of neighborhoods in clusters and boroughs
ny_merged[['Borough', 'Cluster Labels', 'Neighborhood']].groupby(['Borough', 'Cluster Labels']).count().rename(columns={'Neighborhood': 'Count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Count
Borough,Cluster Labels,Unnamed: 2_level_1
Bronx,2,51
Bronx,4,1
Brooklyn,0,1
Brooklyn,2,69
Manhattan,2,40
Queens,0,10
Queens,2,69
Queens,3,2
Staten Island,0,17
Staten Island,1,1


It seems that most of New York neighborhoods are the same. But is it really true? What if the clustering to five clusters is just too general?

Based on the observations above, the neighborhoods seem to be not sufficiently differentiated - we could conclude that:
- the chosen K is too small or
- the neighborhoods in New York cannot be clustered at once.

### 3. Clustering of all New York City neighborhoods with K=10 <a name="clustering_all_k10"></a>

As a next step, clustering of New York City neighborhoods is done with K=10.

##### Perform the clustering

In [46]:
ny_merged = cluster_neighborhoods(kclusters=10, data_grouped=ny_grouped, data_neighborhoods=ny_hoods, neighborhoods_venues_sorted=neighborhoods_venues_sorted)

Cluster labels for the first ten neighborhoods: [0 0 0 7 0 4 0 4 4 4]
Cluster labels: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
The size of the final dataframe: (306, 16)


In [47]:
# Drop neighborhoods without any venues
ny_merged = ny_merged.dropna()

In [48]:
# Convert cluster labels back to integer and check the result
ny_merged['Cluster Labels'] = ny_merged['Cluster Labels'].astype('int')
ny_merged.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Neighborhood Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bronx,Wakefield,40.894705,-73.847201,"Wakefield, Bronx",0,Pharmacy,Ice Cream Shop,Dessert Shop,Laundromat,Caribbean Restaurant,Gas Station,Donut Shop,Sandwich Place,Food,Eye Doctor
1,Bronx,Co-op City,40.874294,-73.829939,"Co-op City, Bronx",0,Bus Station,Baseball Field,Restaurant,Park,Pharmacy,Bagel Shop,Grocery Store,Pizza Place,Discount Store,Fast Food Restaurant
2,Bronx,Eastchester,40.887556,-73.827806,"Eastchester, Bronx",0,Caribbean Restaurant,Deli / Bodega,Diner,Chinese Restaurant,Seafood Restaurant,Donut Shop,Bakery,Pizza Place,Platform,Bowling Alley
3,Bronx,Fieldston,40.895437,-73.905643,"Fieldston, Bronx",4,Plaza,River,Bus Station,Filipino Restaurant,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory
4,Bronx,Riverdale,40.890834,-73.912585,"Riverdale, Bronx",4,Park,Bus Station,Bank,Playground,Plaza,Gym,Home Service,Baseball Field,Food Truck,Event Space


##### Create a map of New York City with clustered neighborhoods

In [49]:
create_map_clustered_neighborhoods(latitude=latitude, longitude=longitude, kclusters=10, data=ny_merged, zoom_level=10)

##### Examine clusters

Get the number of neighborhoods in clusters:

In [50]:
# Calculate the number of neighborhoods in clusters
ny_merged[['Neighborhood Borough', 'Cluster Labels']].groupby('Cluster Labels').count().rename(columns={'Neighborhood Borough': 'Count'})

Unnamed: 0_level_0,Count
Cluster Labels,Unnamed: 1_level_1
0,121
1,1
2,2
3,1
4,153
5,2
6,1
7,21
8,1
9,1


As with five clusters (K=5), most of the neighborhoods (153 out of 306) form one big cluster (cluster label 4). 121 neighborhoods belong to cluster with label 0 and 21 neighborhoods belong to cluster with label 7. There are 7 clusters only one and two neighborhoods.  

Based on the observation of the map above, all Manhattan neighborhoods belong to one cluster (with label 4). This suggests that all neighborhoods in Manhattan are similar. Let's get the numbers:

In [51]:
# Calculate the number of neighborhoods in clusters and boroughs
ny_merged[['Borough', 'Cluster Labels', 'Neighborhood']].groupby(['Borough', 'Cluster Labels']).count().rename(columns={'Neighborhood': 'Count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Count
Borough,Cluster Labels,Unnamed: 2_level_1
Bronx,0,38
Bronx,4,14
Brooklyn,0,33
Brooklyn,1,1
Brooklyn,4,35
Brooklyn,7,1
Manhattan,4,40
Queens,0,30
Queens,2,1
Queens,3,1


Most of the neighborhoods belong to clusters 0 and 4, so it's not so surprising that most of the neighborhoods within every borough belong to these clusters.

##### Clusters 1, 2, 3, 5, 8 and 9
These clusters contain only one or two neighborhoods - just have a look at them:

In [52]:
ny_merged.loc[ny_merged['Cluster Labels'].isin([1, 2, 3, 5, 8, 9]), ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]].sort_values('Cluster Labels')

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
76,Mill Island,1,Pool,Yoga Studio,Filipino Restaurant,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant
192,Somerville,2,Park,Yoga Studio,Field,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant
203,Todt Hill,2,Park,Yoga Studio,Field,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant
303,Bayswater,3,Park,Playground,Yoga Studio,Field,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant
183,Jamaica Estates,5,Intersection,Eye Doctor,Gym,Indian Restaurant,Dog Run,Yoga Studio,Fast Food Restaurant,Ethiopian Restaurant,Event Service,Event Space
202,Grymes Hill,5,Dog Run,Yoga Studio,Filipino Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm
207,Port Ivory,8,Business Service,Yoga Studio,Fish & Chips Shop,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm
193,Brookville,9,Deli / Bodega,Yoga Studio,Fish & Chips Shop,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm


##### Cluster 0
Cluster 0 contains 121 neighborhoods - use function `calculate_most_common_venues` to get the most common venue categories:

In [53]:
calculate_most_common_venues(cluster_label=0, data_merged=ny_merged)

Number of neighborhoods in cluster with label=0: 121

Top 5 category venues for cluster with label=0:

                       Count
1st Most Common Venue       
Pizza Place               25
Deli / Bodega             14
Pharmacy                  11
Caribbean Restaurant       8
Donut Shop                 7

                       Count
2nd Most Common Venue       
Pizza Place               22
Deli / Bodega             10
Fast Food Restaurant       8
Chinese Restaurant         7
Italian Restaurant         5

                       Count
3rd Most Common Venue       
Pizza Place               16
Deli / Bodega              9
Ice Cream Shop             6
Bank                       5
Chinese Restaurant         4

                       Count
4th Most Common Venue       
Chinese Restaurant        10
Donut Shop                10
Pizza Place                8
Pharmacy                   7
Grocery Store              6

                       Count
5th Most Common Venue       
Sandwich Place         

The most common venue category in cluster 0 is Pizza Place. 

##### Cluster 4

In [54]:
calculate_most_common_venues(cluster_label=4, data_merged=ny_merged)

Number of neighborhoods in cluster with label=4: 153

Top 5 category venues for cluster with label=4:

                       Count
1st Most Common Venue       
Italian Restaurant        23
Bar                       13
Coffee Shop               11
Grocery Store              8
Park                       6

                       Count
2nd Most Common Venue       
Coffee Shop               14
Pizza Place                8
Park                       6
Italian Restaurant         6
Deli / Bodega              6

                       Count
3rd Most Common Venue       
American Restaurant        8
Pizza Place                8
Deli / Bodega              7
Bar                        7
Coffee Shop                6

                       Count
4th Most Common Venue       
Pizza Place               10
Bakery                     8
Yoga Studio                7
Chinese Restaurant         7
Coffee Shop                6

                       Count
5th Most Common Venue       
Pizza Place            

The most common venue categories in the most abundant cluster 4 are Pizza Place, Deli/Bodega and Italian Restaurant. However, Pizza Place is very common also for cluster 0. In addition, one would expect that there would be a category that would cover at least half of the neighborhoods, which is not the case of cluster 4. Therefore, it can be concluded that cluster 4 is too general - it's based on smaller contributions from many different venue categories.

##### Cluster 7

In [55]:
calculate_most_common_venues(cluster_label=7, data_merged=ny_merged)

Number of neighborhoods in cluster with label=7: 21

Top 5 category venues for cluster with label=7:

                       Count
1st Most Common Venue       
Bus Stop                   9
Deli / Bodega              4
Italian Restaurant         3
Beach                      1
Bubble Tea Shop            1

                       Count
2nd Most Common Venue       
Deli / Bodega              6
Bagel Shop                 2
Beach                      2
Bus Stop                   2
Italian Restaurant         2

                       Count
3rd Most Common Venue       
Bus Stop                   4
Deli / Bodega              3
American Restaurant        1
Bank                       1
Café                       1

                       Count
4th Most Common Venue       
Athletics & Sports         2
Pizza Place                2
Home Service               1
Plaza                      1
Playground                 1

                       Count
5th Most Common Venue       
Yoga Studio             

The most common venue categories in cluster 7 are Bus Stop (which is probably not a discriminating venue when it comes to characteristics of the neighborhoods) and Deli/Bodega.

To sum up, taking all neighborhoods into one cluster analysis might not be the best idea. Let's switch to clustering of boroughs.

### 4. Clustering of New York City neighborhoods within boroughs <a name="clustering_boroughs"></a>

Next, we cluster neighborhoods within each borough separately. We use K=5 and K=8, and compare the results. 

##### Prepare datasets

In [56]:
# Create borough column
ny_grouped['Borough'] = ny_grouped['Neighborhood Borough'].str.split(',', expand=True)[1]

# Remove any leading and trailing spaces
ny_grouped['Borough'] = ny_grouped['Borough'].str.strip()

# Check the result
ny_grouped.head()

Unnamed: 0,Neighborhood Borough,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,...,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Borough
0,"Allerton, Bronx",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Bronx
1,"Annadale, Staten Island",0.0,0.0,0.0,0.0,0.0,0.181818,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Staten Island
2,"Arden Heights, Staten Island",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Staten Island
3,"Arlington, Staten Island",0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Staten Island
4,"Arrochar, Staten Island",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Staten Island


In [57]:
# Create borough column
neighborhoods_venues_sorted['Borough'] = neighborhoods_venues_sorted['Neighborhood Borough'].str.split(',', expand=True)[1]

# Remove any leading and trailing spaces
neighborhoods_venues_sorted['Borough'] = neighborhoods_venues_sorted['Borough'].str.strip()

# Check the result
neighborhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighborhood Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough
0,0,"Allerton, Bronx",Pizza Place,Deli / Bodega,Supermarket,Spa,Donut Shop,Chinese Restaurant,Grocery Store,Martial Arts Dojo,Bakery,Fast Food Restaurant,Bronx
1,0,"Annadale, Staten Island",American Restaurant,Pizza Place,Bakery,Train Station,Diner,Liquor Store,Restaurant,Pharmacy,Sports Bar,Park,Staten Island
2,0,"Arden Heights, Staten Island",Playground,Pharmacy,Coffee Shop,Pizza Place,Yoga Studio,Entertainment Service,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Staten Island
3,7,"Arlington, Staten Island",Bus Stop,Deli / Bodega,American Restaurant,Coffee Shop,Yoga Studio,Fish & Chips Shop,Event Space,Exhibit,Eye Doctor,Factory,Staten Island
4,0,"Arrochar, Staten Island",Deli / Bodega,Italian Restaurant,Bus Stop,Pizza Place,Mediterranean Restaurant,Food Truck,Lawyer,Taco Place,Sandwich Place,Liquor Store,Staten Island


#### Functions

##### Define function to return borough coordinates

In [58]:
# Get borough coordinates
# Address is for example: 'Bronx, NY'
def get_borough_coordinates(address):
    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    bor_latitude = location.latitude
    bor_longitude = location.longitude
    print('The geograpical coordinates of {} are {}, {}.'.format(address, bor_latitude, bor_longitude))
    return bor_latitude, bor_longitude

#### A. Bronx <a name="clustering_bronx"></a>

##### Get coordinates of the borough

In [59]:
# Get borough coordinates
bronx_latitude, bronx_longitude = get_borough_coordinates('Bronx, NY')

The geograpical coordinates of Bronx, NY are 40.8466508, -73.8785937.


##### Prepare datasets for the borough

In [60]:
# Create dataframe for borough
bronx_grouped = ny_grouped[ny_grouped['Borough'] == 'Bronx']
del bronx_grouped['Borough']

# Check the size of the created dataframe
bronx_grouped.shape

(52, 433)

In [61]:
# Create dataframe for borough
bronx_hoods = ny_hoods[ny_hoods['Borough'] == 'Bronx']

# Check the size of the created dataframe
bronx_hoods.shape

(52, 5)

In [62]:
# Create dataframe for borough
bronx_neighborhoods_venues_sorted = neighborhoods_venues_sorted[neighborhoods_venues_sorted['Borough'] == 'Bronx']
del bronx_neighborhoods_venues_sorted['Borough']

# Check the size of the created dataframe
bronx_neighborhoods_venues_sorted.shape

(52, 12)

##### Perform clustering with K=5

In [63]:
# Clustering
bronx_merged = cluster_neighborhoods(kclusters=5, data_grouped=bronx_grouped, data_neighborhoods=bronx_hoods, neighborhoods_venues_sorted=bronx_neighborhoods_venues_sorted)

Cluster labels for the first ten neighborhoods: [1 1 1 1 1 1 1 1 3 1]
Cluster labels: {0, 1, 2, 3, 4}
The size of the final dataframe: (52, 16)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


##### Create map of the borough with neighborhoods

In [64]:
create_map_clustered_neighborhoods(latitude=bronx_latitude, longitude=bronx_longitude, kclusters=5, data=bronx_merged, zoom_level=12)

##### Examine clusters

Get the number of neighborhoods in clusters:

In [65]:
# Calculate the number of neighborhoods in clusters
bronx_merged[['Neighborhood Borough', 'Cluster Labels']].groupby('Cluster Labels').count().rename(columns={'Neighborhood Borough': 'Count'})

Unnamed: 0_level_0,Count
Cluster Labels,Unnamed: 1_level_1
0,1
1,45
2,1
3,3
4,2


Let's have a closer look at cluster 1.

__Cluster 1__  

Get the most common venues:

In [66]:
calculate_most_common_venues(cluster_label=1, data_merged=bronx_merged)

Number of neighborhoods in cluster with label=1: 45

Top 5 category venues for cluster with label=1:

                       Count
1st Most Common Venue       
Pizza Place               11
Pharmacy                   5
Italian Restaurant         4
Bus Station                3
Grocery Store              3

                           Count
2nd Most Common Venue           
Deli / Bodega                  8
Pizza Place                    6
Grocery Store                  3
Latin American Restaurant      3
Supermarket                    2

                       Count
3rd Most Common Venue       
Pizza Place                8
Pharmacy                   3
Deli / Bodega              3
Donut Shop                 2
Bank                       2

                       Count
4th Most Common Venue       
Chinese Restaurant         6
Donut Shop                 4
Supermarket                3
Grocery Store              3
Mexican Restaurant         2

                       Count
5th Most Common Venue    

Venues like Pizza Place, Donut Shop and Deli/Bodega are typical for both cluster 1. However, these are not what we are looking for.

__View neighborhoods that don't belong to clusters 1:__

In [67]:
bronx_merged.loc[bronx_merged['Cluster Labels'].isin([0, 2, 3, 4,]), bronx_merged.columns[[1] + list(range(5, bronx_merged.shape[1]))]].sort_values('Cluster Labels')

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,Williamsbridge,0,Nightclub,Bar,Caribbean Restaurant,Soup Place,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory
29,Country Club,2,Sandwich Place,Playground,Yoga Studio,Empanada Restaurant,Entertainment Service,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor
3,Fieldston,3,Plaza,River,Bus Station,Filipino Restaurant,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory
4,Riverdale,3,Park,Bus Station,Bank,Playground,Plaza,Gym,Home Service,Baseball Field,Food Truck,Event Space
27,Clason Point,3,Park,Bus Stop,Moving Target,Pool,Grocery Store,Boat or Ferry,South American Restaurant,Yoga Studio,Field,Exhibit
28,Throgs Neck,4,Coffee Shop,Bar,American Restaurant,Asian Restaurant,Sports Bar,Pizza Place,Juice Bar,Deli / Bodega,Italian Restaurant,Event Space
39,Edgewater Park,4,Italian Restaurant,Deli / Bodega,Pizza Place,Japanese Restaurant,Asian Restaurant,Donut Shop,Coffee Shop,Park,Fast Food Restaurant,Spa


Since we are looking for a neighborhood to place the healthy food shop, neighborhoods like Clason Point and Country Club could be good candidates. 

##### Perform clustering with K=8

In [68]:
bronx_merged_8 = cluster_neighborhoods(kclusters=8, data_grouped=bronx_grouped, data_neighborhoods=bronx_hoods, neighborhoods_venues_sorted=bronx_neighborhoods_venues_sorted)

Cluster labels for the first ten neighborhoods: [7 7 7 6 7 4 6 1 3 1]
Cluster labels: {0, 1, 2, 3, 4, 5, 6, 7}
The size of the final dataframe: (52, 16)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


##### Create map of the borough with neighborhoods

In [69]:
create_map_clustered_neighborhoods(latitude=bronx_latitude, longitude=bronx_longitude, kclusters=8, data=bronx_merged_8, zoom_level=12)

##### Examine clusters

Get the number of neighborhoods in clusters:

In [70]:
bronx_merged_8[['Neighborhood Borough', 'Cluster Labels']].groupby('Cluster Labels').count().rename(columns={'Neighborhood Borough': 'Count'})

Unnamed: 0_level_0,Count
Cluster Labels,Unnamed: 1_level_1
0,1
1,10
2,1
3,2
4,7
5,1
6,5
7,25


Let's have a look at clusters 1 and 7.

__Cluster 1__

In [71]:
calculate_most_common_venues(cluster_label=1, data_merged=bronx_merged_8)

Number of neighborhoods in cluster with label=1: 10

Top 5 category venues for cluster with label=1:

                       Count
1st Most Common Venue       
Bus Station                3
Chinese Restaurant         2
Grocery Store              2
Donut Shop                 1
Park                       1

                       Count
2nd Most Common Venue       
Bus Station                3
Grocery Store              2
Baseball Field             1
Bus Stop                   1
Chinese Restaurant         1

                       Count
3rd Most Common Venue       
Pizza Place                2
Bank                       1
Burger Joint               1
Deli / Bodega              1
Donut Shop                 1

                       Count
4th Most Common Venue       
Bank                       1
Breakfast Spot             1
Bus Station                1
Chinese Restaurant         1
Metro Station              1

                       Count
5th Most Common Venue       
Pharmacy                

__Cluster 7__

In [72]:
calculate_most_common_venues(cluster_label=7, data_merged=bronx_merged_8)

Number of neighborhoods in cluster with label=7: 25

Top 5 category venues for cluster with label=7:

                       Count
1st Most Common Venue       
Pizza Place                5
Pharmacy                   4
Donut Shop                 2
Fast Food Restaurant       2
Italian Restaurant         2

                           Count
2nd Most Common Venue           
Deli / Bodega                  5
Pizza Place                    5
Latin American Restaurant      3
Bank                           2
American Restaurant            1

                       Count
3rd Most Common Venue       
Pizza Place                3
Supermarket                2
Chinese Restaurant         2
Pharmacy                   2
Donut Shop                 1

                       Count
4th Most Common Venue       
Chinese Restaurant         4
Donut Shop                 3
Diner                      2
Grocery Store              2
Mexican Restaurant         2

                           Count
5th Most Common Venue

__Examine the neighborhoods belonging to clusters 0, 2, 3, 4, 5 and 6__

In [73]:
bronx_merged_8.loc[bronx_merged_8['Cluster Labels'].isin([0, 2, 3, 4, 5, 6]), bronx_merged_8.columns[[1] + list(range(5, bronx_merged_8.shape[1]))]].sort_values('Cluster Labels')

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,Williamsbridge,0,Nightclub,Bar,Caribbean Restaurant,Soup Place,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory
29,Country Club,2,Sandwich Place,Playground,Yoga Studio,Empanada Restaurant,Entertainment Service,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor
35,Spuyten Duyvil,3,Park,Pizza Place,Intersection,Bank,Food,Thai Restaurant,Pharmacy,Tennis Stadium,Tennis Court,Farm
27,Clason Point,3,Park,Bus Stop,Moving Target,Pool,Grocery Store,Boat or Ferry,South American Restaurant,Yoga Studio,Field,Exhibit
40,Castle Hill,4,Pizza Place,Baseball Field,Bank,Pharmacy,Cosmetics Shop,Market,Diner,Yoga Studio,Event Service,Event Space
36,North Riverdale,4,Pizza Place,Italian Restaurant,Bank,Donut Shop,Mexican Restaurant,Chinese Restaurant,Bus Station,Coffee Shop,Sandwich Place,Bagel Shop
11,Pelham Parkway,4,Italian Restaurant,Frozen Yogurt Shop,Pizza Place,Chinese Restaurant,Food,Gas Station,Metro Station,Performing Arts Venue,Gift Shop,Coffee Shop
17,East Tremont,4,Pizza Place,Fast Food Restaurant,Cosmetics Shop,Café,Shoe Store,Deli / Bodega,Mobile Phone Shop,Paella Restaurant,Breakfast Spot,Supermarket
24,Hunts Point,4,Pizza Place,Waste Facility,Gourmet Shop,Grocery Store,BBQ Joint,Café,Restaurant,Farmers Market,Bakery,Spanish Restaurant
33,Morris Park,4,Pizza Place,Burger Joint,Deli / Bodega,Bakery,Donut Shop,Frozen Yogurt Shop,Buffet,Spanish Restaurant,Supermarket,Food


Some of the neighborhoods that don't belong to clusters 1 and 7 have sport facilities - these are for example: Country Club, Clason Point, Spuyten Duyvil. Increasing the number of clusters has helped to distinguish these neighborhoods.

#### B. Brooklyn <a name="clustering_brooklyn"></a>

##### Get coordinates of the borough

In [74]:
# Get borough coordinates
brooklyn_latitude, brooklyn_longitude = get_borough_coordinates('Brooklyn, NY')

The geograpical coordinates of Brooklyn, NY are 40.6501038, -73.9495823.


##### Prepare datasets for the borough

In [75]:
# Create dataframe for borough
brooklyn_grouped = ny_grouped[ny_grouped['Borough'] == 'Brooklyn']
del brooklyn_grouped['Borough']

# Check the size of the created dataframe
brooklyn_grouped.shape

(70, 433)

In [76]:
# Create dataframe for borough
brooklyn_hoods = ny_hoods[ny_hoods['Borough'] == 'Brooklyn']

# Check the size of the created dataframe
brooklyn_hoods.shape

(70, 5)

In [77]:
# Create dataframe for borough
brooklyn_neighborhoods_venues_sorted = neighborhoods_venues_sorted[neighborhoods_venues_sorted['Borough'] == 'Brooklyn']
del brooklyn_neighborhoods_venues_sorted['Borough']

# Check the size of the created dataframe
brooklyn_neighborhoods_venues_sorted.shape

(70, 12)

##### Perform clustering with K=5

In [78]:
# Clustering
brooklyn_merged = cluster_neighborhoods(kclusters=5, data_grouped=brooklyn_grouped, data_neighborhoods=brooklyn_hoods, neighborhoods_venues_sorted=brooklyn_neighborhoods_venues_sorted)

Cluster labels for the first ten neighborhoods: [1 1 1 1 1 1 1 1 1 1]
Cluster labels: {0, 1, 2, 3, 4}
The size of the final dataframe: (70, 16)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


##### Create map of the borough with neighborhoods

In [79]:
create_map_clustered_neighborhoods(latitude=brooklyn_latitude, longitude=brooklyn_longitude, kclusters=5, data=brooklyn_merged, zoom_level=12)

##### Examine clusters

Get the number of neighborhoods in clusters:

In [80]:
# Calculate the number of neighborhoods in clusters
brooklyn_merged[['Neighborhood Borough', 'Cluster Labels']].groupby('Cluster Labels').count().rename(columns={'Neighborhood Borough': 'Count'})

Unnamed: 0_level_0,Count
Cluster Labels,Unnamed: 1_level_1
0,1
1,66
2,1
3,1
4,1


Let's look at neighborhoods in cluster 1.

__Cluster 1__

In [81]:
calculate_most_common_venues(cluster_label=1, data_merged=brooklyn_merged)

Number of neighborhoods in cluster with label=1: 66

Top 5 category venues for cluster with label=1:

                       Count
1st Most Common Venue       
Bar                        9
Deli / Bodega              6
Pizza Place                6
Caribbean Restaurant       5
Coffee Shop                5

                       Count
2nd Most Common Venue       
Pizza Place               11
Coffee Shop                4
Fast Food Restaurant       4
Deli / Bodega              3
Fried Chicken Joint        3

                       Count
3rd Most Common Venue       
Pizza Place                6
Mexican Restaurant         4
Ice Cream Shop             4
Grocery Store              3
Coffee Shop                3

                       Count
4th Most Common Venue       
Pizza Place                5
Coffee Shop                4
Chinese Restaurant         4
Bakery                     4
Donut Shop                 4

                       Count
5th Most Common Venue       
Pizza Place             

__Neighborhoods not belonging to cluster 1__

In [82]:
brooklyn_merged.loc[brooklyn_merged['Cluster Labels'].isin([0, 2, 3, 4]), brooklyn_merged.columns[[1] + list(range(5, brooklyn_merged.shape[1]))]].sort_values('Cluster Labels')

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
74,Canarsie,0,Asian Restaurant,Thai Restaurant,Business Service,Gym,Caribbean Restaurant,Home Service,Yoga Studio,Event Space,Exhibit,Eye Doctor
85,Sea Gate,2,Sports Club,Beach,Bus Station,Spa,Yoga Studio,Filipino Restaurant,Event Space,Exhibit,Eye Doctor,Factory
76,Mill Island,3,Pool,Yoga Studio,Filipino Restaurant,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant
261,Paerdegat Basin,4,Asian Restaurant,Harbor / Marina,Food,Auto Garage,Home Service,Yoga Studio,Event Service,Event Space,Exhibit,Eye Doctor


Sport facilities are not very common in neighborhoods from cluster 1. On the other hand, the neighborhoods not belonging to the cluster would make good candidates for the shop with healthy food because of venues like Pool, Gym or Spa.

##### Perform clustering with K=8

In [83]:
# Clustering
brooklyn_merged_8 = cluster_neighborhoods(kclusters=8, data_grouped=brooklyn_grouped, data_neighborhoods=brooklyn_hoods, neighborhoods_venues_sorted=brooklyn_neighborhoods_venues_sorted)

Cluster labels for the first ten neighborhoods: [2 2 2 2 4 2 1 2 1 2]
Cluster labels: {0, 1, 2, 3, 4, 5, 6, 7}
The size of the final dataframe: (70, 16)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


##### Create map of the borough with neighborhoods

In [84]:
create_map_clustered_neighborhoods(latitude=brooklyn_latitude, longitude=brooklyn_longitude, kclusters=8, data=brooklyn_merged_8, zoom_level=12)

##### Examine clusters

Get the number of neighborhoods in clusters:

In [85]:
# Calculate the number of neighborhoods in clusters
brooklyn_merged_8[['Neighborhood Borough', 'Cluster Labels']].groupby('Cluster Labels').count().rename(columns={'Neighborhood Borough': 'Count'})

Unnamed: 0_level_0,Count
Cluster Labels,Unnamed: 1_level_1
0,1
1,22
2,42
3,1
4,1
5,1
6,1
7,1


Let's look at neighborhoods in clusters 1 and 2.

__Cluster 1__

In [86]:
calculate_most_common_venues(cluster_label=1, data_merged=brooklyn_merged_8)

Number of neighborhoods in cluster with label=1: 22

Top 5 category venues for cluster with label=1:

                           Count
1st Most Common Venue           
Caribbean Restaurant           5
Deli / Bodega                  4
Bank                           2
Donut Shop                     2
Latin American Restaurant      2

                       Count
2nd Most Common Venue       
Fast Food Restaurant       4
Fried Chicken Joint        3
Grocery Store              2
Pizza Place                2
American Restaurant        1

                       Count
3rd Most Common Venue       
Pizza Place                3
Mexican Restaurant         2
Grocery Store              2
Mobile Phone Shop          1
Supermarket                1

                       Count
4th Most Common Venue       
Donut Shop                 3
Pharmacy                   2
Bank                       1
Food                       1
Print Shop                 1

                        Count
5th Most Common Venue   

__Cluster 2__

In [87]:
calculate_most_common_venues(cluster_label=2, data_merged=brooklyn_merged_8)

Number of neighborhoods in cluster with label=2: 42

Top 5 category venues for cluster with label=2:

                       Count
1st Most Common Venue       
Bar                        9
Pizza Place                5
Coffee Shop                4
Italian Restaurant         4
Chinese Restaurant         2

                       Count
2nd Most Common Venue       
Pizza Place                9
Coffee Shop                4
Bagel Shop                 3
American Restaurant        2
Park                       2

                       Count
3rd Most Common Venue       
Ice Cream Shop             4
Pizza Place                3
Coffee Shop                3
Scenic Lookout             2
Bar                        2

                       Count
4th Most Common Venue       
Bakery                     4
Coffee Shop                4
Pizza Place                4
Sandwich Place             3
Chinese Restaurant         3

                       Count
5th Most Common Venue       
Pizza Place             

__Neighborhoods not belonging to clusters 1 and 2__

In [88]:
brooklyn_merged_8.loc[brooklyn_merged_8['Cluster Labels'].isin([0, 3, 4, 5, 6, 7]), brooklyn_merged_8.columns[[1] + list(range(5, brooklyn_merged_8.shape[1]))]].sort_values('Cluster Labels')

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
74,Canarsie,0,Asian Restaurant,Thai Restaurant,Business Service,Gym,Caribbean Restaurant,Home Service,Yoga Studio,Event Space,Exhibit,Eye Doctor
85,Sea Gate,3,Sports Club,Beach,Bus Station,Spa,Yoga Studio,Filipino Restaurant,Event Space,Exhibit,Eye Doctor,Factory
91,Bergen Beach,4,Harbor / Marina,Playground,Athletics & Sports,Hockey Field,Donut Shop,Baseball Field,Food & Drink Shop,Food,Ethiopian Restaurant,Fountain
76,Mill Island,5,Pool,Yoga Studio,Filipino Restaurant,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant
261,Paerdegat Basin,6,Asian Restaurant,Harbor / Marina,Food,Auto Garage,Home Service,Yoga Studio,Event Service,Event Space,Exhibit,Eye Doctor
81,Dyker Heights,7,Golf Course,Dance Studio,Burger Joint,Bagel Shop,Food,Filipino Restaurant,Event Space,Exhibit,Eye Doctor,Factory


Sport facilities are not very common in neighborhoods from clusters 1 and 2. On the other hand, some of the neighborhoods not belonging to the clusters 1 and 2 could make good candidates for the shop with healthy food. 

#### C. Manhattan <a name="clustering_manhattan"></a>

##### Get coordinates of the borough

In [89]:
# Get borough coordinates
manh_latitude, manh_longitude = get_borough_coordinates('Manhattan, NY')

The geograpical coordinates of Manhattan, NY are 40.7896239, -73.9598939.


##### Prepare datasets for the borough

In [90]:
# Create dataframe for borough
manhattan_grouped = ny_grouped[ny_grouped['Borough'] == 'Manhattan']
del manhattan_grouped['Borough']

# Check the size of the created dataframe
manhattan_grouped.shape

(40, 433)

In [91]:
# Create dataframe for borough
manhattan_hoods = ny_hoods[ny_hoods['Borough'] == 'Manhattan']

# Check the size of the created dataframe
manhattan_hoods.shape

(40, 5)

In [92]:
# Create dataframe for borough
manhattan_neighborhoods_venues_sorted = neighborhoods_venues_sorted[neighborhoods_venues_sorted['Borough'] == 'Manhattan']
del manhattan_neighborhoods_venues_sorted['Borough']

# Check the size of the created dataframe
manhattan_neighborhoods_venues_sorted.shape

(40, 12)

##### Perform clustering with K=5

In [93]:
# Clustering
manhattan_merged = cluster_neighborhoods(kclusters=5, data_grouped=manhattan_grouped, data_neighborhoods=manhattan_hoods, neighborhoods_venues_sorted=manhattan_neighborhoods_venues_sorted)

Cluster labels for the first ten neighborhoods: [2 1 0 0 0 0 0 3 1 2]
Cluster labels: {0, 1, 2, 3, 4}
The size of the final dataframe: (40, 16)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


##### Create map of the borough with neighborhoods

In [94]:
create_map_clustered_neighborhoods(latitude=manh_latitude, longitude=manh_longitude, kclusters=5, data=manhattan_merged, zoom_level=12)

##### Examine clusters

Get the number of neighborhoods in clusters:

In [95]:
# Calculate the number of neighborhoods in clusters
manhattan_merged[['Neighborhood Borough', 'Cluster Labels']].groupby('Cluster Labels').count().rename(columns={'Neighborhood Borough': 'Count'})

Unnamed: 0_level_0,Count
Cluster Labels,Unnamed: 1_level_1
0,15
1,14
2,6
3,4
4,1


__Cluster 0__

In [96]:
calculate_most_common_venues(cluster_label=0, data_merged=manhattan_merged)

Number of neighborhoods in cluster with label=0: 15

Top 5 category venues for cluster with label=0:

                       Count
1st Most Common Venue       
Italian Restaurant         4
Coffee Shop                2
African Restaurant         1
American Restaurant        1
Chinese Restaurant         1

                       Count
2nd Most Common Venue       
Gym / Fitness Center       2
American Restaurant        1
Bakery                     1
Boutique                   1
Café                       1

                       Count
3rd Most Common Venue       
American Restaurant        3
Italian Restaurant         2
Art Gallery                1
Café                       1
Clothing Store             1

                       Count
4th Most Common Venue       
French Restaurant          2
Hotel                      2
Steakhouse                 2
American Restaurant        1
Café                       1

                       Count
5th Most Common Venue       
American Restaurant     

__Cluster 1__

In [97]:
calculate_most_common_venues(cluster_label=1, data_merged=manhattan_merged)

Number of neighborhoods in cluster with label=1: 14

Top 5 category venues for cluster with label=1:

                       Count
1st Most Common Venue       
Italian Restaurant         5
Bar                        2
Café                       2
Pizza Place                2
Coffee Shop                1

                       Count
2nd Most Common Venue       
Coffee Shop                4
Wine Bar                   2
Art Gallery                1
Bakery                     1
Bar                        1

                       Count
3rd Most Common Venue       
Art Gallery                1
Bagel Shop                 1
Bakery                     1
Bar                        1
Bubble Tea Shop            1

                       Count
4th Most Common Venue       
Pizza Place                3
Café                       2
Mexican Restaurant         2
Yoga Studio                2
Bakery                     1

                       Count
5th Most Common Venue       
Deli / Bodega           

__Other neighborhoods__

In [98]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'].isin([2, 3, 4]), manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]].sort_values('Cluster Labels')

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Marble Hill,2,Gym,Sandwich Place,Coffee Shop,Yoga Studio,Kids Store,Shopping Mall,Seafood Restaurant,Miscellaneous Shop,Bank,Pharmacy
110,Roosevelt Island,2,Sandwich Place,Coffee Shop,Japanese Restaurant,Supermarket,Bus Line,Soccer Field,School,Scenic Lookout,Residential Building (Apartment / Condo),Deli / Bodega
115,Murray Hill,2,Sandwich Place,Coffee Shop,American Restaurant,Japanese Restaurant,Hotel,Italian Restaurant,Mediterranean Restaurant,Gym / Fitness Center,Gym,Bagel Shop
125,Morningside Heights,2,Park,Coffee Shop,Bookstore,American Restaurant,Deli / Bodega,Burger Joint,Sandwich Place,Outdoor Sculpture,Salad Place,Frozen Yogurt Shop
127,Battery Park City,2,Park,Coffee Shop,Hotel,Wine Shop,Shopping Mall,Women's Store,Gym,Memorial Site,Men's Store,Sandwich Place
128,Financial District,2,Coffee Shop,American Restaurant,Bar,Pizza Place,Gym,Wine Shop,Cocktail Bar,Food Truck,Hotel,Steakhouse
102,Inwood,3,Café,Mexican Restaurant,Restaurant,Lounge,Pizza Place,Chinese Restaurant,Bakery,Spanish Restaurant,Frozen Yogurt Shop,Park
104,Manhattanville,3,Coffee Shop,Deli / Bodega,Seafood Restaurant,Chinese Restaurant,Park,Mexican Restaurant,Italian Restaurant,American Restaurant,Falafel Restaurant,Climbing Gym
106,East Harlem,3,Mexican Restaurant,Bakery,Thai Restaurant,Latin American Restaurant,Deli / Bodega,Pizza Place,Café,Liquor Store,Sandwich Place,Gym
274,Tudor City,3,Mexican Restaurant,Park,Café,Deli / Bodega,Coffee Shop,Greek Restaurant,Diner,Pizza Place,Gym,Garden


Neighborhoods belonging to clusters 0 and 1 don't have abundant sport facilities. On the contrary, neighborhoods from clusters 2 and 3 are more sport-like. 

#### D. Queens <a name="clustering_queens"></a>

##### Get coordinates of the borough

In [99]:
# Get borough coordinates
queens_latitude, queens_longitude = get_borough_coordinates('Queens, NY')

The geograpical coordinates of Queens, NY are 40.7498243, -73.7976337.


##### Prepare datasets for the borough

In [100]:
# Create dataframe for borough
queens_grouped = ny_grouped[ny_grouped['Borough'] == 'Queens']
del queens_grouped['Borough']

# Check the size of the created dataframe
queens_grouped.shape

(81, 433)

In [101]:
# Create dataframe for borough
queens_hoods = ny_hoods[ny_hoods['Borough'] == 'Queens']

# Check the size of the created dataframe
queens_hoods.shape

(81, 5)

In [102]:
# Create dataframe for borough
queens_neighborhoods_venues_sorted = neighborhoods_venues_sorted[neighborhoods_venues_sorted['Borough'] == 'Queens']
del queens_neighborhoods_venues_sorted['Borough']

# Check the size of the created dataframe
queens_neighborhoods_venues_sorted.shape

(81, 12)

##### Perform clustering with K=5

In [103]:
# Clustering
queens_merged = cluster_neighborhoods(kclusters=5, data_grouped=queens_grouped, data_neighborhoods=queens_hoods, neighborhoods_venues_sorted=queens_neighborhoods_venues_sorted)

Cluster labels for the first ten neighborhoods: [1 0 0 1 1 0 3 0 0 0]
Cluster labels: {0, 1, 2, 3, 4}
The size of the final dataframe: (81, 16)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


##### Create map of the borough with neighborhoods

In [104]:
create_map_clustered_neighborhoods(latitude=queens_latitude, longitude=queens_longitude, kclusters=5, data=queens_merged, zoom_level=11)

##### Examine clusters

Get the number of neighborhoods in clusters:

In [105]:
# Calculate the number of neighborhoods in clusters
queens_merged[['Neighborhood Borough', 'Cluster Labels']].groupby('Cluster Labels').count().rename(columns={'Neighborhood Borough': 'Count'})

Unnamed: 0_level_0,Count
Cluster Labels,Unnamed: 1_level_1
0,36
1,38
2,4
3,2
4,1


__Cluster 0__

In [106]:
calculate_most_common_venues(cluster_label=0, data_merged=queens_merged)

Number of neighborhoods in cluster with label=0: 36

Top 5 category venues for cluster with label=0:

                       Count
1st Most Common Venue       
Deli / Bodega              8
Chinese Restaurant         4
Pizza Place                3
Japanese Restaurant        2
Bar                        2

                       Count
2nd Most Common Venue       
Italian Restaurant         4
Pizza Place                4
Bank                       3
Playground                 3
Pub                        2

                           Count
3rd Most Common Venue           
Deli / Bodega                  4
Pizza Place                    3
Supermarket                    2
South American Restaurant      2
Diner                          2

                           Count
4th Most Common Venue           
Pizza Place                    5
Bakery                         4
Chinese Restaurant             3
Gym                            2
Latin American Restaurant      2

                       Cou

__Cluster 1__

In [108]:
calculate_most_common_venues(cluster_label=1, data_merged=queens_merged)

Number of neighborhoods in cluster with label=1: 38

Top 5 category venues for cluster with label=1:

                       Count
1st Most Common Venue       
Grocery Store              3
Hotel                      2
Korean Restaurant          2
Donut Shop                 2
Cosmetics Shop             2

                       Count
2nd Most Common Venue       
Department Store           2
Bank                       2
Pharmacy                   2
Fast Food Restaurant       2
Sandwich Place             2

                       Count
3rd Most Common Venue       
Chinese Restaurant         3
Deli / Bodega              3
Mobile Phone Shop          2
Bakery                     2
Bank                       2

                           Count
4th Most Common Venue           
Donut Shop                     3
Fried Chicken Joint            2
Supermarket                    2
Chinese Restaurant             2
Latin American Restaurant      2

                       Count
5th Most Common Venue    

__Neighborhoods outside clusters 0 and 1__

In [109]:
queens_merged.loc[queens_merged['Cluster Labels'].isin([2, 3, 4]), queens_merged.columns[[1] + list(range(5, queens_merged.shape[1]))]].sort_values('Cluster Labels')

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
172,Breezy Point,2,Trail,Beach,Bus Stop,Monument / Landmark,Yoga Studio,Filipino Restaurant,Event Space,Exhibit,Eye Doctor,Factory
179,Neponsit,2,Beach,Yoga Studio,Filipino Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm
288,Roxbury,2,Irish Pub,Trail,Deli / Bodega,Beach,Baseball Field,Fast Food Restaurant,Yoga Studio,Field,Event Service,Event Space
302,Hammels,2,Beach,Diner,Bus Station,Dog Run,Café,Gym / Fitness Center,Bus Stop,Fast Food Restaurant,Shoe Store,Deli / Bodega
192,Somerville,3,Park,Yoga Studio,Field,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant
303,Bayswater,3,Park,Playground,Yoga Studio,Field,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant
193,Brookville,4,Deli / Bodega,Yoga Studio,Fish & Chips Shop,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm


Some neighborhoods from cluster 0 could be good candidates because of sport and leisure time facilities. On the other hand, neighborhoods from cluster 1 have mostly non-sport facilities. Neighborhoods belonging to clusters 2 and 3 would also make good candidates for our business purpose.

##### Perform clustering with K=8

In [110]:
# Clustering
queens_merged_8 = cluster_neighborhoods(kclusters=8, data_grouped=queens_grouped, data_neighborhoods=queens_hoods, neighborhoods_venues_sorted=queens_neighborhoods_venues_sorted)

Cluster labels for the first ten neighborhoods: [2 2 2 2 2 2 0 2 2 2]
Cluster labels: {0, 1, 2, 3, 4, 5, 6, 7}
The size of the final dataframe: (81, 16)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


##### Create map of the borough with neighborhoods

In [111]:
create_map_clustered_neighborhoods(latitude=queens_latitude, longitude=queens_longitude, kclusters=8, data=queens_merged_8, zoom_level=11)

##### Examine clusters

Get the number of neighborhoods in clusters:

In [112]:
# Calculate the number of neighborhoods in clusters
queens_merged_8[['Neighborhood Borough', 'Cluster Labels']].groupby('Cluster Labels').count().rename(columns={'Neighborhood Borough': 'Count'})

Unnamed: 0_level_0,Count
Cluster Labels,Unnamed: 1_level_1
0,1
1,8
2,66
3,1
4,1
5,2
6,1
7,1


Increased number of clusters didn't help at all (we have one large cluster with the majority of neighborhoods) - we won't analyze the results in more detail.

#### E. Staten Island <a name="clustering_staten_island"></a>

##### Get coordinates of the borough

In [113]:
# Get borough coordinates
staten_latitude, staten_longitude = get_borough_coordinates('Staten Island, NY')

The geograpical coordinates of Staten Island, NY are 40.5834557, -74.1496048.


##### Prepare datasets for the borough

In [114]:
# Create dataframe for borough
staten_grouped = ny_grouped[ny_grouped['Borough'] == 'Staten Island']
del staten_grouped['Borough']

# Check the size of the created dataframe
staten_grouped.shape

(61, 433)

In [115]:
# Create dataframe for borough
staten_hoods = ny_hoods[ny_hoods['Borough'] == 'Staten Island']

# Check the size of the created dataframe
staten_hoods.shape

(63, 5)

In [116]:
# Create dataframe for borough
staten_neighborhoods_venues_sorted = neighborhoods_venues_sorted[neighborhoods_venues_sorted['Borough'] == 'Staten Island']
del staten_neighborhoods_venues_sorted['Borough']

# Check the size of the created dataframe
staten_neighborhoods_venues_sorted.shape

(61, 12)

##### Perform clustering with K=5

In [117]:
# Clustering
staten_merged = cluster_neighborhoods(kclusters=5, data_grouped=staten_grouped, data_neighborhoods=staten_hoods, neighborhoods_venues_sorted=staten_neighborhoods_venues_sorted)

Cluster labels for the first ten neighborhoods: [1 1 0 1 1 1 1 1 1 1]
Cluster labels: {0, 1, 2, 3, 4}
The size of the final dataframe: (63, 16)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [119]:
# Drop neighborhoods without any venues
staten_merged = staten_merged.dropna()

In [120]:
# Convert cluster labels back to integer and check the result
staten_merged['Cluster Labels'] = staten_merged['Cluster Labels'].astype('int')
staten_merged.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Neighborhood Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
197,Staten Island,St. George,40.644982,-74.079353,"St. George, Staten Island",1,Clothing Store,Sporting Goods Shop,Italian Restaurant,American Restaurant,Bar,Harbor / Marina,Coffee Shop,Toy / Game Store,Tourist Information Center,Falafel Restaurant
198,Staten Island,New Brighton,40.640615,-74.087017,"New Brighton, Staten Island",0,Bus Stop,Deli / Bodega,Park,Playground,Bowling Alley,Discount Store,Fish & Chips Shop,Exhibit,Eye Doctor,Factory
199,Staten Island,Stapleton,40.626928,-74.077902,"Stapleton, Staten Island",1,Sandwich Place,Mexican Restaurant,Bank,Discount Store,Restaurant,New American Restaurant,Motorcycle Shop,Optical Shop,Fast Food Restaurant,Train Station
200,Staten Island,Rosebank,40.615305,-74.069805,"Rosebank, Staten Island",1,Grocery Store,Mexican Restaurant,Italian Restaurant,Restaurant,Martial Arts Dojo,Beach,Sandwich Place,Eastern European Restaurant,Bar,Liquor Store
201,Staten Island,West Brighton,40.631879,-74.107182,"West Brighton, Staten Island",1,Coffee Shop,Bank,Bar,Italian Restaurant,Pharmacy,American Restaurant,Breakfast Spot,Music Store,Bus Stop,Mexican Restaurant


##### Create map of the borough with neighborhoods

In [121]:
create_map_clustered_neighborhoods(latitude=staten_latitude, longitude=staten_longitude, kclusters=5, data=staten_merged, zoom_level=12)

##### Examine clusters

Get the number of neighborhoods in clusters:

In [122]:
# Calculate the number of neighborhoods in clusters
staten_merged[['Neighborhood Borough', 'Cluster Labels']].groupby('Cluster Labels').count().rename(columns={'Neighborhood Borough': 'Count'})

Unnamed: 0_level_0,Count
Cluster Labels,Unnamed: 1_level_1
0,8
1,46
2,1
3,1
4,5


__Cluster 1__

In [123]:
calculate_most_common_venues(cluster_label=1, data_merged=staten_merged)

Number of neighborhoods in cluster with label=1: 46

Top 5 category venues for cluster with label=1:

                       Count
1st Most Common Venue       
Italian Restaurant         7
Pizza Place                6
Deli / Bodega              3
Grocery Store              3
Bus Stop                   2

                       Count
2nd Most Common Venue       
Pizza Place                8
Italian Restaurant         3
Bagel Shop                 2
Bank                       2
Deli / Bodega              2

                       Count
3rd Most Common Venue       
Deli / Bodega              4
Italian Restaurant         3
Bus Stop                   3
Pizza Place                3
Intersection               2

                       Count
4th Most Common Venue       
Train Station              4
Bagel Shop                 3
Baseball Field             3
Pharmacy                   3
Pizza Place                3

                       Count
5th Most Common Venue       
Yoga Studio             

__Other neighborhoods__

In [124]:
staten_merged.loc[staten_merged['Cluster Labels'].isin([0, 2, 3, 4]), staten_merged.columns[[1] + list(range(5, staten_merged.shape[1]))]].sort_values('Cluster Labels')

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
198,New Brighton,0,Bus Stop,Deli / Bodega,Park,Playground,Bowling Alley,Discount Store,Fish & Chips Shop,Exhibit,Eye Doctor,Factory
212,Oakwood,0,Playground,Bus Stop,Lawyer,Bar,Food Truck,Field,Ethiopian Restaurant,Event Service,Event Space,Exhibit
224,Park Hill,0,Bus Stop,Hotel,Coffee Shop,Gym / Fitness Center,Athletics & Sports,Yoga Studio,Event Space,Exhibit,Eye Doctor,Factory
227,Arlington,0,Bus Stop,Deli / Bodega,American Restaurant,Coffee Shop,Yoga Studio,Fish & Chips Shop,Event Space,Exhibit,Eye Doctor,Factory
256,Randall Manor,0,Bus Stop,Bagel Shop,Deli / Bodega,Park,Yoga Studio,Filipino Restaurant,Event Space,Exhibit,Eye Doctor,Factory
285,Willowbrook,0,Bus Stop,Bagel Shop,Deli / Bodega,Pizza Place,Yoga Studio,Field,Event Service,Event Space,Exhibit,Eye Doctor
286,Sandy Ground,0,Bus Stop,Intersection,Market,Home Service,Yoga Studio,Fast Food Restaurant,Ethiopian Restaurant,Event Service,Event Space,Exhibit
305,Fox Hills,0,Bus Stop,Deli / Bodega,Cocktail Bar,Bus Station,Playground,Sandwich Place,Yoga Studio,Field,Event Space,Exhibit
207,Port Ivory,2,Business Service,Yoga Studio,Fish & Chips Shop,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm
202,Grymes Hill,3,Dog Run,Yoga Studio,Filipino Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm


While neighborhoods from the biggest cluster 1 wouldn't make good candidates for our business idea of a healthy food shop, neighborhoods belonging to cluster 0 would fit. 

##### Perform clustering with K=8

In [125]:
# Clustering
staten_merged_8 = cluster_neighborhoods(kclusters=8, data_grouped=staten_grouped, data_neighborhoods=staten_hoods, neighborhoods_venues_sorted=staten_neighborhoods_venues_sorted)

Cluster labels for the first ten neighborhoods: [1 0 5 1 1 1 1 6 1 1]
Cluster labels: {0, 1, 2, 3, 4, 5, 6, 7}
The size of the final dataframe: (63, 16)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [126]:
# Drop neighborhoods without any venues
staten_merged_8 = staten_merged_8.dropna()

In [127]:
# Convert cluster labels back to integer and check the result
staten_merged_8['Cluster Labels'] = staten_merged_8['Cluster Labels'].astype('int')
staten_merged_8.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Neighborhood Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
197,Staten Island,St. George,40.644982,-74.079353,"St. George, Staten Island",1,Clothing Store,Sporting Goods Shop,Italian Restaurant,American Restaurant,Bar,Harbor / Marina,Coffee Shop,Toy / Game Store,Tourist Information Center,Falafel Restaurant
198,Staten Island,New Brighton,40.640615,-74.087017,"New Brighton, Staten Island",5,Bus Stop,Deli / Bodega,Park,Playground,Bowling Alley,Discount Store,Fish & Chips Shop,Exhibit,Eye Doctor,Factory
199,Staten Island,Stapleton,40.626928,-74.077902,"Stapleton, Staten Island",1,Sandwich Place,Mexican Restaurant,Bank,Discount Store,Restaurant,New American Restaurant,Motorcycle Shop,Optical Shop,Fast Food Restaurant,Train Station
200,Staten Island,Rosebank,40.615305,-74.069805,"Rosebank, Staten Island",1,Grocery Store,Mexican Restaurant,Italian Restaurant,Restaurant,Martial Arts Dojo,Beach,Sandwich Place,Eastern European Restaurant,Bar,Liquor Store
201,Staten Island,West Brighton,40.631879,-74.107182,"West Brighton, Staten Island",1,Coffee Shop,Bank,Bar,Italian Restaurant,Pharmacy,American Restaurant,Breakfast Spot,Music Store,Bus Stop,Mexican Restaurant


##### Create map of the borough with neighborhoods

In [128]:
create_map_clustered_neighborhoods(latitude=staten_latitude, longitude=staten_longitude, kclusters=8, data=staten_merged_8, zoom_level=12)

##### Examine clusters

Get the number of neighborhoods in clusters:

In [129]:
# Calculate the number of neighborhoods in clusters
staten_merged_8[['Neighborhood Borough', 'Cluster Labels']].groupby('Cluster Labels').count().rename(columns={'Neighborhood Borough': 'Count'})

Unnamed: 0_level_0,Count
Cluster Labels,Unnamed: 1_level_1
0,1
1,42
2,1
3,5
4,1
5,8
6,2
7,1


Increase of K didn't help rapidly - there are still three clusters with at least 5 members. Therefore, similarly to Queens, we will not do further analysis of the results. 

## Results and discussion <a name="results_discussion"></a>

In this study, two New York City clustering strategies were selected: 1) clustering of all neighborhoods at once, and 2) clustering of neighborhoods within boroughs. In both, two values of K (number of clusters) were chosen. 

__All neighborhoods__  
When taking all New York City neighborhoods into clustering with __K=5__, 271 out of 306 neighborhoods create one big cluster and other 28 neighborhoods create another cluster. The remaining 3 clusters contain less than 5 neighborhoods. Clustering with K=5 is therefore not sufficient to segment the neighborhoods adequately - according to the results, the most of the neighborhoods (more than 88%) would be similar. 

Clustering of all neighborhoods at once with __K=10__ results in more distinguished clusters. Similar to clustering with K=5, most of the neighborhoods (153 out of 306) form one big cluster that can be characterized by venues like Pizza Place, Deli/Bodega and Italian Restaurant. The second largest cluster consists of 121 neighborhoods and the most common venue is Pizza Place. The third largest cluster is formed only by 21 neighborhoods and the most abundant venues are Bus Stop and Deli/Bodega. The remaining clusters contain only 1 or 2 neighborhoods.  

While the big clusters don't have any sport facility in the top common venue categories, the neighborhoods that don't belong to the large clusters can be described by venues like Yoga Studio, Pool or Park. This might suggest that these neighborhoods could be proper candidates for our business purpose. Because these neighborhoods don't form larger clusters, it's likely that there aren't any other neighborhoods where sport facilities "win". On the other hand, it doesn't necessarily mean that neighborhoods belonging to the large clusters don't have any sport facilities - they can just be overrun by restaurants, fast food places, coffee shops and other much more common venue categories.  

It's obvious that most of the neighborhoods within every borough belong to the two largest clusters. One would expect that there would be a category that would cover at least half of the neighborhoods. Therefore, it can be concluded that the largest clusters are too general and are based on smaller contributions from many different venue categories. This leads to assumption that it would be more beneficial to perform clustering within each borough to reveal patterns in more detail.  


In order to cluster neighborhoods within individual boroughs, number of clusters (K) was set to 5 and 8. Let's discuss the boroughs one by one. 

__Bronx (52 neighborhoods).__ With K=5, 52 neighborhoods in Bronx were clustered into one big cluster with 45 members, and four one to three member clusters. Venues like Pizza Place, Donut Shop and Deli/Bodega are typical for the big cluster. Some of the remaining neighborhoods have sport facilities such as Yoga Studio and Pool. In addition, restaurants are not a common venue category here. 
The big cluster with 45 neighborhoods is broken down to smaller clusters when the target number of clusters is set to K=8. The three most abundant clusters contain 25, 10 and 7 members. However, the most common venues are venues like Pizza Place, Restaurant and Deli/Bodega. Closer look at neighborhoods that don't belong to these bigger clusters reveals that these ones have sport facilities. Examples of such neighborhoods are Country Club, Clason Point, Spuyten Duyvil. Increasing the number of clusters from 5 to 8 has helped to distinguish some of these neighborhoods.

__Brooklyn (70 neighborhoods).__ Clustering of Brooklyn neighborhoods into 5 clusters leads to one big cluster with almost all neighborhoods (66) and four one-membered clusters. In general, sport facilities are not very common in neighborhoods belonging to the largest cluster. The common venues mostly include Pizza Place, different types of Restaurants, and Coffee Shops. On the other hand, the neighborhoods not belonging to the large cluster would probably make good candidates for the shop with healthy food because of venues like Pool, Gym or Spa.
Using K=8 leads to two clusters with more than 20 members. While sport facilities are not very common in neighborhoods in any of these clusters, the remaining neighborhoods have sport facilities in the top five most abundant venue categories. Similarly to case with K=5, there are neighborhoods not belonging to the above mentioned abundant clusters that could make good candidates for the shop with healthy food because of sport and leisure time places. Examples of such neighborhoods would be: Sea Gate, Paerdegat Basin, Bergen Beach. 

__Manhattan (40 neighborhoods).__ Because of lower number of neighborhoods in this borough, only clustering with K=5 was performed. Comparing to other boroughs, Manhattan neighborhoods are much more equally distributed among the different clusters. The largest cluster contains 15 neighborhoods with venues such as Restaurant (Italian, American, Indian) and Café/Coffee Shop. Sport related venues are not very common. Similar observation applies for the second largest cluster with 14 neighborhoods. On the other hand, the third largest cluster (6 neighborhoods) could be described by venue categories like Coffee Shop, Sandwich Place, Park and Gym. Although Coffee Shop and Sandwich Place are above sport and leisure time places, the neighborhoods belonging to this cluster could be good candidates for our business purpose. 

__Queens (81 neighborhoods).__ Cluster analysis with K=8 doesn't look like improvement when compared to K=5. Therefore, we will look only at clustering with K=5. Almost all neighborhoods belong to two large clusters - neighborhoods have mostly non-sport facilities (Deli/Bodega, Pizza Place, Restaurants of different type, Bakery, Donut Shop). In contrast, neighborhoods outside these two clusters could be good candidates because of some sport and leisure time facilities. 

__Staten Island (63 neighborhoods).__ Similarly to Queens, clustering with K=8 is not a significant improvement when compared to clustering with K=5, and we will discuss only results for clustering with K=5. The two largest clusters contain 46 and 8 members. While neighborhoods from the biggest cluster with venues such as Italian Restaurant, Pizza Place and Deli/Bodega wouldn't make the best candidates for our business idea of a healthy food shop, neighborhoods belonging to the second largest cluster or other clusters with venues such as Gym and Yoga Studio would fit much better (examples of such neighborhoods: Grymes Hill, Park Hill). 

## Conclusion <a name="conclusion"></a>

In this study, neighborhoods of New York City have been segmented and clustered in order to identify the best candidates for a business plan - opening a healthy food store in a new area. Neighborhoods were clustered based on the similarity and abundance of venues belonging to different categories using the standard K-Means clustering algorithm. Two approaches have been chosen: 1) clustering of all neighborhoods at the same time (with the target number of clusters K=5 and K=10), and 2) clustering of neighborhoods in every borough separately (with K=5 and K=8). The obtained clusters were examined in a detail to gain insights into the common features of the neighborhoods. 

Using all neighborhoods for clustering (that is, irrespective of the boroughs) gives a high level overview of the neighborhoods. With both K=5 and K=10, the most of neighborhoods form one big cluster that can be characterized by the most frequent venue categories. Although we can identify some neighborhoods with features in alignment with the business plan of starting a healthy food store, it might be possible that other suitable neighborhoods are simply hidden and couldn't be revealed due to prevailing venues that are not in our interest. On the other hand, further increase of K could help to distinguish the relevant neighborhoods. 

Cluster analysis of neighborhoods within boroughs was performed with K=5 and K=8. In general, the largest clusters within each borough are similar - they mostly include venues like restaurants, pizza places, sandwich places, coffee shops/cafes. We didn't identify any neighborhood cluster that could be defined as a sport/leisure time type of cluster. However, we have found some smaller clusters or individual neighborhoods that do not fall under the common category "Restaurant/Pizza/Coffee". These would be the appropriate candidates to open a healthy food store, based on the criteria defined above. 

The study provides a basic understanding of neighborhood segmentation in New York City and aims to determine the best neighborhoods for the defined business plan. However, other sources of data and more detailed analyses would be required to gain better understanding of the problem. For example, it would be beneficial to use data on population density in the area or data including information about the character of the area (industrial, business, living). In addition to this, different algorithm or different selection of features could help in better understanding of the neighborhood similarities and segmentation.