Let's first import all of the necessary libraries. It's okay if we miss one, I'll import it later.

In [1]:
import numpy as np
import pandas as pd
import json

from geopy.geocoders import Nominatim

import requests
from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium

print('Libraries imported.')

Libraries imported.


Thanks to the Capstone Labs provided on Coursera, we already know how to download our New York dataset.

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


Since I have limited knowledge with json, I'm going to write it to a file in my Jupyter lab so I can determine where to look for the data I want.

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

The neighborhoods are in the *features* key, so we'll extract the features and assign it to a variable.

In [4]:
neighborhoods_data = newyork_data['features']

# Call the first feature, is neighborhood Wakefield
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

Here I'll create an empty dataframe with the column names we'll want.

In [5]:
# dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']

# create dataframe
neighborhoods = pd.DataFrame(columns=column_names)

neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Next we'll loop through our json 'features' to fill the dataframe per our columns.

In [6]:
for feature in neighborhoods_data:
    borough = feature['properties']['borough']
    neighborhood_name = feature['properties']['name']
    
    latitude = feature['geometry']['coordinates'][1]
    longitude = feature['geometry']['coordinates'][0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': latitude,
                                          'Longitude': longitude}, ignore_index=True)
    
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Since I'll only be working with the borough of Manhattan for this exercise, I'm going to filter out all of the other borough information.

In [45]:
manhattan = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [8]:
print('There are {} neighborhoods in our Manhattan dataset.'.format(
        manhattan.shape[0]
        )
     )

There are 40 neighborhoods in our Manhattan dataset.


It's now time to use the Nominatim geocoder to get the coordinates of New York City.

In [9]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent='nyc_agent') # unique user agent
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Coordinates of NYC are {}, {}'.format(latitude, longitude))

Coordinates of NYC are 40.7900869, -73.9598295


Now that we have the coordinates, we can now create a map zoomed in on NYC that shows all the neighborhoods with labels.

In [10]:
# use folium to create NYC map
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add neighborhood markers
for lat, lng, borough, neighborhood in zip(manhattan['Latitude'],
                                           manhattan['Longitude'],
                                           manhattan['Borough'],
                                           manhattan['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)
    
map_manhattan

I'll now use my Foursquare developer credentials to create my GET request. In the next cell, I chose to save the resulting json to a file so that I can look for the data I'll need to parse (as I did with the previous json).

In [12]:
radius= 500
limit= 100
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)

results = requests.get(url).json()

with open('manhattan.json', 'w') as file:
    json.dump(results, file)

With the function used in the Coursera lab, we'll call the GET request and also assign the nearby venues to a dataframe.

In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        # form our url
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
        
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
    
        # only add name, latitude, longitude, venue name for each venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
    
    # add venues to a new dataframe
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [14]:
manhattan_venues = getNearbyVenues(names=manhattan['Neighborhood'],
                                   latitudes=manhattan['Latitude'],
                                   longitudes=manhattan['Longitude']
                                   )

In [90]:
print(manhattan_venues.shape)
manhattan_venues.head()

(3317, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop
4,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop


Let's see how many venues we get for each neighborhood.

In [72]:
manhattan_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,100,100,100,100,100,100
Carnegie Hill,100,100,100,100,100,100
Central Harlem,44,44,44,44,44,44
Chelsea,100,100,100,100,100,100
Chinatown,100,100,100,100,100,100
Civic Center,100,100,100,100,100,100
Clinton,100,100,100,100,100,100
East Harlem,44,44,44,44,44,44
East Village,100,100,100,100,100,100
Financial District,100,100,100,100,100,100


In the next few cells, I'm going to subset our Manhattan venues by selecting only those that have the 'Outdoor and Recreation' venue categories per Foursquare as well as a few additional categories that may fit what we're looking for. 

In [71]:
# create list of all venue types in Manhattan
venue_cat_list = manhattan_venues['Venue Category'].unique().tolist()
venue_cat_list.sort()

In [18]:
# new url that calls just the venue category types
url2 = 'https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'.format(
       CLIENT_ID, CLIENT_SECRET, VERSION)

# GET request
get_venues = requests.get(url2).json()

# save new json to file to find what data I'm looking for
with open('venues.json', 'w') as file:
    json.dump(get_venues, file)

In [19]:
# assign just the 'Outdoors and Recreation' data to a variable
outdoors_rec = get_venues['response']['categories'][5]['categories']

In [20]:
# transform json to pandas dataframe
rec_venues = json_normalize(outdoors_rec)

# assign venue category names to a list
rec_venues = rec_venues['name'].tolist()

# list additional venue categories
venue_add = ['Disc Golf', 'Mini Golf', 'Roller Rink', 'Theme Park', 'Zoo', 'Water Park', 'Stadium']

# combine category list
rec_venues.extend(venue_add)

In [21]:
# assign categories in Manhattan venues list and outdoor/rec list to subset
subset = list(set(rec_venues).intersection(venue_cat_list))

In [146]:
# filter for rows that have the venue category we want
manhattan_rec = manhattan_venues[manhattan_venues['Venue Category'].isin(subset)].reset_index(drop=True)

print(manhattan_rec.shape)
manhattan_rec.head(5)

(157, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Chinatown,40.715618,-73.994279,The Crown,40.71596,-73.996717,Roof Deck
1,Washington Heights,40.851903,-73.9369,Highest Natural Point In Manhattan,40.852843,-73.93765,Park
2,Washington Heights,40.851903,-73.9369,Bennett Park,40.852967,-73.937874,Park
3,Washington Heights,40.851903,-73.9369,Chittenden Overlook,40.85503,-73.939362,Scenic Lookout
4,Washington Heights,40.851903,-73.9369,Highbridge Park Pool,40.84911,-73.936839,Pool


We now have our pandas dataframe that lists only the venues in the category that we desire, it's time to set up our data to be analyzed.

In [32]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_rec[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column
manhattan_onehot['Neighborhood'] = manhattan_rec['Neighborhood']

# move neighborhood column to first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

Unnamed: 0,Neighborhood,Athletics & Sports,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Park,Playground,Plaza,Pool,Rock Climbing Spot,Roof Deck,Scenic Lookout,Sculpture Garden,Stables,Trail,Tree,Waterfront
0,Chinatown,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,Washington Heights,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
2,Washington Heights,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3,Washington Heights,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
4,Washington Heights,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


Next, group rows by neighborhood and take mean frequency of each category's occurence.

In [34]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped.head()

Unnamed: 0,Neighborhood,Athletics & Sports,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Park,Playground,Plaza,Pool,Rock Climbing Spot,Roof Deck,Scenic Lookout,Sculpture Garden,Stables,Trail,Tree,Waterfront
0,Battery Park City,0.058824,0.0,0.0,0.0,0.058824,0.0,0.0,0.470588,0.117647,0.117647,0.0,0.0,0.0,0.058824,0.058824,0.0,0.0,0.058824,0.0
1,Carnegie Hill,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Harlem,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Chelsea,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.333333,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0
4,Chinatown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


Borrowing another function from the Coursera lab, we will use it to sort the venues in a new dataframe in descending order.

In [35]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [179]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Park,Plaza,Playground,Tree,Garden,Athletics & Sports,Scenic Lookout,Sculpture Garden,Roof Deck,Rock Climbing Spot
1,Carnegie Hill,Playground,Waterfront,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Park,Plaza
2,Central Harlem,Park,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Plaza
3,Chelsea,Park,Scenic Lookout,Pool,Waterfront,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf
4,Chinatown,Roof Deck,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Park


# Cluster neighborhoods

Cluster into five clusters using *k*-means.

In [178]:
# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means on our dataset
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(manhattan_grouped_clustering)

kmeans.labels_[:10]

array([0, 2, 4, 0, 3, 0, 0, 4, 0, 0], dtype=int32)

In [180]:
# add cluster labels to the dataframe
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# we want to add our new information to our original manhattan dataset
manhattan_merged = manhattan

# join datasets so lat/long for each neighborhood is shown
manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Manhattan,Marble Hill,40.876551,-73.91066,,,,,,,,,,,
1,Manhattan,Chinatown,40.715618,-73.994279,3.0,Roof Deck,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Park
2,Manhattan,Washington Heights,40.851903,-73.9369,0.0,Park,Plaza,Scenic Lookout,Pool,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf
3,Manhattan,Inwood,40.867684,-73.92121,0.0,Park,Playground,Dog Run,Waterfront,Bike Trail,Fountain,Garden,Harbor / Marina,Mini Golf,Plaza
4,Manhattan,Hamilton Heights,40.823604,-73.949688,4.0,Park,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Plaza


In [181]:
# remove rows with cluster labels that are 'NaN'
remove = ['Marble Hill', 'Soho']
manhattan_merged = manhattan_merged[~manhattan_merged['Neighborhood'].isin(remove)]

In [204]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Let's take a look at our resulting clusters.

In [183]:
cluster_1 = manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 0, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]
cluster_1

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Washington Heights,Park,Plaza,Scenic Lookout,Pool,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf
3,Inwood,Park,Playground,Dog Run,Waterfront,Bike Trail,Fountain,Garden,Harbor / Marina,Mini Golf,Plaza
8,Upper East Side,Plaza,Playground,Sculpture Garden,Park,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf
9,Yorkville,Park,Athletics & Sports,Dog Run,Pool,Playground,Bike Trail,Fountain,Garden,Harbor / Marina,Mini Golf
11,Roosevelt Island,Waterfront,Playground,Dog Run,Scenic Lookout,Park,Bike Trail,Fountain,Garden,Harbor / Marina,Mini Golf
12,Upper West Side,Trail,Dog Run,Garden,Waterfront,Playground,Bike Trail,Fountain,Harbor / Marina,Mini Golf,Park
13,Lincoln Square,Plaza,Park,Roof Deck,Dog Run,Fountain,Playground,Athletics & Sports,Rock Climbing Spot,Pool,Tree
14,Clinton,Dog Run,Rock Climbing Spot,Park,Waterfront,Playground,Bike Trail,Fountain,Garden,Harbor / Marina,Mini Golf
15,Midtown,Plaza,Park,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Waterfront
16,Murray Hill,Plaza,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Park,Waterfront


In [184]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 1, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Little Italy,Garden,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Harbor / Marina,Mini Golf,Park,Plaza
31,Noho,Garden,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Harbor / Marina,Mini Golf,Park,Plaza


In [185]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 2, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Lenox Hill,Playground,Waterfront,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Park,Plaza
18,Greenwich Village,Playground,Waterfront,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Park,Plaza
30,Carnegie Hill,Playground,Waterfront,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Park,Plaza


In [186]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 3, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Chinatown,Roof Deck,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Park
38,Flatiron,Roof Deck,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Park


In [187]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 4, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Hamilton Heights,Park,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Plaza
5,Manhattanville,Park,Bike Trail,Waterfront,Playground,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Plaza
6,Central Harlem,Park,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Plaza
7,East Harlem,Park,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Plaza
20,Lower East Side,Park,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Plaza
26,Morningside Heights,Park,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Plaza
34,Sutton Place,Park,Waterfront,Playground,Bike Trail,Dog Run,Fountain,Garden,Harbor / Marina,Mini Golf,Plaza


# Results and Discussion

So it appears that subsetting our venue dataset to include only venues that can be considered outdoors and recreation did not help us at all in determing which neighborhoods can be the 'healthiest' in this way. Clusters 1 is our only meaningful cluster. That is to say that Clusters 2 through 5 all, at the very least, have the same 5th-10th most common venue type. This indicates to me that within those clusters, they each have very few venues AND they happen to share those venues.

Let's take a closer look at our remaining cluster to see what it contains.

In [130]:
cluster_1_neighborhoods = cluster_1['Neighborhood'].tolist()
cluster_1_neighborhoods

['Washington Heights',
 'Inwood',
 'Upper East Side',
 'Yorkville',
 'Roosevelt Island',
 'Upper West Side',
 'Lincoln Square',
 'Clinton',
 'Midtown',
 'Murray Hill',
 'Chelsea',
 'East Village',
 'Tribeca',
 'West Village',
 'Manhattan Valley',
 'Gramercy',
 'Battery Park City',
 'Financial District',
 'Civic Center',
 'Midtown South',
 'Turtle Bay',
 'Tudor City',
 'Stuyvesant Town',
 'Hudson Yards']

In [202]:
# subset manhattan_rec with neighborhoods in Cluster 1 and count venues
rec_venue_counts = manhattan_rec[manhattan_rec['Neighborhood'].
                   isin(cluster_1_neighborhoods)].groupby('Neighborhood').count()
rec_venue_counts

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,17,17,17,17,17,17
Chelsea,3,3,3,3,3,3
Civic Center,5,5,5,5,5,5
Clinton,4,4,4,4,4,4
East Village,2,2,2,2,2,2
Financial District,5,5,5,5,5,5
Gramercy,6,6,6,6,6,6
Hudson Yards,6,6,6,6,6,6
Inwood,4,4,4,4,4,4
Lincoln Square,13,13,13,13,13,13


It sparked some curiosity in me to find that there are still some neighborhoods in this cluster that have 1, 2, or 3 venues. But because the algorithm grouped them together nonetheless, it seems to imply that the few venues they did have matched favorably with the top venues in the other neighborhoods.

Looking at this data, it's easy to say something like, "Why not just look at the neighborhoods that have the most venues in the outdoors and recreation category?" and I'd have to agree with that sentiment. It certainly didn't make for very meaningful analysis in this case, as we could have simply chosen Battery Park with 17 venues. However, it's good to know that there are many other options to look at that have a substantial amount of recreational activies to do. Many if not all of these neighborhoods can lead to living a healthy lifestyle in Manhattan. 

Just to have a bit more closure, let's take a look at the neighborhood with the most venues, Battery Park, and see what other context we can gather to indicate it's health level.

Here's an excerpt from Battery Park City's Health section of its Wikipedia page (__[Battery Park](https://en.wikipedia.org/wiki/Battery_Park_City#Health)__):

> The concentration of fine particulate matter, the deadliest type of air pollutant, in Battery Park City and Lower Manhattan is 0.0096 milligrams per cubic metre (9.6×10−9 oz/cu ft), more than the city average. Sixteen percent of Battery Park City and Lower Manhattan residents are smokers, which is more than the city average of 14% of residents being smokers. In Battery Park City and Lower Manhattan, 4% of residents are obese, 3% are diabetic, and 15% have high blood pressure, the lowest rates in the city—compared to the citywide averages of 24%, 11%, and 28% respectively. In addition, 5% of children are obese, the lowest rate in the city, compared to the citywide average of 20%.<br></br>
Ninety-six percent of residents eat some fruits and vegetables every day, which is more than the city's average of 87%. In 2018, 88% of residents described their health as "good," "very good," or "excellent," more than the city's average of 78%. For every supermarket in Battery Park City and Lower Manhattan, there are 6 bodegas.

So we have some bad with the good in Battery Park. People seem to eat healthier and there is a lower rate of obesity, diabetes, and highblood pressure that elsewhere in the city. This could be in some way aided by the higher number of outdoors and recreaction venues. However, there is a higher rate of smokers as well as a higher concentration of air pollutants than the city average. If it were me looking for a new place to live, as a runner, I would have a hard time with the air pollutants and smoking, so it really comes down to personal preference here. Luckily there are plenty of other options from which to choose.