# Capstone Project - The Battle of Neighborhoods

## Introduction

For this project, we have a client that is interested in establishing a pizza shop somewhere in New York. He wants to find an optimal spot where he can find customers and where there are not too many pizza shops around.

His last shop didnt do to well because there were many pizza shops around the block and not too many people would visit. 

The issue here is the New York City is known for its restaurant and finding an optimal place will be difficult. This analysis will use clustering to find an optimal location

## Data

For this analysis we will be using two sources to get our data:

1. NYU: (https://geo.nyu.edu/catalog/nyu_2451_34572)
2. Foursquare

NYU has a json file that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood.

We will use foursquare's api to determine the location of pizza shops in the given boroughs



## Methodology

Once the data is collected and put into pandas dataframe I will:

- Clean the data
- Only take a subset of the borough's, ie. Manhattan
- Visualize the rows we have on a map
- Do some summary statistics, counts

We also need to use Foursquare's data to get the venues:

- Collect data 
- Visualize venues on a map
- Run statistics to see counts of venues categories
- Even though we only care about pizza as a category, other categories might pop up

Machine Learning Algorithm:

The machine learning algorithm I am running is a KMeans clustering. I have decided to use 5 as the number of clusters. From there, each data point will be assigned a given cluster. 

I am using this algorithm because I want to determine where is an optimal location to place the pizza shops. The location where there is the least amount of cluster will be the best place.


## Results

Based on dataframe analysis above Cluster 3 (Upper West Side ) and Cluster 2 (Morningside Heights) areas are the best places to open a new pizza shop.



## Discussion

If I had more time for this project, I would find an optimal value for k, instead of just picking k = 5. Five Clusters could be an optimal one but I would have ran some tests to see if it was the best one.

Something to note that is mentioned in the code:

- When visualizing the venues on a map, it appears that 1667 is to many points
- However, when I took a sample of 1500 I was able to see that on a map.
- Something to note is that the data is performed on a limited amount of data


## Conclusion

After we ran the machine learning algorithm, we were able to determine which location is optimal to set up a pizza shop. There is some room for improvement, either in the collection of more data or tunning the parameter k and seeing which is the best k.


## Code/Analysis

In [9]:
import json 

with open('nyu-2451-34572-geojson.json') as json_data:
    newyork_data = json.load(json_data)

Transforming Data into Pandas df

In [11]:
import pandas as pd

neighborhoods_data = newyork_data['features']
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

When converting the json file into a dataframe:
- There are 306 rows
- There ae 4 columns; Borough, Neighborhood, Latitude and Longitude

The good thing about this dataset is that it already included the latitude/longitude so no need to use geocoder

In [13]:
print(neighborhoods.head())
print(neighborhoods.shape)

  Borough Neighborhood   Latitude  Longitude
0   Bronx    Wakefield  40.894705 -73.847201
1   Bronx   Co-op City  40.874294 -73.829939
2   Bronx  Eastchester  40.887556 -73.827806
3   Bronx    Fieldston  40.895437 -73.905643
4   Bronx    Riverdale  40.890834 -73.912585
(306, 4)


Use geopy library to get the latitude and longitude values of New York City.
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent ny_explorer, as shown below.

In [60]:
from geopy.geocoders import Nominatim

address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


To make this project more feasible, we will only look at the borough **Manhattan** and the neighborhoods inside it.

In [15]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


Create a map of New York with neighborhoods superimposed on top.


In [71]:
import folium
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Borough'], manhattan_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

# Foursquare Venues

We will use foursquare's api to receive nearby venues that are in new york

In [28]:
import urllib
import requests


def getNearbyVenues(names, latitudes, longitudes, radius=5000, categoryIds=''):
    try:
        venues_list=[]
        for name, lat, lng in zip(names, latitudes, longitudes):
            #print(name)

            # create the API request URL
            url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)

            if (categoryIds != ''):
                url = url + '&categoryId={}'
                url = url.format(categoryIds)

            # make the GET request
            
            response = requests.get(url).json()
            results = response["response"]['venues']

            # return only relevant information for each nearby venue
            for v in results:
                success = False
                try:
                    category = v['categories'][0]['name']
                    success = True
                except:
                    pass

                if success:
                    venues_list.append([(
                        name, 
                        lat, 
                        lng, 
                        v['name'], 
                        v['location']['lat'], 
                        v['location']['lng'],
                        v['categories'][0]['name']
                    )])

        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude',  
                  'Venue Category']
    
    except:
        print(url)
        print(response)
        print(results)
        print(nearby_venues)

    return(nearby_venues)

Using Foursquare's Api, we were able to find 1667 venues that are categorized as pizza shops

In [69]:
#pizza place category id = 4bf58dd8d48988d1ca941735

neighborhoods = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
newyork_venues_pizza = getNearbyVenues(names=neighborhoods['Neighborhood'], latitudes=neighborhoods['Latitude'], longitudes=neighborhoods['Longitude'], radius=1000, categoryIds='4bf58dd8d48988d1ca941735')
newyork_venues_pizza.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Ray's Pizza Express,40.878901,-73.916558,Pizza Place
2,Marble Hill,40.876551,-73.91066,Sam's Pizza,40.879435,-73.905859,Pizza Place
3,Marble Hill,40.876551,-73.91066,Cafeccino Bakery,40.880068,-73.907064,Bagel Shop
4,Marble Hill,40.876551,-73.91066,Supreme Pizza & Pasta,40.881001,-73.908966,Pizza Place


In [30]:
newyork_venues_pizza.shape

(1667, 7)

For some reason unknown to me, it wont allow all the points to render on the map.

Instead I took only the first 1500 rows to see the pizza venues in new york

In [92]:

sample = newyork_venues_pizza.head(1500)

map_newyork_pizza = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, venue, neighborhood in zip(sample['Venue Latitude'], sample['Venue Longitude'], sample['Venue'], sample['Neighborhood']):
    label = '{}, {}'.format(venue, neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_newyork_pizza)  
    
map_newyork_pizza


In [39]:
def addColumn(startDf, columnTitle, dataDf):
    grouped = dataDf.groupby('Neighborhood').count()
    
    for n in startDf['Neighborhood']:
        try:
            startDf.loc[startDf['Neighborhood'] == n,columnTitle] = grouped.loc[n, 'Venue']
        except:
            startDf.loc[startDf['Neighborhood'] == n,columnTitle] = 0

In [41]:
manhattan_grouped = newyork_venues_pizza.groupby('Neighborhood').count()
manhattan_grouped

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,44,44,44,44,44,44
Carnegie Hill,50,50,50,50,50,50
Central Harlem,21,21,21,21,21,21
Chelsea,48,48,48,48,48,48
Chinatown,50,50,50,50,50,50
Civic Center,50,50,50,50,50,50
Clinton,50,50,50,50,50,50
East Harlem,47,47,47,47,47,47
East Village,50,50,50,50,50,50
Financial District,50,50,50,50,50,50


## Analysis of Neighborhoods 

Creating dummy variables for the dataframe, newyork_venues_pizza

In [42]:
# one hot encoding
manhattan_onehot = pd.get_dummies(newyork_venues_pizza[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = newyork_venues_pizza['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Arcade,Bagel Shop,Bakery,Bar,Boat or Ferry,Burger Joint,Café,Cocktail Bar,...,Gluten-free Restaurant,Gourmet Shop,Greek Restaurant,Italian Restaurant,Kosher Restaurant,Mexican Restaurant,Pizza Place,Sandwich Place,Sports Bar,Turkish Restaurant
0,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,Marble Hill,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [43]:

manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

Unnamed: 0,Neighborhood,American Restaurant,Arcade,Bagel Shop,Bakery,Bar,Boat or Ferry,Burger Joint,Café,Cocktail Bar,...,Gluten-free Restaurant,Gourmet Shop,Greek Restaurant,Italian Restaurant,Kosher Restaurant,Mexican Restaurant,Pizza Place,Sandwich Place,Sports Bar,Turkish Restaurant
0,Battery Park City,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.022727,0.0,0.045455,0.022727,0.0,0.909091,0.0,0.0,0.0
1,Carnegie Hill,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.08,0.0,0.0,0.92,0.0,0.0,0.0
2,Central Harlem,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.952381,0.0,0.0,0.0
3,Chelsea,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.020833,0.0,0.020833,0.916667,0.0,0.0,0.0
4,Chinatown,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.14,0.0,0.0,0.84,0.0,0.0,0.0
5,Civic Center,0.02,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.1,0.0,0.0,0.82,0.0,0.02,0.0
6,Clinton,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.14,0.0,0.0,0.76,0.04,0.0,0.0
7,East Harlem,0.021277,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.021277,0.93617,0.0,0.0,0.0
8,East Village,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.1,0.0,0.0,0.86,0.0,0.0,0.0
9,Financial District,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.0,0.02,0.0,0.04,0.02,0.0,0.88,0.0,0.0,0.0


In [44]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [47]:
import numpy as np 

num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Battery Park City,Pizza Place,Italian Restaurant,Kosher Restaurant,Gourmet Shop,Turkish Restaurant
1,Carnegie Hill,Pizza Place,Italian Restaurant,Turkish Restaurant,Diner,Arcade
2,Central Harlem,Pizza Place,Deli / Bodega,Turkish Restaurant,Diner,Arcade
3,Chelsea,Pizza Place,Deli / Bodega,Mexican Restaurant,Italian Restaurant,Bar
4,Chinatown,Pizza Place,Italian Restaurant,American Restaurant,Diner,Arcade


## Clustering Analysis, KMeans

In [51]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([3, 3, 1, 1, 4, 4, 0, 1, 3, 3], dtype=int32)

In [53]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged = manhattan_data
manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head()


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Manhattan,Marble Hill,40.876551,-73.91066,4,Pizza Place,Bagel Shop,Italian Restaurant,Turkish Restaurant,Diner
1,Manhattan,Chinatown,40.715618,-73.994279,4,Pizza Place,Italian Restaurant,American Restaurant,Diner,Arcade
2,Manhattan,Washington Heights,40.851903,-73.9369,1,Pizza Place,Café,Turkish Restaurant,Diner,Arcade
3,Manhattan,Inwood,40.867684,-73.92121,1,Pizza Place,Turkish Restaurant,Diner,Arcade,Bagel Shop
4,Manhattan,Hamilton Heights,40.823604,-73.949688,1,Pizza Place,Deli / Bodega,Turkish Restaurant,Diner,Arcade


In [59]:

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters