# Battle Of the Neighborhoods

#### Coursera Capstone Final Project - Battle of the Neighborhoods - to identify the best neighborhoods to startup a new restaurant.

## Introduction

In this project, I will use the Foursquare API to explore neighborhoods in New York City and get the top rated restuarants in each neighborhood. Then I will use this feature to group the neighborhoods into clusters. I will use the *k*-means clustering algorithm to complete this task. Finally, I will use the Folium library to visualize the neighborhoods in New York City and their emerging clusters.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# import average function
from statistics import mean as avg

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.2
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /srv/conda/envs/notebook

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          92 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.21.0-py_0



Downloading and Extracting Packages
geopy-1.21.0         | 58 KB     | ###########

<a id='item1'></a>

## 1. Downloading and Exploring Dataset

Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segment the neighborhoods and explore them, we will use the dataset provided by NYU GeoSpatial Repository - https://geo.nyu.edu/catalog/nyu_2451_34572  
It contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

For convenience, I will download a copy of the file from IBM server, by simply running a 'wget' command.

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


#### Loading and exploring the data

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [None]:
newyork_data

Taking a quick look at the data, it is clear that all the relevant data is in the *features* key, which is basically a list of the neighborhoods. So, defining a new variable that includes this data.

In [5]:
neighborhoods_data = newyork_data['features']

Taking a look at the first item in this list.

In [6]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Tranforming the data into a *pandas* dataframe

The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe.

Creating an empty dataframe with the required column headers and then filling it up with data one row at a time.

In [7]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [8]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Quickly examining the resulting dataframe.

In [9]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [10]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


Made sure that the dataset has all 5 boroughs and 306 neighborhoods and there are no missing values

#### Visualizing the New York City neighborhoods in a map

Using geopy library to get the latitude and longitude values of New York City.

In [11]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


Creating a map of New York with neighborhoods superimposed on top using the Folium library.

In [12]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

#### However, for illustration purposes and due to API constraints, simplifying the above map and segment and cluster only the neighborhoods in Manhattan.

Slicing the original dataframe and create a new dataframe of the Manhattan data.

In [13]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


Getting the geographical coordinates of Manhattan.

In [14]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


#### Visualizing Manhattan with the neighborhoods in it.

In [15]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

## 2. Searching Restaurants in Manhattan

Using Foursquare API to search restaurants in each neighborhood.  
Initialising the required parameters.

In [16]:
CLIENT_ID = '1CK4OGNVAPRT5JBJH3ELOAKFQAEPXHPLDPWQLXARNH0QCOR1' # your Foursquare ID
CLIENT_SECRET = 'P0B4WE3O52G0IDFEMLGKDQ53MQD25HDQTXL0WELTNPEQ54AS' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 1CK4OGNVAPRT5JBJH3ELOAKFQAEPXHPLDPWQLXARNH0QCOR1
CLIENT_SECRET:P0B4WE3O52G0IDFEMLGKDQ53MQD25HDQTXL0WELTNPEQ54AS


The function below uses Foursquare API to get the nearby restaurants for a neighborhood and their corresponding ratings. As required by our analysis, the function returns a dataframe with the number of restaurants, the average rating of the top restaurants and the average rating of the least rated restaurants.

In [17]:
def analyzeNearbyRestaurants(names, latitudes, longitudes, radius=500):
    
    nearby_restaurant_analysis_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
                   
        # create the API request URL to get the nearby restaurants
        search_query = 'Restaurant'
        radius = 500

        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}'.format(CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, search_query, radius)
            
        # make the GET request
        restaurants_list = requests.get(url).json()['response']['venues']
        
        #Get the number of restaurants in the neighborhood
        no_of_restaurants =len(restaurants_list)
        
        ratings_list=[]
        
        for restaurant in restaurants_list:
            restaurant_id = restaurant['id']
            # create the API request URL to explore each restaurant and store it's rating
            url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(restaurant_id, CLIENT_ID, CLIENT_SECRET, VERSION)
            result = requests.get(url).json()
            try:
                rating = result['response']['venue']['rating']
                ratings_list.append(rating)
            except:
                pass
        
        #Get the average rating of the highest rated 5 and the lowest rated 5 restaurants
        if len(ratings_list)==0:
            ratings_list=[0]
        
        if len(ratings_list)<10:
            ratings_list = sorted(ratings_list + [avg(ratings_list)]*(10-len(ratings_list)))
        else:
            ratings_list = sorted(ratings_list)

        bottom5_avg_rating = avg(ratings_list[0:5])
        top5_avg_rating = avg(ratings_list[-5:])
        

        # return relevant information for each nearby restaurant
        nearby_restaurant_analysis_list.append((
            name, 
            lat, 
            lng, 
            no_of_restaurants, 
            top5_avg_rating, 
            bottom5_avg_rating))

    #store in a dataframe
    nearby_restaurant_analysis = pd.DataFrame(nearby_restaurant_analysis_list)
    nearby_restaurant_analysis.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'No of Restaurants', 
                  'Average Top Rating', 
                  'Average Low Rating']
    
    return(nearby_restaurant_analysis)


Run the function to return the restaurant findings in the neighborhood

In [18]:
#manhattan_restaurants = analyzeNearbyRestaurants(names=manhattan_data['Neighborhood'],
#                                   latitudes=manhattan_data['Latitude'],
#                                   longitudes=manhattan_data['Longitude']
#                                  )

The results have already been stored in a csv file. This is read into a dataframe.

In [19]:
manhattan_restaurants = pd.read_csv('Manhattan_Restaurants.csv')

Checking the shape of the resulting dataframe to get the number of neighborhoods

In [20]:
manhattan_restaurants.shape

(40, 6)

## 3. Analyzing the neighborhoods

The dataframe will give an idea of the number of restaurants and the average high and low ratings of each neighborhood.

In [21]:
manhattan_restaurants

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,No of Restaurants,Average Top Rating,Average Low Rating
0,Marble Hill,40.876551,-73.91066,14,7.82,7.38
1,Chinatown,40.715618,-73.994279,30,7.84,6.24
2,Washington Heights,40.851903,-73.9369,30,8.04,6.92
3,Inwood,40.867684,-73.92121,26,7.413333,6.753333
4,Hamilton Heights,40.823604,-73.949688,30,7.06,5.94
5,Manhattanville,40.816934,-73.957385,21,7.466667,6.666667
6,Central Harlem,40.815976,-73.943211,24,7.96,6.5
7,East Harlem,40.792249,-73.944182,30,7.474286,6.497143
8,Upper East Side,40.775639,-73.960508,30,8.22,5.86
9,Yorkville,40.77593,-73.947118,30,8.36,5.96


## 4. Clustering Neighborhoods

Running *k*-means to cluster the neighborhood into 3 clusters.

In [22]:
# set number of clusters
kclusters = 3

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_restaurants[['No of Restaurants','Average Top Rating','Average Low Rating']])

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 0, 0, 2, 0, 2, 2, 0, 0, 0], dtype=int32)

Creating a new dataframe that includes the cluster for each neighborhood.

In [23]:
# add clustering labels
manhattan_restaurants.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_restaurants.head() # checking

Unnamed: 0,Cluster Labels,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,No of Restaurants,Average Top Rating,Average Low Rating
0,1,Marble Hill,40.876551,-73.91066,14,7.82,7.38
1,0,Chinatown,40.715618,-73.994279,30,7.84,6.24
2,0,Washington Heights,40.851903,-73.9369,30,8.04,6.92
3,2,Inwood,40.867684,-73.92121,26,7.413333,6.753333
4,0,Hamilton Heights,40.823604,-73.949688,30,7.06,5.94


Finally, visualizing the resulting clusters

In [24]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_restaurants['Neighborhood Latitude'], manhattan_restaurants['Neighborhood Longitude'], manhattan_restaurants['Neighborhood'], manhattan_restaurants['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [25]:
manhattan_restaurants.sort_values(by='Cluster Labels')

Unnamed: 0,Cluster Labels,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,No of Restaurants,Average Top Rating,Average Low Rating
19,0,East Village,40.727847,-73.982226,30,7.715,6.56
27,0,Gramercy,40.73721,-73.981376,30,7.7,5.92
31,0,Noho,40.723259,-73.988434,30,7.98,6.0
25,0,Manhattan Valley,40.797307,-73.964286,30,7.551429,6.305714
24,0,West Village,40.734434,-74.00618,30,8.72,6.42
23,0,Soho,40.722184,-74.000657,30,8.16,6.02
22,0,Little Italy,40.719324,-73.997305,30,7.94,6.02
21,0,Tribeca,40.721522,-74.010683,30,8.28,5.96
32,0,Civic Center,40.715229,-74.005415,30,7.92,6.36
38,0,Flatiron,40.739673,-73.990947,30,8.04,7.004444


## 5. Examining Clusters

Now, examining each cluster and determining the discriminating venue categories that distinguish each cluster.

#### Cluster 1

In [26]:
manhattan_restaurants.loc[manhattan_restaurants['Cluster Labels'] == 0, manhattan_restaurants.columns[[1] + list(range(4, manhattan_restaurants.shape[1]))]]

Unnamed: 0,Neighborhood,No of Restaurants,Average Top Rating,Average Low Rating
1,Chinatown,30,7.84,6.24
2,Washington Heights,30,8.04,6.92
4,Hamilton Heights,30,7.06,5.94
7,East Harlem,30,7.474286,6.497143
8,Upper East Side,30,8.22,5.86
9,Yorkville,30,8.36,5.96
10,Lenox Hill,30,8.24,6.2
12,Upper West Side,30,7.84,6.08
14,Clinton,30,7.42,5.5
15,Midtown,30,7.32,5.62


#### Cluster 2

In [27]:
manhattan_restaurants.loc[manhattan_restaurants['Cluster Labels'] == 1, manhattan_restaurants.columns[[1] + list(range(4, manhattan_restaurants.shape[1]))]]

Unnamed: 0,Neighborhood,No of Restaurants,Average Top Rating,Average Low Rating
0,Marble Hill,14,7.82,7.38
11,Roosevelt Island,0,0.0,0.0
26,Morningside Heights,12,7.225,6.3
37,Stuyvesant Town,5,6.9,6.9


#### Cluster 3

In [28]:
manhattan_restaurants.loc[manhattan_restaurants['Cluster Labels'] == 2, manhattan_restaurants.columns[[1] + list(range(4, manhattan_restaurants.shape[1]))]]

Unnamed: 0,Neighborhood,No of Restaurants,Average Top Rating,Average Low Rating
3,Inwood,26,7.413333,6.753333
5,Manhattanville,21,7.466667,6.666667
6,Central Harlem,24,7.96,6.5
13,Lincoln Square,25,8.38,6.12
20,Lower East Side,25,7.946667,6.32
28,Battery Park City,23,7.66,6.26
39,Hudson Yards,25,7.26,5.76


On examining the clusters, we can conclude that :
  
**Cluster-1**, which contains most of the neighborhoods, already have a large number of restaurants with a fairly high rating. These neighborhood would prove highly risky for a new business.  
  
**Cluster-3**, which contain a few neighboorhoods including Manhattenville and Hudson Yards, have an average number of restaurants and a fair rating of restaurants.  
  
**Cluster-2**, which includes 4 neighborhoods, have very few restaurants and fair rating. These neighborhoods are ideal for starting a new restaurant. 

The most ideal neighborhoods for starting a new restaurant are the following:

In [29]:
manhattan_restaurants.loc[manhattan_restaurants['Cluster Labels'] == 1,['Neighborhood']].reset_index(drop=True)

Unnamed: 0,Neighborhood
0,Marble Hill
1,Roosevelt Island
2,Morningside Heights
3,Stuyvesant Town
