# Business problem explenation

New York City is the most populous city in the Unites States with an estimated (2018) population of 8,4 million. The city is a vibrant metropolis with immigrants from all over the world living and working in the city, and also bringing some of their culture with them. According to Wikipedia large numbers of Irish, Italian, Jewish, and ultimately Asian and Hispanic Americans emigrated to New York throughout the 20th century, significantly influencing the culture and image of New York City. 
One of the best ways to notice different cultures is by exploring the gastronomical diversity in the city. In this case we will take a look at Chinese restaurants. In particular we will conduct an analysis trying to discover the following:

- how many Chinese restaurants are there in the NYC in the city?
- where are these restaurants located?
- using heat-map check which borough has the lowest density of Chinese restaurants?
- what neighbourhoods have the lowest number of Chinese restaurants?
- would these neighbourhoods be interesting locations to open a Chinese restaurant?

# Data

To conduct the analysis and solve the business problem different sources of data will be analysed and used:
- New York City open data website: https://opendata.cityofnewyork.us/
- Data which contains a list of NYC boroughs, along with their longitude, latitude boundaries available at: https://data.cityofnewyork.us/City-Government/Borough-Boundaries/tqmj-j8zm
- Data regarding population in NYC boroughs and neighborhoods: https://data.cityofnewyork.us/City-Government/New-York-City-Population-By-Neighborhood-Tabulatio/swpk-hqdp/data
- Foursquare list of chinese restaurants. By using this API I will get all venues in every neighborhood and further explore which of these are chinese restaurants.

In [1]:
# Firstly let's import all the necessary libraries

import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In the follwing steps we will define the location and get a map of New york City.

In [2]:
# Define location (New Yor City, New York, USA)

address = 'New York City, New York'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of New York City, New York are {}, {}.'.format(latitude, longitude))

The geographical coordinates of New York City, New York are 40.7127281, -74.0060152.


In order to get locations of all NYC boroughs and neighbourhoods we will use a json file containing all the necessary data regarding latitude and longitude.

In [24]:
nyc_center = latitude, longitude
nyc_center

(40.7127281, -74.0060152)

In [3]:
# Load the Boroughs and Neigborhoods data

with open('nyc_boroughs.json') as json_data:
    nyc = json.load(json_data)

neighborhoods_data = nyc['features']
neighborhoods_data[1]

{'type': 'Feature',
 'id': 'nyu_2451_34572.2',
 'geometry': {'type': 'Point',
  'coordinates': [-73.82993910812398, 40.87429419303012]},
 'geometry_name': 'geom',
 'properties': {'name': 'Co-op City',
  'stacked': 2,
  'annoline1': 'Co-op',
  'annoline2': 'City',
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.82993910812398,
   40.87429419303012,
   -73.82993910812398,
   40.87429419303012]}}

Now we will create an empty dataframe in which we will instert the data from 

In [4]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


In [5]:
# Create a dataframe with all boroughs, neighborhoods and coordinates

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


In [6]:
# Let's visualize the data we have so far: NYC neighborhoods centers:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

After a brief inspection of the map it is clear that the highest concentration of the neighbourhoods is in Manhttan. Now we will split the data frame into five separate data frames, one for each NYC borough.

In [7]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
brooklyn_data = neighborhoods[neighborhoods['Borough'] == 'Brooklyn'].reset_index(drop=True)
bronx_data = neighborhoods[neighborhoods['Borough'] == 'Bronx'].reset_index(drop=True)
queens_data = neighborhoods[neighborhoods['Borough'] == 'Queens'].reset_index(drop=True)
staten_island_data = neighborhoods[neighborhoods['Borough'] == 'Staten Island'].reset_index(drop=True)

print('Separate dataframes for boroughs are created.')

Separate dataframes for boroughs are created.


In the following steps we will use Foursquare API to get the list of all Chinese restaurants.

In [8]:
# Define Foursquare Credentials and Version

CLIENT_ID = 'GDJB5PEJBOIUKSXRYJSCMQPZQDFKZ0FNDQVESYHD5GPZJE4Q' # your Foursquare ID
CLIENT_SECRET = 'KZNIJVOHJQWU4WZOHVGEKVMHXC42MLKUVAH1GJBVXWJ5FB4R' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
search_query = 'chinese'
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: GDJB5PEJBOIUKSXRYJSCMQPZQDFKZ0FNDQVESYHD5GPZJE4Q
CLIENT_SECRET:KZNIJVOHJQWU4WZOHVGEKVMHXC42MLKUVAH1GJBVXWJ5FB4R


Before moving forward with the analysis we will conduct a test on the first neighborhood in the Manhattan dataframe.  

In [9]:
# Get Manhattan's latitude and longitude values

neighborhood_latitude = manhattan_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = manhattan_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = manhattan_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name,
                                                               neighborhood_latitude,
                                                               neighborhood_longitude))

Latitude and longitude values of Marble Hill are 40.87655077879964, -73.91065965862981.


In [10]:
# Get the list of Chinese restaurants for Marble Hill

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
search_query = 'chinese restaurant'

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&query={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius,
    search_query,
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=GDJB5PEJBOIUKSXRYJSCMQPZQDFKZ0FNDQVESYHD5GPZJE4Q&client_secret=KZNIJVOHJQWU4WZOHVGEKVMHXC42MLKUVAH1GJBVXWJ5FB4R&v=20180605&ll=40.87655077879964,-73.91065965862981&radius=500&query=chinese restaurant&limit=100'

In [11]:
# Check results of the search

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e87a79447b43d0023144737'},
  'headerLocation': 'Marble Hill',
  'headerFullLocation': 'Marble Hill, New York',
  'headerLocationGranularity': 'neighborhood',
  'query': 'chinese restaurant',
  'totalResults': 3,
  'suggestedBounds': {'ne': {'lat': 40.88105078329964,
    'lng': -73.90471933917806},
   'sw': {'lat': 40.87205077429964, 'lng': -73.91659997808156}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4b7f2f48f964a520561d30e3',
       'name': 'China Wang',
       'location': {'address': '109 W 225th St',
        'lat': 40.87434742150794,
        'lng': -73.91054009348214,
        'labeledLatLngs': [{'label': 'display',
          'lat': 40.87434742150794,
          'lng': -73.91054009348214}],
        'distance': 245,
        '

In [12]:
# Define a function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [13]:
# Now we are ready to clean the json and structure it into a pandas dataframe.

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,China Wang,Chinese Restaurant,40.874347,-73.91054
1,Yung Cheng Kitchen,Chinese Restaurant,40.877394,-73.906848
2,Golden City Chinese Restaurant,Chinese Restaurant,40.879319,-73.906187


It seems that there are three Chinese restaurants in Marble hill.

Now we will define a function to go through a dataframe and return all Chinese restaurants in a neghborhood.

In [14]:
# define function to get all Chinese restaurants from Foursqare

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&query={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            search_query,
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [15]:
# get Chinese restaurants in Manhattan

manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

print('Dataframes with Chinese restaurants for Manhattan are created.')

Dataframes with Chinese restaurants for Manhattan are created.


In [16]:
# get Chinese restaurants in Brooklyn

brooklyn_venues = getNearbyVenues(names=brooklyn_data['Neighborhood'],
                                   latitudes=brooklyn_data['Latitude'],
                                   longitudes=brooklyn_data['Longitude']
                                  )
print('Dataframes with Chinese restaurants for Brooklyn are created.')

Dataframes with Chinese restaurants for Brooklyn are created.


In [17]:
# get Chinese restaurants in the Bronx

bronx_venues = getNearbyVenues(names=bronx_data['Neighborhood'],
                                   latitudes=bronx_data['Latitude'],
                                   longitudes=bronx_data['Longitude']
                                  )

print('Dataframes with Chinese restaurants for the Bronx are created.')

Dataframes with Chinese restaurants for the Bronx are created.


In [18]:
# get Chinese restaurants in Queens

queens_data.drop(index=75)

queens_venues = getNearbyVenues(names=queens_data['Neighborhood'],
                                   latitudes=queens_data['Latitude'],
                                   longitudes=queens_data['Longitude']
                                  )

print('Dataframes with Chinese restaurants for Queens are created.')

Dataframes with Chinese restaurants for Queens are created.


In [19]:
# get Chinese restaurants in Staten Island

staten_island_venues = getNearbyVenues(names=staten_island_data['Neighborhood'],
                                   latitudes=staten_island_data['Latitude'],
                                   longitudes=staten_island_data['Longitude']
                                  )

print('Dataframes with Chinese restaurants for Staten Island are created.')

Dataframes with Chinese restaurants for Staten Island are created.


In [20]:
# Let's check out the new dataframes.

print(manhattan_venues.shape)
print(brooklyn_venues.shape)
print(bronx_venues.shape)
print(queens_venues.shape)
print(staten_island_venues.shape)

(629, 7)
(314, 7)
(193, 7)
(338, 7)
(57, 7)


In [21]:
# Create a dataframe for Chinese restaurants in all of NYC

nyc_venues = manhattan_venues.append(brooklyn_venues)
nyc_venues = nyc_venues.append(bronx_venues)
nyc_venues = nyc_venues.append(queens_venues)
nyc_venues = nyc_venues.append(staten_island_venues)
nyc_venues.shape

(1531, 7)

In [22]:
# Remove duplicate restaurants from the dataframe containing all of Chinese resaurants in NYC.

nyc_venues = nyc_venues.drop_duplicates('Venue')
nyc_venues.shape

print('There is a total of {} Chinese restaurnats in NYC.'.format(nyc_venues.shape[0]))

There is a total of 1247 Chinese restaurnats in NYC.


Now we will create a heat-map showing the density of Chinese restaurants in NYC.

In [25]:
def boroughs_style(feature):
    return { 'color': 'blue', 'fill': False }

from folium.plugins import HeatMap

map_nyc = folium.Map(location=nyc_center, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(map_nyc) #cartodbpositron cartodbdark_matter
HeatMap(nyc_venues[['Neighborhood Latitude', 'Neighborhood Longitude']]).add_to(map_nyc)
map_nyc

### This heat map demonstrate that the highest density of Chinese restaurants is definetly in Manhattan, which is not surprising. It seems that a lower number of Chinese restaurants may be found on the Staten Island and in some parts of Queens, as well as Brooklyn. However, Staten Island and Queens are primarily residential boroughs so the stakeholders might be more interested in Brooklyn, which is closer to Manhattan. 

In [26]:
# Let's define new, more narrow region of interest, which will include low-restaurant-count parts of Brooklyn.

address_brooklyn = 'Brooklyn, New York'

geolocator = Nominatim(user_agent="ny_explorer")
location_brooklyn = geolocator.geocode(address_brooklyn)
latitude_brooklyn = location_brooklyn.latitude
longitude_brooklyn = location_brooklyn.longitude
print('The geographical coordinates of Brooklyn, New York are {}, {}.'.format(latitude_brooklyn, longitude_brooklyn))


The geographical coordinates of Brooklyn, New York are 40.6501038, -73.9495823.


In [27]:
# Heatmap of the Chinese restaurants in Brooklyn.

brooklyn_center = latitude_brooklyn, longitude_brooklyn

map_brooklyn = folium.Map(location=brooklyn_center, zoom_start=14)
HeatMap(brooklyn_venues[['Neighborhood Latitude', 'Neighborhood Longitude']]).add_to(map_brooklyn)
map_brooklyn

The highest density of Chinese restaurants in Brooklyn may be found in neihborhoods close to Brooklyn bridge and Manhattan bridge. Let's check out which neighborhood has the lowest number of Chinese restaurants.


In [28]:
# Checking the number of restaurants in Brooklyn neighborhoods. In order to do that first we need to group 
# the results by neighborhoods, counting the numer of venues (Chinese restaurants) for each neighborhood. 

brooklyn_venues_groupby = brooklyn_venues.groupby('Neighborhood').count()
brooklyn_venues_groupby['Venue'].head()

Neighborhood
Bath Beach            14
Bay Ridge             10
Bedford Stuyvesant     5
Bensonhurst           11
Boerum Hill            7
Name: Venue, dtype: int64

In [29]:
# Let's find out which neighborhoods have the lowest number of Chinese restaurants.

brooklyn_venues_min = brooklyn_venues_groupby.loc[brooklyn_venues_groupby['Venue'] == brooklyn_venues_groupby['Venue'].min()] 
brooklyn_venues_min.head()

print('There are {} neighborhoods with only 1 Chinese restaurant'.format(brooklyn_venues_min.shape[0]))

There are 10 neighborhoods with only 1 Chinese restaurant


In [30]:
# define a list containing the names of all neighborhoods with only one Chinese restaurant 

brooklyn_neighborhoods_min = brooklyn_venues_min.index.values.tolist()
brooklyn_neighborhoods_min[0:5]

['Borough Park',
 'Broadway Junction',
 'Crown Heights',
 'Flatlands',
 'Highland Park']

In [31]:
# Create a dataframe with all neighborhoods and coordinates that have only one Chinese restaurant

brooklyn_data_chinese_min = brooklyn_data[brooklyn_data['Neighborhood'].isin(brooklyn_neighborhoods_min)].reset_index(drop=bool)


In [32]:
# Let's visualize these neighburhoods

map_brooklyn_min = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(brooklyn_data_chinese_min['Latitude'], brooklyn_data_chinese_min['Longitude'], brooklyn_data_chinese_min['Borough'], brooklyn_data_chinese_min['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_brooklyn_min)  
    
map_brooklyn_min

Considering all of the data presented above it would be reasonable to consider Brooklyn for a potential location, especially the neighbourhood Red Hook where there is only one Chinese restaurant and it is close to Manhattan.