WHAT IS THE BEST LOCATION IN MANHATTAN TO OPEN A COFFEE STALL?

INTRODUCTION

The problem or question to be asked is "what is the best neighbourhood of Manhattan to open a coffee stall?"

Good locations are those with a high foot traffic but with the lowest competition.

A good area would have a high foot traffic, but a low number of coffee shops.

Audience of the results will be entrepreneurs looking to set up a coffee stall in Manhattan.

They care about this question/problem because Manhattan is already a competitve place for coffee shops, but opening shop in a neighbourhood with high foot traffic and low competition will be the area most likely to return a profit.

In [1]:
!conda install lxml --yes

import pandas as pd

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - lxml


The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2020.6.2~ --> pkgs/main::ca-certificates-2020.6.24-0

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            conda-forge::certifi-2020.6.20-py36h9~ --> pkgs/main::certifi-2020.6.20-py36_0
  openssl            conda-forge::openssl-1.1.1g-h516909a_1 --> pkgs/main::openssl-1.1.1g-h7b6447c_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by ru

DATA

Three datasets willl be used:

1) List of Manhattan neighbourhoods from wikipedia

https://en.wikipedia.org/wiki/List_of_Manhattan_neighborhoods#:~:text=The%20following%20approximate%20definitions%20are,34th%20Street%20and%2059th%20Street.

bringing in the following string data

In [19]:
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_Manhattan_neighborhoods', header=0)
pd.set_option('display.max_rows', 200)
df_upper = data[0]
df_midtown = data[1]
df_between = data[2]
df_downtown = data[3]
frames = [df_upper, df_midtown, df_between, df_downtown]
df_manhattan = pd.concat(frames)

Examples of Manhattan neighbourhoods

In [20]:
df_manhattan.head()

Unnamed: 0,Name of the neighborhood,Limits south to north and east to west
0,Upper Manhattan,Above 96th Street
1,Marble Hill,Physically located on the mainland
2,Inwood,Above Dyckman Street
3,Fort George (part of Washington Heights),East of Broadway between 181st Street and Dyck...
4,Washington Heights,155th Street to Dyckman Street


Description of Manhattan Neighbourhoods

In [22]:
df_manhattan.describe()

Unnamed: 0,Name of the neighborhood,Limits south to north and east to west
count,85,85
unique,83,85
top,Hudson Yards,40th to 59th Streets; 3rd to 9th Avenues
freq,2,1


2) Pedestrian Traffic data from NYC Open Data

https://data.cityofnewyork.us/api/views/cqsj-cfgu/rows.csv?accessType=DOWNLOAD&bom=true&format=true

The dataset has the following relevant columns:

Borough (e.g. Manhattan) - String
the_geom (lat, lng coordinate) - String
Time series columns (e.g. 1576 for May 2019 at 161 street in the bronx) - Integer

For each year from 2007 to 2019 measurements of pedestrian data is taken bi annually in May and September at morning, midday and afternoon. 

Data for 2019 only will be averaged for each neighbourhood. 

The GeoPy Library will be used to assign an area name to each lat and lng value in the Pedestrian Traffic Data using the Reverse function. If a matching neighbourhood address cannot be assigned, one will be assigned manually.

In [49]:
df_ped = pd.read_csv('https://data.cityofnewyork.us/api/views/cqsj-cfgu/rows.csv?accessType=DOWNLOAD&bom=true&format=true')
df_ped = df_ped[(df_ped.Borough=='Manhattan')]

Examples of pedestrian Traffic data 

In [50]:
df_ped.head()

Unnamed: 0,Borough,the_geom,OBJECTID,Loc,Street_Nam,From_Stree,To_Street,Index,May07_AM,May07_PM,...,Sept17_PM,Sept17_MD,May18_AM,May18_PM,May18_MD,Sept18_PM,Sept18_MD,May19_AM,May19_PM,May19_MD
34,Manhattan,POINT (-74.01155687409947 40.70463665187371),35,35,Broad Street,Beaver Street,South William Street,Y,3469,3992,...,8303,2036,4374,6603,1756,6471,2010,4100,7302,1669
35,Manhattan,POINT (-74.01286204592034 40.70634164448266),36,36,Broadway,Morris Street,Exchange Place,Y,3660,8390,...,12650,8126,5221,19725,6818,10726,9615,5049,11765,7029
36,Manhattan,POINT (-73.98219706247882 40.77181340301184),37,37,Broadway,West 63rd Street,West 64th Street,Y,1611,6764,...,9305,4663,2059,6194,6037,7773,5259,1696,6864,4907
37,Manhattan,POINT (-74.01009312926121 40.715904559004194),38,38,Chambers Street,West Broadway,Greenwich Street,Y,7081,8512,...,9937,3302,8323,8960,3630,10456,3493,3075,6598,2934
38,Manhattan,POINT (-73.97713579908014 40.7796808276313),39,39,Columbus Avenue,West 75th Street,West 76th Street,N,1071,3037,...,3626,2977,1524,3905,3780,3794,2451,-,-,-


Description of pedestrian traffic data

In [51]:
df_ped.describe()

Unnamed: 0,OBJECTID,Loc,May09_AM,May09_PM,May09_MD,May10_AM,May10_PM,May10_MD,Sept10_AM,Sept10_PM,...,May11_MD,Sept11_AM,Sept11_PM,Sept11_MD,May12_AM,May12_PM,May12_MD,Sept15_AM,Sept15_PM,Sept15_MD
count,36.0,36.0,36.0,36.0,36.0,36.0,36.0,36.0,36.0,36.0,...,36.0,36.0,36.0,36.0,36.0,36.0,36.0,36.0,36.0,36.0
mean,52.5,52.5,3709.666667,9949.138889,4485.75,4096.194444,11097.277778,4615.277778,4033.916667,10709.888889,...,4989.111111,4318.944444,10446.722222,5140.527778,4120.861111,7514.805556,5268.638889,4492.611111,11652.833333,4686.0
std,10.535654,10.535654,2664.188829,6881.73954,3273.855022,2911.466757,6968.435994,2991.685064,2498.866673,6727.910071,...,3196.10469,3210.427448,5806.36891,3488.050151,2830.395946,4576.350004,3474.177437,2505.843482,5938.56118,3008.181302
min,35.0,35.0,27.0,192.0,64.0,280.0,453.0,201.0,273.0,313.0,...,487.0,233.0,141.0,1515.0,255.0,118.0,1175.0,281.0,694.0,733.0
25%,43.75,43.75,1543.75,5762.25,2240.5,1986.5,6151.75,2591.75,2378.5,6141.5,...,2700.5,1993.25,5882.0,2435.75,2101.0,4280.25,2704.5,2349.0,7205.25,2191.0
50%,52.5,52.5,3424.0,8518.0,3557.5,3907.5,9961.0,3831.0,3693.5,9001.0,...,4216.5,3759.0,10063.5,4401.0,3753.0,6798.0,4333.5,4088.0,11341.5,3625.0
75%,61.25,61.25,4778.75,11090.75,5465.0,5312.0,13007.25,5981.75,4782.75,13997.75,...,6006.5,5812.5,12380.75,6709.0,5608.5,8990.0,6343.5,6306.25,13843.25,6524.0
max,70.0,70.0,12690.0,29526.0,13971.0,13421.0,30544.0,12727.0,10010.0,30103.0,...,14182.0,14456.0,27249.0,15946.0,13645.0,18969.0,13790.0,10197.0,25687.0,11670.0


3) Coffee shop venues for each neighbourhood from Foursquare API. 

The venue search query will be used to find the number of coffee shops within each neighbourhood. The call will return a JSON file. The JSON file will return venue data such as name, unique id, category and location.

Location will contain address, country, lat, lng, and distance. All fields are strings, except for lat, lng, and distance. Distance is measured in meters. Some venues have their locations hidden for privacy reasons. 

Category is an array of categories that have been applied to a venue. One of the categories will have a primary field.

JSON data will be returned from the foursquare API called and the number of coffee venues will be added as as column to the df_manhattan dataframe. Data under response, venues, name will be counted.

Initalising foursquare:

In [43]:
CLIENT_ID = 'DMQJLIDJ0EYXZOY2VVA52MB1A5HHX03WA5S0YF54QQDIVCW5' # your Foursquare ID
CLIENT_SECRET = 'MZTDY3U4BC2AGDMNAFBSOH34SHPVYDVKNYND32WBOETW3A1I' # your Foursquare Secret
VERSION = '20180604' # Foursquare API version

Example foursquare API call for coffee shops:

In [46]:
LIMIT = 1 # limit of number of venues returned by Foursquare API
NEAR = 'Upper Manhattan'
CATID = '4bf58dd8d48988d1e0931735'

# create the API request URL
url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&near={},&limit={}&categoryId={}'.format(
CLIENT_ID, 
CLIENT_SECRET, 
VERSION, 
NEAR,
LIMIT,
CATID)
            
# make the GET request
results = requests.get(url).json()
results


{'meta': {'code': 200, 'requestId': '5f42a7907f58eb2f2fe25e78'},
 'response': {'venues': [{'id': '58d933702f91cb026f478e38',
    'name': 'East One Coffee Roasters',
    'location': {'address': '384 Court St',
     'crossStreet': 'at Carroll St',
     'lat': 40.681128035266816,
     'lng': -73.99652634325895,
     'labeledLatLngs': [{'label': 'display',
       'lat': 40.681128035266816,
       'lng': -73.99652634325895}],
     'postalCode': '11231',
     'cc': 'US',
     'neighborhood': 'Carroll Gardens',
     'city': 'Brooklyn',
     'state': 'NY',
     'country': 'United States',
     'formattedAddress': ['384 Court St (at Carroll St)',
      'Brooklyn, NY 11231',
      'United States']},
    'categories': [{'id': '4bf58dd8d48988d1e0931735',
      'name': 'Coffee Shop',
      'pluralName': 'Coffee Shops',
      'shortName': 'Coffee Shop',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_',
       'suffix': '.png'},
      'primary': True}],
    'delivery