### Introduction

When establishing a location for a restaurant or any other type of business, one important consideration is co-tenancy of complementary establishments.  Traffic generators, such as a movie theatre, complement nearby restaurants, as in "dinner and a movie".

In an economically viable area, there is sure to be some ratio of industries that complement each other.  By analyzing those ratio's, we can find opportunities for unserved markets.

There is something called the Starbucks principle: a Starbucks in the neighborhood is just one indication that the property is a good buy.  Backed up by a study from real estate website Zillow, homes located within a quarter-mile of a Starbucks increased in value by 96% between 1997 and 2014.

Franchises like Starbucks' primary R&D expense is dedicated to gathering information and making predictions on the long-term value and appreciation of a property.  We don't have the real estate research capabilities of Starbucks, we'll use a Starbucks locations as a proxy in order to find the most economically viable areas.

We'll group areas of high density of starbucks locations, and then analyze each of those locations for types of establishment in the surrounding areas to confirm that the location is indeed a dynamic (and presumably economically successful) neighborhood.

Anyone opening a brick-and-mortor business would be interested in using this kind of tool as one indication of competition in the market.

### Data

We're grabbing data from the foursquare API.

We start with a map of all Starbucks locations in NY.  We get this data by looping through through the five boroughs of NY and using the /v2/venues/explore endpoint to search by 'near'.

After we have all relevant Starbucks locations, we'll work to group them by distance into clusters.

We'll further use the /v2/venues/explore endpoint to check all significant locations around each Starbucks in a cluster, group them together, and confirm that each cluster represents a dynamic economic zone with appropriate amenities by finding the surrounding venues and their categories.

### Methodology

In [400]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '' # Foursquare API version

In [95]:
import requests
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from pathlib import Path

import folium

In [3]:
s_l = []
ids = []
for path in Path('data/ny-starbucks').glob('*'):
    path_in_str = str(path)
    with open(path_in_str) as json_file:
        starbucks = json.load(json_file)
        sb_items = starbucks['response']['groups'][0]['items']
        for loc in sb_items:
            if loc['venue']['id'] not in ids:
                ids.append(loc['venue']['id'])
                s_l.append(loc)
     
len(ids)


363

In [4]:
mpoints = []
for location in s_l:
    #print(location)
    mpoints.append( { 'lat': location['venue']['location']['lat'] ,
                     'lng': location['venue']['location']['lng'] ,
                     'id': location['venue']['id']} )
    #break
    
len(mpoints)
df = pd.DataFrame(mpoints)
df.head()

Unnamed: 0,id,lat,lng
0,4a9efc7af964a520103c20e3,40.661393,-73.840068
1,45ac15a4f964a52061411fe3,40.726219,-73.852728
2,51323ff4e4b006d9cdccfabc,40.737305,-73.877219
3,5432edbe498e591cd8fd9c13,40.702583,-73.818541
4,4bc8e36bfb84c9b613f8193e,40.781967,-73.829718


Here we create a map of all Starbucks returned by searching specifically for Starbucks in NY.  There are 363 stores returned by Foursquare, although, if you check google maps, you'll see that there may be more than that.

In [5]:
latitude = 40.7141667
longitude = -74.0063889

map = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for idx, loc in df.iterrows():
    #label = '{}, {}'.format(neighborhood, borough)
    #label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [loc['lat'], loc['lng']],
        radius=5,
        popup=loc.id,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map) 

map

In [6]:
import sklearn.utils
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

sklearn.utils.check_random_state(1000)
clustered = df[['lat','lng']]
clustered = np.nan_to_num(clustered)
clustered = StandardScaler().fit_transform(clustered)

In [7]:
colors = ['red', 'blue', 'yellow', 'green', 'purple', 'orange', 'brown', 'gold', 'olive', 'pumpkin']

In [48]:
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=5).fit(clustered)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
df["Clus_Db"]=labels

realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels)) 

map = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for idx, loc in df.iterrows():
    #label = '{}, {}'.format(neighborhood, borough)
    #label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [loc['lat'], loc['lng']],
        radius=5,
        popup=str(loc.Clus_Db) + ' ' + str(loc.id),
        color=colors[loc['Clus_Db']],
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map) 

map

Grab each non-red cluster and assign it to a zone.

In [49]:
zones = {}

for idx, loc in df.iterrows():
    if int(loc.Clus_Db) > 0:
        if ( loc.Clus_Db-1 not in zones ):
           zones[loc.Clus_Db-1] = [] 
        zones[loc.Clus_Db-1].append(loc)
        
len(zones.keys())

4

Divide Manhattan up into zones.

In [50]:
# Compute DBSCAN
db = DBSCAN(eps=0.06, min_samples=5).fit(clustered)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
df["Clus_Db"]=labels

realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels)) 

map = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for idx, loc in df.iterrows():
    #label = '{}, {}'.format(neighborhood, borough)
    #label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [loc['lat'], loc['lng']],
        radius=5,
        popup=str(loc.Clus_Db) + ' ' + str(loc.id),
        color=colors[loc['Clus_Db']],
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map) 

map

In [51]:
len_keys = loc.Clus_Db+len(zones.keys())
for idx, loc in df.iterrows():
    if loc.Clus_Db >= 0:
        key_ = int(loc.Clus_Db)+int(len_keys)-2
        #print(key_)
        if ( key_ not in zones ):
           zones[key_] = [] 
        zones[key_].append(loc)
#print(len_keys)

In [52]:
# Compute DBSCAN
db = DBSCAN(eps=0.25, min_samples=5).fit(clustered)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
df["Clus_Db"]=labels

realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels)) 

map = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for idx, loc in df.iterrows():
    #label = '{}, {}'.format(neighborhood, borough)
    #label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [loc['lat'], loc['lng']],
        radius=5,
        popup=str(loc.Clus_Db) + ' ' + str(loc.id),
        color=colors[loc['Clus_Db']],
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map) 

map

In [53]:
len_zones = len(zones.keys())
for idx, loc in df.iterrows():
    if loc.Clus_Db in [1,0]:
        key_ = int(loc.Clus_Db)+int(len_zones)
        if ( key_ not in zones ):
           zones[key_] = [] 
        zones[key_].append(loc)

In [54]:
zone_keys = []
for zone_key in zones:
    zone_keys.append(zone_key)
    for loc in zones[zone_key]:
        loc.Clus_Db = zone_key
        
len(zones)
zone_keys

[3, 0, 1, 2, 4, 5, 6, 7, 8, 9]

In [57]:
colors = plt.get_cmap('jet')(np.linspace(0.0, 1.0, len(zone_keys)))
colors = ['red', 'blue', 'yellow', 'green', 'purple', 'orange', 'brown', 'gold', 'olive', 'black']
len(colors)

10

In [275]:

map = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for zone_key in zones:
    for loc in zones[zone_key]:
        #label = '{}, {}'.format(neighborhood, borough)
        #label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [loc['lat'], loc['lng']],
            radius=5,
            popup=str(loc.Clus_Db) + ' ' + str(loc.id),
            color=colors[loc.Clus_Db],
            fill=True,
            fill_color=colors[loc.Clus_Db],
            fill_opacity=0.5,
            parse_html=False).add_to(map) 

map

We've created a few zones:
0) JFK airport
1) Staten Island
2) the Bronx
3) Brooklyn
4) Lower Manhattan
5) Wall Street
6) Mid Manhattan
7) aross the river from Wall Street in Brooklyn
8) Queens Blvd
9) Broadway + Queens

In [363]:
zone_list_array = [
    {'zone':0, 'name':'JFK airport'},
    {'zone':1, 'name':'Staten Island'},
    {'zone':2, 'name':'the Bronx'},
    {'zone':3, 'name':'East Brooklyn'},
    {'zone':4, 'name':'Lower Manhattan'},
    {'zone':5, 'name':'Wall St'},
    {'zone':6, 'name':'Mid Manhattan'},
    {'zone':7, 'name':'Brooklyn near Wall St'},
    {'zone':8, 'name':'Queens Blvd'},
    {'zone':9, 'name':'Broadway and Queens'},
]

zone_list = pd.DataFrame(zone_list_array).set_index('zone')
#zone_list

Create a function that retrieves venues in our economic zones from Foursquare and caches the results in a json file.

In [226]:
import os.path
import os
#query='Starbucks'
offset=0


def getNearbyVenues(loc_id, lat, long, offset=0, cat=None):
    
    query=''
    LIMIT=100
    radius=250
    
    path = 'data/loc/'+loc_id+'.json'
    if cat is not None:
        path = 'data/loc/'+cat+'/'+loc_id+'.json'
        dir = 'data/loc/'+cat
        if not os.path.exists(dir):
            os.makedirs(dir)
    if Path(path).is_file():
        with open(path) as json_file:
            json_ = json.load(json_file)
            total_results = json_['response']['totalResults']
            return json_
    
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&limit={}&query={}&ll={},{}&offset={}&radius={}&categoryId={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION,  
        LIMIT,
        query,
        lat,
        lng,
        offset,
        radius,
        cat
    )

    results = requests.get(url).json()

    with open(path, 'w') as outfile:  
        json.dump(results, outfile)
        
    return results

In [None]:
Use previously defined function to retrieve all data for each zone.

In [401]:
#RETRIEVE DATA FROM 
venues = {}
ids = {}
cat = {}

for zone_key in zones:
    venues[zone_key] = []
    ids[zone_key] = []
    cat[zone_key] = []
    for loc in zones[zone_key]:
        nearby = getNearbyVenues(loc.id, loc.lat, loc.lng)
        for nearby_loc in nearby['response']['groups'][0]['items']:
            if nearby_loc['venue']['id'] not in ids[zone_key]:
                category = nearby_loc['venue']['categories'][0]['name']
                if category not in cat[zone_key]:
                    cat[zone_key].append(category)
                ids[zone_key].append(nearby_loc['venue']['id'])
                venues[zone_key].append(
                    {'id':nearby_loc['venue']['id'],
                     'lat': location['venue']['location']['lat'],
                     'lng': location['venue']['location']['lng'] ,
                     'name':nearby_loc['venue']['name'],
                     'categoryId':nearby_loc['venue']['categories'][0]['id'],
                     'category':category})


zone_venue_num = []   
for zone_key in ids:
    zone_venue_num.append({'zone': zone_key, '# venues': len(ids[zone_key])})
    
zone_venues = pd.DataFrame(zone_venue_num).sort_values('zone').set_index('zone')


To confirm Foursquare API's limitations, I grabbed all 'coffee shops' in Brooklyn to see if we could get a more comprehensive list.  It did not change the final number of 'coffee shops' in zone 3, so this technique was not used.

We do confirm the limitations of Foursquare's API here:

In [237]:
#AS A TEST, GET ALL ZONE 3 (BROOKLYN) COFFEE SHOPS - 4bf58dd8d48988d1e0931735
#THIS DOESN'T INCREASE LOCATION SIZE

"""
for loc in zones[3]:
    nearby = getNearbyVenues(loc.id, loc.lat, loc.lng, 0, '4bf58dd8d48988d1e0931735')

    for nearby_loc in nearby['response']['groups'][0]['items']:
        if nearby_loc['venue']['id'] not in ids[zone_key]:
            category = nearby_loc['venue']['categories'][0]['name']
            if category not in cat[zone_key]:
                cat[zone_key].append(category)
            ids[zone_key].append(nearby_loc['venue']['id'])
            venues[zone_key].append(
                {'id':nearby_loc['venue']['id'],
                 'lat': location['venue']['location']['lat'],
                 'lng': location['venue']['location']['lng'] ,
                 'name':nearby_loc['venue']['name'],
                 'categoryId':nearby_loc['venue']['categories'][0]['id'],
                 'category':category})

    
#for zone_key in ids:
#    print(zone_key , len(ids[zone_key]))
    
#for zone_key in ids:
#    print(zone_key , len(cat[zone_key]))
    
cat[3]

venues_df = pd.DataFrame(venues[3])
grouped = venues_df.groupby('categoryId').count()
"""
print(None)

None


#### Categories
Get a list of all categories from all zones.

In [379]:
all_categories = []
for cat_key in cat:
    all_categories = all_categories + cat[cat_key]
#all_categories = sort(all_categories)
all_categories = set(all_categories)
len(all_categories)

all_cat_df = pd.DataFrame(all_categories)
all_cat_df.rename(columns={0: 'category'}).set_index('category').sort_index()

print(False)

False


Here, we group venue types together and count how many of each venue category type are in each zone.

In [398]:
def getMainVenueTypes(zone_id):
    venues_df = pd.DataFrame(venues[zone_id])
    grouped = venues_df.groupby('category')['category'].count()

    column_name = zone_list.loc[zone_id]['name']

    venues_df = pd.DataFrame(grouped)
    venues_df.columns = [ column_name ]
    return venues_df.loc[(venues_df[column_name] > 1)]


zone_venu_count = {}
for zone_id in zones:
    zone_venu_count[zone_id] = getMainVenueTypes(zone_id)

compare = pd.merge(zone_venu_count[0], zone_venu_count[1], on='category')
for zone_id in [2,3,4,5,6,7,8,9]:
    compare = pd.merge(compare, zone_venu_count[zone_id], on='category', how='outer')


### Results

We've identified the following zones with the traffic and economy able to support a retail business and the number of venues in its immediate vicinity:

In [272]:
zone_list
zone_venues
pd.merge(zone_list, zone_venues, on='zone')

Unnamed: 0_level_0,name,# venues
zone,Unnamed: 1_level_1,Unnamed: 2_level_1
0,JFK airport,28
1,Staten Island,1
2,the Bronx,84
3,East Brooklyn,65
4,Lower Manhattan,260
5,Wall St,31
6,Mid Manhattan,43
7,Brooklyn near Wall St,23
8,Queens Blvd,59
9,Broadway and Queens,129


Below is a comparison of each of the economic zones and the amenities they offer.

In [399]:
pd.options.display.max_rows = None
display(compare.sort_index().fillna(''))

Unnamed: 0_level_0,JFK airport,Staten Island,the Bronx,East Brooklyn,Lower Manhattan,Wall St,Mid Manhattan,Brooklyn near Wall St,Queens Blvd,Broadway and Queens
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
American Restaurant,,,,,3.0,,,,,
Art Gallery,,,,,,,,2.0,,
Asian Restaurant,,,,,2.0,,,,,2.0
Bagel Shop,,,,,4.0,,,,,2.0
Bakery,,,3.0,,6.0,,,,,3.0
Bank,,,2.0,3.0,3.0,,,,2.0,2.0
Bar,,,,,13.0,,2.0,,3.0,4.0
Brazilian Restaurant,,,,,3.0,,,,,2.0
Bubble Tea Shop,,,,,2.0,,,,,
Bus Station,,,2.0,,,,,,,


### Discussion


#### Limitations OR Advantages of Foursquare Data?

Foursquare uses an algorithm to return results based on feedback from Foursquare users.  This limitation may be expressed as an advantage: there may be more establishments in an area, but, if Foursquare does not return those establishments in the results of a query, it may be that this establishment is under-performing and may be taken out by competition.

#### The outlyer, Staten Island

Staten Island is an outlier with only a single venue in the area of type "Park".  If you look closely on the map, you'll also see a mall.  Further investigation should be done to determine if this is a promising place to open a business.

#### Queens Blvd

Examining this 'by hand', by looking at the map, we can see that Queens Blvd is a bit of an outlier.  All other zones are grouped together closely in a circle, while the Queens Blvd zone is grouped together in a linear fashion.  A quick wikipedia search confirms the obvious hypothesis that this zone is clustered around a major thoroughfare.  Traffic in this area is probably dominated by cars and should be further investigated if the prospective business needs more foot traffic.

### Recommendations
Any business that relies on a dynamic and economically viable neighborhood to return value to shareholders should consider any of these 9 groups after a more thorough analysis, and perhaps can consider the 'Staten Island' zone after careful consideration of their niche.

There are, of course, other types of business - there is a business that is intended to serve a community need while providing the sole proprietor with employment; this type of business may consider cheapest options and ability to bootstrap rather than volume of customers; manufacturing plants and distribution points could find better opportunities on more cost-effective grounds.  These zones do not apply to those businesses.

In addition, a prospective business owner may wish to take in account:
1. demographics
2. psychographics
3. site quality
    + 3a. visibility
    + 3b. access
4. traffic generators

### Conclusion

In this study, we found dynamic economic zones with enough traffic to support a retail business by clustering Starbucks locations and then checking surrounding venues.

Placing a business is more of an art than a science and a single study like this cannot be the sole facet of such a decision.  However, we have identified several regions of heavy foot traffic and provided some insight in the surrounding businesses that could be useful in making a decision.