The Battle of the Neighborhoods: Brooklyn Edition

Part 1: Introduction/Business Problem

Introduction:

Your friend is considering opening a coffee shop in the Brooklyn borough of New York City, NY. Brooklyn is the up and coming borough of New York City and they are requesting help selecting a neighborhood to open their shop in to be successful. They have asked for your assistance with your expertise in data analytics with Python to help them select a location for an increased likelihood of success.

Business Problem:

The problem you need to solve is to select a neighborhood to recommend your friend open the coffee shop in. You should employ data analytics with Python to select the optimal neighborhood for the coffee shop based on neighborhood segmenting and clustering as well as analysis of the types of venues in the neighborhood. 

Target Audience/Who Would Care About It:

The target audience of this problem is the your friend who is opening the coffee shop as well as any investors or stakeholders involved in the opening of the coffee shop. This presentation will provide a recommendation for the neighborhood to open the coffee shop in as well as provide the documentation of the data analysis peformed to inform the recommendation. Your friend, investors, and stakeholders will care about the recommendations and the supporting analysis because it can make them confident that they are making a data informed decision optimizing their success. 

Part 2: Data

Dataset:

For the data to solve this business problem, we will use the dataset of New York City neighborhoods and boroughs at the following link of data collected and stored as a shapefile by NYU. https://geo.nyu.edu/catalog/nyu_2451_34572  

Example of Dataset Contents:

This dataset consists of the 306 neighborhoods in New York City including the neighborhood name, borough, latitude, longitude, geometry type, and annotation. 

What Can Be Extracted from the Dataset:

We can extract the each neighborhood including the neighborhood name, borough, latitude, and longitude into a Pandas data frame and then will filter the dataset and data frame to only include the Brooklyn borough. 

How Will It Be Used:

This resultant dataset and data frame can be utilized with the Foursquare data of venues to analyze each neighborhood and make a recommendation. 

Data Importing and Cleaning:

First, let's import all of the necessary libraries. 

In [26]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


Then, let's download the New York json data.

In [28]:
#Download New York Json data
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


Next, let's load and explore the data.

In [29]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
newyork_data

{'type': 'FeatureCollection',
 'totalFeatures': 306,
 'features': [{'type': 'Feature',
   'id': 'nyu_2451_34572.1',
   'geometry': {'type': 'Point',
    'coordinates': [-73.84720052054902, 40.89470517661]},
   'geometry_name': 'geom',
   'properties': {'name': 'Wakefield',
    'stacked': 1,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.2',
   'geometry': {'type': 'Point',
    'coordinates': [-73.82993910812398, 40.87429419303012]},
   'geometry_name': 'geom',
   'properties': {'name': 'Co-op City',
    'stacked': 2,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.87429419303012]}},
  {'type': 'Feature',
 

You can notice that all the relevant data is in the features key. So, let's define a new variable neighborhoods that includes this data and then look at the first entry in the list. 

In [30]:
neighborhoods_data = newyork_data['features']
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

Next, let's transform the list into a pandas data frame. First, let's create an empty data frame. 

In [32]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Then, let's loop through the data, fill the data frame one row at a time, and then look at the first five lines. 

In [33]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [34]:
neighborhoods.head()

AttributeError: 'NoneType' object has no attribute 'items'

  Borough Neighborhood   Latitude  Longitude
0   Bronx    Wakefield  40.894705 -73.847201
1   Bronx   Co-op City  40.874294 -73.829939
2   Bronx  Eastchester  40.887556 -73.827806
3   Bronx    Fieldston  40.895437 -73.905643
4   Bronx    Riverdale  40.890834 -73.912585

Next, lets ensure that the data for all 5 boroughs and 306 neighborhoods has entered the data frame. 

In [13]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


Let's use the geopy library to get the latitude and longitude for New York City. In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent ny_explorer, as shown below.

In [14]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


Then, let's create a map of New York with neighborhoods superimposed on top.

In [15]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

The business problem being presented is only considering neighborhoods in the Brooklyn borough. So, we need to simplify the map and data frame to only include Brooklyn neighborhoods. 

Frist, let's slice the original data frame to create a new data frame of only Brooklyn data. 

In [35]:
brooklyn_data = neighborhoods[neighborhoods['Borough'] == 'Brooklyn'].reset_index(drop=True)
brooklyn_data.head()

AttributeError: 'NoneType' object has no attribute 'items'

    Borough Neighborhood   Latitude  Longitude
0  Brooklyn    Bay Ridge  40.625801 -74.030621
1  Brooklyn  Bensonhurst  40.611009 -73.995180
2  Brooklyn  Sunset Park  40.645103 -74.010316
3  Brooklyn   Greenpoint  40.730201 -73.954241
4  Brooklyn    Gravesend  40.595260 -73.973471

Then, let's get the geographical coordinates of Brooklyn.

In [17]:
address = 'Brooklyn, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Brooklyn are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Brooklyn are 40.6501038, -73.9495823.


As we did with all of New York City, let's visualize the neighborhoods in Brooklyn. 

In [18]:
# create map of Brooklyn using latitude and longitude values
map_brooklyn = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(brooklyn_data['Latitude'], brooklyn_data['Longitude'], brooklyn_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_brooklyn)  
    
map_brooklyn

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

First, lets define the Foursquare API credentials. 

In [19]:
CLIENT_ID = '2RBQQAOS5JLFA5XFXJHJH4HYVK0BVLJY1IXCVI234QOZ4QZO' # your Foursquare ID
CLIENT_SECRET = 'VLNLIVI1NUQN1ICJQRLED0DHAE4E4K1ZZTQ2O4IHEQEXNMO4' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 2RBQQAOS5JLFA5XFXJHJH4HYVK0BVLJY1IXCVI234QOZ4QZO
CLIENT_SECRET:VLNLIVI1NUQN1ICJQRLED0DHAE4E4K1ZZTQ2O4IHEQEXNMO4


Then, lets explore the first neighborhood in the data frame. 

In [37]:
brooklyn_data.loc[0, 'Neighborhood']

'Bay Ridge'

Next, lets get the latitude and longitude values of the data frame. 

In [38]:
neighborhood_latitude = brooklyn_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = brooklyn_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = brooklyn_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Bay Ridge are 40.625801065010656, -74.03062069353813.


Now, let's get the top 100 venues that are in Bay Ridge within a radius of 500 meters.

First, let's create the GET request URL. Name your URL url.

In [39]:
# type your answer here
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL


'https://api.foursquare.com/v2/venues/explore?&client_id=2RBQQAOS5JLFA5XFXJHJH4HYVK0BVLJY1IXCVI234QOZ4QZO&client_secret=VLNLIVI1NUQN1ICJQRLED0DHAE4E4K1ZZTQ2O4IHEQEXNMO4&v=20180605&ll=40.625801065010656,-74.03062069353813&radius=500&limit=100'

Send the GET request and examine the results. 

In [41]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f320de3e48ac773fabe1003'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'},
    {'name': '$-$$$$', 'key': 'price'}]},
  'headerLocation': 'Bay Ridge',
  'headerFullLocation': 'Bay Ridge, Brooklyn',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 83,
  'suggestedBounds': {'ne': {'lat': 40.63030106951066,
    'lng': -74.02470273356597},
   'sw': {'lat': 40.62130106051065, 'lng': -74.03653865351028}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4b895827f964a5206c2d32e3',
       'name': 'Pilo Arts Day Spa and Salon',
       'location': {'address': '8412 3rd Ave',
        'lat': 40.62474788273414,
        'lng': -74.03059056940135,
        'labeledLatL

Then, let's use the get_category_type from Foursquare. 

In [42]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now, lets clean the json and structure it into a pandas dataframe.

In [43]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


AttributeError: 'Series' object has no attribute '_mgr'

Then, lets print the number of venues returned by Foursquare for that neighborhood. 

In [44]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

83 venues were returned by Foursquare.


Let's use Foursquare to examine all neighborhoods in Brooklyn. 

In [45]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Then, lets use this data to create a new data frame called Brooklyn data. 

In [46]:
brooklyn_venues = getNearbyVenues(names=brooklyn_data['Neighborhood'],
                                   latitudes=brooklyn_data['Latitude'],
                                   longitudes=brooklyn_data['Longitude']
                                  )



Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker Heights
Gerritsen Beach
Marine Park
Clinton Hill
Sea Gate
Downtown
Boerum Hill
Prospect Lefferts Gardens
Ocean Hill
City Line
Bergen Beach
Midwood
Prospect Park South
Georgetown
East Williamsburg
North Side
South Side
Ocean Parkway
Fort Hamilton
Ditmas Park
Wingate
Rugby
Remsen Village
New Lots
Paerdegat Basin
Mill Basin
Fulton Ferry
Vinegar Hill
Weeksville
Broadway Junction
Dumbo
Homecrest
Highland Park
Madison
Erasmus


Let's check the shape and first few rows of the data frame. 

In [47]:
print(brooklyn_venues.shape)
brooklyn_venues.head()

(2762, 7)


AttributeError: 'NoneType' object has no attribute 'items'

  Neighborhood  Neighborhood Latitude  Neighborhood Longitude  \
0    Bay Ridge              40.625801              -74.030621   
1    Bay Ridge              40.625801              -74.030621   
2    Bay Ridge              40.625801              -74.030621   
3    Bay Ridge              40.625801              -74.030621   
4    Bay Ridge              40.625801              -74.030621   

                         Venue  Venue Latitude  Venue Longitude  \
0  Pilo Arts Day Spa and Salon       40.624748       -74.030591   
1                    Bagel Boy       40.627896       -74.029335   
2                 Pegasus Cafe       40.623168       -74.031186   
3          Leo's Casa Calamari       40.624200       -74.030931   
4                Cocoa Grinder       40.623967       -74.030863   

   Venue Category  
0             Spa  
1      Bagel Shop  
2  Breakfast Spot  
3     Pizza Place  
4       Juice Bar  

 Next, let's check how many venues were returned for each neighborhood. 

In [48]:
brooklyn_venues.groupby('Neighborhood').count()

AttributeError: 'NoneType' object has no attribute 'items'

                           Neighborhood Latitude  Neighborhood Longitude  \
Neighborhood                                                               
Bath Beach                                    45                      45   
Bay Ridge                                     83                      83   
Bedford Stuyvesant                            28                      28   
Bensonhurst                                   30                      30   
Bergen Beach                                   5                       5   
Boerum Hill                                   87                      87   
Borough Park                                  24                      24   
Brighton Beach                                44                      44   
Broadway Junction                             18                      18   
Brooklyn Heights                             100                     100   
Brownsville                                   17                      17   
Bushwick    

Finally, let's find out how many unique categories can be curated from all the returned venues

In [49]:
print('There are {} uniques categories.'.format(len(brooklyn_venues['Venue Category'].unique())))

There are 286 uniques categories.


Now, we have imported and cleaned all necessary data into Python as well as created a map of Brooklyn. We are ready to proceed with analyzing each neighborhood to select the optimal one to recommend to your friend. 