# Introduction - NYC Neighborhood Tool

### Business Need: Find affordable neighborhoods in Manhattan and Brooklyn that have a large number of venues of types a particular user is interested in.

This project creates a tool that can be used to identify neighborhoods in the two New York City boroughs of Manhattan and Brooklyn that might be attractive to people interested in some set of venues, ranking them both by the absolute number of venues, and the number of venues relative to rent in that area. For instance, one person might want to find an affordable neighborhood that has a high proportion of jazz clubs and movie theaters, because they play saxophone and love to go to movies. Another person might want to find a neighborhood with electric vehicle charging stations, good bakeries, and gay bars. A user of this tool can specify any set of FourSquare venue categories they're interested in.

This question is interesting to me because I have two family members who are considering moving to New York City, and they would find this tool useful.

The work is divided into **five major tasks**:
1. Retrieving data from FourSquare
2. Data cleanup
3. Data merging & computation
4. Mapping
5. Machine learning (and more mapping)

The following data is used to accomplish those tasks:
- The latitude and longitude of sections of Manhattan and Brooklyn, used to request data specific to those two boroughs from FourSquare.
- The list of categories and sub-categories retrieved from FourSquare.
- Lists of venues from FourSquare that match the user-specified categories and geographic boundaries.
- Reverse geocoding data from Bing to retrieve missing zip codes.
- The zip code(s) in each neighborhood in the two boroughs.
- The population of each zip code in the two boroughs.
- A geojson file with the boundaries of each zip code.
- The median cost of a 1-bedroom apartment rental in each neighborhood.

The tool generates both text and map data. In addition to a list of neighborhoods sorted by venue density, and a list sorted by venue density and affordability, this tool creates the following maps:
- A **choropleth map** of the neighborhoods, shaded based on their attractiveness value, which takes venue density and median rent into account.
- An overlay of neighborhood boundaries on a **heatmap of venue density**, making it possible to visually compare density of the selected venue categories among the neighborhoods. 
- A **cluster map** that enables visual exploration of clusters of venues, regardless of neighborhood boundaries.

In [1]:
# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
credentials_2 = {
    'IAM_SERVICE_ID': 'iam-ServiceId-05670e26-425c-4505-8824-ef9ea437d9c0',
    'IBM_API_KEY_ID': 'MaNADuPAUY6W0HC-8FxqIqpgji1yUUm6WVPlm9ulZIOW',
    'ENDPOINT': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT': 'https://iam.ng.bluemix.net/oidc/token',
    'BUCKET': 'nystesting-donotdelete-pr-yczufeevob9sat',
    'FILE': 'ManhattanLatLong.csv'
}

bingMapsKey = "AvDnxBjHKE2Y9SU1HRUpB6-79B0Bk_ldW-LCvqWkLWedsqWrlBX0kvkvl4JurmDO"


In [2]:
# @hidden_cell
# The project token is an authorization token that is used to access IBM Cloud-based project resources like data sources, connections, and used by platform APIs
from project_lib import Project
project = Project(project_id='994f2a2f-d34c-428b-8963-154fd9fac775', project_access_token='p-a6c7ad6d54b86c48f27bb6924d28f925b70b5c08')
pc = project.project_context

In [3]:
# @hidden_cell
# FourSquare API credentials
CLIENT_ID = 'UYWTHYG53L3ENHJ4O50HDEHZKE51X0NPSR3YCXC1I0JCDGRA' # your Foursquare ID
CLIENT_SECRET = 'OPBF0AHVEAL11QB0HZUIWHP2BAUWQLZY4U0ZF10USQQVF0PJ' # your Foursquare Secret
numFourSquareCalls = 0

In [4]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import io
import json

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# Enable HTML output in this notebook
from IPython.display import display, HTML

# import necessary packages to work with spatial data in Python
import os
import math
import numpy as np
import geopandas as gpd

In [5]:
### NO
#import matplotlib.pyplot as plt
#import matplotlib.lines as mlines
#from matplotlib.colors import ListedColormap

#from tabulate import tabulate


In [5]:
# Set VERBOSE = True for debug-level logging
VERBOSE = False

## 1. Retrieve data from FourSquare
Several steps are needed in order to retrieve the venue information we're looking for from FourSquare:
- Retrieve and flatten the complete list of FourSquare venue categories  
- Specify which FourSquare venue categories we're investigating  
- Read in CSV files defining latitude and longitude rectangles we'll use to query FourSquare
- Query FourSquare for venues in Manhattan and Brooklyn in the user-specified categories

### 1.1 Retrieve and flatten the complete list of FourSquare venue categories  
The category data from FourSquare is hierarchical, with categories (e.g., Music Venue) that have sub-catgories (e.g. Dance Club), which sometimes have sub-sub-categories (Turkish Dance Club). After retrieving this data, we'll use a **recursive function** to flatten that structure into a non-hierarchical list of category names and category IDs. This will enable us to look up the category ID of any category a user wants to retrieve data about (e.g., figure out that "Turkish Dance Club" ID = 195 without having to worry about the fact that in the FourSquare data, that information is several levels deep: "Music Venue/Dance Club/Turkish Dance Club" ID = 195).

In [6]:
# This recursive function will keep burrowing down, flattening sub(sub-)-categories returned by FourSquare so we
# have a single, flat list of category names and category IDs
def flattenSubCategories(parentCategoryName, subCatSeries, allCategoriesList, nDepth):
    subCatName = subCatSeries['name']
    subCatID = subCatRow['id']
    rowData = [subCatName, parentCategoryName, subCatID]
    allCategoriesList.append(rowData)
    
    subSubCats = subCatSeries['categories']
    for subSubCatRow in subSubCats:
        flattenSubCategories(subCatName, subSubCatRow, allCategoriesList, nDepth+1)

In [7]:
###### Retrieve all categories and sub-categories from FourSquare
VERSION = '20180604'
url = 'https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION)
results = requests.get(url).json()
numFourSquareCalls = numFourSquareCalls + 1

# assign relevant part of JSON to categories
categories = results['response']['categories']

# tranform categories into a dataframe
dfCategories = json_normalize(categories)

# Each category in the DF contains sub-categories. We want all the
# categories and sub-categories flattened into one DF.
# Create an empty list. We'll keep appending categories and 
# sub-categories to that list via the recursive function flattenSubCategories, 
# then we'll turn that list into a DF
allCategoriesList = []
nRowCount = 0
for row in dfCategories.itertuples():
    # First, put the parent category into the list
    # Use the category name for both the parent and category names
    rowData = [row.name, row.name, row.id]
    parentCategoryName = row.name;
    allCategoriesList.append(rowData)

    # Make a dictionary of the sub-categories of this category
    subCats = row.categories
    nRowCount = nRowCount + 1

    # Recursively iterate over the sub-categories, adding each one to the list
    # Include the parent category's name
    for subCatRow in subCats:
        flattenSubCategories(parentCategoryName, subCatRow, allCategoriesList, 1)
        
# Now make a Dat of the flattened categories and sub-categories
dfAllCategories = pd.DataFrame(allCategoriesList)
dfAllCategories.columns = ['name', 'parentCategoryName', 'id']

# Set the index of the DF to the category/sub-category name, because we'll
# retrieve IDs by name
dfAllCategories = dfAllCategories.set_index(['name'])

print("There are " + str(dfAllCategories.shape[0]) + " categories and sub-categories in FourSquare")

There are 937 categories and sub-categories in FourSquare


### 1.2 Specify which FourSquare venue categories we're investigating.  
A user of this notebook can define which categories they want to analyze and map. For each category, they should also define what color they want map markers to be for that category, and which icon to use for each category on the Folium maps.

In [12]:
# You can change and/or add categories to explore in categoriesToFind
#categoriesToFind = ['Movie Theater', 'Juice Bar']
#categoriesToFind = ['Jazz Club','Burrito Place']
categoriesToFind = ['Jazz Club']
#categoriesToFind = ['Food', 'Clothing Store']

categoryIDsToFind = []
for aCategory in categoriesToFind:
    categoryIDsToFind.append(dfAllCategories.loc[aCategory]['id'])
    print('{} category\'s ID is {}'.format(aCategory, dfAllCategories.loc[aCategory]['id']))
    
# For each category listed above, define the map marker color and icon to use
# Icons are available from these sites:
#   https://fontawesome.com/icons?d=gallery&m=free
#   https://getbootstrap.com/docs/3.3/components/
# The fields are category, color, icon_prefix, and icon_name
#                    {'category':categoriesToFind[1], 'color':'orange', 'icon_prefix':'glyphicon', 'icon_name':'glyphicon-cutlery'}
categoryMapSpecs = [
                    {'category':categoriesToFind[0], 'color':'black', 'icon_prefix':'fa', 'icon_name':'fa-music'},
#                    {'category':categoriesToFind[1], 'color':'orange', 'icon_prefix':'glyphicon', 'icon_name':'glyphicon-cutlery'}
#                    {'category':categoriesToFind[1], 'color':'red', 'icon_prefix':'fa', 'icon_name':'fa-film'}
                   ]
dfMapMarkers = pd.DataFrame.from_dict(categoryMapSpecs)
dfMapMarkers.set_index('category', inplace=True)

Jazz Club category's ID is 4bf58dd8d48988d1e5931735


### 1.3 Read in CSV files defining latitude and longitude rectangles we'll use to query FourSquare


In [9]:
# Read the file containing the overlapping rectangles for Manhattan
# This file is stored in IBM Cloud Storage and associated with this project
wrapper = io.TextIOWrapper(project.get_file("ManhattanLatLong.csv"), encoding='utf-8')
dfManhattanLatLon = pd.read_csv(wrapper)
print("Manhattan Lat/Lon sample:")
dfManhattanLatLon.head(2)

Manhattan Lat/Lon sample:


Unnamed: 0,SWLAT,SWLON,NELAT,NELON
0,40.702142,-74.020323,40.707819,-74.00277
1,40.708046,-74.020688,40.715301,-73.981388


In [10]:
# Read the file containing the overlapping rectangles for Brookly
wrapper = io.TextIOWrapper(project.get_file("BrooklynLatLong.csv"), encoding='utf-8')
dfBrooklynLatLon = pd.read_csv(wrapper)
print("Brooklyn Lat/Lon sample:")
dfBrooklynLatLon.head(2)

Brooklyn Lat/Lon sample:


Unnamed: 0,SWLAT,SWLON,NELAT,NELON
0,40.586324,-74.049198,40.6,-73.894187
1,40.599,-74.049198,40.615,-73.894187


### 1.4 Query FourSquare for venues in Manhattan and Brooklyn in the user-specified categories

FourSquare can be queried for venues in a geographic rectangle by specifying the Southwest and Northeast corners of the rectangle. But FourSquare will return at most 50 venues for one query, so we have to break our queries up into separate, small geographic slices of Manhattan and Brooklyn. The slicing of latitude and longitudes is done programmatically in the code below.

This slicing into overlapping rectangles will result in some duplicate venues from FourSquare (partly due to their rounding of the lat/lon values); we'll get rid of the duplicate data later on.

In [13]:
# Make an enpty dataframe that will contain all the venues found for 
# all the categories of interest, in all the rectangles in Manhattan and Brooklyn
dfAllVenues = pd.DataFrame()

# Merge the two Lat/Lon CSV-based DataFrames
dfTwoBoroughsLatLon = pd.concat([dfManhattanLatLon, dfBrooklynLatLon])

for aCategory in categoriesToFind:
    # Get the ID of the category
    categoryName = aCategory
    categoryID = dfAllCategories.loc[categoryName]['id']
    print('Retrieving ' + categoryName)
    
    numVenuesRetrieved = 0
    numRoundedVenuesRetrieved = 0

    # Iterate over the rectangles, retrieving a list of venues for the category
    for row in dfTwoBoroughsLatLon.itertuples():
        swLat = row.SWLAT
        swLon = row.SWLON
        neLat = row.NELAT
        neLon = row.NELON
        
        # Every degree of latitude is approximately 69 miles 
        # (a little more at the North and South poles, a little less in the middle,
        # but 69 is good enough for our needs).
        # Because FourSquare will only return 50 venues at a time,
        # we need to request small slices of latitude, because in certain
        # areas of NYC, there are a lot of juice bars (or movie theaters, or whatever).
        # So figure out how many slices we need to limit the vertical size of the 
        # slice to about 1/10th of a mile
        differenceInLatitude = (neLat-swLat)
#        numLatSlices = math.ceil((differenceInLatitude*69) / 0.10)
        numLatSlices = 4
        #print('neLat {} swLat {} differenceInLatitude={} numLatSlices {}'.format(neLat, swLat, differenceInLatitude, numLatSlices))
        
        # Break the rectangle up into numLatSlices rectangles,  and create a list
        # with numLatSlices values for swLat, so that a smaller area gets retrieved from FourSquare.
        # This is necessary because some venue categories have more than the maximum
        # number of venues (50) FourSquare will return from one call.
        swLatList = np.linspace(swLat, neLat, numLatSlices, endpoint=True)
        numSWLats = len(swLatList)
        
        # We also need to slice the longitude, because requesting 1/10th of mile latitude by 2.5 miles longitude
        # still results in more than 50 venues sometimes.
        numLonSlices = 4
        lonList = np.linspace(swLon, neLon, numLonSlices, endpoint=True)
        numLons = len(lonList)-1

        #print(str(numFourSquareCalls) + ' ' + str(swLat))
        for nWhichSWLat in range(numSWLats):
            swLat = swLatList[nWhichSWLat]

            for nWhichSWLon in range(numLons):
                # Make sure the longitudes overlap a little; do this by
                # making swLon a little more negative
                swLon = lonList[nWhichSWLon] - 0.0005
                neLon = lonList[nWhichSWLon+1]
                
                if (VERBOSE):
                    print('{},{} {},{}  {},{}'.format(nWhichSWLat, nWhichSWLon, swLat, swLon, neLat, neLon))

                url = 'https://api.foursquare.com/v2/venues/search?intent=browse&client_id={}&client_secret={}&sw={},{}&ne={},{}&v={}&categoryId={}'\
    .format(CLIENT_ID, CLIENT_SECRET, swLat, swLon, neLat, neLon, VERSION, categoryID)
                numFourSquareCalls = numFourSquareCalls + 1
                #print(url)
                results = requests.get(url).json()
                httpResponseCode = results['meta']['code']
                if (httpResponseCode == 429):
                    print('*** Exceeded FourSquare API call quota after ' + str(numFourSquareCalls) + ' calls ***')

                # assign relevant part of JSON to venues
                venues = results['response']['venues']

                # tranform venues into a dataframe
                dfOneRect = json_normalize(venues)
                dfOneRect['category'] = categoryName

                numVenuesInRect = dfOneRect.shape[0]
                if (numVenuesInRect >= 50):
                    print('Retrieved the maximum number of venues {} for category {} with SW latitude {} numLatSlices {} numLonSlices {}'.format(numVenuesInRect, categoryName, swLat, numLatSlices, numLonSlices))

                numVenuesRetrieved = numVenuesRetrieved + numVenuesInRect
                if (numVenuesRetrieved - numRoundedVenuesRetrieved >= 400):
                    numRoundedVenuesRetrieved = round(numVenuesRetrieved, -2)
                    print("  Retrieved {} venues".format(numRoundedVenuesRetrieved))

                if (VERBOSE):
                    print('Retrieved ' + str(dfOneRect.shape[0]) + ' for category ' + categoryName + ' with SW latitude ' + str(swLat))
                    print()
                    
                # append that dataframe to the one containing all rectangles' venues
                dfAllVenues = pd.concat([dfAllVenues, dfOneRect], sort=False)
        
# Drop the columns other than the ones listed in the next line
dfAllVenues.drop(dfAllVenues.columns.difference(['categories', 'id', 'location.address', 'location.lat', 'location.lng',\
                                                 'location.postalCode', 'name', 'category']), 1, inplace=True)
# Assign column names to dfAllVenues
newColumnNames = ['categories', 'id', 'lat', 'lon', 'name', 'category',  'address', 'zip' ]
dfAllVenues.columns = newColumnNames
dfAllVenues.head()


Retrieving Jazz Club
  Retrieved 400 venues
  Retrieved 800 venues
  Retrieved 1200 venues
  Retrieved 1600 venues
  Retrieved 2000 venues
  Retrieved 2400 venues
  Retrieved 2800 venues


Unnamed: 0,categories,id,lat,lon,name,category,address,zip
0,"[{'id': '4bf58dd8d48988d1e5931735', 'name': 'M...",57513529498e3c78ff9efc26,40.70588,-74.01452,"Cole Auditorium, Greenwich Library",Jazz Club,,
0,"[{'id': '4bf58dd8d48988d1e7931735', 'name': 'J...",50c40f33e4b0fac7b2632883,40.707199,-74.010002,Blue Notes,Jazz Club,,10005.0
1,"[{'id': '4bf58dd8d48988d121941735', 'name': 'L...",4ce3a4aa2f842d43e66aab1c,40.705117,-74.011551,9W2RLW@NY,Jazz Club,Beaver Street,10004.0
2,"[{'id': '4bf58dd8d48988d171941735', 'name': 'E...",4bf5a9d1abdaef3ba718a184,40.704923,-74.014029,"AICH, Nyc",Jazz Club,11 Broadway Lbby 2,10004.0
3,"[{'id': '5032792091d4c4b30a586d5c', 'name': 'C...",42fd3800f964a520ee261fe3,40.704997,-74.012614,Alwan for the Arts,Jazz Club,16 Beaver St Ste 4,10004.0


In [21]:
### DEBUGGING
if (VERBOSE):
    print(categoriesToFind)
    print(dfAllVenues.columns)
    print(len(dfAllVenues))
    numFound = 0
    for ven in dfAllVenues.itertuples():
        venName = ven.name
        for aCat in ven.categories:
            aVen = ven.categories[0]
            df = json_normalize(aCat)
            print('{}, {}'.format(venName, aCat))
            #catName = df['name'].reset_index()
    print(numFound)

In [19]:
# Let's see how many venues we have of each type. There will be many duplicate
# venues, so we'll see this number go down as we clean up the data
tableIntro = HTML('<h3>Here are the number of venues found for each catetory. There are many duplicates that we\'ll remove soon.</h3>')
display(tableIntro)
df = dfAllVenues.groupby(['category']).name.agg('count').to_frame('Number')
df.head()

Unnamed: 0_level_0,Number
category,Unnamed: 1_level_1
Jazz Club,3074


# 2. Data Cleanup
There are several different types of data cleanup needed:
- Removing duplicate venues
- Filtering out irrelevant results
- Looking up missing zip codes

### 2.1 Data Cleanup Step 1: Removing duplicate venues
**We will have received duplicate venues from FourSquare, for two reasons:**
1. Our lat/lon rectangles were slightly overlapped (intentionally) to ensure complete geographic coverage  
2. FourSquare rounds the lat/lon values  

Let's drop the entries with the same venue ID and category name (if we drop based only on venue ID, then a venue that appeared in more than one category we searched for would get dropped from all but one of those categories)

In [20]:
numRowsWithDups = len(dfAllVenues)
dfAllVenues.drop_duplicates(keep='first', inplace=True, subset=['id', 'name']) 
numRowsWithoutDups = len(dfAllVenues)
print('Dropped {} duplicate venues, {} venues remain'.format(numRowsWithDups-numRowsWithoutDups, numRowsWithoutDups))

Dropped 1791 duplicate venues, 1283 venues remain


In [16]:
### NO 
# FourSquare returns venues that are somewhat-related to the categories we
# specified, but those categories are not their main purpose. For instance,
# it returns Madison Square Garden for just about any venue query, because
# that venue is used for so many different purposes. We need to work with
# a manageable number of venues, so let's eliminate ones whose main
# category is not one of the one's we're looking for
dfWrongCategory = dfAllVenues[~dfAllVenues['name'].isin(categoriesToFind)]
print(len(dfWrongCategory))
print(dfAllVenues['name'].unique())

1285
['Cole Auditorium, Greenwich Library' "DJ Eug's Sick Recording Studio"
 '9W2RLW@NY' ... '3SM Studios' 'Manchitas' 'Life of Fire']


### 2.2 Data Cleanup Steps 2 & 3: Filtering out irrelevant results, and looking up missing zip codes
**Two kinds of data cleanup are necessary at this step.**
1. When you query FourSquare for a specific sub-category (or sub-sub-category), such as **'Jazz Clubs'**,
   it returns venues that are sort-of, kind-of, somewhat related, such as **'Rock Music Clubs'** or **'Dance Clubs.'**
   It returns Madison Square Garden for almost every query, because many different types of events
   take place at the Garden, from Knicks basketball games to rock concerts to monster truck shows.
   As a result, a query for **Jazz Clubs** returns several thousand venues in Manhattan
   and Brooklyn, but there aren't several thousand jazz clubs there.
   This tool is designed to help people find specific types of venues, not venues that may once
   in a while host the type of event they're interested in, so we need to 
   filter these less-relevant results out. Another reason to filter them out is that Folium doesn't 
   seem to be able to display a map with 2,500 cluster points.
   
   To eliminate spurious venues, we'll iterate through the list of all venues that FourSquare returned,
   and look at the first category listed for the venue. If that first category, which specifies the
   venue's primary usage, does not match one of the categories we're looking for, we'll discard that venue.  
     
   
2. Some results from FourSquare have lat/lon, but no zip code, and we need each venue's zip code
   in order to figure out which neighborhood it's in.
   
   To fixup records without zip codes, we'll use **reverse geocoding from Bing** to retrieve the missing zip codes

In [21]:
print("Before processing, there are " + str(len(dfAllVenues)) + " venues in our list")

# Make an empty list that we'll build up with venues we're keeping
allVenuesCleanList = []

# Ensure that zip code is a string so that '01234' isn't interpreted as 1234
dfAllVenues.zip = dfAllVenues.zip.astype(str)

# Keep track of what we do
numVenuesDiscarded = 0
numFound = 0
numZipsChanged = 0
numProcessed = 0
numRoundedProcessed = 0

# Iterate over all of the data
for index, row in dfAllVenues.iterrows():
    try:
        # First, see if we even want to keep this row. Discard it if its
        # first category name is not one of the ones we're looking for. It might
        # seem better to match by categoryID, but FourSquare returns the
        # categoryID we queried for, regardless of what a venue's
        # main category is.
        primaryCategory = row['categories'][0]
        dfCategory = json_normalize(primaryCategory)
        catName = dfCategory.iloc[0]['name']
        if (catName in categoriesToFind):
            numFound = numFound + 1
            theCategory = row['category']
            theID = row['id']
            theAddress = row['address']
            theLat = row['lat']
            theLon = row['lon']
            theZip = row['zip']
            theName = row['name']
            
            if (VERBOSE):
                print('{}, {}, {}'.format(theName, theCategory, theAddress))
                
            # Is the record missing its zip code?
            bUpdateZip = theZip == 'nan' or (len(str(theZip)) != 5)
            if (bUpdateZip):
                numZipsChanged = numZipsChanged + 1
                lat = row['lat']
                lon = row['lon']
                
                # Do a reverse geocode call to Bing to get the zip code for this lat/lon
                bingURL = 'https://dev.virtualearth.net/REST/v1/locationrecog/{},{}?key={}&r=.5&distanceUnit=mi&includeEntityTypes=address&output=json'\
                .format(lat,lon,bingMapsKey)

                if (VERBOSE):
                    print('Zip is "{}" for {} {}, {}'.format(row.zip, index, lat, lon))

                results = requests.get(bingURL).json()
                theZip = results['resourceSets'][0]['resources'][0]['addressOfLocation'][0]['postalCode']
                if (VERBOSE):
                    print('Changing zip from "{}" to {} for {}, {}'.format(row.zip, theZip, lat, lon))
        
            # Now that we know we want to keep this venue, and we have its zip code, add it to the list of records we'll keep
            rowData = [theCategory, theID, theAddress, theLat, theLon, theZip, theName]
            allVenuesCleanList.append(rowData)
        
            numProcessed = numProcessed + 1
            if (numProcessed - numRoundedProcessed >= 400):
                numRoundedProcessed = round(numProcessed, -2)
                print("  Processed {} venues".format(numRoundedProcessed))

        else:
            numVenuesDiscarded = numVenuesDiscarded + 1
        
    except Exception as e: 
        pass # doing nothing on exception, we'll filter out rows w/o zip codes later
        print(e)
    
print('{} venues kept, {} venues discarded, {} zip codes fixed, {} unique zip code'.format(numFound, numVenuesDiscarded, numZipsChanged,\
                                                                                          len(dfAllVenues['zip'].unique())))
# Make a new DataFrame from the list, and fixup the column names
dfAllVenuesClean = pd.DataFrame(allVenuesCleanList)
newColumnNames = ['category', 'id', 'address', 'lat', 'lon', 'zip', 'name']
dfAllVenuesClean.columns = newColumnNames

# Let's see how many we have of each type now
df = dfAllVenuesClean.groupby(['category']).name.agg('count').to_frame('Number')
df.head()

Before processing, there are 1283 venues in our list
95 venues kept, 1188 venues discarded, 16 zip codes fixed, 93 unique zip code


Unnamed: 0_level_0,Number
category,Unnamed: 1_level_1
Jazz Club,95


In [26]:
### DEBUGGING
if (VERBOSE):
    dfAllVenues = dfAllVenuesClean
    print(dfAllVenues['zip'].unique())

# 3. Data Merging
**We need to do several tasks to merge the various data we have into what we need:**  
1. Read in a file that maps each zip code to the neighborhood it represents
2. Determine the neighborhood of each venue and remove venues that are outside the neighborhoods we're investigating
3. Read in a file that has the population and average rent of each neighborhood so we can calculate the density of venues in each neighborhood, and compare the cost of living in the different neighborhoods    
4. Group the venue data by borough (Manhattan or Brooklyn) and neighborhood to determine the number of venues in each neighborhood (venue count)
5. Merge the population/rent data with the venue count data
6. Calculate the density of venues in each neighborhood. Density is defined as the number of venues per 10,000 residents of the neighborhood.
7. Calculate the 'attractiveness' of each neighborhood, defined as the venue density divided by the median monthly rent of a 1-bedroom apartment in that neighborhood (`rent / $1,000`, so `$2,500 rent = 2.5`)

### 3.1 Data Merging: Read in Zip Codes-to-Neighborhood file

In [25]:
# Read in the CSV file 'Brooklyn and Manhattan Zip Codes By Neighborhood.csv'.
# I created this file by combining data from several files. The CSV file
# has these columns:
#  Borough: Either Brooklyn or Manhattan
#  SubBoroNumber: An identifier created by New York City to identify sections of each borough
#  Name: The name of the neighborhood(s) (sub-boro)
#  ZipCode: One of the zip codes in the neighborhood(s)
# There are multiples lines for most neighborhoods in the CSV file, because
# most neighborhoods contain more than one zip code

wrapper = io.TextIOWrapper(project.get_file("Brooklyn and Manhattan Zip Codes By Neighborhood.csv"), encoding='utf-8')
dfNeighborhoodZipcodes = pd.read_csv(wrapper)

# Lets convert ZipCode from int to string, so that zips with leading zeroes (e.g., 02123) work.
# NYC doesn't have any zips with leading zeroes, but it's good data management
# to ensure that code will work with any valid data that's out there, not just the 
# particular data you're using at the moment
dfNeighborhoodZipcodes.ZipCode = dfNeighborhoodZipcodes.ZipCode.astype(str)

# Ensure that Borough and Name columns are strings
dfNeighborhoodZipcodes.Borough = dfNeighborhoodZipcodes.Borough.astype(str)
dfNeighborhoodZipcodes.Name = dfNeighborhoodZipcodes.Name.astype(str)

info = HTML("<h4>Let's test our int-to-string conversion by checking which neighborhood the zip code '11201' (a string, not the int 11,201) is in</h4>")
display(info)
# Let's test our int-to-string conversion by checking which neighborhood the 
# zip code "11201" (a string, not the int 11201) is in
print(dfNeighborhoodZipcodes.loc[dfNeighborhoodZipcodes['ZipCode'] == "11201"])

    Borough  SubBoroNumber                           Name ZipCode
2  Brooklyn              2  Brooklyn Heights, Fort Greene   11201


### 3.2 Data merging: Find the neighborhood of each venue  

Now that we have the venues (with zip codes), and a way to look up a neighborhood for a particular zip code, iterate over dfAllVenues, adding the neighborhood to each entry.


In [31]:
# Start with the clean list of venues we created above
dfAllVenues = dfAllVenuesClean

# Make an empty list that we'll build up with the data we keep
allVenuesList = []

# Start by adding the columns that exist in dfNeighborhoodZipcodes to dfAllVenues
dfAllVenues["Borough"] = ""
dfAllVenues["SubBoroNumber"] = 0
dfAllVenues["Neighborhood"] = ""

numRowsProcessed = 0
numVenuesRemoved = 0

for row in dfAllVenues.itertuples():    
    numRowsProcessed = numRowsProcessed + 1

    zipCode = row.zip
    name = row.name
    
    # Get the neighborhood this zip code corresponds to
    dfNeighborhood = dfNeighborhoodZipcodes.loc[dfNeighborhoodZipcodes['ZipCode'] == zipCode]
    if (len(dfNeighborhood) == 0):
        numVenuesRemoved = numVenuesRemoved + 1
    else:
        borough = dfNeighborhood.iloc[0]["Borough"]
        subBoro = dfNeighborhood.iloc[0]["SubBoroNumber"]
        neighborhood = dfNeighborhood.iloc[0]["Name"]
        rowData = [row.category, row.id, row.address, row.lat, row.lon, zipCode, name,\
            borough, subBoro, neighborhood]
        allVenuesList.append(rowData)

venueColumnNames = ['category', 'id', 'address', 'lat', 'lon', 'zip', 'name', 'Borough', 'SubBoroNumber', 'Neighborhood']
dfAllVenues = pd.DataFrame(allVenuesList)
dfAllVenues.columns = venueColumnNames
info = HTML("<h3><ul><li>{} venues processed</li><li>{} removed due to zip codes outside the target areas</li><li>{} remaining unique zip codes</li></ul></h3>"\
      .format(numRowsProcessed, numVenuesRemoved, len(dfAllVenues['zip'].unique())))
display(info)

### 3.3 Data merging: A little more data cleanup - remove venues with zip codes outside the boroughs we're using

Some of the venues FourSquare returned are not in Manhattan or Brooklyn. For instance, FourSquare returns some venues in Queens, which is just to the east of Brooklyn, because the lat/lon rectangles we used in the FourSquare queries for Brooklyn included bits of Queens; the rectangles are, well, rectangular, while the borough borders are not (the official term for those types of borders is 'squiggly'). We can find venues that are not in Brooklyn or Manhattan by looking for venues whose sub-borough number (the numerical version of their neighborhood) is zero, which will be the case for venues whose zip codes didn't map to a neighborhood in the previous step.

In [29]:
### NO  NO NO NO NO

# Clean up the data, removing any rows in dfAllVenues that have SubBoroNumber == 0, because
# that means their zip code did not match up with any of the zip codes in the neighborhoods
# we're including.
numRowsBefore = len(dfAllVenues)

# Get indexes for which column SubBoroNumber is zero
indexNames = dfAllVenues[dfAllVenues['SubBoroNumber'] == 0].index
 
# Delete these row indexes from dataFrame
dfAllVenues.drop(indexNames , inplace=True)
numRowsAfter = len(dfAllVenues)
numRowsRemoved = numRowsBefore - numRowsAfter
print("Removed " + str(numRowsRemoved) + " rows that are in neighborhoods we're not including")


Removed 0 rows that are in neighborhoods we're not including


### 3.3 Data merging: Read Population and Rent Data into a new DataFrame

In [32]:
# Read in a CSV file that has the population of each neighborhood
# The CSV file has these columns:
#  Borough: Either Brooklyn or Manhattan
#  SubBoroNumber: An identifier created by New York City to identify sections of each borough
#  Name: The name of the neighborhood(s) (sub-boro)
#  Population: The number of people living in the neighborhood as of the 2010 census
#  Rent: The median rent for a 1-bedroom apartment
# We can match rows in this data to rows in the zip code file by matching Borough andSubBohoNumber,
# or more directly by matching the neighborhood name

wrapper = io.TextIOWrapper(project.get_file("Brooklyn and Manhattan Population And Rent By Neighborhood.csv"), encoding='utf-8')
dfNeighborhoodPopulation = pd.read_csv(wrapper)
dfNeighborhoodPopulation.head()

Unnamed: 0,Borough,SubBoroNumber,Name,Population,Rent
0,Brooklyn,1,"Williamsburg, Greenpoint",173083,2720
1,Brooklyn,2,"Brooklyn Heights, Fort Greene",99617,2975
2,Brooklyn,3,Bedford Stuyvesant,152985,2200
3,Brooklyn,4,Bushwick,112634,2200
4,Brooklyn,5,"East New York, Starrett City",182896,1375


### 3.4 Data merging: Group venue data by neighborhood

Create a new dataframe that has the borough, sub-boro (neighborhood), and the groupby count of venues in that borough/sub-boro

In [33]:
# Create a new dataframe that has the borough, sub-boro, and the groupby count
# of venues in that borough/sub-boro
dfVenueCount = dfAllVenues.groupby(['Borough', 'SubBoroNumber']).name.agg('count').to_frame('VenueCount')
dfVenueCount.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,VenueCount
Borough,SubBoroNumber,Unnamed: 2_level_1
Brooklyn,1,4
Brooklyn,2,5
Brooklyn,4,6
Brooklyn,6,3
Brooklyn,12,2


#### See if the grouped data looks reasonable; i.e., not too many neighborhoods with no venues, because it's not plausible that half the neighborhoods in Manhattan won't have a movie theater or juice bar or whatever type of venue we're looking for (unless we're looking for a fairly esoteric type of venue).

In [36]:
venueHoods = dfAllVenues['Neighborhood'].unique()
info = HTML("<h3>{} out of {} neighborhoods have venues</h3>".format(len(venueHoods), len(dfNeighborhoodPopulation)))
display(info)

info = HTML("<h4>Here are the neighborhoods with no venues: </h4>")
display(info)
print(dfNeighborhoodPopulation.loc[(~dfNeighborhoodPopulation['Name'].isin(venueHoods))]['Name'])

2               Bedford Stuyvesant
4     East New York, Starrett City
6     Sunset Park, Windsor Terrace
7              Crown Heights North
8     Crown Heights South, Wingate
9         Bay Ridge, Dyker Heights
10        Bensonhurst, Bath Beach 
12    Coney Island, Brighton Beach
15         Brownsville, Ocean Hill
23     Stuyvesant Town, Turtle Bay
Name: Name, dtype: object


### 3.5 Data merging: Merge the population/rent DataFrame with the venue count DataFrame

In [37]:
# Merge the two dataframes: dfNeighborhoodPopulation and dfVenueCount
# They both have a Borough and a SubBoroNumber column, so we'll merge
# on that combination. We'll do a left merge, which will include all
# the columns from dfNeighborhoodPopulation, and add the VenueCount
# column
dfNeighborhoodPopulation = pd.merge(dfNeighborhoodPopulation, dfVenueCount, how='left', on =['Borough', 'SubBoroNumber'])
dfNeighborhoodPopulation.head()

Unnamed: 0,Borough,SubBoroNumber,Name,Population,Rent,VenueCount
0,Brooklyn,1,"Williamsburg, Greenpoint",173083,2720,4.0
1,Brooklyn,2,"Brooklyn Heights, Fort Greene",99617,2975,5.0
2,Brooklyn,3,Bedford Stuyvesant,152985,2200,
3,Brooklyn,4,Bushwick,112634,2200,6.0
4,Brooklyn,5,"East New York, Starrett City",182896,1375,


### 3.6 Data Merging: Calculate the density of venues in each neighborhood. Density is defined as the number of venues per 10,000 residents of the neighborhood.

In [38]:
dfPop = dfNeighborhoodPopulation

# A neighborhood that doesn't have any venues will have NaN for VenueCount.
# Convert those to zero
values = {'VenueCount': 0}
dfPop.fillna(value=values, inplace=True)

# Calculate the number of venues per 10,000 residents of each neighborhood
dfPop["VenueDensity"] = dfNeighborhoodPopulation["VenueCount"] / (dfNeighborhoodPopulation["Population"] / 10000)

tableIntro = HTML('<h3>The 10 neighborhoods with the highest density of venues in the categories {} are:</h3>'.format(str(categoriesToFind)))
display(tableIntro)

# Print the neighborhoods with the highest venue densities
dfPop = dfPop.sort_values('VenueDensity', ascending=False)
dfPop.head(10)

Unnamed: 0,Borough,SubBoroNumber,Name,Population,Rent,VenueCount,VenueDensity
22,Manhattan,5,Midtown Business District,51673,3125,8.0,1.548197
21,Manhattan,4,"Chelsea, Clinton",103245,3700,14.0,1.355998
19,Manhattan,2,"Greenwich Village, Soho",90016,3375,12.0,1.333096
18,Manhattan,1,"Battery Park City, Tribeca",60978,3877,6.0,0.983961
3,Brooklyn,4,Bushwick,112634,2200,6.0,0.532699
27,Manhattan,10,Central Harlem,115723,2010,6.0,0.518479
1,Brooklyn,2,"Brooklyn Heights, Fort Greene",99617,2975,5.0,0.501922
24,Manhattan,7,"West Side, Upper West Side",209084,3150,8.0,0.382621
20,Manhattan,3,"Lower East Side, Chinatown",163277,2950,5.0,0.306228
5,Brooklyn,6,"Park Slope, Carroll Gardens",104709,2555,3.0,0.286508


### 3.7 Data Merging: Calculate the 'attractiveness' of each neighborhood, defined as the venue density divided by the median monthly rent of a 1-bedroom apartment in that neighborhood

In [39]:
# Convert any NaN VenueDensity values to zeroes
values = {'VenueDensity': 0}
dfPop.fillna(value=values, inplace=True)

# Calculate the 'attractiveness' of each neighborhood by dividing the VenueDensity by the
# median rent for a 1-bedroom apartment (in thousands of dollars per month)
dfPop["Attractiveness"] = dfNeighborhoodPopulation["VenueDensity"] / (dfNeighborhoodPopulation["Rent"] / 1000)

# Print the neighborhoods with the highest venue densities
dfPop = dfPop.sort_values('Attractiveness', ascending=False)

tableIntro = HTML('<h3>The 10 neighborhoods with the highest attractiveness ratings in the categories {} are:</h3>'.format(str(categoriesToFind)))
display(tableIntro)

# The Attractiveness value depends on the venue density, which can vary a lot between
# different types of venues. So let's normalize the Attractiveness value so it's
# always between zero and one.
dfPop["Attractiveness"]=(dfPop["Attractiveness"]-dfPop["Attractiveness"].min())/(dfPop["Attractiveness"].max()-dfPop["Attractiveness"].min())
dfPop.head(10)

Unnamed: 0,Borough,SubBoroNumber,Name,Population,Rent,VenueCount,VenueDensity,Attractiveness
22,Manhattan,5,Midtown Business District,51673,3125,8.0,1.548197,1.0
19,Manhattan,2,"Greenwich Village, Soho",90016,3375,12.0,1.333096,0.797281
21,Manhattan,4,"Chelsea, Clinton",103245,3700,14.0,1.355998,0.739743
27,Manhattan,10,Central Harlem,115723,2010,6.0,0.518479,0.520666
18,Manhattan,1,"Battery Park City, Tribeca",60978,3877,6.0,0.983961,0.512278
3,Brooklyn,4,Bushwick,112634,2200,6.0,0.532699,0.488746
1,Brooklyn,2,"Brooklyn Heights, Fort Greene",99617,2975,5.0,0.501922,0.340544
17,Brooklyn,18,"Canarsie, Flatlands",193543,1542,4.0,0.206672,0.270534
24,Manhattan,7,"West Side, Upper West Side",209084,3150,8.0,0.382621,0.245178
5,Brooklyn,6,"Park Slope, Carroll Gardens",104709,2555,3.0,0.286508,0.226345


# 4. Mapping the Data

Geopandas is an extension of the pandas library that adds various geographical capabilities to pandas DataFrames. We'll be using it to combine some geographic data. We have a geojson file that gives the geographic boundaries of zip codes in New York. We need to combine the polygon boundaries of the zip codes that make up each neighborhood to get the geographic boundaries of the neighborhood as a whole. Geopandas provides a process called 'dissolving' that merges polygons, keeping the outer boundaries of the merged polygons, while eliminating any inner boundaries. This will allow us to combine the boundaries of the zip codes that make up each neighborhood, and get the boundaries of the neighborhood. That, in turn, will enable us to draw the neighborhood boundaries on maps.

We'll do these mapping tasks:
1. Computing the geographic boundaries of the neighborhoods in Manhattan and Brooklyn
2. Create a heatmap of venues in those two boroughs
3. Create a cluster map that enables visual exploration of clusters of venues, regardless of neighborhood boundaries
4. Create a choropleth map of the neighborhoods, shaded based on their attractiveness value, which takes venue density and median rent into account

In [40]:
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 300)

### 4.1 Getting geographic boundaries of our neighborhoods
1. Read in a geojson file that contains polygons that define the boundary of each zip code in NYC
1. Turn the geojson data into a geopandas DataFrame
1. Select the columns that will be used to **'dissolve'** the boundaries. The dissolving process will combine the polygons of all the zip codes in each neighborhood, eliminate the inner boundaries, leaving only the outer boundaries, thereby creating a geopandas DataFrame with the outer boundaries of each neighborhood.
1. Drop the rows from the geopandas DataFrame for zip codes that are not in one of the neighborhoods in Manhattan or Brooklyn


In [46]:
import geopandas as gpd

# Read the zip code/boundary file into memory
nyBoundaryFile = project.get_file('NYCZipcodeBoundaries.json')

# Turn the zip code/boundary file into a geopandas dataframe
nyBoundaries = gpd.read_file(nyBoundaryFile)
if (VERBOSE):
    print(nyBoundaries.head())

# select the columns that you with to use for the dissolve and that will be retained
nyBoundaries2 = nyBoundaries[['postalCode', 'borough', 'PO_NAME', 'geometry']]
if (VERBOSE):
    print(nyBoundaries2['PO_NAME'] + "  " + nyBoundaries2['postalCode'])

# Drop the rows from the geopandas DataFrame that are for zip codes we aren't using
rowsBeforeProcessing = nyBoundaries2.shape[0]
nyBoundaries2 = nyBoundaries2[nyBoundaries2.postalCode.isin(dfNeighborhoodZipcodes['ZipCode'])]
rowsAfterProcessing = nyBoundaries2.shape[0]

info = HTML('<h4>Before processing, {} rows in the geopandas DataFrame<br>After processing, {} rows in the geopandas DataFrame</h4>'.format(rowsBeforeProcessing, rowsAfterProcessing))
display(info)

if (VERBOSE):
    nyBoundaries2.head(2)


### Merge the neighborhood column from dfNeighborhoodZipcodes into nyBoundaries2, matching on zip code
1. Do an inner join on the 'ZipCode' column to merge these two DataFrame
1. Dissolve the zip code boundaries to create neighborhood boundaries

In [47]:
# Now merge the neighborhood column from dfNeighborhoodZipcodes into nyBoundaries2, matching on zip code
# First, let's change 'postalCode' to 'ZipCode' since that's what dfNeighborhoodZipcodes calls it
newColumnNames = list(nyBoundaries2.columns)
for idx, item in enumerate(newColumnNames):
   if 'postalCode' in item:
       newColumnNames[idx] = 'ZipCode'
nyBoundaries2.columns = newColumnNames

# Do an inner join to merge the geopandas DataFrame with zip code->boundaries with our zip code->neighborhood file
nyBoundaries2 = pd.merge(nyBoundaries2, dfNeighborhoodZipcodes, how='inner', on =['ZipCode'])

# Now dissolve zip codes' boundaries into neighborhood boundaries
nyBoundaries3 = nyBoundaries2.dissolve(by='Name')

# Turn the Name column from an index back into a regular column we can access
nyBoundaries3.reset_index(inplace=True)

print("Merged the geographic boundaries of " + str(len(nyBoundaries2)) + " zip codes into " + str(len(nyBoundaries3)) + " neighborhoods")

Merged the geographic boundaries of 97 zip codes into 30 neighborhoods


In [48]:
# Make sure the neighborhood names in dfPop and nyBoundaries3 match, because
# that's the column we'll join the geojson data and the attractiveness data
# on for the choropleth map
print('Neighborhood names match? {}'.format(len(dfPop.loc[dfPop['Name'].isin(nyBoundaries3['Name'])]) == len(dfPop)))

Neighborhood names match? True


### 4.2 Create a Heatmap of venues in Manhattan and Brooklyn

In [49]:
# Get the lat/lon of the center of Manhattan & Brooklyn
# I visually determind the 'center' of Manhattan to be about where the Guggenheim Museum is,
# just east of Central Park, so I query Bing to get its lat/lon
bingURL = 'http://dev.virtualearth.net/REST/v1/Locations/{}?maxResults=1&key={}'.format("Guggenheim Museum, NY", bingMapsKey)
results = requests.get(bingURL).json()
manhattanCenter = results['resourceSets'][0]['resources'][0]['geocodePoints'][0]['coordinates']
print("Manhattan center: " + str(manhattanCenter))

# Various online sources pinpoint the center of Brookly as Brooklyn College, which is at
# 2900 Bedford Avenue.
bingURL = 'http://dev.virtualearth.net/REST/v1/Locations/{}?maxResults=1&key={}'.format("2900 Bedford Avenue, Brookly, NY", bingMapsKey)
results = requests.get(bingURL).json()
brooklynCenter = results['resourceSets'][0]['resources'][0]['geocodePoints'][0]['coordinates']
print("Brooklyn center: " + str(brooklynCenter))


Manhattan center: [40.78292465209961, -73.95895385742188]
Brooklyn center: [40.63181, -73.953412]


In [50]:
def neighborhoods_style(feature):
    return { 'color': 'blue', 'fill': False }

In [51]:
%%capture
# NOTE: This takes a long time and outputs many lines.
# The %%capture magic will suppress that output (but unfortunately doesn't make it load any faster)
!conda install -c conda-forge folium

# Choropleth Map of Neighborhood Attractiveness
Make a choropleth map that shows neighborhood boundaries, and colors each neighborhood according to its **'attractiveness'** value. This lets us see at a glance which neighborhoods have the best combinations of venue density and affordable rent. Map pins display the neighborhood name when clicked.

##### The next cell will make the output cell larger so we don't have to scroll the maps so much

In [52]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

<IPython.core.display.Javascript object>

In [53]:
# Convert the Coordinate Reference System (CRS) of the data to 4362, which is
# is wgs84, World Geographic System 1984
gjson = nyBoundaries3.to_crs(epsg='4326').to_json()

In [54]:
import folium
from folium import plugins

# Create a map
mAttractiveness = folium.Map(location=manhattanCenter,
               tiles='Mapbox Bright', width='100%', height='100%', zoom_start=11)
folium.TileLayer('cartodbpositron').add_to(mAttractiveness)

# Convert the Coordinate Reference System (CRS) of the data to 4362, which is
# is wgs84, World Geographic System 1984
#gjson = nyBoundaries3.to_crs(epsg='4326').to_json()
#
## Add the neighborhood boundaries as a map layer
#folium.GeoJson(gjson, style_function=neighborhoods_style, name='geojson').add_to(mapNYC)

# The geojson data in 'gjon' is currently a string. Decode it into a dict
gjsonDict = json.loads(gjson)
# Add chloropleth layer
# We'll join the geojson data with the dfPop data on column 'Name'. The 'key_on' parameter tells the map
# where to find that data in the geojson data; the key_on value MUST start with 'feature' and be in
# JavaScript (dot) notation
folium.Choropleth(
    geo_data=gjsonDict,
    name='Attractiveness',
    data=dfPop,
    columns=['Name', 'Attractiveness'],
    fill_color='YlGn',
#    fill_color='Blues',
    key_on='feature.properties.Name',
    legend_name='Attractiveness'
).add_to(mAttractiveness)

# Add the centroid of each neighborhood to its geopandas DataFrame so we know
# where to put a text label identifying the neighborhood
nyBoundaries3["neighborhoodCentroid"] = nyBoundaries3.centroid
for index, row in nyBoundaries3.iterrows():
    textMarker = folium.Marker(location=[row.neighborhoodCentroid.y, row.neighborhoodCentroid.x], popup=row.Name)
    mAttractiveness.add_child(textMarker)

# Add the neighborhood boundaries as a map layer
folium.GeoJson(gjson, style_function=neighborhoods_style, name='geojson').add_to(mAttractiveness)

<folium.features.GeoJson at 0x7fee7158d710>

In [55]:
mAttractiveness

### Create a heat map showing venue density in each neighborhood

In [56]:
from folium.plugins import HeatMap

# Create a heat map with a marker for each venue, with the
# neighborhood boundaries overlayed on the map
mapNYC = folium.Map(location=manhattanCenter, zoom_start=13, tiles='Mapbox Bright')
folium.TileLayer('cartodbpositron').add_to(mapNYC)

# Add the neighborhood boundaries as a map layer
folium.GeoJson(gjson, style_function=neighborhoods_style, name='geojson').add_to(mapNYC)

# Make sure lat/lon are floats
dfAllVenues['lat'] = dfAllVenues['lat'].astype(float)
dfAllVenues['lon'] = dfAllVenues['lon'].astype(float)

# Make a list of lat/lon points of all the venues
heatMapData = [[row['lat'],row['lon']] for index, row in dfAllVenues.iterrows()]
HeatMap(heatMapData).add_to(mapNYC)

# Add the centroid of each neighborhood to its geopandas DataFrame so we know
# where to put a text label identifying the neighborhood
nyBoundaries3["neighborhoodCentroid"] = nyBoundaries3.centroid
for index, row in nyBoundaries3.iterrows():
    textMarker = folium.Marker(location=[row.neighborhoodCentroid.y, row.neighborhoodCentroid.x], popup=row.Name)
    mapNYC.add_child(textMarker)

mapNYC

# 5. Maching Learning - Clustering

As we can see on the heatmap with neighborhood boundaries, groupings of venues don't necessarily respect those boundaries. There could be a cluster of jazz clubs that are near one another, but are in several different neighborhoods. E.g., a club on the west side of a major street could be right on the eastern edge of neighborhood A, while a club right across the street is on the very western edge of neighborhood B. I'll use maching learning to discover clusters of the venues we're looking for, regardless of neighbhorhood boundaries, and will map those. 

In [57]:
# Do one-hot encoding to enable k-means clustering
venues_onehot = pd.get_dummies(dfAllVenues[['category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
venues_onehot['lat'] = dfAllVenues['lat'] 
venues_onehot['lon'] = dfAllVenues['lon'] 

venues_onehot.head()

Unnamed: 0,Jazz Club,lat,lon
0,1,40.707199,-74.010002
1,1,40.707199,-74.010002
2,1,40.714723,-74.010598
3,1,40.715197,-74.007533
4,1,40.712786,-74.00436


In [58]:
from sklearn.cluster import KMeans 

# set number of clusters
kclusters = 15

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(venues_onehot)

In [59]:
# add clustering labels to the dataframe, and add back the venue names
venues_onehot.insert(0, 'ClusterLabel', kmeans.labels_)
venues_onehot['name'] = dfAllVenues['name'] 
venues_onehot['category'] = dfAllVenues['category'] 
venues_onehot.head()


Unnamed: 0,ClusterLabel,Jazz Club,lat,lon,name,category
0,3,1,40.707199,-74.010002,Blue Notes,Jazz Club
1,3,1,40.707199,-74.010002,NY Jazz,Jazz Club
2,3,1,40.714723,-74.010598,The 75 Club,Jazz Club
3,3,1,40.715197,-74.007533,78 Reade,Jazz Club
4,3,1,40.712786,-74.00436,Jazz Club Hub,Jazz Club


In [64]:
### DEBUGGING
pd.value_counts(venues_onehot['ClusterLabel'])

14    12
9     11
6     10
7     9 
5     9 
12    6 
10    6 
3     6 
1     5 
8     4 
0     4 
13    3 
11    3 
4     3 
2     2 
Name: ClusterLabel, dtype: int64

In [66]:
### NOT NEEDED?
# Matplotlib and associated plotting modules
#import matplotlib.cm as cm
#import matplotlib.colors as colors
import folium

from folium.plugins import MarkerCluster

# create map
#map_clusters = folium.Map(location=manhattanCenter, zoom_start=11)
map_clusters = mapNYC

# set color scheme for the clusters
#x = np.arange(kclusters)
#ys = [i + x + (i*x)**2 for i in range(kclusters)]
#colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
#rainbow = [colors.rgb2hex(i) for i in colors_array]

In [42]:
### DEBUGGING
print(len(venues_onehot))

93


In [67]:
### NOT NEEDED?
# Create a list of N MarkerCluster objects
# We'll treat this list like an array, using it to add
# venues to the appropriate marker cluster for the cluster map
numClusters = len(venues_onehot['ClusterLabel'].unique())
mcList = [MarkerCluster] * numClusters
nIndex = 0
for mc in mcList:
    mcList[nIndex] = MarkerCluster()
    nIndex = nIndex + 1    

In [61]:
from folium.plugins import MarkerCluster

# add cluster markers to the map, with a cluster child for each top venue
mc = MarkerCluster()

# iterate over the rows in the DataFrame
for clusterRow in venues_onehot.itertuples():
    lat = clusterRow.lat
    lon = clusterRow.lon
    name = clusterRow.name
    category = clusterRow.category
    clusterNumber = clusterRow.ClusterLabel
#    mc = mcList[clusterNumber]

    markerColor = dfMapMarkers.loc[category].color
    markerIconPrefix = dfMapMarkers.loc[category].icon_prefix
    markerIconName = dfMapMarkers.loc[category].icon_name
    thePopup = folium.Popup(name)
#    mc.add_child(folium.Marker(location=[lat,  lon], popup=thePopup, icon=folium.Icon(color='red', icon='info-sign')))
#    mc.add_child(folium.Marker(location=[lat,  lon], popup=thePopup, icon=folium.Icon(color='red', icon='glyphicon-music')))
    mc.add_child(folium.Marker(location=[lat,  lon], popup=thePopup, icon=folium.Icon(color=markerColor, prefix=markerIconPrefix, icon=markerIconName)))

#for mc in mcList:
mapNYC.add_child(mc)

legend_html = "<div style='position: fixed; bottom: 50px; left: 50px; width: 100px; height: 90px; " +\
"border:2px solid grey; z-index:9999; font-size:14px;'>&nbsp;Venue Legend <br>" +\
"&nbsp; " + categoriesToFind[0] + " &nbsp; <i class='" + dfMapMarkers.loc[categoriesToFind[0]].icon_prefix +\
" " + dfMapMarkers.loc[categoriesToFind[0]].icon_name + "' style='color:" + dfMapMarkers.loc[categoriesToFind[0]].color + "'></i><br></i>" +\
"&nbsp; " + categoriesToFind[0] + " &nbsp; <i class='" + dfMapMarkers.loc[categoriesToFind[0]].icon_prefix +\
" " + dfMapMarkers.loc[categoriesToFind[0]].icon_name + "' style='color:" + dfMapMarkers.loc[categoriesToFind[0]].color + "'></i><br></i>" +\
"</div>"

mapNYC.get_root().html.add_child(folium.Element(legend_html))
mapNYC


In [115]:
mapNYC

In [45]:
### DEBUGGING
dfMapMarkers

Unnamed: 0_level_0,color,icon_name,icon_prefix
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jazz Club,black,fa-music,fa


### Create a map showing the venue clusters, overlayed on the neighborhood boundaries. This will enable us to see groupings of venues that don't conform to neighborhood boundaries

In [62]:
# Figure out how many venues are in each cluster, and where the centroid of each cluster is
numVenuesInEachCluster = pd.value_counts(venues_onehot['ClusterLabel'])
clusterCenters = kmeans.cluster_centers_

# JSON doesn't support numpy arrays, so we have to convert the centroids to an int array
clusterCentersInt = getattr(clusterCenters, "tolist", lambda: clusterCenters)()

map_clusterCircles = folium.Map(location=manhattanCenter, zoom_start=13, tiles='Mapbox Bright')
folium.TileLayer('cartodbpositron').add_to(map_clusterCircles)
folium.GeoJson(gjson, style_function=neighborhoods_style, name='geojson').add_to(map_clusterCircles)

# Add a circle representing each cluster to the map; the circle's radius
# varies directly with the number of venues in the cluster. The more venues
# in a cluster, the bigger its circle will be
for nWhichCluster in range(0, len(clusterCentersInt)):
    lat = clusterCentersInt[nWhichCluster][1]
    lon = clusterCentersInt[nWhichCluster][2]
    numVenues = numVenuesInEachCluster[nWhichCluster]
    
    # We have to ensure the radius value we send to Folium is a float, not an int,
    # or it will throw a "can't serialize" exception
    theRadius = (numVenues*75).astype(float)
    
    # Also ensure the radius is big enough to be seen
    theRadius = max(theRadius, 100)

    aCircle = folium.Circle(
      location=[lat, lon],
#      popup=str(nWhichCluster),
      radius=theRadius,
#      color='crimson',
      fill=True,
      fill_color='crimson').add_to(map_clusterCircles)
    
map_clusterCircles

### This map is useful, but it would be more useful if the venues were on it, so we could zoom in and see what's in each cluster. Let's add them.

In [63]:
map_clusterVenues = map_clusterCircles

# set color scheme for the clusters
#x = np.arange(kclusters)
#ys = [i + x + (i*x)**2 for i in range(kclusters)]
#colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
#rainbow = [colors.rgb2hex(i) for i in colors_array]

mc = MarkerCluster()

# Add cluster markers to the map, with a cluster child for each venue
# Iterate over the rows in the DataFrame
for clusterRow in venues_onehot.itertuples():
    lat = clusterRow.lat
    lon = clusterRow.lon
    name = clusterRow.name
    category = clusterRow.category
    clusterNumber = clusterRow.ClusterLabel

    # Each marker will use the icon and color specified for its category at the top of this file
    markerColor = dfMapMarkers.loc[category].color
    markerIconPrefix = dfMapMarkers.loc[category].icon_prefix
    markerIconName = dfMapMarkers.loc[category].icon_name
    thePopup = folium.Popup(name)
    mc.add_child(folium.Marker(location=[lat,  lon], popup=thePopup, icon=folium.Icon(color=markerColor, prefix=markerIconPrefix, icon=markerIconName)))

map_clusterVenues.add_child(mc)

# Add a map legend for the various icons
legend_html = "<div style='position: fixed; bottom: 50px; left: 50px; width: 100px; height: 90px; " +\
"border:2px solid grey; z-index:9999; font-size:14px;'>&nbsp;Venue Legend <br>" +\
"&nbsp; " + categoriesToFind[0] + " &nbsp; <i class='" + dfMapMarkers.loc[categoriesToFind[0]].icon_prefix +\
" " + dfMapMarkers.loc[categoriesToFind[0]].icon_name + "' style='color:" + dfMapMarkers.loc[categoriesToFind[0]].color + "'></i><br></i>" +\
"&nbsp; " + categoriesToFind[0] + " &nbsp; <i class='" + dfMapMarkers.loc[categoriesToFind[0]].icon_prefix +\
" " + dfMapMarkers.loc[categoriesToFind[0]].icon_name + "' style='color:" + dfMapMarkers.loc[categoriesToFind[0]].color + "'></i><br></i>" +\
"</div>"
map_clusterVenues.get_root().html.add_child(folium.Element(legend_html))

map_clusterVenues