In [41]:
import requests
import json
import pandas as pd
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize

# The SoHo of Prague, the Old Town of New York
## *Clustering and comparison of the neighborhoods of Prague and New York*
### Author: Hana Konradova

## 1. Introduction/Business Problem

I will admit that my research question has a personally-driven motivation. I lived in New York for a couple of years and moved back to my home town of Prague just recently, and I have always been amazed by how different these two cities are and how much I could fall in love with both of them, despite of their huge differences. 

Hence, in my analysis, I want to compare and cluster neighborhoods of Prague and New York and explore to what extent they are comparable, what neighborhoods are most like each other in terms of venues they consist of, or what is chararacteristic for each of the most famous areas in town for the two cities. 

Besides of people who are just curious about how these two cities compare to each other, this analysis might be useful for example for anyone moving between the cities to find a neighborhood similar to the one they currently live in, or for any business owner interested in expanding or moving his firm from Prague to New York or vice versa. Another and a more general application is urban development. Some neighborhoods are more popular than others, and learning more about their features might help in deciding what should be the focus of further development in order for the neighborhoods become more "livable". This analysis might also serve as a "template" for preforming similar comparisons between any other two cities.

## 2. Data

The unit of analysis will be neighborhood tabulation areas (NTAa) for New York, and cadastral areas for Prague as they represent a convenient, medium-sized representation of neighborhoods (too small areas would provide too granular results and with bigger areas such as administrative districts, the subtleties of smaller hoods would be averaged out).

Our main data source will be **Foursquare's APIs**, concretely the venues endpoint (see details at <a href="https://developer.foursquare.com/docs/api/venues/search">Foursquare: Search for venues</a>. It will provide us with relevant information about venues present in respective neighborhoods. The venues' categories and their distribution within the unit of analysis will be used as features characterizing the neighborhoods wich will allow us to cluster them and measure their similarity. An example of such info is presented below for the first available NYC neighborhood.

In [42]:
!wget -q -O 'ny_ntas.csv' https://data.cityofnewyork.us/api/views/q2z5-ai38/rows.csv?accessType=DOWNLOAD
with open('ny_ntas.csv', 'rb') as csv_data:
    newyork_data = pd.read_csv(csv_data, header = 0)

In [43]:
CLIENT_ID = '<YOURID>' # your Foursquare ID
CLIENT_SECRET = '<YOURSECRET>' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [44]:
address = newyork_data.loc[0, 'NTAName']
geolocator = Nominatim(user_agent="hoods_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
results = requests.get(url).json()
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

print("Examples of nearby venues found for the {} neighborhood:".format(address))
nearby_venues.head()

Examples of nearby venues found for the Borough Park neighborhood:


Unnamed: 0,name,categories,lat,lng
0,Orchidea,American Restaurant,40.636528,-73.994157
1,GoBo's,Café,40.631571,-73.995369
2,Jentana's Pizza,Pizza Place,40.63741,-73.999225
3,Asia Glatt Kosher,Restaurant,40.636978,-73.994495
4,Sub express,Fast Food Restaurant,40.632984,-73.994216


We will also need information about the cities themselves. Mainly the name of neighborhoods, their location, and also population living in the area that would serve as an additional feature in the analysis.

For **New York**, we will use two data sets available form NYC Open Data that provide information on NYC NTAs, including their location. However, these locations will be used for rendering maps only as they contain only the shapes of the areas, not their centers. Approximate centroids will be geocoded using Python's `geopy` package, an example is shown below.

In [45]:
newyork_data.head()

Unnamed: 0,BoroName,the_geom,CountyFIPS,BoroCode,NTACode,NTAName,Shape_Leng,Shape_Area
0,Brooklyn,MULTIPOLYGON (((-73.97604935657381 40.63127590...,47,3,BK88,Borough Park,39247.228028,54005020.0
1,Queens,MULTIPOLYGON (((-73.80379022888246 40.77561011...,81,4,QN51,Murray Hill,33266.904995,52488280.0
2,Queens,MULTIPOLYGON (((-73.8610972440186 40.763664477...,81,4,QN27,East Elmhurst,19816.712293,19726850.0
3,Queens,MULTIPOLYGON (((-73.75725671509139 40.71813860...,81,4,QN07,Hollis,20976.335574,22887770.0
4,Manhattan,MULTIPOLYGON (((-73.94607828674226 40.82126321...,61,1,MN06,Manhattanville,17040.685413,10647080.0


In [46]:
!wget -q -O 'nyc_pop.csv' https://data.cityofnewyork.us/api/views/swpk-hqdp/rows.csv?accessType=DOWNLOAD
with open('nyc_pop.csv', 'rb') as csv_data:
    nyc_pop = pd.read_csv(csv_data, header = 0)
nyc_pop.head()

Unnamed: 0,Borough,Year,FIPS County Code,NTA Code,NTA Name,Population
0,Bronx,2000,5,BX01,Claremont-Bathgate,28149
1,Bronx,2000,5,BX03,Eastchester-Edenwald-Baychester,35422
2,Bronx,2000,5,BX05,Bedford Park-Fordham North,55329
3,Bronx,2000,5,BX06,Belmont,25967
4,Bronx,2000,5,BX07,Bronxdale,34309


In [47]:
address = 'Borough Park, Brooklyn'

geolocator = Nominatim(user_agent="hoods_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Sample output of geocoding:')
print('The geograpical coordinate of Borough Park, Brooklyn are {}, {}.'.format(latitude, longitude))

Sample output of geocoding:
The geograpical coordinate of Borough Park, Brooklyn are 40.633993, -73.9968059.


For **Prague**, we will use official information from Czech Statistical Office. The information we need (i.e. name of the cadastral areas and their population) is available on the institution's website in an Excel file with information for the years of 2001-2018 (we will work with 2018). Example of the data in its raw format is shown below. The location data will be again geocoded using the `geopy` package, as shown in the example.

In [48]:
!wget -q -O 'CR_L4_KU.xlsx' https://www.czso.cz/documents/11236/37543548/CR_L4_KU.xlsx/7538afe0-debf-4cd4-9879-1261f252c6ba?version=1.9
with open('CR_L4_KU.xlsx', 'rb') as xlsx_data:
    prague_data = pd.read_excel(xlsx_data, header = [5,6], 
                                dtype = str, nrows = 112)
prague_data.head()

Unnamed: 0_level_0,CT code,Name of cadastral territory,Population,Cadastral area (ha),Population,Cadastral area (ha),Population,Cadastral area (ha),Population,Cadastral area (ha),...,Population,Cadastral area (ha),Population,Cadastral area (ha),Population,Cadastral area (ha),Population,Cadastral area (ha),Population,Cadastral area (ha)
Unnamed: 0_level_1,Unnamed: 0_level_1,Unnamed: 1_level_1,31 December 2001,Unnamed: 3_level_1,31 December 2002,Unnamed: 5_level_1,31 December 2003,Unnamed: 7_level_1,31 December 2004,Unnamed: 9_level_1,...,31 December 2014,Unnamed: 29_level_1,31 December 2015,Unnamed: 31_level_1,31 December 2016,Unnamed: 33_level_1,31 December 2017,Unnamed: 35_level_1,31 December 2018,Unnamed: 37_level_1
0,601527,Běchovice,1692,683.3299,3060,683.3903,3094,683.3909,3157,683.3942,...,2598,683.4935,2611,683.4935,2623,683.4938,2668,683.4935,2694,683.4936
1,602582,Benice,345,277.376,330,277.3802,424,277.3948,436,277.3947,...,626,277.3795,641,277.3795,668,277.3796,699,277.3795,706,277.3793
2,730556,Bohnice,18631,465.8495,18318,465.8508,18170,465.9016,18097,465.8984,...,16920,465.8617,16834,465.8615,16763,465.8616,16835,465.8613,16716,465.8615
3,727873,Braník,18337,440.33,18261,440.3128,18145,440.3069,18079,440.312,...,17815,440.0495,17853,440.05,17814,440.0518,17867,440.0522,17898,440.0517
4,729582,Břevnov,24124,524.4628,24054,524.4682,24044,524.4686,24050,524.2628,...,23946,524.0778,24169,524.0775,24995,524.0779,25357,524.0784,25955,524.0784


In [49]:
address = 'Běchovice, Praha'

geolocator = Nominatim(user_agent="hoods_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Sample output of geocoding:')
print('The geograpical coordinate of Běchovice, Prague are {}, {}.'.format(latitude, longitude))

Sample output of geocoding:
The geograpical coordinate of Běchovice, Prague are 50.0812104, 14.6160257.
