# Compare Toronto and New York 

## Poject goal: To find the similarities between Toronto and New York

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)



## Introduction: Business Problem <a name="introduction"></a>

New York City and the city of Toronto are very diverse and are the financial capitals of their respective countries. To compare the neighborhoods of the two cities and determine how similar or dissimilar they are by the venue of categories they have in the city.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* Venue data of the Toronto
* Venue data of the New York
* Total number of different kinds of venue of the two cities
* Number of different kinds of venue in the neighborhood of the two cities


We decided to use postcod to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* The latitude and longitude data of those areas will be obtained using **Google Maps API reverse geocoding**
* number of venue and their type and location in every neighborhood will be obtained using **Foursquare API**

### Neighborhood Candidates

Let's create latitude & longitude coordinates for centroids of all neighborhoods of the two cities. We will create a grid of cells covering our area of interest which is aprox. 0.5 killometers centered around every neighbourhoods.

Let's first find the latitude & longitude of the neighbourhoods, using specific, well known address and Google Maps geocoding API.

# 0. Installing the pre-requisted libs

In [1]:
!pip install beautifulsoup4 requests pandas geocoder



## 1. Prepared data for Toronto

### 1.1 Data scrape for suburb and postcode

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import geocoder
import time
import json

### 1.1 Scraping from wikipedia

In [3]:
def link_or_text(elem):
    link = elem.select_one('a')
    if link:
        return link.text.strip()
    return elem.text.strip()

In [4]:
URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
pc_data = []

html = requests.get(URL).content
soup = BeautifulSoup(html, 'html.parser')
tbody = soup.select_one('table.wikitable>tbody')

for tr in tbody.find_all('tr'):
    tds = tr.find_all('td')
    if len(tds) == 3:
        pc, borough, neighbourhood = [link_or_text(td) for td in tds]
        if borough == 'Not assigned':
            # ignore when borough is ''Not assigned
            continue
        item = dict(zip(['Postcode', 'Borough', 'Neighbourhood'],  [pc, borough, neighbourhood]))
        pc_data.append(item) 

In [5]:
pc_data[:3]

[{'Postcode': 'M3A', 'Borough': 'North York', 'Neighbourhood': 'Parkwoods'},
 {'Postcode': 'M4A',
  'Borough': 'North York',
  'Neighbourhood': 'Victoria Village'},
 {'Postcode': 'M5A',
  'Borough': 'Downtown Toronto',
  'Neighbourhood': 'Harbourfront'}]

In [6]:
pc_df = pd.DataFrame(pc_data)

In [7]:
pc_df.head()

Unnamed: 0,Borough,Neighbourhood,Postcode
0,North York,Parkwoods,M3A
1,North York,Victoria Village,M4A
2,Downtown Toronto,Harbourfront,M5A
3,Downtown Toronto,Regent Park,M5A
4,North York,Lawrence Heights,M6A


In [8]:
pc_df.tail()

Unnamed: 0,Borough,Neighbourhood,Postcode
206,Etobicoke,Kingsway Park South West,M8Z
207,Etobicoke,Mimico NW,M8Z
208,Etobicoke,The Queensway West,M8Z
209,Etobicoke,Royal York South West,M8Z
210,Etobicoke,South of Bloor,M8Z


In [9]:
pc_df = pc_df[pc_df.Borough != 'Not assigned']

In [10]:
pc_df = pc_df[['Postcode','Borough','Neighbourhood']]
pc_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [11]:
pc_df_2 = pc_df.groupby('Postcode')['Neighbourhood'].apply(', '.join).reset_index()
pc_df_2.head()

Unnamed: 0,Postcode,Neighbourhood
0,M1B,"Rouge, Malvern"
1,M1C,"Highland Creek, Rouge Hill, Port Union"
2,M1E,"Guildwood, Morningside, West Hill"
3,M1G,Woburn
4,M1H,Cedarbrae


In [12]:
pc_df_3 = pc_df[['Postcode','Borough']]


In [13]:
pc_df_2 = pd.merge(pc_df_3, pc_df_2, on='Postcode', how='inner') 
pc_df_2 = pc_df_2.drop_duplicates(subset=['Postcode', 'Borough','Neighbourhood'], keep='first')
pc_df_2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
4,M6A,North York,"Lawrence Heights, Lawrence Manor"
6,M7A,Queen's Park,Not assigned


In [14]:
for index, row in pc_df_2.iterrows():
    if row['Neighbourhood']=='Not assigned': 
        row['Neighbourhood']=row['Borough']


In [15]:
pc_df_2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
4,M6A,North York,"Lawrence Heights, Lawrence Manor"
6,M7A,Queen's Park,Queen's Park


In [16]:
pc_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


### 1.2 Retriving location coordinates

In [17]:
GCP_API_KEY = 'THIS_IS_A_SECRET'

In [18]:
#@hidden_cell
GCP_API_KEY = 'AIzaSyC4AdT2WqA17f8ZEO5K_TbOWRV3wW4A8m4'

In [19]:
latlong_data = []

qry_result = []
for (idx, pc) in pc_df_2['Postcode'].items():
    # pc = item.get('Postcode')
    qry = f'{pc}, Toronto, Ontario'
    for i in range(3):
        r = geocoder.google(qry, key=GCP_API_KEY)
        if r.latlng:
            #print(f'{qry} {r.latlng}')
            qry_result.append([pc, r.latlng[0], r.latlng[1]])
            break
        else:
            time.sleep(1.5)
    

In [20]:
## `latlongs` is just a list of dictionary with keys Latitude and Longtitude 
## and we will merge it back to the pc_data later
latlongs = [dict([('Postcode', r[0]), ('Latitude', r[1]), ('Longtitude', r[2])])  for r in  qry_result]

In [21]:
len(latlongs), len(pc_data)

(103, 211)

In [22]:
latlongs[:5]

[{'Postcode': 'M3A', 'Latitude': 43.7532586, 'Longtitude': -79.3296565},
 {'Postcode': 'M4A', 'Latitude': 43.72588229999999, 'Longtitude': -79.3155716},
 {'Postcode': 'M5A', 'Latitude': 43.6542599, 'Longtitude': -79.36063589999999},
 {'Postcode': 'M6A', 'Latitude': 43.718518, 'Longtitude': -79.4647633},
 {'Postcode': 'M7A', 'Latitude': 43.6623015, 'Longtitude': -79.3894938}]

In [23]:
## Merging the pc_data and latlongs
_ = [d.update(d_latlng) for (d, d_latlng) in zip(pc_data, latlongs)]

In [24]:
df_latlongs = pd.DataFrame(latlongs)

In [25]:
df_latlongs

Unnamed: 0,Latitude,Longtitude,Postcode
0,43.753259,-79.329656,M3A
1,43.725882,-79.315572,M4A
2,43.654260,-79.360636,M5A
3,43.718518,-79.464763,M6A
4,43.662301,-79.389494,M7A
5,43.667856,-79.532242,M9A
6,43.806686,-79.194353,M1B
7,43.745906,-79.352188,M3B
8,43.706397,-79.309937,M4B
9,43.657162,-79.378937,M5B


In [26]:
df = pd.merge(pc_df_2, df_latlongs, on='Postcode', how='left')
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longtitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


In [27]:
df2= pd.merge(pc_df, df_latlongs, on='Postcode', how='left')
df2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longtitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M5A,Downtown Toronto,Regent Park,43.65426,-79.360636
4,M6A,North York,Lawrence Heights,43.718518,-79.464763


### 1.3 Get venue info

In [29]:
CLIENT_ID = 'FWQGJHQKEPGJI5QSSD01TZYJB12ZR20ZGHJY4FR130RRHTJI' # your Foursquare ID
CLIENT_SECRET = 'FX3YOPGXYNF5JBSD445TBPOONT4WZ54I5GTQSIKMO4RCNN1A' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: FWQGJHQKEPGJI5QSSD01TZYJB12ZR20ZGHJY4FR130RRHTJI
CLIENT_SECRET:FX3YOPGXYNF5JBSD445TBPOONT4WZ54I5GTQSIKMO4RCNN1A


In [30]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [32]:
LIMIT=100

In [33]:
toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longtitude']
                                  )

Parkwoods
Victoria Village
Harbourfront, Regent Park
Lawrence Heights, Lawrence Manor
Queen's Park
Islington Avenue
Rouge, Malvern
Don Mills North
Woodbine Gardens, Parkview Hill
Ryerson, Garden District
Glencairn
Cloverdale, Islington, Martin Grove, Princess Gardens, West Deane Park
Highland Creek, Rouge Hill, Port Union
Flemingdon Park, Don Mills South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Bloordale Gardens, Eringate, Markland Wood, Old Burnhamthorpe
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Downsview North, Wilson Heights
Thorncliffe Park
Adelaide, King, Richmond
Dovercourt Village, Dufferin
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto
Harbourfront East, Toronto Islands, Union Station
Little Portugal, Trinity
East Birchmount Park, Ionview, Kennedy Park
Bayview Village
CFB Toronto, Downsview East
The D

In [34]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Agincourt,4,4,4,4,4,4
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",3,3,3,3,3,3
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",11,11,11,11,11,11
"Alderwood, Long Branch",11,11,11,11,11,11
"Bathurst Manor, Downsview North, Wilson Heights",19,19,19,19,19,19
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",22,22,22,22,22,22
Berczy Park,57,57,57,57,57,57
"Birch Cliff, Cliffside West",4,4,4,4,4,4


In [35]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 275 uniques categories.


In [36]:
toronto_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.332140,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
5,Victoria Village,43.725882,-79.315572,Eglinton Ave E & Sloane Ave/Bermondsey Rd,43.726086,-79.313620,Intersection
6,Victoria Village,43.725882,-79.315572,Pizza Nova,43.725824,-79.312860,Pizza Place
7,"Harbourfront, Regent Park",43.654260,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
8,"Harbourfront, Regent Park",43.654260,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
9,"Harbourfront, Regent Park",43.654260,-79.360636,Toronto Cooper Koo Family Cherry St YMCA Centre,43.653191,-79.357947,Gym / Fitness Center


In [39]:
toronto_venues.to_csv('C:\\Users\\OSS\\Downloads\\toronto.csv', sep='\t', encoding='utf-8')

In [41]:
toronto_venues.groupby(['Neighborhood','Venue Category']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Neighborhood,Venue Category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",American Restaurant,3,3,3,3,3
"Adelaide, King, Richmond",Art Gallery,1,1,1,1,1
"Adelaide, King, Richmond",Art Museum,1,1,1,1,1
"Adelaide, King, Richmond",Asian Restaurant,2,2,2,2,2
"Adelaide, King, Richmond",Bakery,2,2,2,2,2
"Adelaide, King, Richmond",Bar,4,4,4,4,4
"Adelaide, King, Richmond",Bookstore,1,1,1,1,1
"Adelaide, King, Richmond",Brazilian Restaurant,1,1,1,1,1
"Adelaide, King, Richmond",Breakfast Spot,2,2,2,2,2
"Adelaide, King, Richmond",Building,1,1,1,1,1


In [42]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,"Adelaide, King, Richmond",0.000000,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.030000,...,0.000000,0.0,0.020000,0.00000,0.000000,0.000000,0.000000,0.010000,0.00000,0.0
1,Agincourt,0.000000,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.00000,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.000000,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.00000,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.000000,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.00000,0.0
4,"Alderwood, Long Branch",0.000000,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.00000,0.0
5,"Bathurst Manor, Downsview North, Wilson Heights",0.000000,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.00000,0.052632,0.000000,0.000000,0.000000,0.00000,0.0
6,Bayview Village,0.000000,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.00000,0.0
7,"Bedford Park, Lawrence Manor East",0.000000,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.045455,...,0.000000,0.0,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.00000,0.0
8,Berczy Park,0.000000,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.017544,0.00000,0.000000,0.000000,0.000000,0.000000,0.00000,0.0
9,"Birch Cliff, Cliffside West",0.000000,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.00000,0.0


### Define Foursquare Credentials and Version


In [None]:
CLIENT_ID = 'FWQGJHQKEPGJI5QSSD01TZYJB12ZR20ZGHJY4FR130RRHTJI' # your Foursquare ID
CLIENT_SECRET = 'FX3YOPGXYNF5JBSD445TBPOONT4WZ54I5GTQSIKMO4RCNN1A' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### 2. Prepare data for New York

In [47]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [48]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [49]:
neighborhoods_data = newyork_data['features']

In [50]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

In [51]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [52]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [53]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [54]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [56]:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


In [57]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arcade,Arepa Restaurant,...,Vietnamese Restaurant,Volleyball Court,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [58]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arcade,Arepa Restaurant,...,Vietnamese Restaurant,Volleyball Court,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Battery Park City,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.03,0.0,0.02,0.0
1,Carnegie Hill,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.02,0.0,0.0,0.0,0.0,0.01,0.03,0.0,0.01,0.03
2,Central Harlem,0.0,0.0,0.0,0.065217,0.043478,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Chelsea,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0
4,Chinatown,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,...,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Civic Center,0.0,0.0,0.0,0.0,0.03,0.0,0.01,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.03
6,Clinton,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.03,0.0,0.0,0.0
7,East Harlem,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,East Village,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.02,...,0.02,0.0,0.0,0.0,0.0,0.05,0.01,0.0,0.0,0.0
9,Financial District,0.01,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.01,0.0


In [59]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


## Methodology <a name="methodology"></a>

In this project we will compare Toronto and New York by their neigbourhoods using different catolgories vene data.

For each suburbs, compare the number of different venues and use K-mean Clustering to group the neigbourhoods to see if there is silificant differences between to cities. 


## 3. k-means Clustering

#### Define a function that assigns each datapoint to a cluster

In [None]:
colors_map = np.array(['b', 'r'])
def assign_members(x1, x2, centers):
    compare_to_first_center = np.sqrt(np.square(np.array(x1) - centers[0][0]) + np.square(np.array(x2) - centers[0][1]))
    compare_to_second_center = np.sqrt(np.square(np.array(x1) - centers[1][0]) + np.square(np.array(x2) - centers[1][1]))
    class_of_points = compare_to_first_center > compare_to_second_center
    colors = colors_map[class_of_points + 1 - 1]
    return colors, class_of_points

print('assign_members function defined!')

#### Define a function that updates the centroid of each cluster

In [None]:
# update means
def update_centers(x1, x2, class_of_points):
    center1 = [np.mean(np.array(x1)[~class_of_points]), np.mean(np.array(x2)[~class_of_points])]
    center2 = [np.mean(np.array(x1)[class_of_points]), np.mean(np.array(x2)[class_of_points])]
    return [center1, center2]

print('assign_members function defined!')

#### Define a function that plots the data points along with the cluster centroids

In [None]:
def plot_points(centroids=None, colors='g', figure_title=None):
    # plot the figure
    fig = plt.figure(figsize=(15, 10))  # create a figure object
    ax = fig.add_subplot(1, 1, 1)
    
    centroid_colors = ['bx', 'rx']
    if centroids:
        for (i, centroid) in enumerate(centroids):
            ax.plot(centroid[0], centroid[1], centroid_colors[i], markeredgewidth=5, markersize=20)
    plt.scatter(x1, x2, s=500, c=colors)
    
    # define the ticks
    xticks = np.linspace(-6, 8, 15, endpoint=True)
    yticks = np.linspace(-6, 6, 13, endpoint=True)

    # fix the horizontal axis
    ax.set_xticks(xticks)
    ax.set_yticks(yticks)

    # add tick labels
    xlabels = xticks
    ax.set_xticklabels(xlabels)
    ylabels = yticks
    ax.set_yticklabels(ylabels)

    # style the ticks
    ax.xaxis.set_ticks_position('bottom')
    ax.yaxis.set_ticks_position('left')
    ax.tick_params('both', length=2, width=1, which='major', labelsize=15)
    
    # add labels to axes
    ax.set_xlabel('x1', fontsize=20)
    ax.set_ylabel('x2', fontsize=20)
    
    # add title to figure
    ax.set_title(figure_title, fontsize=24)

    plt.show()

print('plot_points function defined!')