# Canada's boroughs and neighbourhoods

In this notebook a dataset of Canada's boroughs and neighbourhoods is created.

## 1. Creating borough/neighbourhood dataset

In [47]:
import pandas as pd
import requests
import folium
import numpy as np
import json
import matplotlib.cm as cm
import matplotlib.colors as colors
import codecs

In Wikipedia there is a page that contains different location tables of Canada. The table with boroughs and neighbourhoods is the first one (index 0).

In [2]:
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

In [3]:
# The borough/neighborhood table is the first one
df = data[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Not assigned borough are removed

df = df[df.Borough != 'Not assigned']
df.head()

Assigning the borough name to 'Not assigned' neighbourhood 

In [4]:
indexes = df.loc[df['Neighbourhood'] == 'Not assigned', 'Neighbourhood'].index
for idx in indexes:
    df.iloc[idx]['Neighbourhood'] = df.iloc[idx]['Borough']

Grouping the dataset

In [5]:
res = df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(','.join).reset_index()
res.to_csv('canada.csv', index=False)
res.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M1B,Scarborough,"Rouge,Malvern"
2,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
3,M1E,Scarborough,"Guildwood,Morningside,West Hill"
4,M1G,Scarborough,Woburn


Printing the dataset shape

In [6]:
print("Rows: ", res.shape[0])
print("Attributes: ", res.shape[1])

Rows:  180
Attributes:  3


## 2. Getting geolocation data

Geolocation data will be linked to the borouhgs/neighbourhoods datasets.

In [7]:
import geocoder

A function to get the geolocation data for each postcode.

In [8]:
def getgeo(postcode):
    lat_lng_coords = None
    i = 0
    while(lat_lng_coords is None and i < 10):
        g = geocoder.google('{}, Toronto, Ontario'.format(postcode))
        lat_lng_coords = g.latlng
        print(lat_lng_coords)
        i += 1
    return lat_lng_coords

In [9]:
getgeo('M1B')

None
None
None
None
None
None
None
None
None
None


Testing. geocoder is not working!

Therefore, the given geolocation dataset will be used. First it has to be downloaded.

In [10]:
!wget -O geodata.csv https://cocl.us/Geospatial_data

--2019-03-22 20:31:11--  https://cocl.us/Geospatial_data
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving cocl.us (cocl.us)... 169.48.113.201
Connecting to cocl.us (cocl.us)|169.48.113.201|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-03-22 20:31:17--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.25.197, 107.152.24.197
Connecting to ibm.box.com (ibm.box.com)|107.152.25.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-03-22 20:31:18--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/p

What do we have in the geolocation dataset?

In [11]:
geo = pd.read_csv('geodata.csv')
geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Adjusting column names previous to merge with boroughs/neighbourhoods dataset.

In [12]:
geo.rename(columns={'Postal Code': 'Postcode'}, inplace=True)
geo.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merging the dataset.

In [13]:
df_geo = res.merge(geo, on='Postcode')
df_geo.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [14]:
df_geo.to_csv('geo_toronto.csv', index=False)
len(df_geo)

103

## 3. Exploring and clustering boroughs and neighbourhoods

Getting latitude and longitude for Toronto

In [2]:
from geopy.geocoders import Nominatim
address = 'Toronto, ON'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.653963, -79.387207.


In [3]:
df_geo = pd.read_csv('geo_toronto.csv')

Creating a map for Toronto with neighbourhoods superimposed

In [5]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for idx in range(len(df_geo)):
    row = df_geo.iloc[idx]
    label = '{}, {}'.format(row.Neighbourhood, row.Borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [row.Latitude, row.Longitude],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Getting venues for each location

In [15]:
def getNearbyVenues(lat, long, radius=500, limit=100):
    VERSION = '20180604'
    venues_list=[]
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        lat, 
        long, 
        radius, 
        limit
    )
    results = requests.get(url).content
    return json.loads(results)['response']['groups'][0]['items']

In [17]:
CLIENT_ID = 'KAJQTVLX13JDQZPLMWMBD5JQ4CJYOXXWGMZ1ZMPTEEKWKVE2' # your Foursquare ID
CLIENT_SECRET = 'LV33X5LRY1VUSMZDIHY5RCXHWRHUMJWROVXVWMOBLBLIV14J' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [48]:
venues = []
for idx in range(len(df_geo)):
    row = df_geo.iloc[idx]
    lat = row['Latitude']
    long = row['Longitude']
    r = getNearbyVenues(lat, long)
    venues.append(r)

In [80]:
df_geo['venues'] = venues

In [81]:
df_geo.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,venues
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353,"[{'reasons': {'count': 0, 'items': [{'summary'..."
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,"[{'reasons': {'count': 0, 'items': [{'summary'..."
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,"[{'reasons': {'count': 0, 'items': [{'summary'..."
3,M1G,Scarborough,Woburn,43.770992,-79.216917,"[{'reasons': {'count': 0, 'items': [{'summary'..."
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,"[{'reasons': {'count': 0, 'items': [{'summary'..."


In [82]:
df_geo.iloc[0]['venues']

[{'reasons': {'count': 0,
   'items': [{'summary': 'This spot is popular',
     'type': 'general',
     'reasonName': 'globalInteractionReason'}]},
  'venue': {'id': '4bb6b9446edc76b0d771311c',
   'name': "Wendy's",
   'location': {'crossStreet': 'Morningside & Sheppard',
    'lat': 43.80744841934756,
    'lng': -79.19905558052072,
    'labeledLatLngs': [{'label': 'display',
      'lat': 43.80744841934756,
      'lng': -79.19905558052072}],
    'distance': 387,
    'cc': 'CA',
    'city': 'Toronto',
    'state': 'ON',
    'country': 'Canada',
    'formattedAddress': ['Toronto ON', 'Canada']},
   'categories': [{'id': '4bf58dd8d48988d16e941735',
     'name': 'Fast Food Restaurant',
     'pluralName': 'Fast Food Restaurants',
     'shortName': 'Fast Food',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/fastfood_',
      'suffix': '.png'},
     'primary': True}],
   'photos': {'count': 0, 'groups': []}},
  'referralId': 'e-0-4bb6b9446edc76b0d771311c-0'}]

In [83]:
venues = []
for i in range(len(df_geo)):
    row = df_geo.iloc[i]
    n = len(row['venues'])
    n = n if n < 100 else 100
    for j in range(n):
        venue = row['venues'][j]['venue']
        name = venue['name']
        lat = venue['location']['lat']
        long = venue['location']['lng']
        cat = venue['categories'][0]['name']
        venues.append({
            'Postcode': row['Postcode'],
            'name': name,
            'lat': lat,
            'long': long,
            'cat': cat
        })

In [84]:
df_venues = pd.DataFrame(venues)
df_venues.head()

Unnamed: 0,Postcode,cat,lat,long,name
0,M1B,Fast Food Restaurant,43.807448,-79.199056,Wendy's
1,M1C,Bar,43.782533,-79.163085,Royal Canadian Legion
2,M1E,Pizza Place,43.767697,-79.189914,Swiss Chalet Rotisserie & Grill
3,M1E,Electronics Store,43.765309,-79.191537,G & G Electronics
4,M1E,Spa,43.766,-79.191,Marina Spa


In [85]:
mrg = df_geo.merge(df_venues, on='Postcode')
del mrg['venues']
mrg.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,cat,lat,long,name
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353,Fast Food Restaurant,43.807448,-79.199056,Wendy's
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,Bar,43.782533,-79.163085,Royal Canadian Legion
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,Pizza Place,43.767697,-79.189914,Swiss Chalet Rotisserie & Grill
3,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,Electronics Store,43.765309,-79.191537,G & G Electronics
4,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,Spa,43.766,-79.191,Marina Spa


In [86]:
ct = pd.crosstab(mrg['Postcode'], mrg['cat'])

At this point we have how many venues exist in each location by category.

In [87]:
ct.head()

cat,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M1B,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
M1C,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
M1E,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
M1G,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
M1H,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Doing clustering

In [88]:
from sklearn.cluster import KMeans
kclusters = 8
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ct.get_values())

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [89]:
complete = df_geo.merge(ct, on='Postcode')
complete['cluster'] = kmeans.labels_
complete.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,venues,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,cluster
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353,"[{'reasons': {'count': 0, 'items': [{'summary'...",0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,"[{'reasons': {'count': 0, 'items': [{'summary'...",0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,"[{'reasons': {'count': 0, 'items': [{'summary'...",0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,M1G,Scarborough,Woburn,43.770992,-79.216917,"[{'reasons': {'count': 0, 'items': [{'summary'...",0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,"[{'reasons': {'count': 0, 'items': [{'summary'...",0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Preparing map identifying location by cluster

In [90]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for i in range(len(complete)):
    row = complete.iloc[i]
    label = folium.Popup(row.Borough + ' Cluster ' + str(row.cluster), parse_html=True)
    folium.CircleMarker(
        [row.Latitude, row.Longitude],
        radius=5,
        popup=label,
        color=rainbow[row.cluster-1],
        fill=True,
        fill_color=rainbow[row.cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters