# Segmenting and Clustering Neighborhoods in Toronto

This notebook has been developed for the task of week 3 of Data Capstone Project of IBM by <b>Jorge Quintero Bermejo.</b>

## 1. In section 1, the dataframe with the neighborhoods of Toronto is built following Coursera instructions.

##### Importing Libraries:

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium
import folium # plotting library

from bs4 import BeautifulSoup

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

print('Libraries imported.')

Libraries imported.


##### Transform the data in the table on the Wikipedia page into pandas dataframe

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
tab = str(soup.table)

dfs = pd.read_html(tab)
df=dfs[0]
df.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


##### Data transformation

In [3]:
# Ignore cells with a borough that is Not assigned
df_aux = df[df['Borough']!='Not assigned']

# Combining the neighbourhoods with same Postalcode
df1 = df_aux.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
df1.reset_index(inplace=True)

# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
df1['Neighbourhood'] = np.where(df1['Neighbourhood'] == 'Not assigned',df1['Borough'], df1['Neighbourhood'])

df1.rename(columns={'Postal Code':'Postalcode'},inplace=True)

df1.head(12)

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


##### Dataframe shape

In [4]:
df1.shape

(103, 3)

## 2. In section 2, the dataframe with the neighborhoods of Toronto is modified adding latitude and logitude to the dataframe.

##### Getting coordinates: latitude and longitude of all postal codes

In [5]:
df_coordinates = pd.read_csv('https://cocl.us/Geospatial_data')
df_coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


##### Merging 2 datasets: 'df1' with postal code, borough and neighborhoods and coordinates and 'df_coodinates' with postalcode, latitude and longitude

In [6]:
df_coordinates.rename(columns={'Postal Code':'Postalcode'},inplace=True)
df2 = pd.merge(df1,df_coordinates,on='Postalcode')
df2.head(12)

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## 3. In section 3, clustering the neighborhoods in Toronto is done.

##### Choosing only boroughs that contain 'Toronto' in their names

In [7]:
toronto_data = df2[df2['Borough'].str.contains('Toronto',regex=False)]
toronto_data.head(10)

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


##### Get coordinates: latitude and longitude of Toronto.

In [8]:
address = 'Toronto'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


##### Create map of Toronto with the samples

In [9]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Check map in https://eu-gb.dataplatform.cloud.ibm.com/analytics/notebooks/v2/3cbde2a1-de69-4172-b9a0-ddc8468186b0/view?access_token=3070c64b57ed62343b195051f9ea8215dc218331f77bb3ce65be1077d206516c

##### Check the number of boroughs

In [10]:
toronto_neighbours = toronto_data['Borough'].value_counts().to_frame()
print('There are ', len(toronto_neighbours),'types of boroughs.')
toronto_neighbours

There are  4 types of boroughs.


Unnamed: 0,Borough
Downtown Toronto,19
Central Toronto,9
West Toronto,6
East Toronto,5


##### As there are n boroughs defined, clustering of K-means with 'k = number of boroughs' is done 

In [11]:
#set number of clusters
k = len(toronto_neighbours)

toronto_clustering = toronto_data.drop(['Postalcode','Borough','Neighbourhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_


# run k-means clustering
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ 

toronto_data.insert(0, 'Cluster Labels', kmeans.labels_)


In [12]:
toronto_data.head(12)

Unnamed: 0,Cluster Labels,Postalcode,Borough,Neighbourhood,Latitude,Longitude
2,3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,3,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,3,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,0,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,3,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,3,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,1,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,3,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,1,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


##### Draw map with each sample in colour based on the cluster label

In [13]:
map_clusters = folium.Map(location=[latitude,longitude],zoom_start=12)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood'], toronto_data['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Check map in https://eu-gb.dataplatform.cloud.ibm.com/analytics/notebooks/v2/3cbde2a1-de69-4172-b9a0-ddc8468186b0/view?access_token=3070c64b57ed62343b195051f9ea8215dc218331f77bb3ce65be1077d206516c

##### Analyzing relationship of cluster labels and its neighborhood 

In [14]:
# There are 4 types of boroughs in the selected dataframe
toronto_neighbours

Unnamed: 0,Borough
Downtown Toronto,19
Central Toronto,9
West Toronto,6
East Toronto,5


In [15]:
# The index of this 4 boroughs has the next order.
toronto_neighbours.index

Index(['Downtown Toronto', 'Central Toronto', 'West Toronto', 'East Toronto'], dtype='object')

##### Dataframes of each cluster label and its borough:

In [16]:
toronto_data[toronto_data['Cluster Labels'] == 0]

Unnamed: 0,Cluster Labels,Postalcode,Borough,Neighbourhood,Latitude,Longitude
19,0,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,0,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
47,0,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
54,0,M4M,East Toronto,Studio District,43.659526,-79.340923
100,0,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558


In [17]:
toronto_data[toronto_data['Cluster Labels'] == 1]

Unnamed: 0,Cluster Labels,Postalcode,Borough,Neighbourhood,Latitude,Longitude
25,1,M6G,Downtown Toronto,Christie,43.669542,-79.422564
31,1,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259
37,1,M6J,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975
43,1,M6K,West Toronto,"Brockton, Parkdale Village, Exhibition Place",43.636847,-79.428191
69,1,M6P,West Toronto,"High Park, The Junction South",43.661608,-79.464763
75,1,M6R,West Toronto,"Parkdale, Roncesvalles",43.64896,-79.456325
81,1,M6S,West Toronto,"Runnymede, Swansea",43.651571,-79.48445


In [18]:
toronto_data[toronto_data['Cluster Labels'] == 2]

Unnamed: 0,Cluster Labels,Postalcode,Borough,Neighbourhood,Latitude,Longitude
61,2,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
62,2,M5N,Central Toronto,Roselawn,43.711695,-79.416936
67,2,M4P,Central Toronto,Davisville North,43.712751,-79.390197
68,2,M5P,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",43.696948,-79.411307
73,2,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
79,2,M4S,Central Toronto,Davisville,43.704324,-79.38879
83,2,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
86,2,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


In [19]:
toronto_data[toronto_data['Cluster Labels'] == 3]

Unnamed: 0,Cluster Labels,Postalcode,Borough,Neighbourhood,Latitude,Longitude
2,3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,3,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,3,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
20,3,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,3,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
30,3,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
36,3,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752
42,3,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.647177,-79.381576
48,3,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817


##### Check what labels are well classified compared to the borough considering distance between samples. 

- Cluster 0: East Toronto
- Cluster 1: West Toronto
- Cluster 2: Central Toronto
- Cluster 3: Downtown Toronto

In [20]:
for i in range(k):
    toronto_area_df = toronto_data[toronto_data['Cluster Labels'] == k-1-i]
    number_out_area = len(toronto_area_df[toronto_area_df['Borough'] != toronto_neighbours.index[i]])
    print('There are ',number_out_area,'boroughs classified different from its real borough in ',toronto_neighbours.index[i],'.')
    number_in_area = len(toronto_area_df[toronto_area_df['Borough'] == toronto_neighbours.index[i]])
    print('There are ',number_in_area,'boroughs classified equal than its real borough in ', toronto_neighbours.index[i],'.')

There are  1 boroughs classified different from its real borough in  Downtown Toronto .
There are  18 boroughs classified equal than its real borough in  Downtown Toronto .
There are  0 boroughs classified different from its real borough in  Central Toronto .
There are  8 boroughs classified equal than its real borough in  Central Toronto .
There are  1 boroughs classified different from its real borough in  West Toronto .
There are  6 boroughs classified equal than its real borough in  West Toronto .
There are  0 boroughs classified different from its real borough in  East Toronto .
There are  5 boroughs classified equal than its real borough in  East Toronto .
