# Segmenting and Clustering Neighborhoods in Toronto

by Jianxu Shi   
Data Scientist in Training

*This notebook is created to complete the second peer-graded assignment in course "Applied Data Science Capstone". 
The goal is to segment and cluster the neighborhoods in the city of Toronta. The process consists of five steps.
First, a list of postal codes of Canada, along with associated boroughs and neighborhoods, is downloaded from a website.
Second, the data is pre-processed into a dataframe. Third, coordinates are extracted for each neighborhood using API calls.
Fourth, explore and cluster neighborhoods in Toronto. Last, visualize neighborhoods and clusters on a map.*

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import folium
!pip install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/ec/be/5ab8abdd8663c0386ec2dd595a5bc0e23330a0549b8a91e32f38c20845b6/lxml-4.4.1-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
[K     |████████████████████████████████| 5.8MB 3.0MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.4.1


In [23]:
# step 1: download data from website
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
df = pd.read_html(url)
print(df[0].head(10))

# save table back to csv for future use
file = 'list_of_postal_codes_canada_csv'
df[0].to_csv(file, index=False)

  Postcode           Borough     Neighbourhood
0      M1A      Not assigned      Not assigned
1      M2A      Not assigned      Not assigned
2      M3A        North York         Parkwoods
3      M4A        North York  Victoria Village
4      M5A  Downtown Toronto      Harbourfront
5      M5A  Downtown Toronto       Regent Park
6      M6A        North York  Lawrence Heights
7      M6A        North York    Lawrence Manor
8      M7A      Queen's Park      Not assigned
9      M8A      Not assigned      Not assigned


In [30]:
# step 2: process the data into a desired dataframe
df_pc = pd.read_csv(file)

# set the column names, remove Borough='Not assigned', and copy Borough name to Neighborhood='Not assigned'
df_pc.columns = ['PostalCode', 'Borough','Neighborhood']
df_pc = df_pc[df_pc.Borough!='Not assigned']
df_pc.loc[df_pc['Neighborhood'] =='Not assigned' , 'Neighborhood'] = df_pc['Borough']

# merge neighborhoods with the same PostalCode (assuming same Borough as well)
df_merged = df_pc.groupby(['PostalCode','Borough'], sort=False).agg( ', '.join)
df_pc = df_merged.reset_index()

df_pc.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [31]:
# print the shape of dataframe of postalcodes
print(df_pc.shape)

(103, 3)
