<font size=4><b>Coursera Capstone Project Notebook</b></font>

This notebook will be used for the completion of the Coursera Capstone Project in Python.

In [141]:
import numpy as np
import pandas as pd

Need the following installs in order to complete this project

In [142]:
from geopy.geocoders import Nominatim #for letting location data
from sklearn.cluster import KMeans #for performing kmeans clustering
import folium #for visualizing on a world map

In [143]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


Load and view our dataset

In [144]:
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

In [145]:
data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We need to remove all of the not assigned buroughs from our dataset.

In [146]:
data = data[data['Borough'] != 'Not assigned']

In [147]:
data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


We need to combine rows with the same postcode

In [148]:
neighborhoods_in_pc = {}
for postcode in data['Postcode'].unique():
    pc_data = data[data['Postcode'] == postcode]
    neighborhoods_in_pc[postcode] = pc_data['Neighbourhood'].unique()
neighborhoods_in_pc

{'M3A': array(['Parkwoods'], dtype=object),
 'M4A': array(['Victoria Village'], dtype=object),
 'M5A': array(['Harbourfront', 'Regent Park'], dtype=object),
 'M6A': array(['Lawrence Heights', 'Lawrence Manor'], dtype=object),
 'M7A': array(['Not assigned'], dtype=object),
 'M9A': array(['Islington Avenue'], dtype=object),
 'M1B': array(['Rouge', 'Malvern'], dtype=object),
 'M3B': array(['Don Mills North'], dtype=object),
 'M4B': array(['Woodbine Gardens', 'Parkview Hill'], dtype=object),
 'M5B': array(['Ryerson', 'Garden District'], dtype=object),
 'M6B': array(['Glencairn'], dtype=object),
 'M9B': array(['Cloverdale', 'Islington', 'Martin Grove', 'Princess Gardens',
        'West Deane Park'], dtype=object),
 'M1C': array(['Highland Creek', 'Rouge Hill', 'Port Union'], dtype=object),
 'M3C': array(['Flemingdon Park', 'Don Mills South'], dtype=object),
 'M4C': array(['Woodbine Heights'], dtype=object),
 'M5C': array(['St. James Town'], dtype=object),
 'M6C': array(['Humewood-Cedarvale'

Create a new dataframe with postcodes having values of every neighborhood in that postcode

In [149]:
fixed_data = pd.DataFrame(columns=['Postcode','Borough','Neighbourhood'])
fixed_data['Postcode'] = neighborhoods_in_pc.keys()
fixed_data['Neighbourhood'] = neighborhoods_in_pc.values()
fixed_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,,[Parkwoods]
1,M4A,,[Victoria Village]
2,M5A,,"[Harbourfront, Regent Park]"
3,M6A,,"[Lawrence Heights, Lawrence Manor]"
4,M7A,,[Not assigned]


Now, we need to find the Burough for each Postcode

In [150]:
buroughs = []
for pc in fixed_data['Postcode']:
    pc_data = data[data['Postcode'] == pc] #only postcard data in here
    buroughs.append(pc_data['Borough'].unique())
fixed_data['Borough'] = buroughs

In [151]:
fixed_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,[North York],[Parkwoods]
1,M4A,[North York],[Victoria Village]
2,M5A,[Downtown Toronto],"[Harbourfront, Regent Park]"
3,M6A,[North York],"[Lawrence Heights, Lawrence Manor]"
4,M7A,[Queen's Park],[Not assigned]


Columns without assigned Neighbourhood, Neighbourhood should be set to Borough

In [152]:
for index, row in fixed_data.iterrows():
    if (row['Neighbourhood'][0] == 'Not assigned'):
        row['Neighbourhood'][0] = row['Borough']

In [153]:
fixed_data.head(25)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,[North York],[Parkwoods]
1,M4A,[North York],[Victoria Village]
2,M5A,[Downtown Toronto],"[Harbourfront, Regent Park]"
3,M6A,[North York],"[Lawrence Heights, Lawrence Manor]"
4,M7A,[Queen's Park],[[Queen's Park]]
5,M9A,[Etobicoke],[Islington Avenue]
6,M1B,[Scarborough],"[Rouge, Malvern]"
7,M3B,[North York],[Don Mills North]
8,M4B,[East York],"[Woodbine Gardens, Parkview Hill]"
9,M5B,[Downtown Toronto],"[Ryerson, Garden District]"


Clean up the Borough column

In [154]:
fixed_boroughs = []
fixed_neighbourhoods = []
for borough in fixed_data['Borough']:
    b = ','.join(borough)
    fixed_boroughs.append(b)
fixed_data['Borough'] = fixed_boroughs

Clean up the Neighbourhood column

In [155]:
fixed_neighbourhoods = []
for neighborhood in fixed_data['Neighbourhood']:
    if len(neighborhood) == 1:
        n = neighborhood[0]
    else:
        n = ",".join(str(x) for x in neighborhood)
    fixed_neighbourhoods.append(n)
fixed_data['Neighbourhood'] = fixed_neighbourhoods

In [156]:
fixed_data.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,[Queen's Park]
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


Assumptions: I am assuming that each postcode can contain multiple neighbourhoods and does not contain any overlapping neighbourhoods. I am also assuming each borough falls into 1 postcode (and each postcode has 1 Borough).

In [160]:
fixed_data.shape

(103, 3)

103 rows and 3 columns.