# Part I: Segmenting and Clustering Neighborhoods in Toronto

We start downloading the table list and checking the index 0 is our table.

In [1]:
import pandas as pd
import numpy as np

link = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

df = pd.read_html(link)[0]
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
8,M8A,Not assigned,Not assigned
9,M9A,Downtown Toronto,Queen's Park


After that, column *Postcode* is renamed into *PostalCode* as it is asked

In [2]:
df.rename(columns={'Postcode':'PostalCode'}, inplace=True)
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
8,M8A,Not assigned,Not assigned
9,M9A,Downtown Toronto,Queen's Park


Then, we group and join all Neighbourhoods with the same PostalCode in the same row

In [3]:
df = df.groupby(['PostalCode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()

We check how many Neighbourhoods we can assign a Borough name. Only the Queen's Park case

In [4]:
df2=df[df['Neighbourhood']=='Not assigned']
df2[df2['Borough']!='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighbourhood
120,M7A,Queen's Park,Not assigned


Then we replace them

In [5]:
df.Neighbourhood.replace('Not assigned', df.Borough, inplace=True)

And check row with index 120 has been changed properly

In [6]:
df.iloc[120]

PostalCode                M7A
Borough          Queen's Park
Neighbourhood    Queen's Park
Name: 120, dtype: object

After that we can drop rows with no useful data to clean the DataFrame

In [7]:
df.drop(df.loc[df['Neighbourhood']=='Not assigned'].index, inplace=True)
df.reset_index(inplace=True, drop=True)

Our dataframe now has this shape..

In [8]:
df.shape

(103, 3)

## Using Geocode

We first dowload the CSV file and import into *df_geodata* dataframe

In [9]:
df_geodata=pd.read_csv("https://cocl.us/Geospatial_data")

Then check the format

In [10]:
df_geodata.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We need to rename the Postal Code column to match our naming

In [11]:
df_geodata.rename(columns={'Postal Code':'PostalCode'}, inplace=True)

Now we just can merge both dataframes into one, by PostalCode

In [12]:
df = pd.merge(df, df_geodata, on='PostalCode', how='outer')

In [13]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


And check we have 2 more columns and no extra rows:

In [14]:
df.shape

(103, 5)