Scrape the Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

In [24]:
import pandas as pd

tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

Postal Codes are in the first table

In [25]:
postal_codes = tables[0]
postal_codes.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Filter dataframe deleting rows containing Not assigned Postal Codes. Rename first column.

In [26]:
postal_codes = postal_codes[postal_codes['Borough'] != 'Not assigned'].reset_index(drop=True)

postal_codes = postal_codes.rename(columns={'Postal code': 'PostalCode'})

postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In order to check duplicate postal codes we can count unique items for each column.

In [27]:
postal_codes.describe()

Unnamed: 0,PostalCode,Borough,Neighborhood
count,103,103,103
unique,103,10,98
top,M5T,North York,Downsview
freq,1,24,4


No duplicate postal codes! Neighborhoods are already grouped by postal code.
Let's check if there is any NaN or Not Assigned in Neighborhood column.

In [28]:
postal_codes['Neighborhood'].isna().sum()

0

In [29]:
(postal_codes['Neighborhood']=='Not assigned').sum()

0

Multiple Neighborhoods for the same Postal Code are formatted with '/', replace it with a single comma (,).

In [30]:
postal_codes['Neighborhood'].replace(' /',',',regex=True,inplace=True)
postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [31]:
postal_codes.shape

(103, 3)

The first part of the project is complete.
Let's add coordinates to the dataframe.

In [35]:
postal_codes['Latitude'] = 0.000000
postal_codes['Longitude'] = 0.000000

postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,0.0,0.0
1,M4A,North York,Victoria Village,0.0,0.0
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",0.0,0.0
3,M6A,North York,"Lawrence Manor, Lawrence Heights",0.0,0.0
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",0.0,0.0


Use geocoder to get coordinates and store them into dataframe.

In [36]:
import geocoder # import geocoder

for index, row in postal_codes.iterrows():
    postal_code = row['PostalCode']
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
    postal_codes.at[index,'Latitude']= g.lat
    postal_codes.at[index,'Longitude'] = g.lng

In [37]:
postal_codes.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.31189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939
5,M9A,Etobicoke,Islington Avenue,43.667481,-79.528953
6,M1B,Scarborough,"Malvern, Rouge",43.808626,-79.189913
7,M3B,North York,Don Mills,43.7489,-79.35722
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.707193,-79.311529
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657491,-79.377529
