# Q1: Cleaning the data

In [1]:
import numpy as np
import pandas as pd
import requests

Read the table from Wikipedia using the `read_html` function in `pandas`. Assumption here is the first table is always the table of postal codes (index `0`):

In [2]:
page_tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M", header=0)
postal_code_table = page_tables[0]

Rename the columns:

In [3]:
postal_code_table.rename(columns={'Postal code': 'PostalCode'}, inplace=True)

Remove codes unassigned to Boroughs:

In [4]:
postal_code_table = postal_code_table[postal_code_table['Borough'] != "Not assigned"].reset_index(drop=True)

Set the neighborhood name to the borough name, if the neighborhood name is not assigned

In [5]:
postal_code_table.loc[postal_code_table['Neighborhood'] == "Not assigned", ["Neighborhood"]] = postal_code_table["Borough"]

Duplicate postal codes are already combined in the wiki, but not by commas - it is using the slash (/) character. We just replace the slashes with commas here.

In [6]:
postal_code_table['Neighborhood'] = postal_code_table["Neighborhood"].str.replace(" / ", ", ")

In [7]:
postal_code_table

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [8]:
postal_code_table.shape

(103, 3)

# Q2: Geocoding the boroughs

Using `geocoder` doesn't work - returns `None` a lot

In [9]:
#import geocoder

#def geocode_func(postal_code):
#    g = geocoder.osm('{}, Toronto, Ontario'.format(postal_code))
#    return g.latlng

#postal_code_table["Latitude"] = 0
#postal_code_table["Longitude"] = 0
#postal_code_table = postal_code_table.astype({'Latitude': 'float64'})
#postal_code_table = postal_code_table.astype({'Longitude': 'float64'})
#postal_code_table[["Latitude", "Longitude"]] = geocode_func(postal_code_table["PostalCode"])

Using the given geospatial data `.csv` and reading it with `pandas.read_csv`:

In [10]:
geocode_table = pd.read_csv("https://cocl.us/Geospatial_data")
geocode_table.set_index("Postal Code", drop=True, inplace=True)
geocode_table

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476
...,...,...
M9N,43.706876,-79.518188
M9P,43.696319,-79.532242
M9R,43.688905,-79.554724
M9V,43.739416,-79.588437


Joining the postal code location data with our boroughs table, using the PostalCode column:

In [11]:
postal_code_table = postal_code_table.join(geocode_table, on="PostalCode")
postal_code_table

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,Business reply mail Processing CentrE,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
