## Scraping Toronto Postcodes from Wikipedia using BeautifulSoup

In [18]:
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
import xml

Create a urlrequest of the Wikipedia page then cache the file locally for ease of use later.

In [19]:
url = 'http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
req = urllib.request.urlopen(url)
article = req.read().decode()
with open('postcode.html', 'w') as fo:
    fo.write(article)

Read the local file, then find the tables in the html.

In [20]:
article = open('postcode.html').read()
soup = BeautifulSoup(article, 'html.parser')
tables = soup.find_all('table', class_='sortable')

Convert the table to a string and feed that into a pandas dataframe.

In [21]:
df = pd.read_html(str(tables))[0]

Remove Boroughs that are 'Not assigned'

In [22]:
df = df[df.Borough != 'Not assigned']

In [23]:
df.shape

(103, 3)

In [24]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Attempted the Geocoder import, it seemed to get stuck in an infinite loop.
Decided to use the provided csv instead

In [43]:
postcode_df = pd.read_csv("https://cocl.us/Geospatial_data")

In [44]:
postcode_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge the imported csv dataframe into the original df.

In [48]:
df = postcode_df.merge(df, on = 'Postal Code')

In [59]:
df.head()

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,"Malvern, Rouge"
1,M1C,43.784535,-79.160497,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae


In [60]:
df.shape

(103, 5)