In [4]:
!conda install -y html5lib

Solving environment: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - 

In [5]:
import lxml
import html5lib
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Scrape the Data Table
This section scrapes the data from wikipedia.
Beautiful Soup is used to extract the table structure, which is then fed to Pandas to create the dataframe

In [13]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
request = requests.get(url)
soup = BeautifulSoup(request.content, 'html.parser')
tables = soup.find_all('table')
df = pd.read_html(str(tables[0]), header=0)


In [14]:
df[0].head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


# Clean the data
- Ignore cells with borough "not assigned"
- This is done by creating a boolean selector and creating a new dataframe using the inverse

In [15]:
na_rows = df[0].Borough == "Not assigned"
f0 = df[0][~na_rows]


- If Neighbourhood is not assigned then take the borough


In [16]:
pd.set_option('mode.chained_assignment', None)
nb_rows = f0.Neighbourhood == "Not assigned"
f0.loc[nb_rows, 'Neighbourhood'] = f0.loc[nb_rows, 'Borough']
pd.set_option('mode.chained_assignment', 'warn')

- Combined common postcodes. One row with neighbourhoods comma separated

First we use the grouby by function to group rows with the same postcode.
We create a blank dataframe, with the same columns, ready to receive the grouped data

In [17]:
grouped = f0.groupby('Postcode')
ndf = pd.DataFrame(columns=f0.columns)


We loop over each group, extracting the postcode, the Borough, and joining the Neighbourhood names.
Then we append to our new dataframe.


In [18]:
for postcode, group in grouped:
    g = {}
    g['Postcode'] = postcode
    g['Borough'] = group.Borough.iloc[0]
    g['Neighbourhood'] = ",".join(group['Neighbourhood'].values.tolist())
    ndf = ndf.append(g, ignore_index=True)


Finally we print the shape of the new dataframe

In [19]:
print(ndf.shape)


(103, 3)
