## 1. Import Libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np

## 2. Read wikipedia page

In [2]:
page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
page

<Response [200]>

## 3. Extract and process table

1. Extract table using beautiful soup. The tbody elements has 578 children, which is where the table elements are.

In [3]:
soup = BeautifulSoup(page.content, 'html.parser')
tables = soup.find_all('table')

In [4]:
len(tables[0].tbody.contents)

578

2. Import pandas

In [5]:
import pandas as pd

3. Extract the information from the table. For each child of the table, get the non empty elements. A list of children is extracted, which is then transformed into a pandas DataFrame. It can be shown below, 288 rows are created.

In [6]:
all_zipcds = []

In [7]:
for i in tables[0].tbody.children:
    if i != '\n':
        #print(i.contents)
        zipcds = []
        for j in i.children:
            if j!='\n':
                #print(j.text)
                zipcds.append(j.text.strip())
        all_zipcds.append(zipcds)

In [8]:
df_res = pd.DataFrame(all_zipcds[1:], columns=['Postcode', 'Borough', 'Neighbourhood'])
df_res.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [9]:
df_res.shape

(288, 3)

4. Dropping the rows where the boroughs are not assigned. 77 rows are dropped, and 211 rows of records are retained.

In [10]:
df_res_dropped = df_res[df_res['Borough'] != 'Not assigned']

In [11]:
df_res_dropped.shape

(211, 3)

5. Setting the rows where neighbourhoods are not assigned to be the same as the borough name. It turns out only one row has satisfy the criterion, and it is shown that postcode M7A has borough and neighbourhoods with the same name.

In [12]:
df_res_dropped.loc[df_res_dropped['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = df_res_dropped.loc[df_res_dropped['Neighbourhood'] == 'Not assigned', 'Borough']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [13]:
df_res_dropped[df_res_dropped['Postcode'] == 'M7A']

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Queen's Park


In [14]:
df_res_dropped.shape

(211, 3)

6. Using groupby to join the neighbourhoods with the same borough name.

In [15]:
df_res_grouped = pd.DataFrame(df_res_dropped.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(lambda x: ', '.join(x)))

In [16]:
df_res_grouped.reset_index(inplace=True)

looking at the combined neighbourhoods

In [17]:
df_res_grouped.loc[1:103:3, :]

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
4,M1H,Scarborough,Cedarbrae
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
10,M1P,Scarborough,"Dorset Park, Scarborough Town Centre, Wexford ..."
13,M1T,Scarborough,"Clarks Corners, Sullivan, Tam O'Shanter"
16,M1X,Scarborough,Upper Rouge
19,M2K,North York,Bayview Village
22,M2N,North York,Willowdale South
25,M3A,North York,Parkwoods
28,M3H,North York,"Bathurst Manor, Downsview North, Wilson Heights"


In [18]:
df_res_grouped.columns

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

## 4. Save to pickled file

In [19]:
df_res_grouped.to_pickle('toronto.pkl')

#### The final dataframe has 103 rows.

In [20]:
df_res_grouped.shape

(103, 3)