# Parsing Data: Toronto Neighbourhoods

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import requests

The Python package beautiful soup will be used to read the data. In the following code, scraped data are fed into a pandas data frame and transformed according to the requirements.

In [91]:
# Get the content of the Wiki page
get_page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(get_page).text
soup = BeautifulSoup(page, 'lxml')

data = []
my_table = soup.find('table', {'class':'wikitable sortable'})
table_body = my_table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])
my_df = pd.DataFrame(data, columns = ['PostalCode', 'Borough', 'Neighbourhood'])

# Get rid of the first empty row
my_df = my_df.iloc[1:]

# See the head of the data frame
my_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned
10,M8A,Not assigned,Not assigned


We can see that the data frame contains observations where a borough is not assigned. In the following code, we are dealing with this issue as well as with combining the neighbourhood values for each individual postal code.

In [92]:
# Get values of indices for which borough is not assigned
indices = my_df[my_df['Borough'] == 'Not assigned'].index
# Delete these rows from the data frame
my_df.drop(indices, inplace = True)
my_df.head(10)
# Where neighbourhood is not defined, assign the borough value
my_df['Neighbourhood'].loc[my_df['Neighbourhood'] == 'Not assigned'] = my_df['Borough']
# Get the list of values for each postal code and transform the data frame to the desired format
my_df = my_df.groupby('PostalCode').agg(pd.Series.tolist)
my_df['Borough'] = my_df['Borough'].str[0]
my_df['Neighbourhood'] = my_df['Neighbourhood'].apply(lambda x: ', '.join(map(str, x)))
my_df.reset_index(level = ['PostalCode'], inplace = True)
# Double check the result
my_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


The shape of the final file is the following:

In [93]:
my_df.shape

(103, 3)