# Setup / Package imports
Lets get the necessary packages first :) We will need
* requests to scrape the packages
* pandas for the data handling

In [1]:
import requests
import pandas as pd
import numpy as np


# Getting the data ready
## Read the webpage

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_data = requests.get(url).text

##  Extract and prepare the dataframe
This will be done by
*  selecting only the relevant table
* renaming the columns
* dropping "not assigned" boroughs
* checking that we do not have "not assigned" neighborhoods (which should be the case after dropping unassigned boroughs)

In [3]:
webpage_data = pd.read_html(html_data)
# we are only interested in the first table
df = webpage_data[0]
# change to american english
df.rename(columns={'Postal Code': 'PostalCode', 'Neighbourhood': 'Neighborhood'}, inplace=True)
# drop unassigned boroughs
df.drop(df[df.Borough == 'Not assigned'].index, inplace=True)
# peek into the data
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [4]:
#check if we have unassigned neighborhoods
unassigned_neighborhoods = df.loc[df['Neighborhood'] == "Not assigned"]
if(unassigned_neighborhoods.empty):
    print('No unassigned neighborhoods. OK')
else:
    raise ValueError("There are {} unassigned neighborhood entries".format(unassigned_neighborhoods.shape[0]))

No unassigned neighborhoods. OK


To finish the webscraping and data preparation, we will print out the total number of remaining rows of our dataframe

In [5]:
row_total = df.shape[0]
print('There are {} rows in the prepared dataframe.'.format(row_total))

There are 103 rows in the prepared dataframe.
