# WEEK 3 - ASSIGNMENT

## PART A - Scrapping and Pre-processing Data
In this section we scrape and preprocess the data of Canada's postal codes.


In [1]:
# Load packages
import pandas as pd
import requests

We make use of a built-in pandas function "pd.read_html" to retrieve and parse the data from the table in the Wikipedia link

In [2]:
# Page link
link='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Make request to retrieve page
r = requests.get(link)

# Parse table in page to pandas dataframe
df = pd.read_html(r.text) [0]
df.head(3)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods


We now proceed to pre-process the data

1. Rename columns

In [3]:
df.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
df.head(3)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods


2. Ignore cells with a borough that is Not assigned

In [4]:
df = df[df['Borough'] != 'Not assigned']
df.head(3)

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


3. Rows with the same postal code due to different neighbourhoods are already handled in parsing.



4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [5]:
filter1 = df['Borough'] == 'Not assigned' 
filter2 = df['Neighbourhood'] == 'Not assigned'
df.loc[filter1 & filter2, 'Neighbourhood'] = df['Borough']
df.reset_index(inplace=True)
df.head(3)

Unnamed: 0,index,PostalCode,Borough,Neighbourhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


5. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [6]:
print('Number of rows in dataframe: ', df.shape[0])
df

Number of rows in dataframe:  103


Unnamed: 0,index,PostalCode,Borough,Neighbourhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...,...
98,160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,165,M4Y,Downtown Toronto,Church and Wellesley
100,168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Finally, I save the dataframe to a csv file for future use

In [7]:
df.to_csv('canada_postal_codes.csv', index=False)