### My Capstone project

This notebook documents part 1 of the IBM capstone project on data sciences.

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

# Scraping the postal codes from wikipedia

We will be using beautiful soup to scrape the contents of the following page

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

We want the following properties:
+ The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
+ Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
+ More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table.
+ If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.

In [3]:
# Getting the webpage locally with wget
!wget 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

--2021-01-20 08:56:56--  https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Resolving en.wikipedia.org (en.wikipedia.org)... 91.198.174.192, 2620:0:862:ed1a::1
Connecting to en.wikipedia.org (en.wikipedia.org)|91.198.174.192|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 54661 (53K) [text/html]
Saving to: ‘List_of_postal_codes_of_Canada:_M.2’


2021-01-20 08:57:02 (31.1 KB/s) - ‘List_of_postal_codes_of_Canada:_M.2’ saved [54661/54661]



In [4]:
# Open the file and soupify it
wikiDoc = 'List_of_postal_codes_of_Canada:_M'
with open(wikiDoc, 'rt') as f:
    soup = BeautifulSoup(f, 'html.parser')

In [5]:
# Search through the tables for the one with the headings we want.
tables = soup.find_all('table', class_='sortable')

# This dictionary will contain the data from the table before being used to create a pandas dataframe.
pCodes = {}

PostalCode, Borough, Neighborhood = [], [], []

# Parsing the table
for tr in tables[0].find_all('tr'):
    
    tds = tr.find_all('td')    
    
    if not tds:
        continue
    
    line = [td.text.strip() for td in tds]
    
    PostalCode.append(line[0])
    Borough.append(line[1])
    Neighborhood.append(line[2])

pCodes = {'PostalCode':PostalCode, 'Borough':Borough, 'Neighborhood':Neighborhood}

# Creating a pandas dataframe from the pCode dictionary
pCodes_df = pd.DataFrame(pCodes)

pCodes_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


# Cleaning the data frame

### Only process the cells that have an assigned borough

In [6]:
# Ignore cells with a borough that is Not assigned. Then resetting the index.
pCodes_df.drop(pCodes_df[pCodes_df.Borough == 'Not assigned'].index, inplace=True)
pCodes_df.reset_index(drop=True, inplace=True)
pCodes_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [7]:
pCodes_df.shape

(103, 3)

### Merging lines describing the same Borough

In [8]:
# More than one neighborhood can exist in one postal code area. 
# For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. 
# These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

# Counting the number of unique postCode to see whether we have duplicates that we should merge.
pCodes_df.PostalCode.unique()

103

This appears not to be the case here as we have 103 unique values corresponding to the 103 lines.
I suspect that the wiki page has been edited to make this simpler.
Nothing else to do.

### Dealing with 'Not assigned' neighborhoods

In [9]:
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
# Finding lines with an assigned borough but unassigned neighborhood

In [80]:
pCodes_df[pCodes_df.Neighborhood == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood


In [81]:
# This returns an empty df ,eaning that we do not have neighborhoods with an 'Not assigned' value. Probably due to wiki page edits.
pCodes_df.shape

(103, 3)

In [82]:
# We have 103 entries left in the table

# Exporting dataframe to CSV

In [83]:
pCodes_df.to_csv('postCodes.csv')