In [14]:
import urllib.request # to open URLs

In [15]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url) # open the url 

__Next we want to import the functions from ""Beautiful Soup"" which will let us parse and work with the HTML we fetched from our Wiki page:__

In [16]:
from bs4 import BeautifulSoup

In [17]:
# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "lxml")

To get an idea of the structure of the underlying HTML in our web page, we can view the code with __Soup’s prettify__ function

In [20]:
print(soup.prettify()[0:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"XrKgOwpAMNEAA6jL3OkAAADV","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":955176509,"wgRevisionId":955176509,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Communications in Ontario","Postal codes in Canada","Toronto","Ontario

HERE is the important part for us : 

Starting with an HTML __table tag__ with a class identifier of ”wikitable sortable”. 

Scroll down a little to see how the table is made up and you’ll see the rows start and end with __tr__ and __tr__ tags.

The top row of headers has __th__ tags while the data rows beneath for each club has __td__ tags. It’s in these tags that we will tell Python to extract our data from.

In [21]:
# let's see the title of the web page
soup.title.string

'List of postal codes of Canada: M - Wikipedia'

let's look at our table :

firstly we send Beautiful Soup off to retrieve all instances of the __table__ tag within the page and add them to an array called all_tables:

In [27]:
all_tables = soup.find_all('table')
# or lets just specify the table with a "wikitable sortable" class ID
# doing so, we will get rid of some informations unnecessary
table = soup.find('table', class_ = 'wikitable sortable')
# Let's see first 100 character of the table 
print(table.prettify()[0:100])


<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postalcode
   </th>
   <th>
    Borou


There are 3 columns in our table that we want to scrape the data from.
__so we will set up 3 empty lists (A, B, C) to store our data in.__

*  We know that the table is set up in rows (starting with 'tr' tags) with the data sitting within 'td' tags in each row. We aren’t too worried about the header row with the 'th' elements as we know what each of the columns represent by looking at the table.
* To start with, we want to use the Beautiful Soup ‘find_all’ function again and set it to look for the string ‘tr’. We will then set up a FOR loop for each row within that array and set Python to loop through the rows, one by one.

* Within the loop we are going to use find_all again to search each row for 'td' tags with the ‘td’ string. We will add all of these to a variable called ‘cells’ and then check to make sure that there are 3 items in our ‘cells’ array.

* If there are then we use the find(text=True)) option to extract the content string from within each 'td' element in that row and add them to the A-C lists we created at the start of this step

In [28]:
A = []
B = []
C = []

for row in table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True).rstrip('\n'))
        B.append(cells[1].find(text=True).rstrip('\n'))
        C.append(cells[2].find(text=True).rstrip('\n'))
              

We’ll import pandas and create a dataframe with it, assigning each of the lists A-C into a column with the name of our source table columns i.e. Postal code, Borough, Neighborhood 

In [29]:
import pandas as pd

pd.set_option('display.max_columns', None) # to see all the columns
pd.set_option('display.max_rows', None)

df = pd.DataFrame(A, columns=['PostalCode'])
df['Borough'] = B
df['Neighborhood'] = C
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Let's remove 'Not Assigned' cells

In [30]:
import numpy as np
df['Borough'].replace('Not assigned', np.nan, inplace=True)
df.dropna(subset=['Borough'], inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


Let's replace '/' with ','

In [32]:
# df=df.groupby(["PostalCode", "Borough"], as_index=False)
df['Neighborhood'] = df['Neighborhood'].str.split(pat = "/")
df['Neighborhood'] = df['Neighborhood'].apply(', '.join)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


for Neighborhood="Not assigned", make the value the same as Borough

In [33]:
# if there would be "not assigned" for Neighborhood column

for index, row in df.iterrows():
    if row["Neighborhood"] == "Not assigned":
        row["Neighborhood"] = row["Borough"]

Let's look at the shape 

In [34]:
df.shape

(103, 3)