# Task 1: Web Scraping a Wikipedia table

Using `BeautifulSoup` to scrape the table of [Postal codes in Toronto](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) from Wikipedia.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

We first make a `get` request to obtain the `html` for the page, which we then feed into a `BeautifulSoup` object. 

In [2]:
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'lxml')

With the page saved locally in `soup`, we now find the table on the page of class `wikitable sortable` (as found by inspecting the `html` on the website), and then divide it into separate rows by searching for `tr` tags.

In [3]:
table = soup.find('table', attrs={'class':'wikitable sortable'})
table_data = table.find_all('tr')

In order to construct our dataframe, we first construct a table of dictionaries; the keys will determine the columns of the dataframe.

In [4]:
# Initialise an empty list for the neighbourhoods.

neighbourhoods = []

For later use, we collect the header names within a table called `headers` by finding the `th` tags within the first row of `table_data` (i.e., the header row). When we convert the results to text they will contain line break characters, which we remove along with whitespace.

In [5]:
# Collect the header names in a list

headers = [th.text.replace('\n','').replace(' ', '') for th in table_data[0].find_all('th')]

After doing this, we scan through the rest of `table_data`: for each row of the table we first produce an empty dictionary and then, using a zipped list of `headers` and the elements of the row contained within `td` tags, we associate to the dictionary the header names as keys and the table entries as values (with line break characters again removed). This dictionary is then appended to `neighbourhoods`.

In [6]:
for n in range(1, len(table_data)):
    neighbourhood = dict()
    for header, td in zip(headers, table_data[n].find_all('td')):
        neighbourhood[header] = td.text.replace('\n','')
    neighbourhoods.append(neighbourhood)

We can now convert this list into a dataframe using the `pandas.DataFrame()` function.

In [7]:
df = pd.DataFrame(neighbourhoods)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


We can immediately see that some postal codes do not have an associated borough, so let's collect their indices and then drop them from the dataframe.

In [8]:
empty_rows = df[df['Borough']=='Not assigned'].index
df.drop(empty_rows, axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


As there is a potential risk that there are duplicates of postal codes in the original table, we compare the number of rows in the table to the number of unique postal codes:

In [9]:
print(f"There are {df.shape[0]} rows in the table, and {df['PostalCode'].nunique()} unique postal codes")

There are 103 rows in the table, and 103 unique postal codes


There does not seem to be any duplicate postal codes. But what about if there are any neighbourhoods that are not assigned?

In [10]:
print(f"The number of neighbourhoods that are not assigned: {df[df['Neighbourhood']=='Not assigned'].shape[0]}")

The number of neighbourhoods that are not assigned: 0


So, as a sanity check: what size is the dataframe?

In [11]:
df.shape

(103, 3)