# Scraping Toronto Neighbourhood Data

**Data Source:** [Wikipedia Page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

Import required libraries.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

print('Imports complete.')

Imports complete.


Read HTML data and convert to BeautifulSoup object. Find the first table and save to a variable.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_raw = requests.get(url, 'html.parser').text
soup = BeautifulSoup(html_raw)
html_table = soup.find('table')

## Scraping table data

Pull headers into a list to use as column names.

In [3]:
headers=[]
for header in html_table.find_all('th'):
    headers.append(header.string.strip())
headers

['Postal Code', 'Borough', 'Neighbourhood']

Iterate over remaining rows of the table (except headers) and store all values in a list containing sub-lists.

In [4]:
table_data = []
for row in html_table.find_all('tr')[1:]:
    row_data=[] # Reset list to hold each row's data
    
    for cell in row.find_all('td'):
        row_data.append(cell.string.strip())
    
    table_data.append(row_data)

Convert the lists to a pandas dataframe and view first few rows.

In [5]:
table = pd.DataFrame(table_data, columns=headers)
table.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Data Cleaning

We now need to clean the data for further use. For our purposes, we do not require the postal codes that are not assigned to any borough.

**Note:**
---
- If Borough is 'Not Assigned', row is ignored.
- If Neighbourhood is 'Not Assigned', Neighbourhood = Borough

In [6]:
table.shape # Check initial shape

(180, 3)

Replace all 'Not Assigned' cells with NA.

In [7]:
table.replace('Not assigned', pd.NA, inplace=True)
table.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Drop all rows without a value for borough.

In [8]:
table.dropna(axis='index', subset=['Borough'], inplace=True)
table.shape # Check new number of rows

(103, 3)

Check if there are any rows with missing values in Neighbourhood.

In [9]:
table[table['Neighbourhood'].isnull()]

Unnamed: 0,Postal Code,Borough,Neighbourhood


Reset index and save table to CSV for further processing.

In [10]:
table.reset_index(drop=True, inplace=True)
table.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [11]:
table.to_csv('../data/toronto_neighbourhoods.csv', index=False)