# capstone_week3_part1

First part of the Segmenting and Clustering Neighborhoods in Toronto -assignment. The assignment for the first part was to scrape Toronto neighbourhoods data from Wikipedia and wrangle the data into pandas dataframe in predefined format.

The toronto neighbourhoods data with postal codes seems to reside in: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

<b> Install and import required dependencies </b>

In [1]:
#import required dependencies
from bs4 import BeautifulSoup
import pandas as pd
import requests
import re

<b> Get the data from the site </b>

The url containing the postal and neighbourhood data seems to contain a single table. Lets request the page and insert the text content into the BeautifulSoup for parsing.

In [2]:
# Set url
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# Query the page html
page = requests.get(url)
# Import the page into bs4
soup = BeautifulSoup(page.text, 'html.parser')

First lets extract the table and its cells from the html.

In [3]:
# Get table and cells
table = soup.find('table')
cells = table.find_all('td')

Now we can prepare the dataframe with columns PostalCode, Borough and Neighbourhood.

In [4]:
# Setup dataframe
toronto_neighbourhoods = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'])

<b> Parsing the html for neighbourhood data</b>

Looking at the cell content it seems like the postal codes are always bolded, so we can acquire the postal code by finding the b tags. The borough and the neighbourhood seems to be included inside a span tag which we can similarly extract and get the text content for the data. Some postal codes don't seem to have been assigned so we can just ignore those.

Since the neighbourhood data is always surrounded by parenthesis we can just first acquire the borough by using a regex to cut out the neighbourhood content. Then to extract the neighbourhoods we can use another regex to get anything inside a parenthesis and then map the found data with a lambda function to clear the parenthesis, then use a join function to join the neighbourhoods. Finally we can replace the / separators with a comma.

After cleaning up the data we can append the data into the dataframe

In [5]:
# Go through the cells and clean & build the dataframe
for cell in cells:
    postal = cell.find('b').get_text()
    borough_and_neighbourhoods = cell.find('span').get_text()
    
    # Ignore postal codes that have no burrough or neighbourhood
    if(borough_and_neighbourhoods == 'Not assigned'):
        continue
    
    # Get the borough by cutting out the neighbourhoods
    borough = re.sub(r'\(.*\).*', '', borough_and_neighbourhoods)
    # Join the neighbourhoods by comma and clean the text of parenthesis
    neighbourhoods = ','.join(
        map(
            lambda neighbourhoods: neighbourhoods.replace('(', '').replace(')', ''),
            re.findall(r'\(.*\)', borough_and_neighbourhoods)
        )
    ).replace(' /', ',')
    
    # Append data to dataframe
    toronto_neighbourhoods = toronto_neighbourhoods.append(
        { 'PostalCode': postal, 'Borough': borough, 'Neighborhood': neighbourhoods },
        ignore_index=True)

Clean up the Borough names

In [6]:
toronto_neighbourhoods['Borough']=toronto_neighbourhoods['Borough'].replace(
    {'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
    'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
    'EtobicokeNorthwest':'Etobicoke Northwest',
    'East YorkEast Toronto':'East York/East Toronto',
    'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

# Check the data
toronto_neighbourhoods

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [7]:
# export the dataframe into a csv for easier use for the next part of the assignment
toronto_neighbourhoods.to_csv('toronto_postal_codes.csv', index=False)

<b> Print out the dataframe shape </b>

In [8]:
toronto_neighbourhoods.shape

(103, 3)