# Segment and cluster Toronto neighbourhoods
### Author: Kazimierz Hermaszewski
This notebook does the following:
- Scrape Toronto's neighbourhood data from Wikipedia
    - Webdata is scraped using the requests, BeautifulSoup and pandas packages
- Geocode the neighbourhoods

## Scrape Toronto's neighbourhood data from Wikipedia

In [16]:
# install beautiful soup to scrape webpages
!pip install beautifulsoup4

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the 'c:\program files (x86)\python38-32\python.exe -m pip install --upgrade pip' command.


In [17]:
# import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import geocoder
import errno

In [18]:
# get webpage content
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969'
try:
    response = requests.get(url)
except requests.exceptions.RequestException as e:
    print(f'Error: {e}')

When building the table below, these assumptions were made:
- If there is no assigned borough, the table row is skipped.
- If there is no assigned neighbourhood but a borough exists, the borough is assigned as the neighbourhood.

In [19]:
# create and read the soup object
soup = BeautifulSoup(response.text, 'html.parser')
table_list = []
table = soup.find('table')

# find and transform the table rows
for row in table.findAll('tr'):
    cell = {}
    if row.find('th') is not None:
        continue
    
    postalcode = row.find('td')
    borough = postalcode.find_next_sibling()
    neighbourhood = borough.find_next_sibling()
    
    if 'Not assigned' in borough.text.rstrip(): # skip rows if no assigned borough
        continue
    else: 
        cell['PostalCode'] = postalcode.text.rstrip()
        cell['Borough'] = borough.text.rstrip()

        if 'Not assigned' in neighbourhood.text.rstrip(): # if no assigned neighbourhood, set neighbourhood to borough
            cell['Neighbourhood'] = borough.text.rstrip()
        else:
            cell['Neighbourhood'] = neighbourhood.text.rstrip()

    table_list.append(cell)

# build the dataframe   
df = pd.DataFrame(table_list)

# replace bad rows
df['Borough'] = df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [20]:
df.shape

(103, 3)

## Geocode the neighbourhoods

In [21]:
# get coordinates from the provided CSV file as geocoder was unreliable
url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv'
try:
    response = requests.get(url)
except requests.exceptions.RequestException as e:
    print(f'Error: {e}')

df_g = pd.read_csv(url)
df = pd.merge(df, df_g, left_on = 'PostalCode', right_on = 'Postal Code')
df.drop(columns = ['Postal Code'], inplace = True)
df