## Scraping Toronto Neighbourhoods

This notebook scrapes the toronto postal codes, borough and neighbourhoods from wikipedia page and stores it into a dataframe https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [13]:
#loading libraries
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq 

import numpy as np
import pandas as pd

In [14]:
#this function will take in a url and read the html page using beautiful soup and return the page html
def read_page(my_url):

    #Reading the url
    uClient = uReq(my_url)
    page_html = uClient.read()
    uClient.close()

    #parsing the html page
    page_soup = soup(page_html,"html.parser")
    
    return page_soup

In [15]:
#specifying url and reading
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
postal_code_soup = read_page(wiki_url)

In [16]:
#from the above page I am trying to scrape the postal code

#finding and converting the table to a data list
table = postal_code_soup.find("table",{"class":"wikitable sortable"})
table_rows = table.find_all('tr')

#creating empty list to store data
data = []

#looping through all table rows
for tr in table_rows:
    #if table row consists of td then we strip the row from td
    td = tr.find_all('td')
    row = [i.text.strip() for i in td if i.text.strip()]
    #if the row is non empty then it is appended to data
    if row:
        data.append(row)

In [17]:
#creating dataframe from data
df = pd.DataFrame(data, columns=["postal_code", "borough", "neighbourhoods"])
df.head()

Unnamed: 0,postal_code,borough,neighbourhoods
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [18]:
#checking for duplicate postal codes
df[df.duplicated(['postal_code'],keep=False)]

Unnamed: 0,postal_code,borough,neighbourhoods


No duplicates!

Certain boroughs and neighbourhoods are not assigned to any postal codes, I will remove all of them

In [19]:
df = df[df['borough']!='Not assigned']

Finally I will make a new column for each neighbourhood name. Currently neighbourhoods are separated by commas for each postal code

In [20]:
neighbourhoods = df['neighbourhoods'].str.split(',')

In [21]:
#assign will make a list of the different neighbourhoods separted by comma
#explode will divide it into multiple rows 
df = (df.assign(neighbourhoods = df['neighbourhoods'].str.split(','))
         .explode('neighbourhoods')
         .reset_index(drop=True))

In [22]:
df.head()

Unnamed: 0,postal_code,borough,neighbourhoods
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park
3,M5A,Downtown Toronto,Harbourfront
4,M6A,North York,Lawrence Manor


In [23]:
#saving to csv
df.to_csv('toronto_neighbourhoods.csv',index=False)