# Webscraping Toronto neighborhoods data #
This notebook is created for part I of 'Segmenting and Clustering Neighborhoods in Toronto' Assignment. Here we build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and transform the data into a pandas dataframe. 

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# it's a good practice to identify ourselves
headers = {"user-agent": "Webscraper for IBM Data Science Capstone"}
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M", headers = headers)

# check for valid status 
if page.status_code != requests.codes.ok :
    print("Request was not successful, status code:", page.status_code)
    exit()
    
# Parse page using BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [3]:
# print scraped page title
print(soup.title.text)

List of postal codes of Canada: M - Wikipedia


From inspecting the page we can see that the information we need is inside table element with class 'wikitable sortable jquery-tablesorter'.

In [4]:
# get table data
table = soup.find("table", {"class":"wikitable sortable"})
# print first row of the table (column headers)
print(table.find("tr"))

<tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>


Now we extract the data we need from the table:  PostalCode, Borough, and Neighborhood. We will only process the cells that have an assigned borough and ignore the ones with a borough that is Not assigned. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. Since more than one neighborhood can exist in one postal code area, we combine all the neghborhoods into one row with the neighborhoods separated with a comma.

In [13]:
# initialize empty dictionary for data from table (to deal with multiple neighborhoods in the same postal code area)
data = {}

# get all table rows
trs = table.find_all("tr")

# process row data
for tr in trs[1:]:
    entry = tr.find_all("td")
    # if borough is not assigned, skip the row
    if entry[1].text.strip() == 'Not assigned':
        continue
    # key is a tuple of postal code, borough
    key = (entry[0].text.strip(), entry[1].text.strip())
    val = entry[2].text.strip()
    # if neighborhood is not assigned set it to borough name
    if val == 'Not assigned':
        val = entry[1].text.strip()
    if key not in data.keys():
        data[key] = [val]
    # if multiple neighborhoods exist, append to the list of neighborhoods 
    else:
        data[key].append(val) 

# populate dataframe
df= pd.DataFrame(columns = ["PostalCode", "Borough", "Neighborhood"])
for i, entry in enumerate(data.items()):
    df.loc[i] = [entry[0][0], entry[0][1], ", ".join(entry[1])] 
    
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [15]:
# save to csv file to use later
df.to_csv('neighborhoods.csv', index=False)

In [16]:
df.shape

(103, 3)