## Get the Toronto postal codes and associated neighborhood information. 

This will download the Toronto postal code information from Wikipedia. The data scraping is performed by beautifulsoup from the bs4 library. The data is then put into a data frame, and cleaned up. 

In [2]:
# import the necessary libraries 
import numpy as np 
import pandas as pd 
from bs4 import BeautifulSoup
import requests

## Use parser to download the webpage. 

In [114]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

response = requests.get(url)
page_html = response.text
soup = BeautifulSoup(page_html, 'html.parser')

The table that we want uses the class "wikitable sortable" in the table tag. From there we need to get each row, which is specified with the tr tag. 

In [116]:
table1 = soup.find(class_="wikitable sortable").find("tbody").find_all("tr")


This does the main cleanup of the data. It loops through each row in the table and puts the data in the right column of the data frame. First it looks for the td tags. The first row is empty, so it skips that one. It also checks to make sure that the Borough is not equal to "Not assigned". Those rows are skipped. If the neighborhood is listed as "Not assigned", then the neighborhood is assigned the value for borough. Finally it glues the values from the table into a data frame.  

In [107]:
# skip the first row of the table
Borough = [] 
Neighborhood = []
PostalCode = []
for row in table1: 
    tmp1 = row.find_all("td")
    if tmp1 != [] and tmp1[1].get_text().strip() != "Not assigned": 
        PostalCode.append(tmp1[0].get_text().strip())
        Borough.append(tmp1[1].get_text().strip())
        if tmp1[2] == "Not assigned":
            Neighborhood.append(tmp1[1].get_text().strip())
        else: 
           Neighborhood.append(tmp1[2].get_text().strip())
data = {'PostalCode' : PostalCode, 'Borough' : Borough, 'Neighborhood' : Neighborhood}
df = pd.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighborhood'])            




Take a peek at the data to make sure that it looks generally as expected. 

In [117]:
print(df.head())

PostalCode           Borough                                 Neighborhood
0        M3A        North York                                    Parkwoods
1        M4A        North York                             Victoria Village
2        M5A  Downtown Toronto                    Regent Park, Harbourfront
3        M6A        North York             Lawrence Manor, Lawrence Heights
4        M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government


## Test the postal code 

Check the data frame to make sure that there is just one line per postal code. To do this, group it by postal code and then count the number of entries for each postal code. 

In [112]:
df.groupby('PostalCode').size().value_counts()

1    103
dtype: int64

## Shape of the data frame

Add the shape of the data frame to give us the number of rows and columns.

In [64]:
df.shape

(103, 3)