# Toronto Clustering - ML project

This notebook scrapes the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas  dataframe

To create the above dataframe:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice   that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row   with the neighborhoods separated with a comma as shown in row 11  in the above table.
- If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.



In [5]:
import pandas as pd
import numpy as np
import requests
pd.set_option('display.max_rows', None)

In [6]:
# !pip install bs4

In [7]:
# import the library we use to open URLs
import urllib.request
from bs4 import BeautifulSoup

In [8]:
# specify which URL/web page we are going to be scraping
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)

In [9]:
# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "lxml")

In [10]:
# uncomment the following line to check the page HTML content
# print(soup.prettify())

Starting with an HTML _table_ tag with a class identifier of ”wikitable sortable”. We’ll make a note of that for further use later.

Scroll down a little to see how the table is made up and you’ll see the rows start and end with _tr_ and _/tr_ tags.

The top row of headers has _th_ tags while the data rows beneath for each neigh has _td_ tags. It’s in these _td_ tags that we will tell Python to extract our data from.



In [11]:
# check the page title
soup.title.string

'List of postal codes of Canada: M - Wikipedia'

In [12]:
# use the 'find_all' function to bring back all instances of the 'table' tag in the HTML and store in 'all_tables' variable
all_tables=soup.find_all("table")

In [13]:
# the table I am interested has the wikitable sortable tag
right_table=soup.find('table', class_='wikitable sortable')
# right_table


In [14]:
# Data extraction
A=[]
B=[]
C=[]

for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True).strip('\n'))
        B.append(cells[1].find(text=True).strip('\n'))
        C.append(cells[2].find(text=True).strip('\n'))

In [15]:
# Create a dataframe with the scraped columns
df = pd.DataFrame(A,columns=['Postal Code'])
df['Borough']= B
df['Neighbourhood']= C

df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [16]:
# Get only rows where Borough is assigned
df = df[df['Borough'] != 'Not assigned']
df.reset_index(drop=True,inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [17]:
# check if borough have been listed multiple times
grouped = df.groupby(by="Postal Code").count()
if len(grouped[grouped["Borough"] > 1]) == 0:
    print("No repetitions. df is ready to go")
else:
    print("Found repetitions")
    grouped[grouped["Borough"] > 1]

No repetitions. df is ready to go


In [18]:
# assign the borough to un-assigned neighbourhoods where name == Not assigned
idx = df[df['Neighbourhood'] == 'Not assigned'].index
df.loc[idx,'Neighbourhood'] = df.loc[idx,'Borough']
df.head(5)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [19]:
# check dataframe shape 
df.shape

(103, 3)

In [20]:
# install the geocoder library 
# !pip install geocoder

In [None]:
import geocoder # import geocoder

latitude = []
longitude = []

# loop until you get the coordinates
for postal_code in df['Postal Code']:
    # initialize your variable to None
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng

    latitude.append(lat_lng_coords[0])
    longitude.append(lat_lng_coords[1]) 

In [None]:
# Add the new columns to the dataframe
df['Latitude'] = latitude
df['Longitude'] = longitude
df.head()