# Segmenting and Clustering Neighborhoods in Toronto

### 1. Scrape the wikipedia page containing the postal codes of Toronto to obtain the data on boroughs and neighborhoods. Transform the data into pandas dataframe.

For web scraping we'll be using BeautifulSoup package.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

Get the page and pass it to BeautifulSoup, along with the parser (we'll use lxml parser).

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_page=requests.get(url).text

In [3]:
soup = BeautifulSoup(wiki_page, 'lxml')

Find the table containing the postal codes. There are several tables on this page. The one we need is of class "wikitable sortable".

In [4]:
table = soup.find('table', class_="wikitable sortable")

Get all the rows of the table. The first row is the table header, we need to skip it.

In [5]:
tr = table.find_all('tr')
print(tr[0])

<tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>


In [6]:
print(tr[1])

<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>


In [7]:
print(len(tr))

289


Create a pandas dataframe. Set PostalCode as index, to facilitate search and insertion.

In [8]:
df = pd.DataFrame(columns = ['PostalCode', 'Borough', 'Neighborhood'])
df.set_index('PostalCode', inplace=True)
df

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1


The following code populates the dataframe. It iterates over the rows and extracts table data. If the borough is not assigned, it skips the row. If the borough is assigned, but the neighborhood is not assigned, it sets the neighborhood to the same value as the borough. If the postal code already exists in the table, we append the new neighborhood value to the existing neighborhood value for this code. We assume that the borough is the same for the same postal code.

In [9]:
#Iterate over the rows, skipping the header
for i in range(1, len(tr)-1):
    row = tr[i]
    td = row.find_all('td')
    code = td[0].text.strip()
    if td[1].text.strip() == 'Not assigned':
        continue
    else:
        borough = td[1].text.strip()
       
    if td[2].text.strip() == 'Not assigned':
        neigh = borough
    else:
        neigh = td[2].text.strip()
    
    #Check if the row with this postal code already exists in the table.
    if code in df.index:
        oldNeigh = df.loc[code, 'Neighborhood']
        neigh = oldNeigh + ", " + neigh
        df.loc[code, 'Neighborhood'] = neigh
    else:
        df.loc[code] = [borough, neigh]

#reset the index, so the table is in the original form.
df = df.reset_index()
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [10]:
df.shape

(103, 3)