# Segmenting and Clustering Neighborhoods in Toronto

## Part 1 - Table of postal codes to dataframe

In [1]:
# Install required library
!pip3 install beautifulsoup4 as bs4

[33mYou are using pip version 19.0.3, however version 20.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
# import necessary libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


### Getting all info from website
With the Beautiful Soup library we are able to get all the info from the Wikipedia page, in order to get the information in an easier to read way we can use "prettify".

In [3]:
my_url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r=requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(r,"html.parser")

As we are interested in getting the postal code table, we have to look for this table using "find". After we print the result, we can see that every row starts with "tr" and the header of the table with "th".

In [7]:
tab = soup.find('table')
tab.text

"\n\nPostal code\n\nBorough\n\nNeighborhood\n\n\nM1A\n\nNot assigned\n\n\n\n\nM2A\n\nNot assigned\n\n\n\n\nM3A\n\nNorth York\n\nParkwoods\n\n\nM4A\n\nNorth York\n\nVictoria Village\n\n\nM5A\n\nDowntown Toronto\n\nRegent Park / Harbourfront\n\n\nM6A\n\nNorth York\n\nLawrence Manor / Lawrence Heights\n\n\nM7A\n\nDowntown Toronto\n\nQueen's Park / Ontario Provincial Government\n\n\nM8A\n\nNot assigned\n\n\n\n\nM9A\n\nEtobicoke\n\nIslington Avenue\n\n\nM1B\n\nScarborough\n\nMalvern / Rouge\n\n\nM2B\n\nNot assigned\n\n\n\n\nM3B\n\nNorth York\n\nDon Mills\n\n\nM4B\n\nEast York\n\nParkview Hill / Woodbine Gardens\n\n\nM5B\n\nDowntown Toronto\n\nGarden District / Ryerson\n\n\nM6B\n\nNorth York\n\nGlencairn\n\n\nM7B\n\nNot assigned\n\n\n\n\nM8B\n\nNot assigned\n\n\n\n\nM9B\n\nEtobicoke\n\nWest Deane Park / Princess Gardens / Martin Grove / Islington / Cloverdale\n\n\nM1C\n\nScarborough\n\nRouge Hill / Port Union / Highland Creek\n\n\nM2C\n\nNot assigned\n\n\n\n\nM3C\n\nNorth York\n\nDon Mills\n

Using "find_all" we are able to get all the rows in the table and the numbers of rows.

In [8]:
ro = tab.find_all('tr')
nrows = len(ro)
nrows

181

The first step is to extract the header of the table, therefore we use "tr.text" to get the first row as a string.

In [9]:
title = tab.tr.text
title

'\nPostal code\n\nBorough\n\nNeighborhood\n'

Then we can get each header, by using "split" we split the string in a list as shown below.

In [10]:
title = title.split('\n')
title

['', 'Postal code', '', 'Borough', '', 'Neighborhood', '']

In [11]:
h1 = title[1]
h2 = title[3]
h3 = title[5]
h3

'Neighborhood'

In order to get the desired dataframe we have to perform the following steps:
    1. Get Postal Code, Borough and Neighborhood data (using split).
    2. For postal codes with more than one neighborhood we use "replace" in order to get all neighborhoods in the same row separated with a comma.
    3. For neighborhoods Not assigned we use "replace" to change the Not assigned for the information in borough.
    4. Ignore cells with Not assigned boruoghs (using if statement).
 

In [12]:
data = []
for i in range(1,nrows):
    data_PC = ro[i].text.split('\n')[1]
    data_B = ro[i].text.split('\n')[3]
    data_N = ro[i].text.split('\n')[5]
    data_N = data_N.replace('/',',')
    data_N = data_N.replace('Not assigned',data_B)
    if data_B != 'Not assigned':
        data.append((data_PC, data_B, data_N))


Finally we transform the list into a dataframe and get the numbers of rows and columns using "shape".

In [13]:
df = pd.DataFrame(data, columns = [h1,h2,h3])
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


In [14]:
print(df.shape)

(103, 3)
