# Segmenting and Clustering Neighborhoods in Toronto. Cleaning the data

We need some modules to be imported

In [1]:
import requests
!pip install bs4
from bs4 import BeautifulSoup
import pandas as pd



Then we will get the information from Wikipedia page

In [2]:
URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
res = requests.get(URL)
soup = BeautifulSoup(res.content)
dataf = {}
boroughs = []

then we will clean the data according to the assignment:
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [3]:
i = 0
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
    data = items.find_all(['th','td'])
    postcode = str(data[0])[4:-5]
    borough = str(data[1])[4:-5]
    neighborhood = str(data[2])[4:-5].replace('\n','')
    #only if borough is assigned
    if borough != 'Not assigned':
        #if a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
        if neighborhood == 'Not assigned':
            neighborhood = borough
        #cleaning from html tags
        if '<a ' in borough:
            borough = borough[borough.find('>')+1:borough.find('<',borough.find('>'))]
        if '<a ' in neighborhood:
            neighborhood = neighborhood[neighborhood.find('>')+1:neighborhood.find('<',neighborhood.find('>'))]
        
        if [postcode,borough] not in boroughs:
            
            dataf[i] = [postcode,borough,neighborhood]
            boroughs.append([postcode,borough])
            i += 1
        # several neighborhoods in one borough on one postcode
        else:
            
            for data in dataf.values():
                if data[0] == postcode and data[1] == borough:
                    data[2] = data[2] + f', {neighborhood}'


We may glimpse on what we got

In [4]:
#this will be a dataframe
dataf    

{0: ['M3A', 'North York', 'Parkwoods'],
 1: ['M4A', 'North York', 'Victoria Village'],
 2: ['M5A', 'Downtown Toronto', 'Harbourfront, Regent Park'],
 3: ['M6A', 'North York', 'Lawrence Heights, Lawrence Manor'],
 4: ['M7A', "Queen's Park", "Queen's Park"],
 5: ['M9A', 'Etobicoke', 'Islington Avenue'],
 6: ['M1B', 'Scarborough', 'Rouge, Malvern'],
 7: ['M3B', 'North York', 'Don Mills North'],
 8: ['M4B', 'East York', 'Woodbine Gardens, Parkview Hill'],
 9: ['M5B', 'Downtown Toronto', 'Ryerson, Garden District'],
 10: ['M6B', 'North York', 'Glencairn'],
 11: ['M9B',
  'Etobicoke',
  'Cloverdale, Islington, Martin Grove, Princess Gardens, West Deane Park'],
 12: ['M1C', 'Scarborough', 'Highland Creek, Rouge Hill, Port Union'],
 13: ['M3C', 'North York', 'Flemingdon Park, Don Mills South'],
 14: ['M4C', 'East York', 'Woodbine Heights'],
 15: ['M5C', 'Downtown Toronto', 'St. James Town'],
 16: ['M6C', 'York', 'Humewood-Cedarvale'],
 17: ['M9C',
  'Etobicoke',
  'Bloordale Gardens, Eringate,

In dictionary dataf rows and columns are swapped compared to what we should achieve after importing it to the <b>pandas</b> dataframe, so we will <i>transpose</i> the dataframe and assign column names.

In [5]:
df=pd.DataFrame(dataf).transpose()
#df = df.transpose()
df.columns = ['PostalCode','Borough','Neighborhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [6]:
df.shape

(103, 3)