# Segmenting and Clustering Neighborhoods in Toronto

In [1]:
import pandas as pd
import numpy as np

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [2]:
# Define column names
column_names = ['PostalCode', 'Borough', 'Neighborhood']

Getting the data and sorting out values into a jagged array.

In [3]:
# download table
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')

# find table in html page
My_table = soup.find('table',{'class':'wikitable sortable'})
MyTableCells = My_table.findAll('td')
# variables for extraction
My_list = []
jagged = []

# counter for list items (every 3 items will be a row)
# starting with 0, reset at 2 
x = 0

    # sort out data
for i in MyTableCells:
    if x == 2:
        My_list.append(i.text.strip('\n'))
        
        # if Borough is not "not assigned", add to the jagged array
        if My_list[1] != 'Not assigned':
            if My_list[2] == 'Not assigned':
                My_list[2] = My_list[1]
            jagged.append(My_list)
        My_list = []
        x = 0
    else:
        My_list.append(i.text)
        x += 1

print('Extraction is complete.')

Extraction is complete.


Making the dataframe workbench.

In [4]:
toronto_df_wb = pd.DataFrame(jagged, columns=column_names)
toronto_df_wb.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


We need to bundle up neighborhoods by PostalCode. For that I'm making a new dataframe.

In [5]:
# Bundle Neighborhoods up
new_df = pd.DataFrame(toronto_df_wb.groupby('PostalCode')['Neighborhood'].apply(list))

# Some cosmetics to convert list elements into strings
l = []
for index, row in new_df.iterrows():
    x = ', '
    y = x.join(row['Neighborhood'])
    l.append(y)
new_df['Neighborhood'] = np.array(l)

# Let's see what we've got
new_df.head()

Unnamed: 0_level_0,Neighborhood
PostalCode,Unnamed: 1_level_1
M1B,"[Rouge, Malvern]"
M1C,"[Highland Creek, Rouge Hill, Port Union]"
M1E,"[Guildwood, Morningside, West Hill]"
M1G,[Woburn]
M1H,[Cedarbrae]


I'm merging our existing DataFrame (toronto_df_wb) with the listed neighborhoods dataframe (new_df).
And then dropping the duplicates (toronto_df_wb has multiple duplicate PostalCode values, therefore we have to remove them duplicates).

In [6]:
dff = pd.merge(toronto_df_wb[['PostalCode','Borough']], new_df[['Neighborhood']], left_on='PostalCode', right_index=True)
toronto_df = dff.loc[dff.astype(str).drop_duplicates().index]
toronto_df.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,[Parkwoods]
1,M4A,North York,[Victoria Village]
2,M5A,Downtown Toronto,"[Harbourfront, Regent Park]"
4,M6A,North York,"[Lawrence Heights, Lawrence Manor]"
6,M7A,Queen's Park,[Queen's Park]
7,M9A,Etobicoke,[Islington Avenue]
8,M1B,Scarborough,"[Rouge, Malvern]"
10,M3B,North York,[Don Mills North]
11,M4B,East York,"[Woodbine Gardens, Parkview Hill]"
13,M5B,Downtown Toronto,"[Ryerson, Garden District]"


### We've got a nice and clean dataframe now.

In [7]:
toronto_df.shape

(103, 3)