
<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

<a id='item1'></a>

## 1. Download and Explore Dataset

We will download neighbourhood data for Toronto from the website https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. We will start by getting the contents of the webpage into a variable.

In [3]:
html_str = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
result = requests.get(html_str)
result.status_code

200

Initialize a BeautifulSoup object with text from the webpage

In [4]:
soup = BeautifulSoup(result.text, 'html.parser')

Get the text from the first table in the webpage which contains the data we need.

In [18]:
table_text = soup.find_all('table')[0].get_text()
# for line in table_text.split('\n'):
#     print(line)

Create an empty Dataframe with the columns 'PostalCode', 'Borough' & 'Neighbourhood'

In [6]:
column_names = ['PostalCode', 'Borough', 'Neighbourhood'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)
neighborhoods

Unnamed: 0,PostalCode,Borough,Neighbourhood


Populate the dataframe with the table text data from the webpage.

In [7]:
i = 0
for line in table_text.split('\n'):
    if(len(line.strip()) == 0):
        continue
    i = i + 1
    if(i == 1):
        postalCode = line.strip() 
    elif(i == 2):
        borough = line.strip()
    elif(i == 3):
        neighborhood = line.strip()
        neighborhoods = neighborhoods.append({'PostalCode': postalCode,
                                          'Borough': borough,
                                          'Neighbourhood': neighborhood}, ignore_index=True)
        i = 0
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


Print the shape of the dataframe

In [8]:
neighborhoods.shape

(289, 3)

Remove the first row of the table which contains the name of the columns. We will also remove the rows with a borough that is Not assigned.

In [9]:
neighborhoods = neighborhoods[(neighborhoods['PostalCode'] != 'Postcode') & (neighborhoods['Borough'] != 'Not assigned')]
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


Print the shape of the dataframe to check how many rows we have actually deleted.

In [10]:
neighborhoods.shape

(211, 3)

If a cell has a borough but a Not assigned neighborhood, then we will assign the borough name to the neighborhood.

In [11]:
neighborhoods['Neighbourhood'].replace('Not assigned', neighborhoods['Borough'], inplace = True)
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


We will check that the boroughs are same for the same 'PostalCode'.

In [12]:
def myFuncToCheckUniquenessOfBorough(df):
    uniqueCount = len(df['Borough'].unique())
    if(uniqueCount != 1):
        print(df['Borough'], " is not unique")
neighborhoods.groupby(['PostalCode']).apply(myFuncToCheckUniquenessOfBorough)

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma for all such scenarios. Assigning the resultant dataframe to a new one.

In [13]:
def myFunc(df):
    newNeibourhood = str()
    for index, row in df.iterrows():
        newNeibourhood = newNeibourhood + "," + row[2]
    d = {"PostalCode" : [row[0]], "Borough" : [row[1]] , "Neighbourhood" : [newNeibourhood[1:]]}
    returnDf = pd.DataFrame(d)
    return returnDf

neighborhoods_grouped = neighborhoods.groupby(['PostalCode']).apply(myFunc)
neighborhoods_grouped.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,PostalCode,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1B,0,M1B,Scarborough,"Rouge,Malvern"
M1C,0,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
M1E,0,M1E,Scarborough,"Guildwood,Morningside,West Hill"
M1G,0,M1G,Scarborough,Woburn
M1H,0,M1H,Scarborough,Cedarbrae


Dropping the colum named PostalCode in order to reindex the dataframe.

In [14]:
neighborhoods_grouped.drop(columns={'PostalCode'}, inplace = True)
neighborhoods_grouped.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
M1B,0,Scarborough,"Rouge,Malvern"
M1C,0,Scarborough,"Highland Creek,Rouge Hill,Port Union"
M1E,0,Scarborough,"Guildwood,Morningside,West Hill"
M1G,0,Scarborough,Woburn
M1H,0,Scarborough,Cedarbrae


Reindexing the dataframe

In [15]:
neighborhoods_grouped.reset_index(inplace = True)
neighborhoods_grouped.head()

Unnamed: 0,PostalCode,level_1,Borough,Neighbourhood
0,M1B,0,Scarborough,"Rouge,Malvern"
1,M1C,0,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,0,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,0,Scarborough,Woburn
4,M1H,0,Scarborough,Cedarbrae


Removing the column named 'level_1' resulted due to the reindexing.

In [16]:
neighborhoods_grouped.drop(columns={'level_1'}, inplace = True)

Printing first 20 elements of the dataframe.

In [17]:
neighborhoods_grouped.head(20)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"
