# Exploring and Clustering Neighbourhoods in Toronto - IBM Data Science Capstone

This notebook will be used for solving IBM's Data Science Capstone Project on Coursera

In [94]:
import pandas as pd
import numpy as np
print('Hello Capstone Project Course')

Hello Capstone Project Course


### Web Scraping

In [95]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [96]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html = urlopen(url)

In [97]:
soup = BeautifulSoup(html, 'lxml')
type(soup)

bs4.BeautifulSoup

In [None]:
# Finding the table

In [116]:
table = soup.find('table')
table

<table class="wikitable">
<tbody><tr>
<th>Postal code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park / Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor / Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park / Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern / Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M3B
</td>
<td>North York
</td>
<td>Don Mills
</td></tr>
<tr>
<td>M4B
</td>
<td>East York
<

In [117]:
# finding all the rows in the table
table_rows = table.find_all('tr')

In [118]:
# Making a row list

row_list = []

for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    row_list.append(row)

In [189]:
# Importing this into a Data Frame

df = pd.DataFrame(row_list)

In [190]:
df

Unnamed: 0,0,1,2
0,,,
1,M1A\n,Not assigned\n,\n
2,M2A\n,Not assigned\n,\n
3,M3A\n,North York\n,Parkwoods\n
4,M4A\n,North York\n,Victoria Village\n
...,...,...,...
176,M5Z\n,Not assigned\n,\n
177,M6Z\n,Not assigned\n,\n
178,M7Z\n,Not assigned\n,\n
179,M8Z\n,Etobicoke\n,Mimico NW / The Queensway West / South of Bloo...


In [191]:
# Need to clean this dataset
for i in range(3):
    df[i] = df[i].str.strip('\n')

In [122]:
df

Unnamed: 0,0,1,2
0,,,
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
...,...,...,...
176,M5Z,Not assigned,
177,M6Z,Not assigned,
178,M7Z,Not assigned,
179,M8Z,Etobicoke,Mimico NW / The Queensway West / South of Bloo...


### Need to make this as instructed:
The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.

In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [192]:
# Need to rename the columns with the table headers
soup.find_all('th')

[<th>Postal code
 </th>,
 <th>Borough
 </th>,
 <th>Neighborhood
 </th>,
 <th class="navbox-title" style="font-size:110%"><a href="/wiki/Postal_codes_in_Canada" title="Postal codes in Canada">Canadian postal codes</a>
 </th>]

In [193]:
#Making a list of headers
t_headers = []

for t_head in soup.find_all('th'):
    
    head = t_head.text
    t_headers.append(head)
t_headers        

['Postal code\n', 'Borough\n', 'Neighborhood\n', 'Canadian postal codes\n']

In [194]:
# Limiting it to what is required
t_headers = t_headers[:3]
t_headers

['Postal code\n', 'Borough\n', 'Neighborhood\n']

In [195]:
#Cleaning it and saving the names to header_name
header_name = []
for name in t_headers:
    header_name.append(name.strip('\n'))
    
header_name

['Postal code', 'Borough', 'Neighborhood']

In [196]:
df.columns = header_name

In [197]:
df

Unnamed: 0,Postal code,Borough,Neighborhood
0,,,
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
...,...,...,...
176,M5Z,Not assigned,
177,M6Z,Not assigned,
178,M7Z,Not assigned,
179,M8Z,Etobicoke,Mimico NW / The Queensway West / South of Bloo...


### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [198]:
# replace field that's entirely space (or empty) with NaN

df = df.replace(r'^\s*$', np.nan, regex=True)

In [199]:
df

Unnamed: 0,Postal code,Borough,Neighborhood
0,,,
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
...,...,...,...
176,M5Z,Not assigned,
177,M6Z,Not assigned,
178,M7Z,Not assigned,
179,M8Z,Etobicoke,Mimico NW / The Queensway West / South of Bloo...


In [200]:
# Dropping the rows that have borough as Not Assigned as well as the first row since it is missing
df.drop(0, axis = 0, inplace = True)

In [201]:
df_final = df[df['Borough'] != 'Not assigned']
df_final

Unnamed: 0,Postal code,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Regent Park / Harbourfront
6,M6A,North York,Lawrence Manor / Lawrence Heights
7,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
161,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
166,M4Y,Downtown Toronto,Church and Wellesley
169,M7Y,East Toronto,Business reply mail Processing CentrE
170,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


### More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [204]:
# The table is already corrected for twice listings, hence I'm only replacing the / with a ,
df_final['Neighborhood'] = df_final['Neighborhood'].str.replace(' /', ',')

In [203]:
df_final

Unnamed: 0,Postal code,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
161,M8X,Etobicoke,"The Kingsway, Montgomery Road , Old Mill North"
166,M4Y,Downtown Toronto,Church and Wellesley
169,M7Y,East Toronto,Business reply mail Processing CentrE
170,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


### If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [205]:
df_final[df_final['Neighborhood'] == 'Not assigned']

Unnamed: 0,Postal code,Borough,Neighborhood


In [206]:
# There are no cells with a Not assigned neighborhood

In [207]:
df_final[df_final['Neighborhood'].isnull() == True]

Unnamed: 0,Postal code,Borough,Neighborhood


In [208]:
# Nor are there any cells with a missing value for a neighborhood

In [209]:
df_final.shape

(103, 3)