# Segmenting and Clustering Neighborhoods in Toronto (part 1)

In [57]:
#!conda install -c conda-forge beautifulsoup4 --yes # Install Beautiful Soup4 package for Web Scraping
#!conda install -c conda-forge html5lib --yes # Install html5lib package for HTML5 parsing

from bs4 import BeautifulSoup # Tool for web scraping
import html5lib # HTML5 parser to parse the Wikipedia page
import requests # To download the HTML resource we want to scrap to our local server

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Download and Explore the data
Toronto has a total of 11 assigned boroughs and 103 postal codes (each postal code can encompass one or more neighborhoods). In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the boroughs, the postal codes and the neighborhoods. We will scrape the following Wikipedia page to collect this information: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

### Load and explore the data
First, let's download the HTML page content to be scraped

In [58]:
html = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text # Fetch the html content

Notice that all the relevant data is inside a **div** whose _id_ is 'mw-content-text'. Let's get first **table** tag inside that **div** using BeatifulSoup package for Web Scraping

In [68]:
soup = BeautifulSoup(html, "html5lib") # Using a HTML5 parser
table = soup.find('div', id='mw-content-text').table

Let's see the first two rows...

In [70]:
print(table.find_all('tr')[0].prettify())
print(table.find_all('tr')[1].prettify())

<tr>
 <th>
  Postcode
 </th>
 <th>
  Borough
 </th>
 <th>
  Neighbourhood
 </th>
</tr>

<tr>
 <td>
  M1A
 </td>
 <td>
  Not assigned
 </td>
 <td>
  Not assigned
 </td>
</tr>



Every row of the table (including the header) is inside the **tr** tag, so now we will store all the **tr** tags in a list

In [71]:
rows = table.find_all('tr') # List of all the rows in the Toronto Postal Code table
rows[:3]

[<tr>
 <th>Postcode</th>
 <th>Borough</th>
 <th>Neighbourhood
 </th></tr>, <tr>
 <td>M1A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M2A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>]

The next task is transforming this data DOM elements into a pandas dataframe. So let's start by creating an empty dataframe.

In [72]:
# define the dataframe columns
colum_names = ['PostalCode', 'Borough', 'Neighborhood']

# instantiate the dataframe
df = pd.DataFrame(columns = colum_names) 

Then let's loop through the rows and fill the dataframe one row at a time.

In [73]:
# Starting from the second row since the first row contains the column header names
for tr in rows[1:]:
    row = {}
    for i, cell in enumerate(tr.find_all('td')): # Loop through every cell in the row
        row[colum_names[i]] = cell.text.strip() # Fill every cell with the content of the td tag

    df = df.append(row, ignore_index=True)

Quickly examine the resulting dataframe.

In [74]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Process and Clean the DataFrame
Only process the cells that have an assigned a borough. Ignore cells with a borough that is Not assigned.

In [75]:
df = df[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Combine the neighborhoods that belong to a given postal code area into one row with the neighborhoods separated with a comma


In [76]:
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


If a cell has a Borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [77]:
cond = df['Neighborhood'] == 'Not assigned'
df.loc[cond, 'Neighborhood'] = df.loc[cond, 'Borough']

Print the number of rows of the final dataframe

In [78]:
df.shape

(103, 3)