# Segmenting and Clustering Neighborhoods in Toronto (part 2)

In [50]:
#!conda install -c conda-forge beautifulsoup4 --yes # Install Beautiful Soup4 package for Web Scraping
#!conda install -c conda-forge html5lib --yes # Install html5lib package for HTML5 parsing

from bs4 import BeautifulSoup # Tool for web scraping
import html5lib # HTML5 parser to parse the Wikipedia page
import requests # To download the HTML resource we want to scrap to our local server

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Download and Explore the data
Toronto has a total of 11 assigned boroughs and 103 postal codes (each postal code can encompass one or more neighborhoods). In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the boroughs, the postal codes and the neighborhoods. We will scrape the following Wikipedia page to collect this information: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

### Load and explore the data
First, let's download the HTML page content to be scraped

In [51]:
html = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text # Fetch the html content

Notice that all the relevant data is inside a **div** whose _id_ is 'mw-content-text'. Let's get first **table** tag inside that **div** using BeatifulSoup package for Web Scraping

In [52]:
soup = BeautifulSoup(html, "html5lib") # Using a HTML5 parser
table = soup.find('div', id='mw-content-text').table

Every row of the table (including the header) is inside the **tr** tag, so now we will store all the **tr** tags in a list

In [53]:
rows = table.find_all('tr') # List of all the rows in the Toronto Postal Code table

The next task is transforming this data DOM elements into a pandas dataframe. So let's start by creating an empty dataframe.

In [62]:
# define the dataframe columns
colum_names = ['PostalCode', 'Borough', 'Neighborhood']

# instantiate the dataframe
df = pd.DataFrame(columns = colum_names) 

Then let's loop through the rows and fill the dataframe one row at a time.

In [63]:
# Starting from the second row since the first row contains the column header names
for tr in rows[1:]:
    row = {}
    for i, cell in enumerate(tr.find_all('td')): # Loop through every cell in the row
        row[colum_names[i]] = cell.text.strip() # Fill every cell with the content of the td tag

    df = df.append(row, ignore_index=True)

### Process and Clean the DataFrame
Only process the cells that have an assigned a borough. Ignore cells with a borough that is Not assigned.

In [64]:
df = df[df['Borough'] != 'Not assigned']

Combine the neighborhoods that belong to a given postal code area into one row with the neighborhoods separated with a comma


In [65]:
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index()

If a cell has a Borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [66]:
cond = df['Neighborhood'] == 'Not assigned'
df.loc[cond, 'Neighborhood'] = df.loc[cond, 'Borough']

Print the number of rows of the final dataframe

In [67]:
df.shape

(103, 3)

### Adding Latitude and Longitude information
Let's download geospatial information and save it as a CSV file called geo_coordinates.csv

In [18]:
!wget -q -O 'geo_coordinates.csv' http://cocl.us/Geospatial_data

Now that the data is downloaded, let's read it into a *pandas* dataframe.

In [68]:
geo_coor_df = pd.read_csv('geo_coordinates.csv')
geo_coor_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now let's add the spatial coordinates to the previous dataframe

In [74]:
# Set the primary key in both dataframe to be the Postal Code, and then make the inner join
df_toronto = df.set_index('PostalCode').join(geo_coor_df.set_index('Postal Code'), how = 'inner').reset_index()
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Checking dimensions...

In [73]:
df_toronto.shape

(103, 5)