# Segmenting and Clustering Neighborhoods in Toronto - Part 1

## Introduction

In this assignment, you will be required to explore and cluster the neighborhoods in Toronto. You will group the neighborhoods into clusters. In this part, we download the dataset in a dataframe to store the data to be explored in the next parts.


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download Dataset</a>
 
</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# Install website scraping libraries and packages in Python from BeautifulSoup 
#!conda install -c conda-forge beautifulsoup4 --yes  # uncomment this line if you haven't completed 
from bs4 import BeautifulSoup as bs

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

## 1. Download Dataset

From the Wikipage given the List of postal codes of Canada :
1. Download the HTML file at the given link :
     https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

2. Register the file locally

3. Open the file and iterate through HTML elements to extract postal codes using 'BeautifulSoup' library

Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segment the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and longitude coordinates of each neighborhood. 

Luckily, this dataset exists for free on the web. Feel free to try to find this dataset on your own, but here is the link to the dataset: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

I downloaded the file from wikipedia web site and placed it locally using simply a `wget` command and access the data.
So let's go ahead and do that.

In [3]:
!wget -q -O 'canada_postal_code_list_from_wikipedia.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
print('HTML Postal Code page downloaded!')

HTML Postal Code page downloaded!


#### Load the data

Next, let's load the data.

In [4]:
# Get HTML content
with open("canada_postal_code_list_from_wikipedia.html") as fp:
    soup = bs(fp, 'lxml')

# Get the HTML table codes
tagTable = soup.table
#Get table body
body = tagTable.tbody

#### Tranform the data into a *pandas* dataframe

The task is essentially transforming this HTML data  into a *pandas* dataframe.
So let's start by creating an empty dataframe with just the column names

In [5]:
# Define the dataframe columns 
# get table column names -> all 'th' tags of the body in 'tr' fields
colTab = (body.tr).find_all('th')
#print (colTab)
colNames = [(bs(str(colTab[i]))).find('th').string.strip() for i in range (0,3)]

# instantiate the dataframe
postcode_df = pd.DataFrame(columns=colNames)
postcode_df

Unnamed: 0,Postcode,Borough,Neighbourhood


Then let's loop through the data and fill the dataframe one row at a time.

In [7]:
postcode_df = pd.DataFrame(columns=colNames)

# extract all 'tr' tagged fields except the first one (column names)
codesTab= body.find_all('tr')[1:]

for n, code in enumerate(codesTab):
    # n.th postal code either : name or link
    #print ("\n", n ,".th",  code, )
    # for each element code 
    tabc = ["","",""]
    for i, value in enumerate(code.stripped_strings):
        tabc[i] = value.strip()
    #print("tabc", tabc)
    # Ignore cells with a borough that is Not assigned.
    #print(tabc[1], 'Not assigned', tabc[1] == 'Not assigned')
    postcode = tabc[0]
    borough = tabc[1]
    neighbourhood = tabc[2]
    
    if borough != 'Not assigned':
        # insert
        # check a neighbourhood is assigned else set it with borough
        if neighbourhood == 'Not assigned':
            neighbourhood = borough
        # insert the built postal code into the dataframe
        postcode_df = postcode_df.append({'Postcode' : postcode,
                            'Borough' : borough,
                            'Neighbourhood': neighbourhood},
                           ignore_index=True)

# Combine rows with same postal code into one row with the neighborhoods separated with a comma 
df = postcode_df.groupby('Postcode', as_index=False).agg({'Borough':'first', 'Neighbourhood':', '.join})
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [8]:
print ("Toronto postal codes dataframe dimensions = ", df.shape)

Toronto postal codes dataframe dimensions =  (103, 3)


### Thank you for this lab!

This notebook was created by [Alex Aklson](https://www.linkedin.com/in/aklson/) and [Polong Lin](https://www.linkedin.com/in/polonglin/). I hope you found this lab interesting and educational. Feel free to contact us if you have any questions!

This notebook is part of a course on **Coursera** called *Applied Data Science Capstone*. If you accessed this notebook outside the course, you can take this course online by clicking [here](http://cocl.us/DP0701EN_Coursera_Week3_LAB2).

<hr>

Copyright &copy; 2018 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).