## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

This assignment requires the following:

<ol>
    <li>Creating a new Notebook for this assignment
    <li>Build the code to scrape Wikipedia page table of postal codes and transform the data into a pandas dataframe
    <li>Above dataframe must include
        <ul><li>Consist of three columns: PostalCode, Borough, and Neighborhood
            <li>Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned
            <li>More than one neighborhood can exist in one postal code area, convert these rows into one row with the neighborhoods separated with a comma as shown in row 11 of assignment table
            <li>If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
            <li>Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making
            <li>In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe
        </ul>
    <li>Get latitude and longitude for Toronto and add to neighborhood dataframe 
    <li>Explore and cluster neighborhoods in Toronto
        <ul><li>reduce to only boroughs that contain the word Toronto
            <li>replicate the same analysis done with New York City data
            <li>add enough Markdown cells to explain what you decided to do and to report any observations you make
            <li>generate maps to visualize your neighborhoods and how they cluster together
        </ul>
    <li>Submit a link to your Notebook on your Github repository.
</ol>

In [1]:
## Install and Import Required Libraries

!pip install beautifulsoup4
!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 


from IPython.display import display_html
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Folium installed')
print('Libraries imported.')

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 5.8MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.1 soupsieve-2.0.1
Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/55/6f/c87dffdd88a54dd26a3a9fef1d14b6384a9933c455c54ce3ca7d64a84c88/lxml-4.5.1-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 21.5MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.1
Collecting package metadata (current_repodata.json): done
Solving e

## 2. Convert Wikipedia Canadian Postal Codes into Dataframe 

Elements to do are:
<ul>
    <li>Scrape table from Wikipedia page of Canadian postal codes
    <li>Transform HTML data into pandas dataframe
</ul>

In [2]:
# Use BeautifulSoup to scrape table from wikipedia URL
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')
print(soup.title)

from IPython.display import display_html
tab = str(soup.table)

display_html(tab,raw=True)

<title>List of postal codes of Canada: M - Wikipedia</title>


Postal Code,Borough,Neighborhood
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M8A,Not assigned,Not assigned
M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
M1B,Scarborough,"Malvern, Rouge"


In [3]:
# Convert HTML table to pandas dataframe
df_init = pd.read_html(tab)
df_base =df_init[0]

df_base.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## 3. Preprocess and Clean Dataframe

Dataframe to include:
<ul>
    <li>Consist of three columns: PostalCode, Borough, and Neighborhood
    <li>Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned
    <li>More than one neighborhood can exist in one postal code area, convert these rows into one row with the neighborhoods separated with a comma as shown in row 11 of assignment table
    <li>If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
    <li>Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making
    <li>In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe
</ul>

In [4]:
# Drop rows with 'Not assigned'
df_not_assigned = df_base[df_base.Borough != 'Not assigned']

# Combine neighbourhoods with same Postalcode
df_toronto = df_not_assigned.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
df_toronto.reset_index(inplace=True)

# Replace neighbourhoods which are 'Not assigned' with name of Borough
df_toronto['Neighborhood'] = np.where(df_toronto['Neighborhood'] == 'Not assigned',df_toronto['Borough'], df_toronto['Neighborhood'])

# Rename Postal Code to be CamelCased 
df_toronto.rename(columns={'Postal Code':'PostalCode'},inplace=True)

df_toronto

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [5]:
# Shape of dataframe
df_toronto.shape

(103, 3)

First submit point in Notebook
___

## 4. Get Latitude and Longitudes and Add to Dataframe

Use the Geocoder package or the Geospatial_data csv file to add latitude and longitude to Toronto Neighborhood dataframe.

In [6]:
# Import Geospatial_data csv file and merg into neighborhood dataframe

df_location = pd.read_csv('https://cocl.us/Geospatial_data')
df_location.rename(columns={'Postal Code':'PostalCode'},inplace=True)

df_location.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
df_combined = pd.merge(df_toronto,df_location,on='PostalCode')

df_combined

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


Second submit point in Notebook
___

## 5. Explore and Cluster Neighborhoods in Toronto

Elements to do are:
<ul>
    <li>reduce to only boroughs that contain the word Toronto
    <li>replicate same analysis done with New York City data
    <li>add enough Markdown cells to explain what you decided to do and to report any observations you make
    <li>generate maps to visualize your neighborhoods and how they cluster together
</ul>

In [8]:
# Create a new dataframe with only Boroughs which contain 'Toronto'

df_reduced = df_combined[df_combined['Borough'].str.contains('Toronto',regex=False)]

df_reduced

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


In [9]:
# Visualize only Boroughs with 'Toronto' using Folium

map_toronto = folium.Map(location=[43.651070,-79.347015], zoom_start=11, width='75%', height='75%')

for lat,lng,borough,neighborhood in zip(df_reduced['Latitude'],df_reduced['Longitude'],df_reduced['Borough'],df_reduced['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

# Display map
map_toronto

In [10]:
# Cluster neighborhoods using k-means

k=5
toronto_clustering = df_reduced.drop(['PostalCode','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
df_reduced.insert(0, 'Cluster Labels', kmeans.labels_)

df_reduced

Unnamed: 0,Cluster Labels,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,0,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,0,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,4,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,0,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,3,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,0,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,1,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


In [11]:
# Visualize a Second Map with Clustered Boroughs

map_clusters = folium.Map(location=[43.651070,-79.347015], zoom_start=11, width='75%', height='75%')

# Set colors for clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add color markers
markers_colors = []
for lat, lon, neighborhood, cluster in zip(df_reduced['Latitude'], df_reduced['Longitude'], df_reduced['Neighborhood'], df_reduced['Cluster Labels']):
    label = '{}, {}'.format(' Cluster ' + str(cluster), neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

# Display map
map_clusters

Third submit point in Notebook
___

## 6. Screenshots of Toronto Borough Maps

The above maps may not render in GitHub. They should look like the following:

Map 1 = Toronto Inner Boroughs

![Map1_Toronto_Boroughs.png](Map1_Toronto_Boroughs.png)

Map 2 = Toronto Inner Boroughs (Clustered)

![Map2_Toronto_Boroughs_Clustered.png](Map2_Toronto_Boroughs_Clustered.png)