# Week 3

## Segmenting and Clustering Neighborhoods in Toronto

### Instructions

#### This notebook is set to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

#### In this project, I will be exploring, segmenting, and clustering the neighborhoods in the city of Toronto, Canada. The neighborhood data though is not readily available on the internet. For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto (link above). 

#### All 3 tasks of scraping, cleaning, and clustering are implemented in this notebook for ease of evaluation.

#### Once the data is in a structured format, I will replicate the analysis similar to another dataset (New York dataset).

### Creating the dataframe similar to below
* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making. 
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

<img src="https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1588464000000&hmac=JOuB1OoX2V8d-sYkNligqHYnrtbWLMxh9JZjrx2roTE" alt="drawing" width="450" align="left"/>

##### Installing and Importing libaries 

In [1]:
# Install and import
print("Installing libraries . . .")
print(". . . ")
print("")

!pip install beautifulsoup4
!pip install lxml
!pip install folium
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
from IPython.display import display_html
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# Folium
import folium # plotting library

# BeautifulSoup
from bs4 import BeautifulSoup

# Skitslearn
from sklearn.cluster import KMeans

# Matplotlib
import matplotlib.cm as cm
import matplotlib.colors as colors

print("")
print('Importing . . .')
print(". . .")
print('Libraries imported!')

Installing libraries . . .
. . . 


Importing . . .
. . .
Libraries imported!


##### Data downloading, scraping, and wrangling

This part of the notebook involves scraping the data from Wikipedia. The BeautifulSoup library of Python helps us with that. The title of the page is printed to check if for scrape success and the table of the postal codes of Toronto, Canada.

In [2]:
# Import 
from IPython.display import display_html

# Data
source   = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup     = BeautifulSoup(source, 'lxml')
table    = str(soup.table)

# Check title
print("HTML Title")
print(soup.title)
print("")

# Display table
print("HTML Table")
display_html(table, raw = True)

HTML Title
<title>List of postal codes of Canada: M - Wikipedia</title>

HTML Table


Postal code,Borough,Neighborhood
M1A,Not assigned,
M2A,Not assigned,
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Regent Park / Harbourfront
M6A,North York,Lawrence Manor / Lawrence Heights
M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
M8A,Not assigned,
M9A,Etobicoke,Islington Avenue
M1B,Scarborough,Malvern / Rouge


##### Converting HTML table to Pandas DataFrame for cleaning and preprocessing

In [3]:
# Dataframes
dfs = pd.read_html(table)
df  = dfs[0]

df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


##### Data cleaning and preprocessing

In [4]:
# Drop rows where 'Borough' = "Not assigned"
df1 = df[df.Borough != 'Not assigned']

# Combine Neighborhoods with same Postalcode
df2 = df1.groupby(['Postal code', 'Borough'], sort = False).agg(', '.join)

df2.reset_index(inplace = True)

# Replace Neighborhood name that = "Not assigned" -> names of Borough
df2['Neighborhood'] = np.where(df2['Neighborhood'] == 'Not assigned', df2['Borough'], df2['Neighborhood'])

df2

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,Malvern / Rouge
7,M3B,North York,Don Mills
8,M4B,East York,Parkview Hill / Woodbine Gardens
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [5]:
# DataFrame Shape
df2.shape

(103, 3)

##### Importing csv file containing Latitudes and Longitudes for various neighborhoods in Toronto, Canada.

In [129]:
# Latitude and Longitude Data
# DataFrame3 contains 
df3 = pd.read_csv('https://cocl.us/Geospatial_data')

df3.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


##### Merging two tables for getting the Latitudes and Longitudes for various neighborhoods in Toronto, Canada.

In [134]:
# Merge data
# DataFrame3 contains Latitude and Longitude
# DataFrame4 contains both df2 and df3
df4 = df2.join(df3)

# Rename column
df4.rename(columns = {'Postal code': 'Postcode'}, inplace = True)

df4.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Postcode.1,Latitude,Longitude
0,M3A,North York,Parkwoods,M1B,43.806686,-79.194353
1,M4A,North York,Victoria Village,M1C,43.784535,-79.160497
2,M5A,Downtown Toronto,Regent Park / Harbourfront,M1E,43.763573,-79.188711
3,M6A,North York,Lawrence Manor / Lawrence Heights,M1G,43.770992,-79.216917
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,M1H,43.773136,-79.239476


### From here, the data will now go through Clustering and the Plotting of the Neighborhoods of Toronto, Canada which Toronto in their Brough

#### Retreiving all the rows from the DataFrame which contain Toronto in their Borough
##### Borough = a town or district which is an administrative unit. 

In [135]:
# Data
# DataFrame4 contains both df2 and df3
df5 = df4[df4['Borough'].str.contains('Toronto', regex = 'False')] # regex = regular expression

df5

Unnamed: 0,Postcode,Borough,Neighborhood,Postcode.1,Latitude,Longitude
2,M5A,Downtown Toronto,Regent Park / Harbourfront,M1E,43.763573,-79.188711
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,M1H,43.773136,-79.239476
9,M5B,Downtown Toronto,"Garden District, Ryerson",M1N,43.692657,-79.264848
15,M5C,Downtown Toronto,St. James Town,M1W,43.799525,-79.318389
19,M4E,East Toronto,The Beaches,M2K,43.786947,-79.385975
20,M5E,Downtown Toronto,Berczy Park,M2L,43.75749,-79.374714
24,M5G,Downtown Toronto,Central Bay Street,M2R,43.782736,-79.442259
25,M6G,Downtown Toronto,Christie,M3A,43.753259,-79.329656
30,M5H,Downtown Toronto,Richmond / Adelaide / King,M3K,43.737473,-79.464763
31,M6H,West Toronto,Dufferin / Dovercourt Village,M3L,43.739015,-79.506944


#### Visualizing the Neighborhoods of the DataFrame above using Folium
##### *If map is not visible, refer to README.md*

In [143]:
# Location source - Google search "map of toronto location lat and long"
map_of_toronto = folium.Map(location = [43.651070, -79.347015], zoom_start = 10) # zoom_start - Initial zoom level for the map.

# For Loop
for lat, lon, borough, neighborhood in zip(df5['Latitude'], df5['Longitude'], df5['Borough'], df5['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html = True)
    
    # A circle of a fixed size with radius specified in pixels
    folium.CircleMarker([lat, lon],
                        radius       = 5,
                        popup        = label,
                        color        = 'Blue',
                        fill         = True,
                        fill_color   = '#3186cc',
                        fill_opacity = 0.7,
                        parse_html   = False).add_to(map_of_toronto)
    
# Display map
map_of_toronto

#### Using K-means Clustering for Clustering the neighborhoods
##### *If map is not visible, refer to README.md*

In [153]:
# Data
k = 5

toronto_clustering = df5.drop(['Postcode', 'Borough', 'Neighborhood'], 1)
kmeans             = KMeans(n_clusters = k, random_state = 0).fit(toronto_clustering)

# Labels of each point
kmeans.labels_

# Run first time, error second run
# method inserts an element to the list at a given index
# df5.insert(0, 'Cluster Labels', kmeans.labels_)

array([0, 0, 2, 4, 4, 4, 4, 4, 3, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2,
       2, 3, 3, 3, 3, 3, 3, 3, 3, 1, 2, 3, 1, 1, 1, 1, 1], dtype=int32)

In [162]:
# Drop duplicate columns
# Source = https://stackoverflow.com/questions/14984119/python-pandas-remove-duplicate-columns
df5 = df5.loc[:,~df5.columns.duplicated()]

# Display DataFrame
df5

Unnamed: 0,Cluster Labels,Postcode,Borough,Neighborhood,Latitude,Longitude
2,4,M5A,Downtown Toronto,Regent Park / Harbourfront,43.763573,-79.188711
4,4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.773136,-79.239476
9,2,M5B,Downtown Toronto,"Garden District, Ryerson",43.692657,-79.264848
15,0,M5C,Downtown Toronto,St. James Town,43.799525,-79.318389
19,0,M4E,East Toronto,The Beaches,43.786947,-79.385975
20,0,M5E,Downtown Toronto,Berczy Park,43.75749,-79.374714
24,0,M5G,Downtown Toronto,Central Bay Street,43.782736,-79.442259
25,0,M6G,Downtown Toronto,Christie,43.753259,-79.329656
30,3,M5H,Downtown Toronto,Richmond / Adelaide / King,43.737473,-79.464763
31,1,M6H,West Toronto,Dufferin / Dovercourt Village,43.739015,-79.506944


In [165]:
# Create map
# Location source - Google search "map of toronto location lat and long"
map_clusters = folium.Map(location = [43.651070, -79.347015], zoom_start = 10)

# Set color scheme for clusters
x = np.arange(k)
y = [i + x + (i*x)**2 for i in range(k)]

colors_array = cm.rainbow(np.linspace(0, 1, len(y))) # Colormaps
rainbow      = [colors.rgb2hex(i) for i in colors_array]

# Map Markers
markers = []

for lat, lon, neighborhood, cluster in zip(df5['Latitude'], df5['Longitude'], df5['Neighborhood'], df5['Cluster Labels']):
    label = folium.Popup('Cluster' + str(cluster), parse_html = True)
    
    folium.CircleMarker([lat, lon],
                        radius       = 5,
                        popup        = label,
                        color        = rainbow[cluster - 1],
                        fill         = True,
                        fill_color   = rainbow[cluster - 1],
                        fill_opacity = 0.7).add_to(map_clusters)
    
# Display Map
map_clusters