# Segmenting and Clustering in Toronto
This notebook is for submission to the IBM Data Science Professional Certificate capstone, Week 3 assignment. Data will be scraped from the Toronto Wikipedia page.

## Section 1: Scraping the data, creating the dataframe, and cleaning the data.
### Scraping the data from the website.

In [74]:
import pandas as pd

# Site URL
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Write to dataframe
df = pd.read_html(url, header=0)

In [75]:
len(df)

3

This implies that pandas read three tables from this url. Upon inspection, we see that we want the first table (index=0).

In [76]:
df[0].head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


We define our dataframe to be the first table.

In [77]:
df = df[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [78]:
df.shape

(180, 3)

### Cleaning and formatting the data.

In [79]:
df.groupby('Borough').count()

Unnamed: 0_level_0,Postal Code,Neighborhood
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1
Central Toronto,9,9
Downtown Toronto,19,19
East Toronto,5,5
East York,5,5
Etobicoke,12,12
Mississauga,1,1
North York,24,24
Not assigned,77,77
Scarborough,17,17
West Toronto,6,6


There are 77 `Not assigned` boroughs. We drop those from the dataframe.

In [80]:
df = df[df['Borough'] != 'Not assigned']

df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [81]:
# Reset the index to start at 0.
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [35]:
# The shape of the resulting dataframe
df.shape

(103, 3)

## Section 2: Adding latitude and longitude to the dataframe

In [40]:
# import the geospatial data csv

geodata = pd.read_csv('Geospatial_Coordinates.csv')
geodata.shape

(103, 3)

This matches the size of the dataframe. Good! Let's join the two tables. First we check that the Postal Code IDs match up.

In [68]:
lista=list(df['Postal Code'].sort_values())
listb=list(geodata['Postal Code'].sort_values())
lista==listb

True

This postal codes match! 

In [82]:
# Join the original dataframe and the geodata dataframe.
df = df.join(geodata.set_index('Postal Code'), on='Postal Code')
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [86]:
df

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


This appears to be what we want for problem 2 of this assignment.

## Section 3: Cluster the neighborhoods

In [113]:
import numpy as np

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

Let's focus our attention on boroughs containing the name `Toronto`.

In [107]:
toronto_boroughs = ['Downtown Toronto', 'East Toronto', 'West Toronto', 'Central Toronto']
df_toronto=df.loc[df['Borough'].isin(toronto_boroughs)]
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


### Create map and markers

In [89]:
# Latitude and Longitude of Toronto
latitude = 43.6532
longitude = -79.3832

In [112]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12, tiles='Stamen Toner')

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'],\
                                           df_toronto['Borough'], df_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### Define FourSquare credentials

In [93]:
CLIENT_ID = 'JSN5MX1DKF5XI3CXVZADJMU5LZE5FMLT2COF00LRJDFMFWIK' # your Foursquare ID
CLIENT_SECRET = 'ZST1WYPJCG2J2LGQGUER23BPAC1OMF1BKYC4WQKSSRD3WC1T' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version