# <center>Toronto Neighbourhood (Segmenting and Cluseter)</center>

#### This notebook is created to explore and cluster the neighbourhoods in Toronto.
#### It consists of below sections: 
### Section 1: Read Wikipedia Page and Create Dataframe
> ###### (1) Create Notebood
> ##### (2) Scrape Wikipedia Page
> ##### (3) Create Dataframe with three columns (PostCode, Borough and Neighbourhood)
> ##### (4) Those records are ignored where Borough is not assigned
> ##### (5) PostCodes are unique, and Neighbourhoods are combined for each PostCode in one cell, separated by coma(,)
> ##### (6) For the records where Borough is assigned but Neighbourhood is not assigned, in such scenarion Neighbourhood is assigned with the same value as Borough.
> ##### (7) In the end of Part1 Number of rows and columns are displayed for Data Frame
### Section 2: Add Latitude and Longitude to Dataframe
> ###### (1) Get data file with Postal Codes and latitude and longitude
> ###### (2) Merge tables to add Latitude and Longitude to Postal Codes and Neighbourhood data
### Section 3: Explore and Clustering the Toronto Neighbourhood
> ###### (1) Select the records from Dataframe where Borough contains Toronto.
> ###### (2) Create a Map of Toronto and Mark Boroughs on the map
> ###### (3) Explore the Neighbourhoods with Foursquare API
> ###### (4) Cluster the Boroughs based on Latitude and Longitude

## Section 1
#### (1) Notebook Created
Importing required libraries

In [1]:
# Importing the required libraries
import requests
import json
import random
#import geocoder
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from IPython.display import display_html 
from pandas.io.json import json_normalize
!conda install -c conda-forge folium=0.5.0 --yes 
import folium
from sklearn.cluster import KMeans
import matplotlib
import matplotlib.cm as cm
import matplotlib.colors as colors

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    altair-3.2.0               |           py36_0         770 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ca-certificates-2019.9.11  |       hecc5488_0         144 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be 

#### (2) Scraping Wikipedia Page

In [2]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
BS = BeautifulSoup(website_url, 'lxml')
FSA_table = str(BS.table)
display_html(FSA_table, raw=True)

Postcode,Borough,Neighbourhood
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M5A,Downtown Toronto,Regent Park
M6A,North York,Lawrence Heights
M6A,North York,Lawrence Manor
M7A,Queen's Park,Not assigned
M8A,Not assigned,Not assigned


#### (3) Read data and create dataframe

In [3]:
T_data=pd.read_html(FSA_table)
Toronto_df=T_data[0]
#Toronto_df

#### (4) Ignore records where Borough is not assigned

In [4]:
#Clean-up drop 'Not assigned' Borough
Toronto_df.drop(Toronto_df[Toronto_df['Borough'] == 'Not assigned'].index, inplace=True)
#Reseting index
Toronto_df=Toronto_df.reset_index(drop=True)
Toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


#### (5) Grouping all the Neighborhoods which are under same postal code Sepatrated by coma(,), Postal codes are unique.

In [5]:
Toronto_df=Toronto_df.groupby(['Postcode', 'Borough'], sort=False).agg(lambda x: ','.join(x))
Toronto_df.reset_index(inplace=True)

#### (6) Assign value of Borough to Neighbourhood in case Neighbourhood is not assigned

In [6]:
Toronto_df['Neighbourhood'] = np.where(Toronto_df['Neighbourhood'] == "Not assigned", Toronto_df['Borough'], Toronto_df['Neighbourhood'])
#update the Column name for Postal codes to PostalCodes as advised in requirements
Toronto_df.rename(columns={'Postcode':'PostalCode'}, inplace=True)
Toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


#### (7) Number of Rows and Columns in Data Frame

In [7]:
Toronto_df.shape

(103, 3)

## Section 2
#### In this section we will add latitude and longitude to the dataframe so that same can be used for clustering and plotting
> #### (1) Get data file with Postal Codes and latitude and longitude
> #### (2) Merge tables to add Latitude and Longitude to Postal Codes and Neighbourhood data
> In order to merge two data frames (Toronto_df and Lat_Long_df) identifiers for Postal Code will be alligned. Then two tables will be merged in order to produce new dataframe with fields (PostalCode, Borough, Neighbourhood, Latitude, Longitude)

> #### As geocoder is not working CSV file is being used to get Latitude and Longitude

#### (1) Get data file and create dataframe with Latitude and longitudes

In [8]:
Lat_Long_df = pd.read_csv('http://cocl.us/Geospatial_data')
Lat_Long_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Change the name of Field Postal Code in order to align it to Toronto_df. It is required for merging two tables.

In [9]:
Lat_Long_df.rename(columns={'Postal Code' : 'PostalCode'}, inplace=True)
Lat_Long_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### (2) Merge two tables to produce table with fileds (PostalCode, Borough, Neighbourhood, Latitude, Longitude)

In [10]:
Toronto_ll_df = pd.merge(Toronto_df, Lat_Long_df, on='PostalCode')
Toronto_ll_df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


## Section 3
#### In this section we will explore Toronto neighbourhood and plot the clusters on map
> ###### (1) Select the records from Dataframe where Borough contains Toronto.
> ###### (2) Create a Map of Toronto and Mark Boroughs on the map
> ###### (3) Explore the Neighbourhoods with Foursquare API
> ###### (4) Cluster the Neighbourhoods based on Latitude and Longitude

#### (1) Select records which contain Toronto in Borough

In [11]:
Toronto_ll_df = Toronto_ll_df[Toronto_ll_df['Borough'].str.contains('Toronto', regex=False)]
Toronto_ll_df = Toronto_ll_df.reset_index(drop=True)
Toronto_ll_df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
1,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


#### (2) Create map of Toronto

 we will use mean of Latitude in Toronto_ll_df as Latitude and mean of Longitude in Toronto_ll_df as Map Latitude and Longitude

In [12]:
Latitude = Toronto_ll_df['Latitude'].mean()
Longitude =  Toronto_ll_df['Longitude'].mean()
print ('Toronto Latitude : ', Latitude)
print ('Toronto Longitude : ', Longitude)

Toronto Latitude :  43.66726218421053
Toronto Longitude :  -79.38988323421052


Create the map of Toronto

In [13]:
Toronto_Map = folium.Map(location=[Latitude, Longitude], zoom_start=12)
#adding marker
for lat, long, label in zip(Toronto_ll_df['Latitude'], Toronto_ll_df['Longitude'], Toronto_ll_df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=True        
    ).add_to(Toronto_Map)
Toronto_Map

#### (4) Exploring Toronto using Foursquare API

In [14]:
# The code to be removed by Watson Studio.

CLIENT_ID = 'ZOWL1P3X0ZFVGKRRCIOXJ25WCCSOKZM1KZCRVUMGICLDBWW4' # your Foursquare ID
CLIENT_SECRET = 'WJ5JZ5SM1QWQVPTZYSN4UQFKUDE104NC5AS0FALWDBONRKRI' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version 

In [15]:
# We will define a function to explore Neighbourhoods as same procedure is to be applied for all the neighbourhoods
def exp_neighbourhood(Neighbourhood_name, latitude, longitude, radius=500, limit=100):
    venues_list=[]
    for name, lat, long in zip(Neighbourhood_name, latitude, longitude):
        #print(Neighbourhood_name)
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            long,
            radius,
            limit
        )
    #Make Get request
    results=requests.get(url).json()['response']['groups'][0]['items']
    
    #Add to Venue list
    venues_list.append([(
        name, 
        lat, 
        long, 
        v['venue']['name'], 
        v['venue']['location']['lat'],
        v['venue']['location']['lng'],
        v['venue']['categories'][0]['name']
    )for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood',
                             'Neighbourhood Latitude',
                             'Neighbourhood Longitude',
                             'Venue',
                             'Venue Latitude',
                             'Venue Longitude',
                             'Venue Category'
                            ]
    return(nearby_venues)

In [16]:
Toronto_Venues=exp_neighbourhood(Neighbourhood_name = Toronto_ll_df['Neighbourhood'],
                                 latitude = Toronto_ll_df['Latitude'], 
                                 longitude = Toronto_ll_df['Longitude']
                                )
Toronto_Venues.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,Rorschach Brewing Co.,43.663483,-79.319824,Brewery
1,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,Leslieville Farmers Market,43.664901,-79.319784,Farmers Market
2,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,The Sidekick,43.664484,-79.325162,Comic Shop
3,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,Chino Locos,43.664653,-79.325584,Burrito Place
4,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,Queen Margherita Pizza,43.664685,-79.324164,Pizza Place


In [17]:
Toronto_Venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Business Reply Mail Processing Centre 969 Eastern,16,16,16,16,16,16


Explore various categories of venues

In [18]:
# one hot encoding
Toronto_onehot = pd.get_dummies(Toronto_Venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['Neighborhood'] = Toronto_Venues['Neighbourhood'] 

# move neighborhood column to the first column
cols=list(Toronto_onehot.columns.values)
cols.pop(cols.index('Neighborhood'))
Toronto_onehot=Toronto_onehot[['Neighborhood']+cols]

# rename Neighborhood for Neighbourhood so that future merge works
Toronto_onehot.rename(columns = {'Neighborhood': 'Neighbourhood'}, inplace = True)
Toronto_onehot

Unnamed: 0,Neighbourhood,Auto Workshop,Brewery,Burrito Place,Butcher,Comic Shop,Farmers Market,Fast Food Restaurant,Garden,Garden Center,Light Rail Station,Park,Pizza Place,Restaurant,Skate Park,Spa,Yoga Studio
0,Business Reply Mail Processing Centre 969 Eastern,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Business Reply Mail Processing Centre 969 Eastern,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,Business Reply Mail Processing Centre 969 Eastern,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3,Business Reply Mail Processing Centre 969 Eastern,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Business Reply Mail Processing Centre 969 Eastern,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
5,Business Reply Mail Processing Centre 969 Eastern,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
6,Business Reply Mail Processing Centre 969 Eastern,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
7,Business Reply Mail Processing Centre 969 Eastern,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
8,Business Reply Mail Processing Centre 969 Eastern,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
9,Business Reply Mail Processing Centre 969 Eastern,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### (4) Clustering of Neighbourhood

In [19]:
Toronto_ll_df.columns

Index(['PostalCode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude'], dtype='object')

Create lables using kmeans

In [20]:
k=5
Toronoto_Clustering = Toronto_ll_df.drop(['PostalCode', 'Borough', 'Neighbourhood'],axis=1)
kmeans=KMeans(n_clusters = k, random_state=0).fit(Toronoto_Clustering)
kmeans.labels_


array([0, 0, 0, 4, 0, 0, 1, 0, 2, 0, 1, 4, 0, 1, 4, 0, 4, 3, 3, 3, 3, 2,
       3, 1, 2, 3, 1, 2, 3, 1, 3, 0, 0, 0, 0, 0, 0, 4], dtype=int32)

Insert cluster labeles to dataframe

In [21]:
Toronto_ll_df.insert(0, 'ClusterLabel', kmeans.labels_)
Toronto_ll_df

Unnamed: 0,ClusterLabel,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,0,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
1,0,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
2,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,4,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,0,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
5,0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
6,1,M6G,Downtown Toronto,Christie,43.669542,-79.422564
7,0,M5H,Downtown Toronto,"Adelaide,King,Richmond",43.650571,-79.384568
8,2,M6H,West Toronto,"Dovercourt Village,Dufferin",43.669005,-79.442259
9,0,M5J,Downtown Toronto,"Harbourfront East,Toronto Islands,Union Station",43.640816,-79.381752


In [22]:
Latitude = Toronto_ll_df['Latitude'].mean()
Longitude =  Toronto_ll_df['Longitude'].mean()
print ('Toronto Latitude : ', Latitude)
print ('Toronto Longitude : ', Longitude)

Toronto Latitude :  43.66726218421053
Toronto Longitude :  -79.38988323421052


Clustering on Map

In [23]:
Cluster_Map = folium.Map(location = [Latitude, Longitude], zoom_start=11)

#Set colour scheme for clusters
x=np.arange(k)
y=[i+x+(i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0,1,len(y)))
rainbow=[colors.rgb2hex(i) for i in colors_array]

#add markers to the map
for lat, long, neighbourhood, cluster in zip(Toronto_ll_df['Latitude'], 
                                             Toronto_ll_df['Longitude'], 
                                             Toronto_ll_df['Neighbourhood'], 
                                             Toronto_ll_df['ClusterLabel']
                                            ):
    label = folium.Popup('Cluster' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7
    ).add_to(Cluster_Map)
Cluster_Map