# Segmenting and Clustering Neighborhoods in Toronto
This is part of the "Applied Data Science Capstone" project.

In this notebook we will merge the Postcode & geographical location data we acquired earlier with info about venues in Toronto.

These data will be used to cluster Toronto neighbourhoods and gain insight into the commercial life of Toronto. 

## Importing relevant packages

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import folium # map making library
import requests # fetching http 
from sklearn.cluster import KMeans # clustering

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

## Foursquare secrets
Below are fields that you must fill with your personal CLIENT_ID & CLIENT_SECRET in order to use the Foursquare API.

In [2]:
CLIENT_ID = 'NONE'
CLIENT_SECRET = 'NONE'

## Loading data
Below we read the csv we built before, containing info about the Postcode, Bourough, Neighbourhood, Latitude & Longitude for each neighbourhood in Torornto.

In [3]:
df = pd.read_csv('../input/Toronto Postcodes and coordinates.csv',
                 usecols=['Postcode', 'Bourough', 'Neighbourhood', 'Latitude', 'Longitude'])
df.head()

Unnamed: 0,Postcode,Bourough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752435,-79.329268
1,M4A,North York,Victoria Village,43.730417,-79.31334
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65512,-79.36264
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.72327,-79.451603
4,M7A,Queen's Park,Queen's Park,43.661072,-79.390895


Below is a script from one of the labs that uses the Foursquare explore API to create a DataFrame with info about venues.

The function receives three lists (or pandas.Series) objects containing the name, latitude & longtitude around which we want to look for venues.

The function then queries Foursquare and parses the response into a DataFrame.

In [4]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, limit=100, version='20180605'):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            version, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postcode', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']
    
    return(nearby_venues)

In [5]:
nearby_venues_df = getNearbyVenues(names=df.Postcode, latitudes=df.Latitude, longitudes=df.Longitude)
nearby_venues_df.head()

Unnamed: 0,Postcode,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,KFC,43.754387,-79.333021,Fast Food Restaurant
2,M3A,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,M4A,Wigmore Park,43.731023,-79.310771,Park
4,M4A,Memories of Africa,43.726602,-79.312427,Grocery Store


We now have two DataFrames:
* df - contains info about the names, postcodes and geographic coordinates of neighbourhoods in Torornto
* nearby_venues_df - contains a list of venues, their type and geographic coordinates

Both DataFrames contain a Postcode column so we can merge them later if we need to.

## Clustering neighbourhoods based on venues they contain
In this section, we'll use KMeans to cluster Toronto neighbourhoods into groups with similar venues.

First, we'll generate a "one hot" encoding scheme for the Venue Category column.

We will then group the one hot encoded DataFrame based on the Postcode and calculate a mean for each column.

Based on this summery of the one hot encoded table we'll generate KMeans clusters.

In [6]:
category_oh_df = pd.get_dummies(nearby_venues_df[['Venue Category']], prefix="", prefix_sep="")
category_oh_df['Postcode'] = nearby_venues_df['Postcode']

In [7]:
cat_score = category_oh_df.groupby('Postcode').mean()
cat_score.head()

Unnamed: 0_level_0,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Auto Garage,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Basketball Court,Basketball Stadium,Beer Bar,Beer Store,Belgian Restaurant,Bike Shop,Bistro,Board Shop,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bridge,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Bus Line,Bus Station,...,Soccer Field,Soup Place,Southern / Soul Food Restaurant,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Stadium,Steakhouse,Storage Facility,Strip Club,Summer Camp,Supermarket,Supplement Shop,Sushi Restaurant,Swim School,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tea Room,Tennis Court,Thai Restaurant,Theater,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1J,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
# set number of clusters
kclusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cat_score)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 2, 1, 1, 1, 1, 1, 2, 1], dtype=int32)

In [11]:
clusters = pd.DataFrame({'Postcode': cat_score.index, 
                         'Cluster': kmeans.labels_})
clusters.head()

Unnamed: 0,Postcode,Cluster
0,M1C,0
1,M1E,2
2,M1G,2
3,M1H,1
4,M1J,1


In [12]:
df = pd.merge(left=df, right=clusters, on='Postcode')

In [13]:
df.head()

Unnamed: 0,Postcode,Bourough,Neighbourhood,Latitude,Longitude,Cluster
0,M3A,North York,Parkwoods,43.752435,-79.329268,2
1,M4A,North York,Victoria Village,43.730417,-79.31334,2
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65512,-79.36264,1
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.72327,-79.451603,1
4,M7A,Queen's Park,Queen's Park,43.661072,-79.390895,1


## Utility function for generating maps
The make_map function takes a data frame and retuns a folium map.
Each row in the data frame is added to the map using it's Latitude, Longitude columns
Additionaly, each label can be assigned a name using the label_col variable (default Postcode)
Lastly, the color of each label is set based on the values in the Cluster column

In [14]:
def make_map(df, label_col='Postcode'):
    latitude, longitude = 43.7, -79.4 # Toronto coordinates

    # create map
    map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

    # set color scheme for the clusters
    x = np.arange(df.Cluster.nunique())
    ys = [i+x+(i*x)**2 for i in range(kclusters)]
    colors_array = cm.Set1(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster in zip(df['Latitude'], df['Longitude'], df[label_col], df['Cluster']):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)
    
    return map_clusters

In [15]:
make_map(df)

Based on k=4 clustering, we can see that most of the neighbourhoods fall into either cluster 1 (red) or 2 (purple).

Clusters 0 (grey) and 3 (brown) seem to have very few memebers. 

Let's see what kind of venues are common in each type of cluster:

In [21]:
nearby_venues_df = nearby_venues_df.merge(on='Postcode', right=clusters)
nearby_venues_df.head()

Unnamed: 0,Postcode,Venue,Venue Latitude,Venue Longitude,Venue Category,Cluster
0,M3A,Brookbanks Park,43.751976,-79.33214,Park,2
1,M3A,KFC,43.754387,-79.333021,Fast Food Restaurant,2
2,M3A,Variety Store,43.751974,-79.333114,Food & Drink Shop,2
3,M4A,Wigmore Park,43.731023,-79.310771,Park,2
4,M4A,Memories of Africa,43.726602,-79.312427,Grocery Store,2


## Summerizing clusters
Based on the k=4 cluster we've built a map repersenting the geographic distribution of clustered neighbourhoods.

We would now look at the most common types of businesses in each cluster.

In order to automate this analysis I wrote a small function that prints the top businesses (by count) in each cluster:

In [49]:
def cluster_summery(df, top=10):
    for i in range(df.Cluster.nunique()):
        data_slice = df.loc[df.Cluster == i]
        summery = data_slice.groupby('Venue Category').count().sort_values(by='Cluster', ascending=False)
        print(f'Cluster #{i} summery:')
        print(summery.iloc[0:top, 1:2], '\n')

In [50]:
cluster_summery(nearby_venues_df, top=5)

Cluster #0 summery:
                Venue
Venue Category       
Bar                 1
History Museum      1 

Cluster #1 summery:
                    Venue
Venue Category           
Coffee Shop           204
Café                  107
Restaurant             72
Bar                    55
Italian Restaurant     52 

Cluster #2 summery:
                            Venue
Venue Category                   
Park                           26
Grocery Store                   4
Coffee Shop                     4
Brewery                         3
Construction & Landscaping      3 

Cluster #3 summery:
                  Venue
Venue Category         
Home Service          2
Business Service      1 



Looking at the two main clusters (1, 2), it's obvious that that cluster 1 is a commercial area with a high density of coffee shops, restaurants and bars. On the otherhand, in cluster 2, the most common venue is a park followed by grocery stores and coffee shops. This result seems to indicate it's more of a residential area.

## Redoing the experiment
Let's now repeat the analysis but with k=8. 

This choice will add degrees of freedom the clustering algorithm and might result in differing conclusions.

In [52]:
# set number of clusters
kclusters = 8

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cat_score)

clusters_2 = pd.DataFrame({'Postcode': cat_score.index, 
                           'Cluster': kmeans.labels_})

In [56]:
df2 = df.drop('Cluster', axis=1)
df2.head()

Unnamed: 0,Postcode,Bourough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752435,-79.329268
1,M4A,North York,Victoria Village,43.730417,-79.31334
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65512,-79.36264
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.72327,-79.451603
4,M7A,Queen's Park,Queen's Park,43.661072,-79.390895


In [57]:
df2 = pd.merge(left=df2, right=clusters_2, on='Postcode')
df2.head()

Unnamed: 0,Postcode,Bourough,Neighbourhood,Latitude,Longitude,Cluster
0,M3A,North York,Parkwoods,43.752435,-79.329268,0
1,M4A,North York,Victoria Village,43.730417,-79.31334,6
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65512,-79.36264,2
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.72327,-79.451603,2
4,M7A,Queen's Park,Queen's Park,43.661072,-79.390895,2


In [58]:
make_map(df2)

In [60]:
nearby_venues_df_2 = nearby_venues_df.drop('Cluster', axis=1)
nearby_venues_df_2 = pd.merge(left=nearby_venues_df_2, right=clusters_2, on='Postcode')
nearby_venues_df_2.head()

Unnamed: 0,Postcode,Venue,Venue Latitude,Venue Longitude,Venue Category,Cluster
0,M3A,Brookbanks Park,43.751976,-79.33214,Park,0
1,M3A,KFC,43.754387,-79.333021,Fast Food Restaurant,0
2,M3A,Variety Store,43.751974,-79.333114,Food & Drink Shop,0
3,M4A,Wigmore Park,43.731023,-79.310771,Park,6
4,M4A,Memories of Africa,43.726602,-79.312427,Grocery Store,6


In [62]:
cluster_summery(nearby_venues_df_2, top=5)

Cluster #0 summery:
                      Venue
Venue Category             
Park                     22
Pizza Place              15
Grocery Store            11
Fast Food Restaurant     11
Pharmacy                 10 

Cluster #1 summery:
                  Venue
Venue Category         
Home Service          2
Business Service      1 

Cluster #2 summery:
                    Venue
Venue Category           
Coffee Shop           206
Café                  103
Restaurant             71
Bar                    53
Italian Restaurant     51 

Cluster #3 summery:
                      Venue
Venue Category             
Park                      3
Athletics & Sports        1
Convenience Store         1
Gym / Fitness Center      1
Harbor / Marina           1 

Cluster #4 summery:
                Venue
Venue Category       
Park                2
Locksmith           1 

Cluster #5 summery:
                Venue
Venue Category       
Music Venue         1 

Cluster #6 summery:
                        

In this run, we again have two clusters that encompass most of the businesses. 

Cluster 0 (grey) contains many parks followed by grocery stores and pizza shops, indicating a residential area

Cluster 2 (blue) again leads with coffee shops and restaurants

# Summary
Based on our KMeans clustering, we've identified two types of neighbourhoods in Toronto:

* Commercial neighbourhoods have a prevalence of coffee shops and restaurants

* Residential neighbourhoods have a prevalence of parks, grocery shops and fast food places