# Assignment: Segmenting and Clustering Neighborhoods in Toronto
### part 3 of 3
*by Miguel Rozsas*


## abstract
In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

### Start by importing the relevant libraries and the file saved in previous session

In [1]:
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes       
import requests # library to handle requests

## Import the DataFrame saved in part 2 of this assignment.
file= 'Coursera-Capstone_2of3.csv'
df_merged= pd.read_csv(file)
df_merged.drop ('Unnamed: 0', axis=1, inplace=True)
df_merged.head ()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### Explore and cluster the neighborhoods in Toronto.

Lets start with the necessary libraries

In [2]:

#!pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize


#! pip install folium==0.5.0
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

interest= "Toronto, Ontario"
geolocator = Nominatim(user_agent="toronto_explorer")
toronto = geolocator.geocode(interest)
print('The coordinates of Toronto are {}, {}.'.format(toronto.latitude, toronto.longitude))

Folium installed
Libraries imported.
The coordinates of Toronto are 43.6534817, -79.3839347.


Toronto's map

In [3]:
map_Toronto = folium.Map(location=[toronto.latitude, toronto.longitude], zoom_start=12)
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format("neighbourhoods, boroughs and postal code in Toronto") 
map_Toronto.get_root().html.add_child(folium.Element(title_html))

# adding markers to map
for lat, lon, borough, neighbourhood, postal in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Borough'], df_merged['Neighbourhood'], df_merged['PostalCode']):
    label = '{}, {} - {}'.format(neighbourhood, borough, postal)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=4,
        popup=label,
        color='blue',
        fill=True
        ).add_to(map_Toronto)  
    
map_Toronto

### Toronto's venues by neighbourhood
Using foursquare do get venues by neighbourhood 

In [4]:
# initialize my Foursquare credentials
CLIENT_ID = 'FZDPFCII3LOEUDNI2GETXZM2T2AFWPKWZMDP4VX5X3OK4DFH' # my Foursquare ID
CLIENT_SECRET = 'L314JHHIHHYXUSGEXJU1UZ5JSXOWNHACRNWAAD5EWPFHJO5Q' # my Foursquare Secret
ACCESS_TOKEN = 'YJ55TLGLG5XTRKKQ0PZZ1RVVWVZ4EK4BWJLS2ZCDA100YDVJ' # my FourSquare Access Token
VERSION = '20180604'
LIMIT = 30
print ("Foursquare credentilas are set.")

Foursquare credentilas are set.


Define a function to retrieve venues by latitudes and longitudes on the merged dataframe from above and returns a similar dataframe with the venue and category columns added to it.

In [5]:
def getNearbyVenues(df):
    # radius around the lat,lng
    radius=500

    # iterate over the input data frame using the lat, long of each Neighbourhood as an argument to Foursquare API to get the venues at that location.

    # define the base Foursquare API URL 
    base_url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}'

    venues_list=[]
    for i, r in df.iterrows ():
        postalCode= r['PostalCode']
        borough= r['Borough']
        nbhd= r['Neighbourhood']
        b_lat= r['Latitude'] 
        b_lng= r['Longitude']   
        #print("DEBUG: ", postalCode, nbhd)
        
        # create the API request URL
        url= base_url.format(CLIENT_ID, CLIENT_SECRET, VERSION, b_lat, b_lng, radius)
            
        # GET request
        result = requests.get(url)
        try:
            items= result.json()["response"]['groups'][0]['items']
        except KeyError:
            print ("KeyError: response has not groups or items. Check: ")
            print (result.json()["response"])
            quit ()

        # iterate over Foursquare data, get the vanue name and category
        for i in items:
            v= i['venue']       
            venue= (postalCode, borough, nbhd, b_lat, b_lng, v['name'], v['location']['lat'], v['location']['lng'], v['categories'][0]['name'])
            venues_list.append(venue)

    # return venues_list as a dataframe
    nearby= pd.DataFrame(data= venues_list, columns= ['PostalCode', 'Borough', 'Neighborhood', 'B_Lat', 'B_Long', 'Venue', 'V_Lat', 'V_Long', 'Category'])
    return(nearby)

In [10]:
print ("This may take a while...be patient")
torontoVenues= getNearbyVenues(df_merged)
print ("...done.")

This may take a while...be patient
...done.


Now, our dataframe nearbyVenues has borough data from 'https://cocl.us/Geospatial_data' (part 2) and venues names, coordinates and category from foursquare. Nice.

In [11]:
torontoVenues.head ()

Unnamed: 0,PostalCode,Borough,Neighborhood,B_Lat,B_Long,Venue,V_Lat,V_Long,Category
0,M3A,North York,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,North York,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,M4A,North York,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,M4A,North York,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
4,M4A,North York,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


In [12]:
torontoVenues.shape

(1329, 9)

There are 1329 venues in our Toronto's dataset.

### Exploring the data.
As the question is open, there is no specific task and questions are just poping out of my mind at no particular order.

So, the first one that I can think is how many distinct categories are by borough ?

Which one is the most diverse bourogh ? Which is the lesser one ?


In [13]:
torontoCategGroupBy= torontoVenues.groupby ('Borough')['Category']
torontoCategGroupBy.nunique().nlargest (100)

Borough
Downtown Toronto    147
North York           90
West Toronto         73
Central Toronto      60
East Toronto         59
Scarborough          53
East York            47
Etobicoke            38
York                 16
Mississauga          12
Name: Category, dtype: int64

So, Downtown Toronto has the greatest number of distinct venues, and Mississauga is the lesser one.

Lets explore a particular bourogh, lets say, North York. What is the most common venue in North York ?

In [14]:
NorthYorkDF= torontoVenues[torontoVenues['Borough']=='North York']
NorthYorkDF['Category'].value_counts ()

Coffee Shop           17
Clothing Store        10
Pizza Place            8
Park                   7
Restaurant             7
                      ..
Deli / Bodega          1
Salon / Barbershop     1
Food Court             1
Convenience Store      1
Department Store       1
Name: Category, Length: 90, dtype: int64

So, the Coffe Shop is the most common venue in North York. Nice.

Lets list all the 17 Coffe Shops in North York.

In [15]:
NorthYorkDF[NorthYorkDF['Category']=='Coffee Shop'][['Venue', 'Neighborhood', 'V_Lat', 'V_Long']]

Unnamed: 0,Venue,Neighborhood,V_Lat,V_Long
4,Tim Hortons,Victoria Village,43.725517,-79.313103
42,Tim Hortons,"Lawrence Manor, Lawrence Heights",43.719427,-79.467995
146,Tim Hortons,Don Mills,43.722897,-79.339117
156,Delimark Cafe,Don Mills,43.727536,-79.339547
349,Starbucks,"Bathurst Manor, Wilson Heights, Downsview North",43.755797,-79.440471
352,Tim Hortons,"Bathurst Manor, Wilson Heights, Downsview North",43.754767,-79.44325
449,Starbucks,"Fairview, Henry Farm, Oriole",43.77799,-79.344091
454,Aroma Espresso Bar,"Fairview, Henry Farm, Oriole",43.7777,-79.344652
465,Tim Hortons,"Fairview, Henry Farm, Oriole",43.774993,-79.346303
466,Tim Hortons,"Fairview, Henry Farm, Oriole",43.777964,-79.344715


And how many distinct venues coordinates are in North York ?

In [16]:
ct= nearbyVenues[nearbyVenues['Borough']=='North York']
ct.groupby (['V_Lat', 'V_Long']).count ()

Unnamed: 0_level_0,Unnamed: 1_level_0,PostalCode,Borough,Neighborhood,B_Lat,B_Long,Venue,Category
V_Lat,V_Long,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
43.705747,-79.442378,1,1,1,1,1,1,1
43.707170,-79.442658,1,1,1,1,1,1,1
43.707420,-79.443126,1,1,1,1,1,1,1
43.709031,-79.444053,1,1,1,1,1,1,1
43.709111,-79.443930,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...
43.801685,-79.363938,1,1,1,1,1,1,1
43.803664,-79.363905,1,1,1,1,1,1,1
43.804515,-79.366138,1,1,1,1,1,1,1
43.805455,-79.364186,1,1,1,1,1,1,1


There is 198 distinct coordinates only in North York.

### Show boroughs in Toronto's map
Lets create a map of Toronto showing the Boroughs.

We need to assign to each borough a unique color from the choosen colormap (cm.rainbow).

I will add an additional column, Borough_ID, to our nearbyVenues dataframe. This column is a unique ID for the borough and it is used as an index in the color map table.

Lets create 2 auxiliary lists: 

1. borough_list: which is a simple list of unique boroughs names 
1. borough_IDs: another simple list of integers, one by borough in the list above.

In [18]:
borough_IDs= [i for i in range (0, len (nearbyVenues.Borough.unique()))]
borough_list= torontoVenues.Borough.unique()

Now, using the 2 lists above, we can add the additional Borough_ID column to our dataframe.

In [19]:
torontoVenues['Borough_ID']= torontoVenues['Borough'].replace(to_replace=borough_list, value=borough_IDs, inplace=False)
torontoVenues.head ()

Unnamed: 0,PostalCode,Borough,Neighborhood,B_Lat,B_Long,Venue,V_Lat,V_Long,Category,Borough_ID
0,M3A,North York,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park,0
1,M3A,North York,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop,0
2,M4A,North York,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena,0
3,M4A,North York,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant,0
4,M4A,North York,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop,0


Plot the venues in each Neighbourhood using the Borough_ID as a index to a unique color and the borough center with the same color.

In [20]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# n= number of Boroughs
n= len (nearbyVenues.Borough.unique())

# create map
map_Toronto = folium.Map(location=[toronto.latitude, toronto.longitude], zoom_start=12)
title= "Venues by Neighborhood in Toronto"
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(title) 
map_Toronto.get_root().html.add_child(folium.Element(title_html))

# set color scheme for the clusters
x = np.arange(n)
ys = [i + x + (i*x)**2 for i in range(n)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add venues to the map
for name, category, bid, v_lat, v_lon, in zip(torontoVenues['Venue'], torontoVenues['Category'], torontoVenues['Borough_ID'], torontoVenues['V_Lat'], torontoVenues['V_Long']):
    label = '{}: {}'.format(category, name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [v_lat, v_lon],
        radius=3,
        popup=label,
        color=rainbow[bid],
        fill=True,
        fill_color=rainbow[bid],
        fill_opacity=0.6).add_to(map_Toronto)

# add the borough centers to the map
for borough, nbhd, bid, pc, b_lat, b_lon, in zip(torontoVenues['Borough'], torontoVenues['Neighborhood'], torontoVenues['Borough_ID'], torontoVenues['PostalCode'], torontoVenues['B_Lat'], torontoVenues['B_Long']):
    label = '{}: {} ({})'.format(borough, nbhd, pc)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [b_lat, b_lon],
        radius=5,
        popup=label,
        color=rainbow[bid],
        fill=True,
        fill_color=rainbow[bid],
        fill_opacity=0.7).add_to(map_Toronto)
print ("done")

done


In [21]:
map_Toronto

Let's repeat the analyzes done in Manhattan, New York in Toronto. 

I will not restrict the analysis to one borough like it was done in Manhattan, New York.

Instead I will work with all Toronto data, as Manhattan, New York alone has 3166 venues and the entire Toronto has about one third than that. Check.


In [27]:
torontoVenues[['Category']].shape

(1329, 1)

In [28]:
torontoVenues.groupby('Neighborhood').count()

Unnamed: 0_level_0,PostalCode,Borough,B_Lat,B_Long,Venue,V_Lat,V_Long,Category,Borough_ID
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Agincourt,4,4,4,4,4,4,4,4,4
"Alderwood, Long Branch",6,6,6,6,6,6,6,6,6
"Bathurst Manor, Wilson Heights, Downsview North",22,22,22,22,22,22,22,22,22
Bayview Village,4,4,4,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",24,24,24,24,24,24,24,24,24
...,...,...,...,...,...,...,...,...,...
"Willowdale, Willowdale East",30,30,30,30,30,30,30,30,30
"Willowdale, Willowdale West",6,6,6,6,6,6,6,6,6
Woburn,4,4,4,4,4,4,4,4,4
Woodbine Heights,8,8,8,8,8,8,8,8,8


In [29]:
torontoVenues[['Category']]

Unnamed: 0,Category
0,Park
1,Food & Drink Shop
2,Hockey Arena
3,Portuguese Restaurant
4,Coffee Shop
...,...
1324,Hardware Store
1325,Fast Food Restaurant
1326,Tanning Salon
1327,Thrift / Vintage Store


In [30]:
print('There are {} uniques categories.'.format(len(torontoVenues['Category'].unique())))

There are 236 uniques categories.


In [31]:
type (torontoVenues)

pandas.core.frame.DataFrame

In [188]:
torontoVenues.shape

(1329, 10)

In [189]:
toronto_onehot = pd.get_dummies(torontoVenues[['Category']])
toronto_onehot.head ()

Unnamed: 0,Category_Accessories Store,Category_Adult Boutique,Category_Airport,Category_Airport Food Court,Category_Airport Gate,Category_Airport Lounge,Category_Airport Service,Category_Airport Terminal,Category_American Restaurant,Category_Antique Shop,...,Category_Train Station,Category_Vegetarian / Vegan Restaurant,Category_Video Game Store,Category_Video Store,Category_Vietnamese Restaurant,Category_Warehouse Store,Category_Wine Bar,Category_Wings Joint,Category_Women's Store,Category_Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [220]:
# one hot encoding
toronto_onehot = pd.get_dummies(torontoVenues[['Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = torontoVenues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.shape


(1329, 236)

In [221]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000,0.000000,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000,0.000000,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000,0.000000,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000,0.000000,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,"Willowdale, Willowdale East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000,0.033333,0.0,0.0,0.0,0.0
91,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000,0.000000,0.0,0.0,0.0,0.0
92,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000,0.000000,0.0,0.0,0.0,0.0
93,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.125,0.000000,0.0,0.0,0.0,0.0


In [222]:
toronto_grouped.shape

(95, 236)

In [223]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

           venue  freq
0     Pizza Place  0.33
1             Gym  0.17
2  Sandwich Place  0.17
3     Coffee Shop  0.17
4             Pub  0.17


----Bathurst Manor, Wilson Heights, Downsview North----
           venue  freq
0           Bank  0.09
1    Coffee Shop  0.09
2       Pharmacy  0.05
3   Intersection  0.05
4  Shopping Mall  0.05


----Bayview Village----
                 venue  freq
0                 Café  0.25
1                 Bank  0.25
2   Chinese Restaurant  0.25
3  Japanese Restaurant  0.25
4          Yoga Studio  0.00


----Bedford Park, Lawrence Manor East----
                  venue  freq
0           Coffee Shop  0.08
1        Sandwich Place  0.08
2    Italian Restaurant  0.08
3  Fast Food Restaurant  0.04
4       Thai Restaurant  0.04


----Berczy Park----
                venue  freq
0        Cocktail Bar  0.07
1  Seafood Restaurant  0.07
2      Farmers Market  0.07
3            Beer Bar  0.07
4         Coffee Shop  0.07


----Birch Cliff, Cliffside West----
         

In [224]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [225]:
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
toronto_grouped_sorted = pd.DataFrame(columns=columns)
toronto_grouped_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    toronto_grouped_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

toronto_grouped_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agincourt,Latin American Restaurant,Lounge,Skating Rink,Breakfast Spot,College Gym
1,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Gym,Sandwich Place,Pub
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Gas Station,Chinese Restaurant,Bridal Shop
3,Bayview Village,Bank,Chinese Restaurant,Japanese Restaurant,Café,Women's Store
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant,Coffee Shop,Juice Bar,Thai Restaurant
...,...,...,...,...,...,...
90,"Willowdale, Willowdale East",Ramen Restaurant,Coffee Shop,Café,Sandwich Place,Pizza Place
91,"Willowdale, Willowdale West",Pizza Place,Discount Store,Coffee Shop,Butcher,Pharmacy
92,Woburn,Coffee Shop,Soccer Field,Korean BBQ Restaurant,Curling Ice,Drugstore
93,Woodbine Heights,Park,Dance Studio,Beer Store,Skating Rink,Bus Stop


In [226]:
toronto_grouped_sorted.shape

(95, 6)

### Clusterize the toronto data

In [227]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 10

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering init="k-means++", n_clusters=num_clusters, n_init=12)
kmeans = KMeans(init="k-means++", n_clusters=kclusters, n_init=12).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([4, 1, 1, 4, 1, 4, 4, 4, 4, 4, 0, 1, 4, 1, 4, 4, 1, 4, 4, 1, 1, 1,
       4, 1, 4, 4, 2, 1, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 7, 4, 4, 1, 4,
       4, 4, 0, 1, 4, 8, 2, 4, 4, 4, 4, 4, 5, 1, 7, 4, 1, 0, 1, 4, 4, 0,
       4, 9, 4, 1, 6, 1, 4, 4, 1, 4, 4, 1, 1, 4, 4, 0, 1, 4, 4, 1, 3, 1,
       2, 4, 1, 1, 1, 4, 2], dtype=int32)

In [228]:
len (kmeans.labels_)

95

In [229]:
kmeans.cluster_centers_

array([[-4.33680869e-19,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  5.00000000e-02],
       [ 3.86624869e-03,  2.16840434e-19,  1.08420217e-19, ...,
         2.16840434e-19,  5.42101086e-19,  1.73472348e-18],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       ...,
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00]])

In [230]:
kmeans.cluster_centers_[0].shape

(235,)

In [231]:
len (kmeans.cluster_centers_[0])

235

In [232]:
kmeans.cluster_centers_[0].shape

(235,)

In [233]:
centroid_labels = [kmeans.cluster_centers_[i] for i in kmeans.labels_]
len (centroid_labels[0])

235

In [234]:
# add clustering labels
toronto_grouped_sorted.insert(0, 'Cluster Labels', kmeans.labels_)


# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = torontoVenues.join(toronto_grouped_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,B_Lat,B_Long,Venue,V_Lat,V_Long,Category,Borough_ID,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park,0,0,Park,Food & Drink Shop,Women's Store,Curling Ice,Drugstore
1,M3A,North York,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop,0,0,Park,Food & Drink Shop,Women's Store,Curling Ice,Drugstore
2,M4A,North York,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena,0,1,Pizza Place,Coffee Shop,Hockey Arena,Intersection,French Restaurant
3,M4A,North York,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant,0,1,Pizza Place,Coffee Shop,Hockey Arena,Intersection,French Restaurant
4,M4A,North York,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop,0,1,Pizza Place,Coffee Shop,Hockey Arena,Intersection,French Restaurant


In [235]:
columns= ['Borough', 'Neighborhood', 'Category', 'Cluster Labels']
toronto_merged[columns]

Unnamed: 0,Borough,Neighborhood,Category,Cluster Labels
0,North York,Parkwoods,Park,0
1,North York,Parkwoods,Food & Drink Shop,0
2,North York,Victoria Village,Hockey Arena,1
3,North York,Victoria Village,Portuguese Restaurant,1
4,North York,Victoria Village,Coffee Shop,1
...,...,...,...,...
1324,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",Hardware Store,4
1325,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",Fast Food Restaurant,4
1326,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",Tanning Salon,4
1327,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",Thrift / Vintage Store,4


In [236]:
toronto_grouped_sorted.head ()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,4,Agincourt,Latin American Restaurant,Lounge,Skating Rink,Breakfast Spot,College Gym
1,1,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Gym,Sandwich Place,Pub
2,1,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Gas Station,Chinese Restaurant,Bridal Shop
3,4,Bayview Village,Bank,Chinese Restaurant,Japanese Restaurant,Café,Women's Store
4,1,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant,Coffee Shop,Juice Bar,Thai Restaurant


### plot the clusters on the map

In [237]:
def add_categorical_legend(folium_map, title, colors, labels):
    if len(colors) != len(labels):
        raise ValueError("colors and labels must have the same length.")

    color_by_label = dict(zip(labels, colors))
    
    legend_categories = ""     
    for label, color in color_by_label.items():
        legend_categories += f"<li><span style='background:{color}'></span>{label}</li>"
        
    legend_html = f"""
    <div id='maplegend' class='maplegend'>
      <div class='legend-title'>{title}</div>
      <div class='legend-scale'>
        <ul class='legend-labels'>
        {legend_categories}
        </ul>
      </div>
    </div>
    """
    script = f"""
        <script type="text/javascript">
        var oneTimeExecution = (function() {{
                    var executed = false;
                    return function() {{
                        if (!executed) {{
                             var checkExist = setInterval(function() {{
                                       if ((document.getElementsByClassName('leaflet-top leaflet-right').length) || (!executed)) {{
                                          document.getElementsByClassName('leaflet-top leaflet-right')[0].style.display = "flex"
                                          document.getElementsByClassName('leaflet-top leaflet-right')[0].style.flexDirection = "column"
                                          document.getElementsByClassName('leaflet-top leaflet-right')[0].innerHTML += `{legend_html}`;
                                          clearInterval(checkExist);
                                          executed = true;
                                       }}
                                    }}, 100);
                        }}
                    }};
                }})();
        oneTimeExecution()
        </script>
      """
   

    css = """

    <style type='text/css'>
      .maplegend {
        z-index:9999;
        float:right;
        background-color: rgba(255, 255, 255, 1);
        border-radius: 5px;
        border: 2px solid #bbb;
        padding: 10px;
        font-size:12px;
        positon: relative;
      }
      .maplegend .legend-title {
        text-align: left;
        margin-bottom: 5px;
        font-weight: bold;
        font-size: 90%;
        }
      .maplegend .legend-scale ul {
        margin: 0;
        margin-bottom: 5px;
        padding: 0;
        float: left;
        list-style: none;
        }
      .maplegend .legend-scale ul li {
        font-size: 80%;
        list-style: none;
        margin-left: 0;
        line-height: 18px;
        margin-bottom: 2px;
        }
      .maplegend ul.legend-labels li span {
        display: block;
        float: left;
        height: 16px;
        width: 30px;
        margin-right: 5px;
        margin-left: 0;
        border: 0px solid #ccc;
        }
      .maplegend .legend-source {
        font-size: 80%;
        color: #777;
        clear: both;
        }
      .maplegend a {
        color: #777;
        }
    </style>
    """

    folium_map.get_root().header.add_child(folium.Element(script + css))

    return folium_map

In [238]:
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [239]:
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(n)]
len (ys)
ys

[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])]

In [240]:
kclusters

10

In [245]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# n= number of Boroughs
n= len (toronto_merged.Borough.unique())

# create map
map_clusters = folium.Map(location=[toronto.latitude, toronto.longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(n)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, venue, categ, cluster in zip(toronto_merged['V_Lat'], toronto_merged['V_Long'], toronto_merged['Venue'], toronto_merged['Category'], toronto_merged['Cluster Labels']):
    label_str= "{}:{}({})".format (categ, venue, cluster)
    popup = folium.Popup (label_str, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=3,
        color=rainbow[cluster],
        popup= popup,
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_clusters)

# add a legend on the right upper corner
clusters_colors= []
clusters_legends= []
for cluster in range (kclusters):
    clusters_colors.append (rainbow[cluster])
    clusters_legends.append ("cluster {}". format (cluster))

map_clusters= add_categorical_legend(map_clusters, 'Clusters',
                             colors = clusters_colors,
                           labels = clusters_legends)
print ('done.')


done.


In [255]:
yyy

['cluster 0',
 'cluster 1',
 'cluster 2',
 'cluster 3',
 'cluster 4',
 'cluster 5',
 'cluster 6',
 'cluster 7',
 'cluster 8',
 'cluster 9']

In [247]:
map_clusters

In [292]:
t= 0
for i in range (kclusters):
    cluster= toronto_merged.loc[toronto_merged['Cluster Labels'] == i, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
    cluster.reset_index(drop=True, inplace=True)
    n= len (cluster)
    t= t+n
    print ("cluster {} has {} venues and the most common venues in this cluster are:".format (i, n))
    most_common= []
    for v in cluster['1st Most Common Venue'].unique ():
        most_common.append (v)
    print (most_common)
    print ()
print ("the {} cluster sum {} venues.".format (kclusters, t))

cluster 0 has 15 venues and the most common venue in this cluster are:
['Park', 'River']

cluster 1 has 418 venues and the most common venue in this cluster are:
['Pizza Place', 'Coffee Shop', 'Sandwich Place', 'Furniture / Home Store', 'Bar', 'Ramen Restaurant', 'Indian Restaurant', 'Park', 'Dessert Shop', 'Fast Food Restaurant', 'Grocery Store']

cluster 2 has 11 venues and the most common venue in this cluster are:
['Park', 'Playground']

cluster 3 has 1 venues and the most common venue in this cluster are:
['Bakery']

cluster 4 has 872 venues and the most common venue in this cluster are:
['Coffee Shop', 'Clothing Store', 'Gym', 'Café', 'Park', 'Gastropub', 'Home Service', 'Restaurant', 'Trail', 'Grocery Store', 'Bakery', 'Golf Course', 'Hotel', 'Bar', 'Bank', 'Greek Restaurant', 'Fast Food Restaurant', 'American Restaurant', 'General Entertainment', 'Pool', 'Jewelry Store', 'Mexican Restaurant', 'Middle Eastern Restaurant', 'Gift Shop', 'Latin American Restaurant', 'Tennis Court',

In [290]:
cluster= toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
cluster.reset_index(drop=True, inplace=True)
cluster['1st Most Common Venue'].unique ()

array(['Park', 'River'], dtype=object)

In [250]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Venue,V_Lat,V_Long,Category,Borough_ID,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,North York,Brookbanks Park,43.751976,-79.33214,Park,0,0,Park,Food & Drink Shop,Women's Store,Curling Ice,Drugstore
1,North York,Variety Store,43.751974,-79.333114,Food & Drink Shop,0,0,Park,Food & Drink Shop,Women's Store,Curling Ice,Drugstore
249,York,Nairn Park,43.690654,-79.4563,Park,5,0,Park,Women's Store,Pool,Cupcake Shop,Donut Shop
250,York,Maximum Woman,43.690651,-79.456333,Women's Store,5,0,Park,Women's Store,Pool,Cupcake Shop,Donut Shop
251,York,Fairbanks Pool,43.691959,-79.448922,Pool,5,0,Park,Women's Store,Pool,Cupcake Shop,Donut Shop
252,York,Fairbank Memorial Park,43.692028,-79.448924,Park,5,0,Park,Women's Store,Pool,Cupcake Shop,Donut Shop
807,Central Toronto,Lawrence Park Ravine,43.726963,-79.394382,Park,8,0,Park,Swim School,Bus Line,Women's Store,Curling Ice
808,Central Toronto,Zodiac Swim School,43.728532,-79.38286,Swim School,8,0,Park,Swim School,Bus Line,Women's Store,Curling Ice
809,Central Toronto,TTC Bus #162 - Lawrence-Donway,43.728026,-79.382805,Bus Line,8,0,Park,Swim School,Bus Line,Women's Store,Curling Ice
1163,Downtown Toronto,Rosedale Park,43.682328,-79.378934,Playground,1,0,Park,Playground,Trail,Women's Store,Cupcake Shop


In [251]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Venue,V_Lat,V_Long,Category,Borough_ID,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
477,East York,Danforth & Jones,43.684352,-79.334792,Intersection,3,2,Park,Convenience Store,Intersection,Women's Store,Curling Ice
478,East York,The Path,43.683923,-79.335007,Park,3,2,Park,Convenience Store,Intersection,Women's Store,Curling Ice
479,East York,Sammon Convenience,43.686951,-79.335007,Convenience Store,3,2,Park,Convenience Store,Intersection,Women's Store,Curling Ice
817,York,Maison Birks,43.705857,-79.516102,Jewelry Store,5,2,Park,Jewelry Store,Convenience Store,Women's Store,Curling Ice
818,York,Grattan Park,43.706222,-79.521705,Park,5,2,Park,Jewelry Store,Convenience Store,Women's Store,Curling Ice
819,York,Olympic convenience store,43.704486,-79.515789,Convenience Store,5,2,Park,Jewelry Store,Convenience Store,Women's Store,Curling Ice
826,North York,Kitchen Food Fair,43.751298,-79.401393,Convenience Store,0,2,Park,Convenience Store,Women's Store,Curling Ice,Drugstore
827,North York,Tournament Park,43.751257,-79.399717,Park,0,2,Park,Convenience Store,Women's Store,Curling Ice,Drugstore
1093,Scarborough,McNicoll & Brimley,43.815461,-79.281716,Intersection,2,2,Playground,Park,Intersection,Women's Store,Cupcake Shop
1094,Scarborough,Port Royal Park,43.815477,-79.289773,Park,2,2,Playground,Park,Intersection,Women's Store,Cupcake Shop


In [252]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 7, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Venue,V_Lat,V_Long,Category,Borough_ID,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
767,North York,Strathburn Park,43.721765,-79.532854,Baseball Field,0,7,Baseball Field,Food Service,Women's Store,Electronics Store,Drugstore
768,North York,Triple A,43.722412,-79.528716,Food Service,0,7,Baseball Field,Food Service,Women's Store,Electronics Store,Drugstore
1314,Etobicoke,The Artisan Cheese and Food Gallery,43.638785,-79.499953,Deli / Bodega,4,7,Baseball Field,Deli / Bodega,Women's Store,Escape Room,Eastern European Restaurant
1315,Etobicoke,Woodford Park,43.633152,-79.496266,Baseball Field,4,7,Baseball Field,Deli / Bodega,Women's Store,Escape Room,Eastern European Restaurant


In [148]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Venue,V_Lat,V_Long,Category,Borough_ID,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,Brookbanks Park,43.751976,-79.33214,Park,0,3,Park,Food & Drink Shop,Women's Store,Curling Ice,Drugstore,Donut Shop,Dog Run,Distribution Center,Discount Store,Diner
1,North York,Variety Store,43.751974,-79.333114,Food & Drink Shop,0,3,Park,Food & Drink Shop,Women's Store,Curling Ice,Drugstore,Donut Shop,Dog Run,Distribution Center,Discount Store,Diner
249,York,Nairn Park,43.690654,-79.4563,Park,5,3,Park,Women's Store,Pool,Cupcake Shop,Donut Shop,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant
250,York,Maximum Woman,43.690651,-79.456333,Women's Store,5,3,Park,Women's Store,Pool,Cupcake Shop,Donut Shop,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant
251,York,Fairbanks Pool,43.691959,-79.448922,Pool,5,3,Park,Women's Store,Pool,Cupcake Shop,Donut Shop,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant
252,York,Fairbank Memorial Park,43.692028,-79.448924,Park,5,3,Park,Women's Store,Pool,Cupcake Shop,Donut Shop,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant
477,East York,Danforth & Jones,43.684352,-79.334792,Intersection,3,3,Park,Convenience Store,Intersection,Women's Store,Curling Ice,Drugstore,Donut Shop,Dog Run,Distribution Center,Discount Store
478,East York,The Path,43.683923,-79.335007,Park,3,3,Park,Convenience Store,Intersection,Women's Store,Curling Ice,Drugstore,Donut Shop,Dog Run,Distribution Center,Discount Store
479,East York,Sammon Convenience,43.686951,-79.335007,Convenience Store,3,3,Park,Convenience Store,Intersection,Women's Store,Curling Ice,Drugstore,Donut Shop,Dog Run,Distribution Center,Discount Store
807,Central Toronto,Lawrence Park Ravine,43.726963,-79.394382,Park,8,3,Park,Swim School,Bus Line,Women's Store,Curling Ice,Donut Shop,Dog Run,Distribution Center,Discount Store,Diner


In [253]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Venue,V_Lat,V_Long,Category,Borough_ID,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
8,Downtown Toronto,Roselle Desserts,43.653447,-79.362017,Bakery,1,4,Coffee Shop,Park,Bakery,Café,Theater
9,Downtown Toronto,Tandem Coffee,43.653559,-79.361809,Coffee Shop,1,4,Coffee Shop,Park,Bakery,Café,Theater
10,Downtown Toronto,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center,1,4,Coffee Shop,Park,Bakery,Café,Theater
11,Downtown Toronto,Body Blitz Spa East,43.654735,-79.359874,Spa,1,4,Coffee Shop,Park,Bakery,Café,Theater
12,Downtown Toronto,Impact Kitchen,43.656369,-79.356980,Restaurant,1,4,Coffee Shop,Park,Bakery,Café,Theater
...,...,...,...,...,...,...,...,...,...,...,...,...
1324,Etobicoke,RONA,43.629393,-79.518320,Hardware Store,4,4,Tanning Salon,Burger Joint,Supplement Shop,Wings Joint,Discount Store
1325,Etobicoke,McDonald's,43.630007,-79.518041,Fast Food Restaurant,4,4,Tanning Salon,Burger Joint,Supplement Shop,Wings Joint,Discount Store
1326,Etobicoke,Koala Tan Tanning Salon & Sunless Spa,43.631370,-79.519006,Tanning Salon,4,4,Tanning Salon,Burger Joint,Supplement Shop,Wings Joint,Discount Store
1327,Etobicoke,Value Village,43.631269,-79.518238,Thrift / Vintage Store,4,4,Tanning Salon,Burger Joint,Supplement Shop,Wings Joint,Discount Store
