# Assignment: Segmenting and Clustering Neighborhoods in Toronto
### part 3 of 3
*by Miguel Rozsas*


## abstract
In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

### Start by importing the relevant libraries and the file saved in previous session

In [145]:
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes       
import requests # library to handle requests

## Import the DataFrame saved in part 2 of this assignment.
file= 'Coursera-Capstone_2of3.csv'
df_merged= pd.read_csv(file)
df_merged.drop ('Unnamed: 0', axis=1, inplace=True)
df_merged.head ()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### Explore and cluster the neighborhoods in Toronto.

Lets start with the necessary libraries

In [146]:

#!pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize


#! pip install folium==0.5.0
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

interest= "Toronto, Ontario"
geolocator = Nominatim(user_agent="toronto_explorer")
toronto = geolocator.geocode(interest)
print('The coordinates of Toronto are {}, {}.'.format(toronto.latitude, toronto.longitude))

Folium installed
Libraries imported.
The coordinates of Toronto are 43.6534817, -79.3839347.


Toronto's map

In [147]:
map_Toronto = folium.Map(location=[toronto.latitude, toronto.longitude], zoom_start=12)
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format("neighbourhoods, boroughs and postal code in Toronto") 
map_Toronto.get_root().html.add_child(folium.Element(title_html))

# adding markers to map
for lat, lon, borough, neighbourhood, postal in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Borough'], df_merged['Neighbourhood'], df_merged['PostalCode']):
    label = '{}, {} - {}'.format(neighbourhood, borough, postal)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=4,
        popup=label,
        color='blue',
        fill=True
        ).add_to(map_Toronto)  
    
map_Toronto

### Toronto's venues by neighbourhood
Using foursquare do get venues by neighbourhood 

In [148]:
# initialize my Foursquare credentials
CLIENT_ID = 'FZDPFCII3LOEUDNI2GETXZM2T2AFWPKWZMDP4VX5X3OK4DFH' # my Foursquare ID
CLIENT_SECRET = 'L314JHHIHHYXUSGEXJU1UZ5JSXOWNHACRNWAAD5EWPFHJO5Q' # my Foursquare Secret
ACCESS_TOKEN = 'YJ55TLGLG5XTRKKQ0PZZ1RVVWVZ4EK4BWJLS2ZCDA100YDVJ' # my FourSquare Access Token
VERSION = '20180604'
LIMIT = 30
print ("Foursquare credentilas are set.")

Foursquare credentilas are set.


Define a function to retrieve venues by latitudes and longitudes on the merged dataframe from above and returns a similar dataframe with the venue and category columns added to it.

In [149]:
def getNearbyVenues(df):
    # radius around the lat,lng
    radius=500

    # iterate over the input data frame using the lat, long of each Neighbourhood as an argument to Foursquare API to get the venues at that location.

    # define the base Foursquare API URL 
    base_url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}'

    venues_list=[]
    for i, r in df.iterrows ():
        postalCode= r['PostalCode']
        borough= r['Borough']
        nbhd= r['Neighbourhood']
        b_lat= r['Latitude'] 
        b_lng= r['Longitude']   
        #print("DEBUG: ", postalCode, nbhd)
        
        # create the API request URL
        url= base_url.format(CLIENT_ID, CLIENT_SECRET, VERSION, b_lat, b_lng, radius)
            
        # GET request
        result = requests.get(url)
        try:
            items= result.json()["response"]['groups'][0]['items']
        except KeyError:
            print ("KeyError: response has not groups or items. Check: ")
            print (result.json()["response"])
            quit ()

        # iterate over Foursquare data, get the vanue name and category
        for i in items:
            v= i['venue']       
            venue= (postalCode, borough, nbhd, b_lat, b_lng, v['name'], v['location']['lat'], v['location']['lng'], v['categories'][0]['name'])
            venues_list.append(venue)

    # return venues_list as a dataframe
    nearby= pd.DataFrame(data= venues_list, columns= ['PostalCode', 'Borough', 'Neighborhood', 'B_Lat', 'B_Long', 'Venue', 'V_Lat', 'V_Long', 'Category'])
    return(nearby)

In [6]:
print ("This may take a while...be patient")
torontoVenues= getNearbyVenues(df_merged)
print ("...done.")

This may take a while...be patient
...done.


Now, our dataframe nearbyVenues has borough data from 'https://cocl.us/Geospatial_data' (part 2) and venues names, coordinates and category from foursquare. Nice.

In [150]:
torontoVenues.head ()

Unnamed: 0,PostalCode,Borough,Neighborhood,B_Lat,B_Long,Venue,V_Lat,V_Long,Category,Borough_ID
0,M3A,North York,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park,0
1,M3A,North York,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop,0
2,M4A,North York,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena,0
3,M4A,North York,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant,0
4,M4A,North York,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop,0


In [8]:
torontoVenues.shape

(1326, 9)

There are 1329 venues in our Toronto's dataset.

### Exploring the data.
As the question is open, there is no specific task and questions are just poping out of my mind at no particular order.

So, the first one that I can think is how many distinct categories are by borough ?

Which one is the most diverse bourogh ? Which is the lesser one ?


In [151]:
torontoCategGroupBy= torontoVenues.groupby ('Borough')['Category']
torontoCategGroupBy.nunique().nlargest (100)

Borough
Downtown Toronto    145
North York           90
West Toronto         74
East Toronto         59
Central Toronto      56
Scarborough          56
East York            45
Etobicoke            40
York                 16
Mississauga          11
Name: Category, dtype: int64

So, Downtown Toronto has the greatest number of distinct venues, and Mississauga is the lesser one.

Lets explore a particular bourogh, lets say, Downtown Toronto. What is the most common venue in Downtown Toronto ?

In [154]:
downtown_data = torontoVenues[torontoVenues['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
downtown_data['Category'].value_counts ()

Coffee Shop             48
Café                    46
Restaurant              18
Park                    15
Bakery                  14
                        ..
Sculpture Garden         1
Ethiopian Restaurant     1
College Gym              1
Market                   1
IT Services              1
Name: Category, Length: 145, dtype: int64

So, the Coffe Shop is the most common venue in Downtown Toronto. Nice.

Lets list all the 48 Coffe Shops in Downtown Toronto.

In [156]:
downtown_data[downtown_data['Category']=='Coffee Shop'][['Venue', 'Neighborhood', 'V_Lat', 'V_Long']]

Unnamed: 0,Venue,Neighborhood,V_Lat,V_Long
1,Tandem Coffee,"Regent Park, Harbourfront",43.653559,-79.361809
13,Sumach Espresso,"Regent Park, Harbourfront",43.658135,-79.359515
14,Arvo,"Regent Park, Harbourfront",43.649963,-79.361442
15,Rooster Coffee,"Regent Park, Harbourfront",43.6519,-79.365609
17,Dark Horse Espresso Bar,"Regent Park, Harbourfront",43.653081,-79.357078
20,Starbucks,"Regent Park, Harbourfront",43.651613,-79.364917
32,NEO COFFEE BAR,"Queen's Park, Ontario Provincial Government",43.66013,-79.38583
47,Starbucks,"Queen's Park, Ontario Provincial Government",43.658204,-79.388998
52,Tim Hortons,"Queen's Park, Ontario Provincial Government",43.661038,-79.393797
53,Starbucks,"Queen's Park, Ontario Provincial Government",43.660887,-79.39372


And how many distinct venues coordinates are in North York ?

In [158]:
ct= torontoVenues[torontoVenues['Borough']=='Downtown Toronto']
ct.groupby (['V_Lat', 'V_Long']).count ()

Unnamed: 0_level_0,Unnamed: 1_level_0,PostalCode,Borough,Neighborhood,B_Lat,B_Long,Venue,Category,Borough_ID
V_Lat,V_Long,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
43.627721,-79.389274,1,1,1,1,1,1,1,1
43.627813,-79.389109,1,1,1,1,1,1,1,1
43.630680,-79.395756,1,1,1,1,1,1,1,1
43.630706,-79.398760,1,1,1,1,1,1,1,1
43.630717,-79.398698,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...
43.673754,-79.423988,1,1,1,1,1,1,1,1
43.676352,-79.373842,1,1,1,1,1,1,1,1
43.678300,-79.382773,1,1,1,1,1,1,1,1
43.682036,-79.373788,1,1,1,1,1,1,1,1


There is 396 distinct coordinates only in Downtown Toronto.

### Show boroughs in Toronto's map
Lets create a map of Toronto showing the Boroughs.

We need to assign to each borough a unique color from the choosen colormap (cm.rainbow).

I will add an additional column, Borough_ID, to our nearbyVenues dataframe. This column is a unique ID for the borough and it is used as an index in the color map table.

Lets create 2 auxiliary lists: 

1. borough_list: which is a simple list of unique boroughs names 
1. borough_IDs: another simple list of integers, one by borough in the list above.

In [159]:
borough_IDs= [i for i in range (0, len (torontoVenues.Borough.unique()))]
borough_list= torontoVenues.Borough.unique()

Now, using the 2 lists above, we can add the additional Borough_ID column to our dataframe.

In [160]:
torontoVenues['Borough_ID']= torontoVenues['Borough'].replace(to_replace=borough_list, value=borough_IDs, inplace=False)
torontoVenues.head ()

Unnamed: 0,PostalCode,Borough,Neighborhood,B_Lat,B_Long,Venue,V_Lat,V_Long,Category,Borough_ID
0,M3A,North York,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park,0
1,M3A,North York,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop,0
2,M4A,North York,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena,0
3,M4A,North York,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant,0
4,M4A,North York,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop,0


Plot the venues in each Neighbourhood using the Borough_ID as a index to a unique color and the borough center with the same color.

In [161]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# n= number of Boroughs
n= len (torontoVenues.Borough.unique())

# create map
map_Toronto = folium.Map(location=[toronto.latitude, toronto.longitude], zoom_start=12)
title= "Venues by Neighborhood in Toronto"
title_html = '''
             <h3 align="center" style="font-size:16px"><b>{}</b></h3>
             '''.format(title) 
map_Toronto.get_root().html.add_child(folium.Element(title_html))

# set color scheme for the clusters
x = np.arange(n)
ys = [i + x + (i*x)**2 for i in range(n)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add venues to the map
for name, category, bid, v_lat, v_lon, in zip(torontoVenues['Venue'], torontoVenues['Category'], torontoVenues['Borough_ID'], torontoVenues['V_Lat'], torontoVenues['V_Long']):
    label = '{}: {}'.format(category, name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [v_lat, v_lon],
        radius=3,
        popup=label,
        color=rainbow[bid],
        fill=True,
        fill_color=rainbow[bid],
        fill_opacity=0.6).add_to(map_Toronto)

# add the borough centers to the map
for borough, nbhd, bid, pc, b_lat, b_lon, in zip(torontoVenues['Borough'], torontoVenues['Neighborhood'], torontoVenues['Borough_ID'], torontoVenues['PostalCode'], torontoVenues['B_Lat'], torontoVenues['B_Long']):
    label = '{}: {} ({})'.format(borough, nbhd, pc)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [b_lat, b_lon],
        radius=5,
        popup=label,
        color=rainbow[bid],
        fill=True,
        fill_color=rainbow[bid],
        fill_opacity=0.7).add_to(map_Toronto)
print ("done")

done


In [162]:
map_Toronto

Let's repeat the analyzes done in Manhattan, New York in Downtown Toronto the biggest bourogh. 

The dataframe downtown_data was created at the top of this section, lets just use it.


In [165]:
downtown_data.groupby('Neighborhood').count()

Unnamed: 0_level_0,PostalCode,Borough,B_Lat,B_Long,Venue,V_Lat,V_Long,Category,Borough_ID
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Berczy Park,30,30,30,30,30,30,30,30,30
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",17,17,17,17,17,17,17,17,17
Central Bay Street,30,30,30,30,30,30,30,30,30
Christie,17,17,17,17,17,17,17,17,17
Church and Wellesley,30,30,30,30,30,30,30,30,30
"Commerce Court, Victoria Hotel",30,30,30,30,30,30,30,30,30
"First Canadian Place, Underground city",30,30,30,30,30,30,30,30,30
"Garden District, Ryerson",30,30,30,30,30,30,30,30,30
"Harbourfront East, Union Station, Toronto Islands",30,30,30,30,30,30,30,30,30
"Kensington Market, Chinatown, Grange Park",30,30,30,30,30,30,30,30,30


In [166]:
downtown_data[['Category']]

Unnamed: 0,Category
0,Bakery
1,Coffee Shop
2,Distribution Center
3,Spa
4,Restaurant
...,...
513,Sushi Restaurant
514,Indian Restaurant
515,Ethiopian Restaurant
516,Café


In [167]:
print('There are {} uniques categories.'.format(len(downtown_data['Category'].unique())))

There are 145 uniques categories.


In [169]:
downtown_onehot = pd.get_dummies(downtown_data[['Category']])
downtown_onehot.head ()

Unnamed: 0,Category_Airport,Category_Airport Food Court,Category_Airport Gate,Category_Airport Lounge,Category_Airport Service,Category_Airport Terminal,Category_American Restaurant,Category_Aquarium,Category_Art Gallery,Category_Art Museum,...,Category_Thai Restaurant,Category_Theater,Category_Theme Restaurant,Category_Trail,Category_Train Station,Category_Vegetarian / Vegan Restaurant,Category_Video Game Store,Category_Vietnamese Restaurant,Category_Wine Bar,Category_Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [170]:
# one hot encoding
downtown_onehot = pd.get_dummies(downtown_data[['Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
downtown_onehot['Neighborhood'] = downtown_data['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [downtown_onehot.columns[-1]] + list(downtown_onehot.columns[:-1])
downtown_onehot = downtown_onehot[fixed_columns]

downtown_onehot.shape


(518, 145)

In [171]:
downtown_grouped = downtown_onehot.groupby('Neighborhood').mean().reset_index()
downtown_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Aquarium,...,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.033333,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0
1,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.058824,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Bay Street,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.033333,0.033333,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0
3,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.033333,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0
5,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"First Canadian Place, Underground city",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,...,0.033333,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0
7,"Garden District, Ryerson",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.033333,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"Harbourfront East, Union Station, Toronto Islands",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,...,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0
9,"Kensington Market, Chinatown, Grange Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.066667,0.033333


In [172]:
downtown_grouped.shape

(19, 145)

In [173]:
num_top_venues = 5

for hood in downtown_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = downtown_grouped[downtown_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
                venue  freq
0            Beer Bar  0.07
1      Farmers Market  0.07
2         Coffee Shop  0.07
3        Cocktail Bar  0.07
4  Seafood Restaurant  0.07


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
              venue  freq
0   Airport Service  0.18
1    Airport Lounge  0.12
2  Airport Terminal  0.12
3       Coffee Shop  0.06
4               Bar  0.06


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.23
1                Café  0.07
2  Italian Restaurant  0.07
3         Yoga Studio  0.03
4    Sushi Restaurant  0.03


----Christie----
           venue  freq
0  Grocery Store  0.24
1           Café  0.18
2           Park  0.12
3    Candy Store  0.06
4     Baby Store  0.06


----Church and Wellesley----
              venue  freq
0       Coffee Shop  0.07
1  Sushi Restaurant  0.07
2       Pizza Place  0.03
3              Café  0.03
4       Escape Room  0.0

In [174]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [175]:
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe 
downtown_grouped_sorted = pd.DataFrame(columns=columns)
downtown_grouped_sorted['Neighborhood'] = downtown_grouped['Neighborhood']

for ind in np.arange(downtown_grouped.shape[0]):
   downtown_grouped_sorted.iloc[ind, 1:] = return_most_common_venues(downtown_grouped.iloc[ind, :], num_top_venues)

downtown_grouped_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Berczy Park,Cocktail Bar,Farmers Market,Seafood Restaurant,Beer Bar,Coffee Shop
1,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Plane,Boat or Ferry
2,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Yoga Studio,Art Museum
3,Christie,Grocery Store,Café,Park,Coffee Shop,Nightclub
4,Church and Wellesley,Coffee Shop,Sushi Restaurant,Indian Restaurant,Salon / Barbershop,Juice Bar
5,"Commerce Court, Victoria Hotel",Café,Hotel,Restaurant,Japanese Restaurant,Gastropub
6,"First Canadian Place, Underground city",Coffee Shop,Café,Restaurant,Seafood Restaurant,Sandwich Place
7,"Garden District, Ryerson",Café,Coffee Shop,Theater,Burger Joint,Bookstore
8,"Harbourfront East, Union Station, Toronto Islands",Hotel,Park,Café,Plaza,Performing Arts Venue
9,"Kensington Market, Chinatown, Grange Park",Café,Vietnamese Restaurant,Vegetarian / Vegan Restaurant,Caribbean Restaurant,Mexican Restaurant


### Clusterize the toronto data

In [176]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters. 
# Warning, If you change kcluster here, you should re-evaluate the block above.
kclusters = 10

downtown_grouped_clustering = downtown_grouped.drop('Neighborhood', 1)

# run k-means clustering init="k-means++", n_clusters=num_clusters, n_init=12)
kmeans = KMeans(init="k-means++", n_clusters=kclusters, n_init=12).fit(downtown_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([1, 4, 7, 5, 6, 3, 3, 6, 6, 8, 7, 7, 3, 2, 1, 0, 1, 3, 9],
      dtype=int32)

In [177]:
# add clustering labels downtown_grouped_clustering
downtown_grouped_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
downtown_grouped_sorted.head ()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,1,Berczy Park,Cocktail Bar,Farmers Market,Seafood Restaurant,Beer Bar,Coffee Shop
1,4,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Plane,Boat or Ferry
2,7,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Yoga Studio,Art Museum
3,5,Christie,Grocery Store,Café,Park,Coffee Shop,Nightclub
4,6,Church and Wellesley,Coffee Shop,Sushi Restaurant,Indian Restaurant,Salon / Barbershop,Juice Bar


In [178]:
# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
downtown_merged = downtown_data.join(downtown_grouped_sorted.set_index('Neighborhood'), on='Neighborhood')

downtown_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,B_Lat,B_Long,Venue,V_Lat,V_Long,Category,Borough_ID,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery,1,7,Coffee Shop,Park,Café,Theater,Breakfast Spot
1,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop,1,7,Coffee Shop,Park,Café,Theater,Breakfast Spot
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center,1,7,Coffee Shop,Park,Café,Theater,Breakfast Spot
3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa,1,7,Coffee Shop,Park,Café,Theater,Breakfast Spot
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant,1,7,Coffee Shop,Park,Café,Theater,Breakfast Spot


In [179]:
columns= ['Borough', 'Neighborhood', 'Category', 'Cluster Labels']
downtown_merged[columns]

Unnamed: 0,Borough,Neighborhood,Category,Cluster Labels
0,Downtown Toronto,"Regent Park, Harbourfront",Bakery,7
1,Downtown Toronto,"Regent Park, Harbourfront",Coffee Shop,7
2,Downtown Toronto,"Regent Park, Harbourfront",Distribution Center,7
3,Downtown Toronto,"Regent Park, Harbourfront",Spa,7
4,Downtown Toronto,"Regent Park, Harbourfront",Restaurant,7
...,...,...,...,...
513,Downtown Toronto,Church and Wellesley,Sushi Restaurant,6
514,Downtown Toronto,Church and Wellesley,Indian Restaurant,6
515,Downtown Toronto,Church and Wellesley,Ethiopian Restaurant,6
516,Downtown Toronto,Church and Wellesley,Café,6


In [180]:
downtown_grouped_sorted.head ()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,1,Berczy Park,Cocktail Bar,Farmers Market,Seafood Restaurant,Beer Bar,Coffee Shop
1,4,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Plane,Boat or Ferry
2,7,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Yoga Studio,Art Museum
3,5,Christie,Grocery Store,Café,Park,Coffee Shop,Nightclub
4,6,Church and Wellesley,Coffee Shop,Sushi Restaurant,Indian Restaurant,Salon / Barbershop,Juice Bar


### plot the clusters on the map

In [181]:
def add_categorical_legend(folium_map, title, colors, labels):
    if len(colors) != len(labels):
        raise ValueError("colors and labels must have the same length.")

    color_by_label = dict(zip(labels, colors))
    
    legend_categories = ""     
    for label, color in color_by_label.items():
        legend_categories += f"<li><span style='background:{color}'></span>{label}</li>"
        
    legend_html = f"""
    <div id='maplegend' class='maplegend'>
      <div class='legend-title'>{title}</div>
      <div class='legend-scale'>
        <ul class='legend-labels'>
        {legend_categories}
        </ul>
      </div>
    </div>
    """
    script = f"""
        <script type="text/javascript">
        var oneTimeExecution = (function() {{
                    var executed = false;
                    return function() {{
                        if (!executed) {{
                             var checkExist = setInterval(function() {{
                                       if ((document.getElementsByClassName('leaflet-top leaflet-right').length) || (!executed)) {{
                                          document.getElementsByClassName('leaflet-top leaflet-right')[0].style.display = "flex"
                                          document.getElementsByClassName('leaflet-top leaflet-right')[0].style.flexDirection = "column"
                                          document.getElementsByClassName('leaflet-top leaflet-right')[0].innerHTML += `{legend_html}`;
                                          clearInterval(checkExist);
                                          executed = true;
                                       }}
                                    }}, 100);
                        }}
                    }};
                }})();
        oneTimeExecution()
        </script>
      """
   

    css = """

    <style type='text/css'>
      .maplegend {
        z-index:9999;
        float:right;
        background-color: rgba(255, 255, 255, 1);
        border-radius: 5px;
        border: 2px solid #bbb;
        padding: 10px;
        font-size:12px;
        positon: relative;
      }
      .maplegend .legend-title {
        text-align: left;
        margin-bottom: 5px;
        font-weight: bold;
        font-size: 90%;
        }
      .maplegend .legend-scale ul {
        margin: 0;
        margin-bottom: 5px;
        padding: 0;
        float: left;
        list-style: none;
        }
      .maplegend .legend-scale ul li {
        font-size: 80%;
        list-style: none;
        margin-left: 0;
        line-height: 18px;
        margin-bottom: 2px;
        }
      .maplegend ul.legend-labels li span {
        display: block;
        float: left;
        height: 16px;
        width: 30px;
        margin-right: 5px;
        margin-left: 0;
        border: 0px solid #ccc;
        }
      .maplegend .legend-source {
        font-size: 80%;
        color: #777;
        clear: both;
        }
      .maplegend a {
        color: #777;
        }
    </style>
    """

    folium_map.get_root().header.add_child(folium.Element(script + css))

    return folium_map

In [182]:
kclusters

10

In [190]:
import matplotlib.cm as cm
import matplotlib.colors as colors


# create map
map_clusters = folium.Map(location=[toronto.latitude, toronto.longitude], zoom_start=12)

# add a descritive title to it
title= "Venues by cluster in Downtown Toronto"
subtitle= 'click on a marker to see the Category:Venue Name (cluster number)'
title_html = '''
             <h2 align="center" style="font-size:16px"><b>{}</b></h2>
             <h3 align="center" style="font-size:14px"><b>{}</b></h3>

             '''.format(title, subtitle) 
map_clusters.get_root().html.add_child(folium.Element(title_html))

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, venue, categ, cluster in zip(downtown_merged['V_Lat'], downtown_merged['V_Long'], downtown_merged['Venue'], downtown_merged['Category'], downtown_merged['Cluster Labels']):
    label_str= "{}:{}({})".format (categ, venue, cluster)
    popup = folium.Popup (label_str, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=3,
        color=rainbow[cluster],
        popup= popup,
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_clusters)

# add a legend on the right upper corner
clusters_colors= []
clusters_legends= []
for cluster in range (kclusters):
    clusters_colors.append (rainbow[cluster])
    clusters_legends.append ("cluster {}". format (cluster))

map_clusters= add_categorical_legend(map_clusters, 'Clusters',
                             colors = clusters_colors,
                           labels = clusters_legends)
print ('done.')


done.


In [191]:
map_clusters

Lets inspect the clusters created.

For each cluster, print the 3 most common venues assigned to that cluster just you can see a pattern emerging.

In [185]:
t= 0
for i in range (kclusters):
    cluster= downtown_merged.loc[downtown_merged['Cluster Labels'] == i, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]
    cluster.reset_index(drop=True, inplace=True)
    n= len (cluster)
    t= t+n
    print ("cluster {} has {} venues and the most common venues (limited to 3) in this cluster are:".format (i, n))
    most_common= []
    for v in cluster['1st Most Common Venue'].unique ():
        most_common.append (v)
    print ("1st most_common: ", most_common)
    most_common= []
    for v in cluster['2nd Most Common Venue'].unique ():
        most_common.append (v)
    print ("2nd most_common: ", most_common)
    most_common= []
    for v in cluster['3rd Most Common Venue'].unique ():
        most_common.append (v)
    print ("3rd most_common: ", most_common)
    print ()
print ("the {} cluster sum {} venues.".format (kclusters, t))

cluster 0 has 30 venues and the most common venues (limited to 3) in this cluster are:
1st most_common:  ['Restaurant']
2nd most_common:  ['Bakery']
3rd most_common:  ['Café']

cluster 1 has 90 venues and the most common venues (limited to 3) in this cluster are:
1st most_common:  ['Café', 'Cocktail Bar']
2nd most_common:  ['Gastropub', 'Farmers Market', 'Restaurant']
3rd most_common:  ['Coffee Shop', 'Seafood Restaurant', 'Beer Bar']

cluster 2 has 4 venues and the most common venues (limited to 3) in this cluster are:
1st most_common:  ['Park']
2nd most_common:  ['Playground']
3rd most_common:  ['Trail']

cluster 3 has 120 venues and the most common venues (limited to 3) in this cluster are:
1st most_common:  ['Coffee Shop', 'Café']
2nd most_common:  ['Café', 'Coffee Shop', 'Hotel']
3rd most_common:  ['Seafood Restaurant', 'Restaurant']

cluster 4 has 17 venues and the most common venues (limited to 3) in this cluster are:
1st most_common:  ['Airport Service']
2nd most_common:  ['Air