# Segmenting and Clustering Neighborhoods in Toronto

## For this assignment:

### 1) We need to get the table from the wikipedia and then create a pandas dataframe from it.

we start by importing the libraries we need

In [1]:
import requests #to send all kinds of HTTP requests
import pandas as pd #data analysis tool for the python programming language
from bs4 import BeautifulSoup # library to parse HTML documents

To get the table from the page we perform an __HTTP request__ as following:

In [2]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
response=requests.get(url)
print(response.status_code)

200


We can check the html with the information that we ask

In [3]:
#response.text

The next step is to __parse the html__ using BeautifulSoup, after inspecting the page we found that the info it's inside the tag __table__ and class __wikitable__.

In [4]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(response.text, 'html.parser')
toronto_table=soup.find('table',{'class':"wikitable"})

we can have a look on the table

In [5]:
#toronto_table

we can see that the table is __already grouped by postal code__ and in the column Neigborhoods appears the neighborhoods that have the same postal code separeted by comma

Now, we create a pandas __dataframe__ from this table using the method __read_html__

In [6]:
#convert to dataframe
df_toronto=pd.read_html(str(toronto_table))[0]

we can have a look into the first rows of the dataframe

In [7]:
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


As we can see there are rows where the Borough column has the __value 'Not assigned'__. Since these rows are not useful we will __clean__ our data __by droping__ them

So, we can select the indexes where the column Borough has a value equal to 'Not assigned'

In [8]:
indexes = df_toronto[ df_toronto['Borough'] == 'Not assigned' ].index

And now we use these indexes to __drop__ the rows

In [9]:
df_toronto.drop(indexes , inplace=True)

Now we could check the frequency for each valuen in the column Borough, as we see any Not assigned value remainds in the coloumn

In [10]:
frequency_Borough = df_toronto['Borough'].value_counts()

In [11]:
frequency_Borough

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
East Toronto         5
York                 5
Mississauga          1
Name: Borough, dtype: int64

We can check if there are any row with Neigborhood Not assigned

In [12]:
indexes_2= df_toronto[ df_toronto['Neighbourhood'] == 'Not assigned' ].index
indexes_2

Int64Index([], dtype='int64')

After droping the rows with Borough equals to Not assigned there is no row with neighborhood equals to Not assigned, so no more actions are needed to reproduce the table asked. So, we can show the __final__ version of the __dataframe__.

In [13]:
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


And finalyy, we print the number of rows of your dataframe using the __shape__ attribute of the pandas dataframe

In [14]:
print('number of rows: ', df_toronto.shape[0])

number of rows:  103


### 2) Add coordinates of each neighborhood to the data frame

 To do this we need too get the latitude and the longitude coordinates of each neighborhood. We will use the geocoder python package

In [15]:
import geocoder

define functions to create the column for the latitude and the longitude

In [16]:
def get_geocoder(postal_code_from_df):
     # initialize your variable to None
     lat_lng_coords = None
     # loop until you get the coordinates
     while(lat_lng_coords is None):
       g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code_from_df))
       lat_lng_coords = g.latlng
     latitude = lat_lng_coords[0]
     longitude = lat_lng_coords[1]
     return latitude,longitude

we apply to the dataframe each function in order to ger the regarding column

In [17]:
#The zip() function returns a zip object, which is an iterator of tuples 
#where the first item in each passed iterator is paired together, 
#and then the second item in each passed iterator are paired togethe
df_toronto['Latitude'], df_toronto['Longitude'] = zip(*df_toronto['Postal Code'].apply(get_geocoder))

In [18]:
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.75245,-79.32991
3,M4A,North York,Victoria Village,43.73057,-79.31306
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188


## 3) Explore and cluster the neighborhoods in Toronto.

In [19]:
# @hidden_cell

CLIENT_ID = 'your Foursquare ID' # your Foursquare ID
CLIENT_SECRET = 'your Foursquare Secret' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30

In [20]:
import numpy as np # library to handle data in a vectorized manner
import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans # import k-means from clustering stage

We can get the latitude and longitude values of Toronto, we will need them to set the map

In [21]:
address = 'East Toronto, Ontario, Canada'#just to center better the map in order to show all neighborhoods once I am going to use zoom_start=11

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.72178945, -79.37402706301704.


We create a map of Toronto with neighborhoods using the coordinates on the dataframe

In [22]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 

In [23]:
map_toronto

In order to __simplify__ the study we narrow down the boroughs to the ones around the Toronto Downtown. So, we __select only the Borough that includes the word Toronto__ on them.

In [24]:
#select idex for rows that have a Borough value that includes the word Toronto
indexes_3= df_toronto[ df_toronto['Borough'].str.contains("Toronto")].index
indexes_3

Int64Index([  4,   6,  13,  22,  30,  31,  40,  41,  49,  50,  58,  59,  66,
             67,  68,  75,  76,  84,  93,  94, 102, 103, 104, 111, 112, 113,
            120, 121, 122, 129, 130, 138, 139, 147, 148, 156, 157, 165, 168],
           dtype='int64')

In [25]:
df_toronto_center = df_toronto.iloc[df_toronto.index.isin(indexes_3)]
df_toronto_center.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188
13,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804
22,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587
30,M4E,East Toronto,The Beaches,43.67709,-79.29547


In [26]:
address_downtown = 'Downtown Toronto, Ontario'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address_downtown)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6563221, -79.3809161.


In [27]:
# create map of Manhattan using latitude and longitude values
map_downtown = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(df_toronto_center['Latitude'], df_toronto_center['Longitude'], df_toronto_center['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_downtown)  

In [28]:
map_downtown

__A)__ We can have a look to __single neighborhood__, for example we select __Chinatown__

In [29]:
df_toronto_center.loc[130, 'Neighbourhood']

'Kensington Market, Chinatown, Grange Park'

In [30]:
neighborhood_latitude = df_toronto_center.loc[130, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_toronto_center.loc[130, 'Longitude'] # neighborhood longitude value

neighborhood_name = df_toronto_center.loc[130, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Kensington Market, Chinatown, Grange Park are 43.65351000000004, -79.39721999999995.


Using foursquare api we get the __top 25 venues__ around __500 meters__ to the center of this neighborhood. To do this we have to create the proper __url__:

In [31]:
LIMIT = 25 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=G0E2LJG4P4NOYVVQXF1CYNOWXQKUMEQBJJRQNV553AGMVGI3&client_secret=1HDV1UD4OUHY03E3YEWPFHCNBRH2LSAD0LEWOL5VYFFJ0NLN&v=20180604&ll=43.65351000000004,-79.39721999999995&radius=500&limit=25'

And using this url we __send a HTTP request__ getting back a __response__ with the information in the form of a __json file__.

In [32]:
results = requests.get(url).json()
#results

In order to get the categories from the response, we define the next function

In [33]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to __clean the json file__ and __create__ a pandas __dataframe__. The needed information is stored in the items key.

In [34]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Banh Mi Nguyen Huong,Vietnamese Restaurant,43.653628,-79.398376
1,Meeplemart,Gaming Cafe,43.651628,-79.39741
2,The Moonbean Cafe,Café,43.654147,-79.400182
3,Essence of Life Organics,Organic Grocery,43.654111,-79.400431
4,El Rey,Cocktail Bar,43.652764,-79.400048


In [35]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

25 venues were returned by Foursquare.


__B)__ Now we can __explore all the neighborhoods_ around the downtown__

First we define a function that we help us to get the near venues to each neighborhood

In [36]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We run this function over all the neighborhoods.

In [37]:
df_toronto_center_venues = getNearbyVenues(names=df_toronto_center['Neighbourhood'],
                                   latitudes=df_toronto_center['Latitude'],
                                   longitudes=df_toronto_center['Longitude']
                                  )

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West, Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
R

In [38]:
print('There are {} uniques categories.'.format(len(df_toronto_center_venues['Venue Category'].unique())))

There are 170 uniques categories.


And now we start to analyze each neighborhood. We are going to count __how many of these venues are in each neighborhood__

In [39]:
# one hot encoding
toronto_center_onehot = pd.get_dummies(df_toronto_center_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_center_onehot['Neighbourhood'] = df_toronto_center_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_center_onehot.columns[-1]] + list(toronto_center_onehot.columns[:-1])
toronto_center_onehot = toronto_center_onehot[fixed_columns]

toronto_center_onehot.head()

Unnamed: 0,Neighbourhood,American Restaurant,Antique Shop,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Baby Store,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


 We __group__ rows __by neighborhood__ and take the __mean of the frequency of occurrence of each category__

In [40]:
toronto_center_grouped = toronto_center_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_center_grouped

Unnamed: 0,Neighbourhood,American Restaurant,Antique Shop,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Baby Store,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,Berczy Park,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.04,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,...,0.04,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04
4,Central Bay Street,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.090909,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Commerce Court, Victoria Hotel",0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We have a look to the __top 5 venues__ for __each neighborhood__.

In [41]:
num_top_venues = 5

for hood in toronto_center_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_center_grouped[toronto_center_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venues','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
               venues  freq
0      Farmers Market  0.08
1  Seafood Restaurant  0.08
2        Cocktail Bar  0.08
3            Beer Bar  0.08
4              Museum  0.04


----Brockton, Parkdale Village, Exhibition Place----
                   venues  freq
0      Italian Restaurant  0.08
1  Furniture / Home Store  0.08
2              Restaurant  0.08
3             Coffee Shop  0.08
4             Supermarket  0.08


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                venues  freq
0                 Café  0.08
1          Coffee Shop  0.08
2           Restaurant  0.08
3         Concert Hall  0.08
4  American Restaurant  0.04


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
                 venues  freq
0    Italian Restaurant  0.12
1  Gym / Fitness Center  0.08
2                  Park  0.08
3           Yoga Studio  0.04
4     French Restaurant  0.04

We need to create a dataframe with this information with the most common venues

In [42]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In the new dataframe we will keep the __top 5 most common venues for each neighborhood__.

In [43]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_center_grouped['Neighbourhood']

for ind in np.arange(toronto_center_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_center_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Berczy Park,Cocktail Bar,Seafood Restaurant,Farmers Market,Beer Bar,Jazz Club
1,"Brockton, Parkdale Village, Exhibition Place",Coffee Shop,Restaurant,Furniture / Home Store,Italian Restaurant,Supermarket
2,"Business reply mail Processing Centre, South C...",Coffee Shop,Concert Hall,Café,Restaurant,Sushi Restaurant
3,"CN Tower, King and Spadina, Railway Lands, Har...",Italian Restaurant,Park,Gym / Fitness Center,New American Restaurant,Sandwich Place
4,Central Bay Street,Coffee Shop,Plaza,New American Restaurant,Sandwich Place,Bubble Tea Shop


No we will __cluster the neighborhoods by k-means__

In [44]:
# set number of clusters
kclusters = 5

toronto_center_grouped_clustering = toronto_center_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_center_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Finally we __merge all the information__ into a new dataframe that __includes the cluster as well as the top 5 venues for each neighborhood__

In [45]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

df_toronto_center_merged = df_toronto_center

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
df_toronto_center_merged = df_toronto_center_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
df_toronto_center_merged = df_toronto_center_merged.dropna()#some rows have a NaN value

In [46]:
df_toronto_center_merged['Cluster Labels'] = df_toronto_center_merged['Cluster Labels'].astype(int)

In [47]:
df_toronto_center_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264,0,Coffee Shop,Breakfast Spot,Yoga Studio,Bakery,Restaurant
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188,0,Coffee Shop,Sandwich Place,Falafel Restaurant,Theater,Portuguese Restaurant
13,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804,0,Café,Ramen Restaurant,Theater,Clothing Store,Music Venue
22,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587,0,Gastropub,Coffee Shop,Restaurant,Japanese Restaurant,American Restaurant
30,M4E,East Toronto,The Beaches,43.67709,-79.29547,3,Trail,Health Food Store,Neighborhood,Pub,Yoga Studio


To __visualize the resulting clusters__ we create a new map with the proper information

In [49]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_toronto_center_merged['Latitude'], df_toronto_center_merged['Longitude'], df_toronto_center_merged['Neighbourhood'], df_toronto_center_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

No we __examine each clusters__ to try to understand which venue categories are responsible to discriminate between the clusters

__Cluster 1__

In [50]:
df_toronto_center_merged.loc[df_toronto_center_merged['Cluster Labels'] == 0, df_toronto_center_merged.columns[[1] + list(range(5, df_toronto_center_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
4,Downtown Toronto,0,Coffee Shop,Breakfast Spot,Yoga Studio,Bakery,Restaurant
6,Downtown Toronto,0,Coffee Shop,Sandwich Place,Falafel Restaurant,Theater,Portuguese Restaurant
13,Downtown Toronto,0,Café,Ramen Restaurant,Theater,Clothing Store,Music Venue
22,Downtown Toronto,0,Gastropub,Coffee Shop,Restaurant,Japanese Restaurant,American Restaurant
31,Downtown Toronto,0,Cocktail Bar,Seafood Restaurant,Farmers Market,Beer Bar,Jazz Club
40,Downtown Toronto,0,Coffee Shop,Plaza,New American Restaurant,Sandwich Place,Bubble Tea Shop
41,Downtown Toronto,0,Café,Grocery Store,Playground,Athletics & Sports,Baby Store
49,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Seafood Restaurant,Sushi Restaurant
50,West Toronto,0,Grocery Store,Park,Bar,Convenience Store,Pharmacy
58,Downtown Toronto,0,Hotel,Coffee Shop,Park,Japanese Restaurant,Deli / Bodega


__Cluster 2__

In [51]:
df_toronto_center_merged.loc[df_toronto_center_merged['Cluster Labels'] == 1, df_toronto_center_merged.columns[[1] + list(range(5, df_toronto_center_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
93,Central Toronto,1,Bus Line,Swim School,Yoga Studio,Farmers Market,Falafel Restaurant


__Cluster 3__

In [52]:
df_toronto_center_merged.loc[df_toronto_center_merged['Cluster Labels'] == 2, df_toronto_center_merged.columns[[1] + list(range(5, df_toronto_center_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
103,Central Toronto,2,Park,French Restaurant,Yoga Studio,Discount Store,Event Space
111,Central Toronto,2,Playground,Park,Gym Pool,Yoga Studio,Discount Store
147,Downtown Toronto,2,Park,Shop & Service,Playground,Bike Trail,Tennis Court


__Cluster 4__

In [53]:
df_toronto_center_merged.loc[df_toronto_center_merged['Cluster Labels'] == 3, df_toronto_center_merged.columns[[1] + list(range(5, df_toronto_center_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
30,East Toronto,3,Trail,Health Food Store,Neighborhood,Pub,Yoga Studio


__Cluster 5__

In [54]:
df_toronto_center_merged.loc[df_toronto_center_merged['Cluster Labels'] == 4, df_toronto_center_merged.columns[[1] + list(range(5, df_toronto_center_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
129,Central Toronto,4,Gym,Playground,Trail,Yoga Studio,Diner


It seems that the __veneus that were crucial for each cluster__ are:
- cluster 1, Coffe shops and restaurants 
- cluster 2, Bus line
- cluster 3, Parks
- cluster 4, Trails
- cluster 5, Gyms