# Segmenting and Clustering Neighborhoods in Toronto
by Hugo Bertini @ 2020.05.24

### **this is part 1/3 of the assignment. "Scraping into a Dataframe"**   
We will begin to scrape the [List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) into a pandas Dataframe:


_first we import the libraries we need:_

In [1]:
#first we import the libraries we need:
import requests
import urllib.request
import time
import pickle
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors


_then we perform the request:_

In [2]:
#then we perform the request:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
print(response)

#a response of 200 means we received the page we asked for, so let's continue:
soup = BeautifulSoup(response.text, "html.parser")

<Response [200]>


_now that we have our soup ready, let's see what tables are available:_

In [3]:
#now that we have our soup ready, let's see what tables are available.
#as the output might be too unpratically big for dispaly we will limit it to 500 characters:
str(soup.findAll('table'))[0:500]

'[<table class="wikitable sortable">\n<tbody><tr>\n<th>Postal Code\n</th>\n<th>Borough\n</th>\n<th>Neighborhood\n</th></tr>\n<tr>\n<td>M1A\n</td>\n<td>Not assigned\n</td>\n<td>\n</td></tr>\n<tr>\n<td>M2A\n</td>\n<td>Not assigned\n</td>\n<td>\n</td></tr>\n<tr>\n<td>M3A\n</td>\n<td>North York\n</td>\n<td>Parkwoods\n</td></tr>\n<tr>\n<td>M4A\n</td>\n<td>North York\n</td>\n<td>Victoria Village\n</td></tr>\n<tr>\n<td>M5A\n</td>\n<td>Downtown Toronto\n</td>\n<td>Regent Park, Harbourfront\n</td></tr>\n<tr>\n<td>M6A\n</td>\n<td>North York\n</td>\n<td>'

_it looks like our table is the first one! so let's grab it and extract the data:_

In [4]:
#it looks like our table is the first one. let's grab it and extract the data:
pc_table = soup.find_all('table')[0]
pc_table_body = pc_table.find('tbody')
pc_table_rows = pc_table_body.find_all('tr')
table = []

#getting the table headers:
headers = pc_table_rows[0].find_all('th')
table.append([h.text.strip() for h in headers])

#getting the table data rows:
for row in pc_table_rows[1:]:
    cols = row.find_all('td')
    cols = [txt.text.strip() for txt in cols]
    #table.append([txt for txt in cols if txt])  # empty items are not added
    table.append([txt for txt in cols])
    
table[0:15]

[['Postal Code', 'Borough', 'Neighborhood'],
 ['M1A', 'Not assigned', ''],
 ['M2A', 'Not assigned', ''],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Regent Park, Harbourfront'],
 ['M6A', 'North York', 'Lawrence Manor, Lawrence Heights'],
 ['M7A', 'Downtown Toronto', "Queen's Park, Ontario Provincial Government"],
 ['M8A', 'Not assigned', ''],
 ['M9A', 'Etobicoke', 'Islington Avenue, Humber Valley Village'],
 ['M1B', 'Scarborough', 'Malvern, Rouge'],
 ['M2B', 'Not assigned', ''],
 ['M3B', 'North York', 'Don Mills'],
 ['M4B', 'East York', 'Parkview Hill, Woodbine Gardens'],
 ['M5B', 'Downtown Toronto', 'Garden District, Ryerson']]

_now we can create a pandas Dataframe:_

In [5]:
#now we can create our pandas Dataframe:

trtPC_df1 = pd.DataFrame(table[1:], columns=table[0])
print('table shape: {} rows x {} columns'.format(trtPC_df1.shape[0], trtPC_df1.shape[1]))
trtPC_df1.head(10)

table shape: 180 rows x 3 columns


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


**now we will wrangle the dataframe to have it ready for proper analysis:**

_Ignore empty cells and the ones with a borough that is 'Not assigned':_

In [6]:
#let's discard rows whose Borough field is either empty or "Not assigned":

trtPC_df1 = trtPC_df1[(trtPC_df1.Borough.notnull()) & (trtPC_df1['Borough']!='Not assigned')]
print(trtPC_df1.shape)
trtPC_df1.head()

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


_for cases of empty or not assigned neighborhood, it will become the same as the borough:_

In [7]:
#although it looks like the original table in the wiki was cleaned up in advance, 
#we still treat those cases just in case the table is changed at origin:

trtPC_df1['Neighborhood'] = np.where((trtPC_df1['Neighborhood'].eq('Not assigned')) | (trtPC_df1['Neighborhood'].eq('')), trtPC_df1['Borough'], trtPC_df1['Neighborhood'])
trtPC_df1.head(15)

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


_More than one neighborhood go to the same postal code line:_

In [8]:
#here we alter the data in one cell to cause having more than 1 line with repeated postal code, so we can check the code is right
#this cell will be fully commented once testing is fine, so we don't actually alter the data.
#we're doing this just because the table in the wiki page already has the neighborhoods as a list for the 103 that are not empty or not assigned.

#trtPC_df1.iloc[1, 0]='M3A'
#trtPC_df1.head()

![dataframe head](./img/013image1.png)

_now we will aggregate the neighborhoods belonging to the same postal code
in order to have one row per postal code and the corresponding neighborhoods listed in the 'Neighborhood' column.
and again, this seems to have been done in the original table, but we will process the data just in case the original table changes in the future:_

In [9]:
#let's group by the postal code column and aggregate borough and neighborhood columns respectively to keep the first borough in each group, and the list of neighborhoods
trtPC_df = trtPC_df1.groupby(['Postal Code']).agg({'Borough':"first", 'Neighborhood':list}).reset_index()

In [10]:
#next we will get rid of the list representation brackets from the neighborhood column.
#please note the data corresponds to the original and not to the tweaked data used in testing this.
trtPC_df['Neighborhood'] = trtPC_df['Neighborhood'].apply(lambda n: ', '.join(n))
trtPC_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [11]:
#now let's check how the M3A postal code line became after the grouping operations. 
#We expect to have two neighborhoods listed, Parkwoods and Victoria Village.
#this cell is aimed only for testing the code, not for the assignment submission. 
#This cell will be submitted as commented for the assignment.

#trtPC_df[trtPC_df['Postal Code']=='M3A'].head()

![dataframe head](./img/013image2.png)

_using the **.shape** method to print the number of rows of our dataframe:_

In [12]:
#let's check the number of rows in the dataframe:
print('The dataframe has {} rows.'.format(trtPC_df.shape[0]))

The dataframe has 103 rows.


### **this is part 2/3 of the assignment: "Adding geolocation data"**   
Here we add the geolocation data from a csv file into the above-created pandas Dataframe:


_first we import the geocoder library:_   

In [13]:
import geocoder # import geocoder
from time import sleep

_After some experimenting with the geocoder library, I found out that Google provider was returning **None** on all the attempts.   
So I tried with some other providers (from [geocoder](https://geocoder.readthedocs.io/)) and found [ArcGIS](https://geocoder.readthedocs.io/providers/ArcGIS.html) to respond promptly to manual requests.   
Then let's be thankful and stick to this provider._

In [14]:
def get_coords_Toronto (pc):
    #print(pc)
    #let's control the maximum number of calls to the location provider
    max_attempts = 5
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    i=0
    while(lat_lng_coords is None):
      g = geocoder.arcgis('{}, Toronto, Ontario'.format(pc))
      lat_lng_coords = g.latlng
      i += 1
      if i > max_attempts:
            print('too many attempts. quiting')
            break
      sleep(0.2) #a small pause between requests, trying not to be kicked out.

    #print('coordinates for {} required {} call(s) to ArcGIS provider.'.format(pc, i))
    
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    
    return latitude, longitude

_let's test the function:_

In [15]:
print(get_coords_Toronto('M3A'))

(43.75293455500008, -79.33564142299997)


_remembering how the dataframe looks like:_

In [16]:
print('columns: {}'.format(', '.join(trtPC_df.columns)))
trtPC_df.tail()

columns: Postal Code, Borough, Neighborhood


Unnamed: 0,Postal Code,Borough,Neighborhood
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
102,M9W,Etobicoke,"Northwest, West Humber - Clairville"


_it's time to collect the coordinates. we'll store them in a pickles "jar" first to avoid having to repeat the call:_

In [17]:
try:
    coords = pickle.load( open( "./data/toronto-coords.p", "rb" ) )
    if (not len(coords) > 0):
        raise Exception("No coordinates data found on file, so please wait while I request the information from the provider...") 
except:
    coords = trtPC_df.apply(lambda row: get_coords_Toronto(row[0]), axis=1)
    pickle.dump( coords, open( "./data/toronto-coords.p", "wb" ) )

In [18]:
print(coords)
coords_df = pd.DataFrame(list(coords), columns=['latitude', 'longitude'])
coords_df.head()

0      (43.80862623100006, -79.18991284599997)
1      (43.78577865700004, -79.15736763799998)
2      (43.76580607300008, -79.18528434099994)
3      (43.77154467100007, -79.21813521299998)
4      (43.76879106300004, -79.23881306799996)
                        ...                   
98     (43.70549635400005, -79.52037009099996)
99     (43.69629612800003, -79.53312611699994)
100    (43.68688713700004, -79.56550730099997)
101    (43.74405485200003, -79.58120294599996)
102    (43.71161519300006, -79.58835079199997)
Length: 103, dtype: object


Unnamed: 0,latitude,longitude
0,43.808626,-79.189913
1,43.785779,-79.157368
2,43.765806,-79.185284
3,43.771545,-79.218135
4,43.768791,-79.238813


_now that we have the coordinates as a dataframe, we will add them as new columns to the desired dataframe **trtPC_df**:_

In [19]:
trtPC_df[['latitude', 'longitude']] = coords_df
trtPC_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,latitude,longitude
0,M1B,Scarborough,"Malvern, Rouge",43.808626,-79.189913
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.785779,-79.157368
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765806,-79.185284
3,M1G,Scarborough,Woburn,43.771545,-79.218135
4,M1H,Scarborough,Cedarbrae,43.768791,-79.238813


_before continuing, let's just sneak peek the one postal code we probed before:_

In [20]:
trtPC_df[trtPC_df['Postal Code']=='M3A']

Unnamed: 0,Postal Code,Borough,Neighborhood,latitude,longitude
25,M3A,North York,Parkwoods,43.752935,-79.335641


_as we can see below, the dataframe has 103 postal codes, and the limit rectangle corner coordinates (lines of **min** and **max**):_

In [21]:
trtPC_df.describe(include='all')

Unnamed: 0,Postal Code,Borough,Neighborhood,latitude,longitude
count,103,103,103,103.0,103.0
unique,103,10,99,,
top,M3M,North York,Downsview,,
freq,1,24,4,,
mean,,,,43.704481,-79.394989
std,,,,0.052814,0.09487
min,,,,43.600895,-79.588351
25%,,,,43.656781,-79.450931
50%,,,,43.696448,-79.385653
75%,,,,43.746551,-79.350151


_finally, here is our dataframe with the coordinates:_

In [22]:
trtPC_df.head(15)

Unnamed: 0,Postal Code,Borough,Neighborhood,latitude,longitude
0,M1B,Scarborough,"Malvern, Rouge",43.808626,-79.189913
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.785779,-79.157368
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765806,-79.185284
3,M1G,Scarborough,Woburn,43.771545,-79.218135
4,M1H,Scarborough,Cedarbrae,43.768791,-79.238813
5,M1J,Scarborough,Scarborough Village,43.744203,-79.228725
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.726881,-79.265694
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.71334,-79.284942
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.723538,-79.228353
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.696448,-79.265642


### **this is part 3/3 of the assignment: "Clustering Toronto's neighborhoods"**   
Here we explore and cluster the neighborhoods in Toronto, based on information from Foursquare:


_let's install the required libraries:_

In [23]:
import time
import json # library to handle JSON files
import folium # map rendering library
import requests # library to handle requests
from pandas import json_normalize # tranform JSON file into a pandas dataframe


_now we represent a map with Toronto neighborhoods on top of it:_

In [24]:
#getting the central coordinates of Toronto
g = geocoder.arcgis('Toronto, Ontario')
trt_lat, trt_lon = g.latlng

In [25]:
# create map using latitude and longitude values
map_toronto = folium.Map(location=[trt_lat, trt_lon], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(trtPC_df['latitude'], trtPC_df['longitude'], trtPC_df['Borough'], trtPC_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

_Now we will use the Foursquare service to gather information about Toronto Neighborhoods:_

However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in central Toronto. So let's slice the original dataframe and create a new dataframe from the central Toronto data.

In [26]:
trt_central_df = trtPC_df[trtPC_df['Borough'].str.contains('Toronto')].reset_index(drop=True)
print(trt_central_df.describe())
trt_central_df.head()

        latitude  longitude
count  39.000000  39.000000
mean   43.666281 -79.390595
std     0.024061   0.035198
min    43.623750 -79.482692
25%    43.648668 -79.405198
50%    43.658720 -79.385649
75%    43.680021 -79.377118
max    43.729455 -79.295349


Unnamed: 0,Postal Code,Borough,Neighborhood,latitude,longitude
0,M4E,East Toronto,The Beaches,43.678148,-79.295349
1,M4K,East Toronto,"The Danforth West, Riverdale",43.683424,-79.354564
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668291,-79.315578
3,M4M,East Toronto,Studio District,43.648,-79.33926
4,M4N,Central Toronto,Lawrence Park,43.729455,-79.386415


In [27]:
CLIENT_ID = 'LWPPRHSKBQ3BFLAGZDAKYGZEXEXBAFPOQZQSKZT2IC4J24QW' # your Foursquare ID
CLIENT_SECRET = 'ZSWEL3T5T23PGV1WZGU4M5Y1MB2SK5L5MJATBKGXDLWVOSRI' # your Foursquare Secret
REDIRECT_URI = 'https://www.google.com'
VERSION = '20180605' # Foursquare API version
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: LWPPRHSKBQ3BFLAGZDAKYGZEXEXBAFPOQZQSKZT2IC4J24QW
CLIENT_SECRET:ZSWEL3T5T23PGV1WZGU4M5Y1MB2SK5L5MJATBKGXDLWVOSRI


#### Let's create a function to grab Foursquare information about the central neighborhoods in Toronto

In [28]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        #wait a bit before allowing the request to be performed
        #this tries to avoid lack of responses from foursquare
        time.sleep(2)
        
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
       
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now we run the above function on each neighborhood and create a new dataframe called *toronto_venues*.   
_in case he venues already had been provided we will use that data instead of requesting it from Fourquare again. For this we keep the data as pickles._

In [29]:
try:
    toronto_venues = pickle.load( open( "./data/toronto_venues.p", "rb" ) )
    if (not len(coords) > 0):
        raise Exception("No venues data found on file, so please wait while I request the information from the provider...") 
except:
    toronto_venues = getNearbyVenues(names=trt_central_df['Neighborhood'],
                                       latitudes=trt_central_df['latitude'],
                                       longitudes=trt_central_df['longitude']
                                      )
    pickle.dump( toronto_venues, open( "./data/toronto_venues.p", "wb" ) )

#### Let's check the size of the resulting dataframe

In [30]:
print(toronto_venues.shape)
toronto_venues.head()

(1584, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.678148,-79.295349,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.678148,-79.295349,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.678148,-79.295349,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.678148,-79.295349,Dip 'n Sip,43.678897,-79.297745,Coffee Shop
4,The Beaches,43.678148,-79.295349,Glen Stewart Park,43.675278,-79.294647,Park


Let's check how many venues were returned for each neighborhood

In [31]:
toronto_venues.groupby('Neighborhood')[['Venue']].count()

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
Berczy Park,66
"Brockton, Parkdale Village, Exhibition Place",43
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",100
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",63
Central Bay Street,54
Christie,12
Church and Wellesley,83
"Commerce Court, Victoria Hotel",100
Davisville,27
Davisville North,6


#### Let's find out how many unique categories can be curated from all the returned venues

In [32]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique()) -1 ))  # not counting the Neighborhood column

There are 222 unique categories.


<a id='item3'></a>

## 3. Analyze Each Neighborhood

We will use onehot encoding to help relating venue types to the neighborhoods:

In [33]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# remove the Neighborhood column, as it got dummies too
toronto_onehot.drop('Neighborhood', inplace=True, axis=1)

# move neighborhood column to the first column
toronto_onehot.insert(0, 'Neighborhood', toronto_venues['Neighborhood'] )
toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [34]:
toronto_onehot.shape

(1584, 223)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [35]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.015152
1,"Brockton, Parkdale Village, Exhibition Place",0.023256,0.0,0.0,0.023256,0.0,0.023256,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.02,0.01,0.0,0.0,0.02,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.015873,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.018519,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.018519,0.018519,0.018519,0.0,0.0,0.0


#### Let's confirm the new size

In [36]:
toronto_grouped.shape

(39, 223)

#### Let's print each neighborhood along with the top 5 most common venues

In [37]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
          venue  freq
0   Coffee Shop  0.09
1  Cocktail Bar  0.05
2   Cheese Shop  0.03
3        Bakery  0.03
4         Hotel  0.03


----Brockton, Parkdale Village, Exhibition Place----
         venue  freq
0         Café  0.07
1  Coffee Shop  0.07
2        Diner  0.05
3   Restaurant  0.05
4    Gift Shop  0.05


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                 venue  freq
0          Coffee Shop  0.07
1                Hotel  0.05
2           Restaurant  0.04
3                 Café  0.03
4  Japanese Restaurant  0.03


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
               venue  freq
0               Café  0.06
1        Coffee Shop  0.06
2               Park  0.05
3  French Restaurant  0.05
4         Restaurant  0.03


----Central Bay Street----
                 venue  freq
0          Coffee Shop  0.15
1                Plaza  0.04
2  

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [38]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [39]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Bakery,Café,Cheese Shop,Hotel,Restaurant,Pub,Beer Bar
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Diner,Gift Shop,Restaurant,Thrift / Vintage Store,North Indian Restaurant,Brewery,Caribbean Restaurant,Sandwich Place
2,"Business reply mail Processing Centre, South C...",Coffee Shop,Hotel,Restaurant,Japanese Restaurant,Café,Salon / Barbershop,Seafood Restaurant,Burrito Place,Steakhouse,Gym
3,"CN Tower, King and Spadina, Railway Lands, Har...",Café,Coffee Shop,French Restaurant,Park,Lounge,Restaurant,Speakeasy,Gym / Fitness Center,Italian Restaurant,Bar
4,Central Bay Street,Coffee Shop,Bubble Tea Shop,Japanese Restaurant,Clothing Store,Plaza,Middle Eastern Restaurant,Restaurant,Sandwich Place,Chinese Restaurant,Poke Place


<a id='item4'></a>

## 4. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [40]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 2, 0, 0, 0, 2])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [41]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = trt_central_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighborhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.678148,-79.295349,0,Health Food Store,Pub,Trail,Park,Coffee Shop,Dog Run,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm
1,M4K,East Toronto,"The Danforth West, Riverdale",43.683424,-79.354564,2,Bus Line,Business Service,Grocery Store,Park,Discount Store,Yoga Studio,Eastern European Restaurant,Fish & Chips Shop,Fast Food Restaurant,Farmers Market
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668291,-79.315578,0,Sandwich Place,Pizza Place,Fast Food Restaurant,Board Shop,Movie Theater,Sushi Restaurant,Italian Restaurant,Restaurant,Food & Drink Shop,Steakhouse
3,M4M,East Toronto,Studio District,43.648,-79.33926,0,Business Service,Government Building,Night Market,Baseball Field,Yoga Studio,Eastern European Restaurant,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market
4,M4N,Central Toronto,Lawrence Park,43.729455,-79.386415,4,Swim School,Bus Line,Yoga Studio,Food & Drink Shop,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant


Finally, let's visualize the resulting clusters

In [42]:
# create map
map_clusters = folium.Map(location=[trt_lat, trt_lon], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['latitude'], toronto_merged['longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

## 5. Examine Clusters

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

#### Cluster 1

In [43]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,0,Health Food Store,Pub,Trail,Park,Coffee Shop,Dog Run,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm
2,East Toronto,0,Sandwich Place,Pizza Place,Fast Food Restaurant,Board Shop,Movie Theater,Sushi Restaurant,Italian Restaurant,Restaurant,Food & Drink Shop,Steakhouse
3,East Toronto,0,Business Service,Government Building,Night Market,Baseball Field,Yoga Studio,Eastern European Restaurant,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market
7,Central Toronto,0,Dessert Shop,Italian Restaurant,Pizza Place,Sandwich Place,Coffee Shop,Café,Costume Shop,Seafood Restaurant,Gym,Diner
8,Central Toronto,0,Playground,Gym,Restaurant,Convenience Store,Creperie,Donut Shop,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market
9,Central Toronto,0,Light Rail Station,Coffee Shop,Supermarket,Liquor Store,Athletics & Sports,Skating Rink,Yoga Studio,Electronics Store,Fish Market,Fish & Chips Shop
11,Downtown Toronto,0,Coffee Shop,Bakery,Café,Pizza Place,Pub,Restaurant,Italian Restaurant,Chinese Restaurant,Farm,Jewelry Store
12,Downtown Toronto,0,Coffee Shop,Japanese Restaurant,Restaurant,Café,Sushi Restaurant,Gay Bar,Pub,Smoke Shop,Dance Studio,Grocery Store
13,Downtown Toronto,0,Pub,Café,Athletics & Sports,Coffee Shop,Thai Restaurant,Mexican Restaurant,Seafood Restaurant,Performing Arts Venue,Food Truck,French Restaurant
14,Downtown Toronto,0,Coffee Shop,Clothing Store,Middle Eastern Restaurant,Sandwich Place,Cosmetics Shop,Hotel,Restaurant,Café,Italian Restaurant,Bar


#### Cluster 2

In [44]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,Central Toronto,1,Park,Yoga Studio,Dog Run,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant,Ethiopian Restaurant


#### Cluster 3

In [45]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,East Toronto,2,Bus Line,Business Service,Grocery Store,Park,Discount Store,Yoga Studio,Eastern European Restaurant,Fish & Chips Shop,Fast Food Restaurant,Farmers Market
5,Central Toronto,2,Breakfast Spot,Bus Line,Gym,Department Store,Park,Food & Drink Shop,Yoga Studio,Electronics Store,Fish Market,Fish & Chips Shop
6,Central Toronto,2,Playground,Gym Pool,Park,Garden,Yoga Studio,Dog Run,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm
10,Downtown Toronto,2,Playground,Grocery Store,Park,Candy Store,Dog Run,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant
30,Downtown Toronto,2,Café,Grocery Store,Playground,Park,Candy Store,Baby Store,Athletics & Sports,Coffee Shop,Deli / Bodega,Electronics Store
31,West Toronto,2,Park,Furniture / Home Store,Grocery Store,Pharmacy,Café,Smoke Shop,Bank,Bakery,Bus Line,Pet Store


#### Cluster 4

In [46]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Downtown Toronto,3,Harbor / Marina,Farm,Park,Theme Park,Yoga Studio,Dog Run,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market


#### Cluster 5

In [47]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Central Toronto,4,Swim School,Bus Line,Yoga Studio,Food & Drink Shop,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant
