# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto, Matteo Vadi

---

# Part 1. Getting the dataframe.

### Importing the libraries needed for scraping html content

Request library for the HTML content of the website and lxml.html for parsing the relevant fields. Then storing everything in a Pandas DataFrame.

In [1]:
import requests
import lxml.html as lh
import pandas as pd

### Using requests and lxml for scraping data

In [2]:
# Setting the URL to the wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
# Creating an object to handle the contents of the website
webpage = requests.get(url)
# Storing the contents of the website under an object doc
doc = lh.fromstring(webpage.content)
# Parsing data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

In [3]:
# Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

Ok now we have checked that each row has 3 columns that is consistent with our table. The next steps concern dealing with the header from the HTML table.

In [4]:
# Create empty list
col = []
i = 0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print ('%d:"%s"'%(i,name))
    col.append((name,[]))

1:"Postal Code
"
2:"Borough
"
3:"Neighbourhood
"


In [5]:
# First row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is the j'th row
    T=tr_elements[j]
    
    #If row is not of size 3, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1
# Checking the consistency of number of rows for each column
[len(C) for (title,C) in col]

[181, 181, 181]

Now let's import the data into a pandas dataframe (postcode) from the dictionary (Dict)

In [6]:
# Creating a dictionary
Dict={title:column for (title,column) in col}
postcode=pd.DataFrame(Dict)

In [7]:
# Checking for the first 10 rows of the pandas DataFrame (postcode)
postcode.head(10)

Unnamed: 0,Postal Code\n,Borough\n,Neighbourhood\n
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"
5,M6A\n,North York\n,"Lawrence Manor, Lawrence Heights\n"
6,M7A\n,Downtown Toronto\n,"Queen's Park, Ontario Provincial Government\n"
7,M8A\n,Not assigned\n,Not assigned\n
8,M9A\n,Etobicoke\n,"Islington Avenue, Humber Valley Village\n"
9,M1B\n,Scarborough\n,"Malvern, Rouge\n"


In [8]:
# Checking for the last 5 rows of the pandas DataFrame (postcode)
postcode.tail()

Unnamed: 0,Postal Code\n,Borough\n,Neighbourhood\n
176,M6Z\n,Not assigned\n,Not assigned\n
177,M7Z\n,Not assigned\n,Not assigned\n
178,M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor,..."
179,M9Z\n,Not assigned\n,Not assigned\n
180,\n,Canadian postal codes\n,\n


Now we see that actually we have also imported a footer to be treated. Let's start preproccessing the data

## Data Preproccessing

In [9]:
# Removing \n deriving from HTML text imported
postcode.replace('\n', '', regex = True, inplace = True)

In [10]:
# Changing columns' names
postcode.columns = ['PostalCode', 'Borough', 'Neighborhood']
postcode

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."
179,M9Z,Not assigned,Not assigned


In [11]:
# Removing the footer
postcode.drop(postcode[postcode.Borough == 'Canadian postal codes'].index, inplace = True)
postcode

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [12]:
# Removing each not assigned borough
postcode.drop(postcode[postcode.Borough == 'Not assigned'].index, inplace = True)

In [13]:
# Sorting postcode by PostalCode columns for a latter use
postcode.sort_values('PostalCode', inplace = True)
postcode

Unnamed: 0,PostalCode,Borough,Neighborhood
9,M1B,Scarborough,"Malvern, Rouge"
18,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
27,M1E,Scarborough,"Guildwood, Morningside, West Hill"
36,M1G,Scarborough,Woburn
45,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
107,M9P,Etobicoke,Westmount
116,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
143,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [14]:
# Creating a new column to be set as index 
postcode['Index']=range(len(postcode))
postcode

Unnamed: 0,PostalCode,Borough,Neighborhood,Index
9,M1B,Scarborough,"Malvern, Rouge",0
18,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",1
27,M1E,Scarborough,"Guildwood, Morningside, West Hill",2
36,M1G,Scarborough,Woburn,3
45,M1H,Scarborough,Cedarbrae,4
...,...,...,...,...
98,M9N,York,Weston,98
107,M9P,Etobicoke,Westmount,99
116,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",100
143,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",101


In [15]:
# Setting new indices (from 0 to postcode length)
postcode.set_index('Index', inplace = True)
postcode

Unnamed: 0_level_0,PostalCode,Borough,Neighborhood
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


More than one neighborhood can exist in 
one postal code area. For example, in the table 
on the Wikipedia page, is possible to notice that M5A is listed 
twice and has two neighborhoods: Harbourfront and Regent Park. 
These two rows will be combined 
into one row with the neighborhoods separated with a comma

In [16]:
# Checking if, after having removed each not assigned postal code, there is still any Not assigned Neighborhood 
postcode[postcode['Neighborhood'] == 'Not assigned']

Unnamed: 0_level_0,PostalCode,Borough,Neighborhood
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


Hence, we can move on without changing anything

In [17]:
# Printing the number of rows after preprocessing
postcode.shape

(103, 3)

The output cell above shows that there are 103 different postal code that we will use later on

---

# Part 2. Getting the geographical coordinates.

In [18]:
geo_coo = pd.read_csv('/Users/matteovadi/Desktop/IBM certificate/Course 9 - Applied Data Science/Capstone project/Geospatial_Coordinates.csv', header = 0)
geo_coo

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Now we see that this dataframe is sorted by Postal Code column and as it is for postcode too.

In [19]:
# Storing last 2 columns into 2 objects
Latitude = geo_coo['Latitude']
Longitude = geo_coo['Longitude']

Now let's add these 2 columns to the postcode dataframe

In [20]:
postcode['Latitude'] = Latitude
postcode['Longitude'] = Longitude
postcode

Unnamed: 0_level_0,PostalCode,Borough,Neighborhood,Latitude,Longitude
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


# Part 3. Exploring and Clustering neighborhoods in Toronto.

In [21]:
# importing some libraries
import json 
import numpy as np
!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
!pip install folium
import folium # map rendering library
print('Libraries imported.')

Libraries imported.


---

Now, let's create a new dataset with only the neighborhood in the city of Toronto (where Borough attribute involves the word Toronto)

In [22]:
toronto = postcode[postcode['Borough'].str.contains('Toronto')]
toronto

Unnamed: 0_level_0,PostalCode,Borough,Neighborhood,Latitude,Longitude
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


In [23]:
# Now let's just do the same already done before for having the correct indeces from 0 up tp len(dataframe)
toronto['Index']=range(len(toronto))
toronto.set_index('Index', inplace = True)
toronto

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  toronto['Index']=range(len(toronto))


Unnamed: 0_level_0,PostalCode,Borough,Neighborhood,Latitude,Longitude
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


Now, just getting the Toronto geo coordinates

In [24]:
# Toronto Geo Coordinates
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="matteo_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.6534817, -79.3839347.


In [25]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto['Latitude'], toronto['Longitude'], toronto['Borough'], toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

In [26]:
# Foursquare credentials are stored in a csv file. So let's load them
credentials = pd.read_csv('foursquare_credentials.csv')
CLIENT_ID = credentials.loc[0,'CLIENT_ID']
CLIENT_SECRET = credentials.loc[0,'CLIENT_SECRET']
VERSION = str(credentials.loc[0,'VERSION'])
LIMIT = 100 # A default Foursquare API limit value

## Exploring the very first neighborhood in our dataframe.

In [27]:
neighborhood_latitude = toronto.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto.loc[0, 'Longitude'] # neighborhood longitude value
neighborhood_name = toronto.loc[0, 'Neighborhood'] # neighborhood name
print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of The Beaches are 43.67635739999999, -79.2930312.


Now, let's get the top 50 venues that are in The Beaches within a radius of 1000 meters. In order to do so, first let's create the Get request URL

In [28]:
# First, let's create the GET request URL
LIMIT = 50 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius 

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

Now, let's send the GET request and examine the results

In [29]:
results = requests.get(url).json()

In [30]:
# all the information are in the items key, so let's create a get_category_type function to extract the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [31]:
# now, cleaning the json and converting it to a pd dataframe
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  nearby_venues = json_normalize(venues) # flatten JSON


Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,Tori's Bakeshop,Vegetarian / Vegan Restaurant,43.672114,-79.290331
2,The Fox Theatre,Indie Movie Theater,43.672801,-79.287272
3,The Beech Tree,Gastropub,43.680493,-79.288846
4,Beaches Bake Shop,Bakery,43.680363,-79.289692


In [32]:
# finally, let's see how many venues are returned by foursquare
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

50 venues were returned by Foursquare.


## Exploring the Neighborhoods in the dataset

In [33]:
# creating a function which gives back the venues for each neighborhood
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    # now, cleaning the json and converting it to a pd dataframe
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [34]:
# now let's use the function in the dataset
toronto_venues = getNearbyVenues(names=toronto['Neighborhood'],
                                   latitudes=toronto['Latitude'],
                                   longitudes=toronto['Longitude']
                                  )

The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West,  Lawrence Park
Davisville
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North & West, Forest Hill Road Park
The Annex, North Midtown, Yorkville
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Dufferin, Dovercourt Village
Little Portugal, Trinity
Brockton, Parkdale Village, Exhibition Place
High

In [35]:
# printing some information
print(toronto_venues.head())
print(toronto_venues.groupby('Neighborhood').count())
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

  Neighborhood  Neighborhood Latitude  Neighborhood Longitude  \
0  The Beaches              43.676357              -79.293031   
1  The Beaches              43.676357              -79.293031   
2  The Beaches              43.676357              -79.293031   
3  The Beaches              43.676357              -79.293031   
4  The Beaches              43.676357              -79.293031   

               Venue  Venue Latitude  Venue Longitude  \
0  Glen Manor Ravine       43.676821       -79.293942   
1    Tori's Bakeshop       43.672114       -79.290331   
2    The Fox Theatre       43.672801       -79.287272   
3     The Beech Tree       43.680493       -79.288846   
4  Beaches Bake Shop       43.680363       -79.289692   

                  Venue Category  
0                          Trail  
1  Vegetarian / Vegan Restaurant  
2            Indie Movie Theater  
3                      Gastropub  
4                         Bakery  
                                                    Neig

## Analyzing each Neighboorhood

In [36]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

# printing some info
print(toronto_onehot.head())
print(toronto_onehot.shape)

   Zoo  Adult Boutique  Airport  Airport Lounge  American Restaurant  \
0    0               0        0               0                    0   
1    0               0        0               0                    0   
2    0               0        0               0                    0   
3    0               0        0               0                    0   
4    0               0        0               0                    0   

   Amphitheater  Animal Shelter  Antique Shop  Aquarium  Art Gallery  ...  \
0             0               0             0         0            0  ...   
1             0               0             0         0            0  ...   
2             0               0             0         0            0  ...   
3             0               0             0         0            0  ...   
4             0               0             0         0            0  ...   

   Toy / Game Store  Track  Trail  Train Station  Turkish Restaurant  \
0                 0      0      

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [37]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

Let's print each neighborhood along with the top 5 most common venues

In [38]:
num_top_venues = 5 # 
for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
         venue  freq
0  Coffee Shop  0.10
1     Creperie  0.04
2       Bakery  0.04
3     Beer Bar  0.04
4   Restaurant  0.04


----Brockton, Parkdale Village, Exhibition Place----
         venue  freq
0         Café  0.10
1       Bakery  0.06
2  Coffee Shop  0.06
3    Gift Shop  0.06
4   Restaurant  0.06


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                  venue  freq
0                  Park  0.10
1               Brewery  0.06
2           Pizza Place  0.06
3  Fast Food Restaurant  0.04
4      Sushi Restaurant  0.04


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
             venue  freq
0      Coffee Shop  0.13
1  Harbor / Marina  0.13
2             Café  0.13
3          Dog Run  0.07
4          Airport  0.07


----Central Bay Street----
              venue  freq
0       Coffee Shop  0.10
1             Hotel  0.04
2  Ramen Restaurant  0.0

Let's write a function to sort the venues in descending order.

In [39]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [40]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Farmers Market,Café,Restaurant,Cheese Shop,Bakery,Park,Beer Bar,Japanese Restaurant,Creperie
1,"Brockton, Parkdale Village, Exhibition Place",Café,Gift Shop,Bakery,Restaurant,Coffee Shop,Furniture / Home Store,Italian Restaurant,Performing Arts Venue,Soup Place,Soccer Stadium
2,"Business reply mail Processing Centre, South C...",Park,Brewery,Pizza Place,Sushi Restaurant,Coffee Shop,Pet Store,Fast Food Restaurant,Italian Restaurant,Snack Place,Bistro
3,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Café,Harbor / Marina,Airport,Airport Lounge,Scenic Lookout,Park,Track,Dog Run,Dance Studio
4,Central Bay Street,Coffee Shop,Ramen Restaurant,Park,Plaza,Hotel,Café,Yoga Studio,Cosmetics Shop,Seafood Restaurant,Breakfast Spot


## Cluster Neighborhoods.

Now, I will run k-means to cluster the neighborhood into 5 clusters.

In [41]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 4, 4, 0, 0, 1, 0, 2, 4, 4], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [42]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = toronto

Finally, let's visualize the resulting clusters

In [43]:
# merge toronto_grouped with toronto dataset to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Analyzing Clusters.

Now, is possible to examine each cluster and determine the discriminating venue categories that distinguish each one. Then assign a name to each cluster

In [44]:
# Cluster 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0_level_0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
9,Central Toronto,0,Coffee Shop,Italian Restaurant,Sushi Restaurant,Park,Restaurant,Spa,Café,Gym,Grocery Store,Liquor Store
10,Downtown Toronto,0,Coffee Shop,Grocery Store,Park,Pie Shop,BBQ Joint,Playground,Sandwich Place,Metro Station,Candy Store,Breakfast Spot
12,Downtown Toronto,0,Coffee Shop,Sushi Restaurant,Thai Restaurant,Gay Bar,Japanese Restaurant,Men's Store,Dance Studio,Restaurant,Yoga Studio,Burger Joint
13,Downtown Toronto,0,Coffee Shop,Café,Theater,Bakery,Park,Breakfast Spot,Pub,Italian Restaurant,Historic Site,Restaurant
14,Downtown Toronto,0,Coffee Shop,Plaza,Cosmetics Shop,Gastropub,Hotel,Theater,Ramen Restaurant,Middle Eastern Restaurant,Yoga Studio,Sandwich Place
16,Downtown Toronto,0,Coffee Shop,Farmers Market,Café,Restaurant,Cheese Shop,Bakery,Park,Beer Bar,Japanese Restaurant,Creperie
17,Downtown Toronto,0,Coffee Shop,Ramen Restaurant,Park,Plaza,Hotel,Café,Yoga Studio,Cosmetics Shop,Seafood Restaurant,Breakfast Spot
19,Downtown Toronto,0,Coffee Shop,Hotel,Park,Baseball Stadium,Brewery,Aquarium,Plaza,Yoga Studio,Ice Cream Shop,Sporting Goods Shop
27,Downtown Toronto,0,Coffee Shop,Café,Harbor / Marina,Airport,Airport Lounge,Scenic Lookout,Park,Track,Dog Run,Dance Studio
37,Downtown Toronto,0,Coffee Shop,Park,Sushi Restaurant,Bookstore,Italian Restaurant,Japanese Restaurant,Yoga Studio,Juice Bar,Indian Restaurant,Smoke Shop


In [45]:
# Cluster 2
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0_level_0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,East Toronto,1,Pub,Japanese Restaurant,Caribbean Restaurant,Coffee Shop,Park,Breakfast Spot,Beach,Bakery,Pizza Place,Health Food Store
1,East Toronto,1,Greek Restaurant,Yoga Studio,Bakery,Ice Cream Shop,Italian Restaurant,Pub,Coffee Shop,Trail,Turkish Restaurant,Bookstore
8,Central Toronto,1,Coffee Shop,Grocery Store,Italian Restaurant,Park,Gym,Bank,Café,Sandwich Place,Restaurant,Pizza Place
11,Downtown Toronto,1,Diner,Café,Coffee Shop,Restaurant,Park,Japanese Restaurant,Gastropub,Indian Restaurant,Steakhouse,Pool
23,Central Toronto,1,Bank,Café,Coffee Shop,Park,Gym / Fitness Center,Liquor Store,Sushi Restaurant,Pharmacy,Skating Rink,Burger Joint
24,Central Toronto,1,Café,Italian Restaurant,Grocery Store,Coffee Shop,Vegetarian / Vegan Restaurant,Museum,Indie Movie Theater,Design Studio,Beer Bar,Pizza Place
25,Downtown Toronto,1,Café,Bakery,Bookstore,Vegetarian / Vegan Restaurant,Restaurant,Grocery Store,Park,Japanese Restaurant,Dessert Shop,Music School
26,Downtown Toronto,1,Café,Vegetarian / Vegan Restaurant,Mexican Restaurant,Dessert Shop,Comfort Food Restaurant,Coffee Shop,Bar,Bakery,Caribbean Restaurant,Vietnamese Restaurant
30,Downtown Toronto,1,Café,Korean Restaurant,Coffee Shop,Grocery Store,Pizza Place,Indian Restaurant,Cocktail Bar,Park,Restaurant,Italian Restaurant


In [46]:
# Cluster 3
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0_level_0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
15,Downtown Toronto,2,Café,Coffee Shop,Cosmetics Shop,Bakery,Creperie,Gastropub,Cheese Shop,Restaurant,Farmers Market,Hotel
18,Downtown Toronto,2,Café,Coffee Shop,Theater,American Restaurant,Sushi Restaurant,Restaurant,Cosmetics Shop,Concert Hall,Yoga Studio,Monument / Landmark
20,Downtown Toronto,2,Café,Hotel,Coffee Shop,Restaurant,Japanese Restaurant,American Restaurant,Concert Hall,Beer Bar,Mediterranean Restaurant,Sporting Goods Shop
21,Downtown Toronto,2,Café,Japanese Restaurant,Restaurant,Hotel,Coffee Shop,Gym,American Restaurant,Gym / Fitness Center,Park,Seafood Restaurant
28,Downtown Toronto,2,Japanese Restaurant,Café,Bakery,Coffee Shop,Farmers Market,Creperie,Cheese Shop,Restaurant,Gastropub,Park
29,Downtown Toronto,2,Café,Coffee Shop,Hotel,Restaurant,American Restaurant,Concert Hall,Gym,Beer Bar,Deli / Bodega,Mediterranean Restaurant


In [47]:
# Cluster 4
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0_level_0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
4,Central Toronto,3,Gym / Fitness Center,Pharmacy,College Quad,College Gym,Coffee Shop,Park,Café,Bookstore,Trail,Gas Station


In [48]:
# Cluster 5
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0_level_0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2,East Toronto,4,Beach,Coffee Shop,Indian Restaurant,Restaurant,Fast Food Restaurant,Park,Brewery,Bakery,Burrito Place,Butcher
3,East Toronto,4,Coffee Shop,Bakery,Café,Bar,Brewery,French Restaurant,Italian Restaurant,Yoga Studio,Latin American Restaurant,Hotel
5,Central Toronto,4,Italian Restaurant,Coffee Shop,Pizza Place,Café,Dessert Shop,Yoga Studio,Food & Drink Shop,Restaurant,Bookstore,Seafood Restaurant
6,Central Toronto,4,Coffee Shop,Italian Restaurant,Sporting Goods Shop,Skating Rink,Café,Restaurant,Diner,Mexican Restaurant,Thai Restaurant,Park
7,Central Toronto,4,Italian Restaurant,Sushi Restaurant,Café,Dessert Shop,Indian Restaurant,Bookstore,Coffee Shop,Pizza Place,Restaurant,Gym
22,Central Toronto,4,Sushi Restaurant,Bank,Coffee Shop,Café,Pharmacy,Italian Restaurant,Dry Cleaner,Japanese Restaurant,Bakery,Bagel Shop
31,West Toronto,4,Café,Coffee Shop,Park,Portuguese Restaurant,Bar,Italian Restaurant,Sushi Restaurant,Pharmacy,Brewery,Bakery
32,West Toronto,4,Bar,Cocktail Bar,Asian Restaurant,Vietnamese Restaurant,Vegetarian / Vegan Restaurant,Restaurant,Café,Japanese Restaurant,Coffee Shop,Theater
33,West Toronto,4,Café,Gift Shop,Bakery,Restaurant,Coffee Shop,Furniture / Home Store,Italian Restaurant,Performing Arts Venue,Soup Place,Soccer Stadium
34,West Toronto,4,Café,Bar,Thai Restaurant,Antique Shop,Italian Restaurant,Sushi Restaurant,Coffee Shop,Gastropub,Flea Market,Grocery Store


---