## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto
# PART 1 of 3
First we import the modules that will be needed to complete this part:

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

Next we download the source code of the Wikipedia page and parse it using BeautifulSoup:

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

Then we extract the information we are interested in and put it into a Pandas DataFrame object:

In [3]:
# Find the table that we're interested in:
table = soup.find('table', class_='wikitable sortable')

# Initiate objects to hold our column names and rows data:
columns = []
rows = []

# Find and store all the rows in thw table: 
table_rows = table.find_all('tr')

# For each row search for 'th' (table headers) and 'td' (regular rows):
for row in table_rows:
    values_found = row.find_all(('th', 'td'))
    
    # At this point we can also initiate an object to hold the row values:
    row_values = []
    
    # For each value that is found in a given row check if it's a 'th' (table header) or 'td' (regular row):
    for value in values_found:
        
        # In case it's a table header row ('th'), append it to the columns list:
        if value.name == 'th':
            if value.text.replace('\n', '') == 'Postcode':
                columns.append('PostalCode')
            else:
                columns.append(value.text.replace('\n', ''))
                
        # In case it's a regular table row ('td'), appent it to the row_values list:
        if value.name == 'td':
            row_values.append(value.text.replace('\n', ''))
            
    # If there are any regular row values found in this particular row, i.e. it isn't a header row, add those values to the rows list:
    if len(row_values) != 0:
        rows.append(row_values)

# To finish this step off convert the rows list to a Pandas DataFrame providing the columns list as input for the columns parameter:
scrape_result = pd.DataFrame(rows, columns=columns)
scrape_result.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Etobicoke,Islington Avenue


The next step is to drop rows with a borough that is Not assigned:

In [4]:
# Use the drop method to remove rows where the value for 'Borough' is 'Not assigned': 
scrape_result.drop(scrape_result[scrape_result['Borough'] == 'Not assigned'].index, inplace=True)

# Reset the indices: 
scrape_result.reset_index(drop=True, inplace=True)

scrape_result.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Etobicoke,Islington Avenue
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


Next we combine rows where the post code is the same for a number of neighbourhoods:

In [5]:
# Make a list of all unique post codes:
unique_postal_codes = scrape_result['PostalCode'].unique()

# Set up a new list to hold the new data set:
toronto_neighbourhoods = []

# Iterate through all the post codes in the unique_postal_codes list:
for postal_code in unique_postal_codes:
    
    # For each post code locate all the neighbourhoods with a given post code:
    neighbourhoods_at_post_code = scrape_result.loc[scrape_result['PostalCode'] == postal_code]
    
    # Create a new object with the first located neighbourhood: 
    postal_code_entry = neighbourhoods_at_post_code.iloc[0]
    
    # If there is more than one neighbourhood for that post code
    if neighbourhoods_at_post_code.shape[0] > 1:
        
        # go over all the entries and add those neighbourhoods after a comma to the one neighbourhood that was saved in the previous step:
        for entry in range(1, neighbourhoods_at_post_code.shape[0]):
            neighbourhood = neighbourhoods_at_post_code.iloc[entry]['Neighbourhood']
            postal_code_entry['Neighbourhood'] = postal_code_entry['Neighbourhood'] + ', ' + neighbourhood
    
    # Save the entry for the current post code to the toronto_neighbourhoods list:
    toronto_neighbourhoods.append(postal_code_entry)

# To finish this step off convert the toronto_neighbourhoods list to a Pandas DataFrame and reset the indices:
toronto_neighbourhoods = pd.DataFrame(toronto_neighbourhoods)
toronto_neighbourhoods.reset_index(drop=True, inplace=True)
toronto_neighbourhoods.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


I didn't find any rows where there was a Borough specified but not a Neighbourhood.  
The last cell of this part of the notebook checks the shape of the resulting data set:

In [6]:
toronto_neighbourhoods.shape

(103, 3)

# PART 2 of 3
First we import the modules that will be needed to complete this part:

In [7]:
import geocoder

Next we fetch the coordinates and add them to our data set:

In [None]:
# Set up objects to hold the coordinate components:
latitudes = []
longitudes = []

# Grab each of the post codes and fetch the coordinates:
# Here using the ArcGIS provider instead of Google due to much better reliability
for post_code in toronto_neighbourhoods['PostalCode']:
    geo_data = geocoder.arcgis(post_code+', Toronto, Ontario')
    coordinates = geo_data.latlng
    
    # Save the coordinate components to the the latitudes and longitudes objects: 
    latitudes.append(coordinates[0])
    longitudes.append(coordinates[1])

# Add new 'Latitudes' and 'Longitudes' columns to the data frame from the coresponding lists:
toronto_neighbourhoods['Latitude'] = latitudes
toronto_neighbourhoods['Longitude'] = longitudes
    
toronto_neighbourhoods.head(12)

# PART 3 of 3
First we import the modules that will be needed to complete this part:

In [237]:
import folium
import numpy as np
from pandas import json_normalize
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

Let's start by mapping all the post code areas on a map of Toronto:  
(I will be using the post codes for segmenting and clustering as only they all have unique coordinates in our dataset)

In [238]:
# First find central coordinates for the city of Toronto:
toronto_coordinates = geocoder.arcgis('Toronto, Ontario').latlng

# Initialise a map object:
map_toronto = folium.Map(location=[toronto_coordinates[0], toronto_coordinates[1]], zoom_start=11)

# Add markers for each postal zone:
for lat, lng, post_code in zip(toronto_neighbourhoods['Latitude'], toronto_neighbourhoods['Longitude'], toronto_neighbourhoods['PostalCode']):
    label = folium.Popup(post_code)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7
        ).add_to(map_toronto)  

# Display the map:
map_toronto

In [239]:
client_id = '5V3HI5HAENBFSCFUUXW0CTGYNL1IZHICG3B20NM4JWFHSFOS'
client_secret = 'Y4ABTMLDASPSBZ5Z1XN5XTFZXHKW1EM5FOPDGZBIAKTDJFOC'
version = '20200229'

Next we define a method that will fetch nearby venues for our postal areas:

In [240]:
def getNearbyVenues(post_codes, latitudes, longitudes, radius, limit):

    # Initialise an object to store the list of venues:
    venues_list = []
    
    # Iterate though pairs of post codes and coordinates:
    for post_code, lat, lng in zip(post_codes, latitudes, longitudes):

        # For each pair create a request URL:
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            client_id, 
            client_secret, 
            version, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # Make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Collect only relevant information for each nearby venue from the response
        venues_list.append([(
            post_code, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    # Convert the venues_list to a Pandas DataFrame:
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    
    # Name the columns in the DataFrame
    nearby_venues.columns = ['PostalCode', 
                  'PostalCode Latitude', 
                  'PostalCode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    # Return the DataFrame:
    return(nearby_venues)

Now we use the above defined method to get the nearby venues for our postal areas:

In [241]:
toronto_venues = getNearbyVenues(post_codes=toronto_neighbourhoods['PostalCode'],
                                 latitudes=toronto_neighbourhoods['Latitude'],
                                 longitudes=toronto_neighbourhoods['Longitude'],
                                 radius=1000,
                                 limit=100
                                )

Let's have a quick look at the size and some records of our dataset:

In [242]:
print(toronto_venues.shape)
toronto_venues.head(10)

(5120, 7)


Unnamed: 0,PostalCode,PostalCode Latitude,PostalCode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.75242,-79.329242,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,M3A,43.75242,-79.329242,Brookbanks Park,43.751976,-79.33214,Park
2,M3A,43.75242,-79.329242,Tim Hortons,43.760668,-79.326368,Café
3,M3A,43.75242,-79.329242,A&W,43.760643,-79.326865,Fast Food Restaurant
4,M3A,43.75242,-79.329242,Bruno's valu-mart,43.746143,-79.32463,Grocery Store
5,M3A,43.75242,-79.329242,High Street Fish & Chips,43.74526,-79.324949,Fish & Chips Shop
6,M3A,43.75242,-79.329242,Food Basics,43.760865,-79.326015,Supermarket
7,M3A,43.75242,-79.329242,Shoppers Drug Mart,43.745315,-79.3258,Pharmacy
8,M3A,43.75242,-79.329242,Shoppers Drug Mart,43.760857,-79.324961,Pharmacy
9,M3A,43.75242,-79.329242,Variety Store,43.751974,-79.333114,Food & Drink Shop


Next we check how many venues were returned for some of the areas:

In [243]:
toronto_venues.groupby('PostalCode').count()

Unnamed: 0_level_0,PostalCode Latitude,PostalCode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,7,7,7,7,7,7
M1C,4,4,4,4,4,4
M1E,19,19,19,19,19,19
M1G,18,18,18,18,18,18
M1H,25,25,25,25,25,25
...,...,...,...,...,...,...
M9N,17,17,17,17,17,17
M9P,19,19,19,19,19,19
M9R,19,19,19,19,19,19
M9V,15,15,15,15,15,15


Let's also see how many unique categories we have to work with:

In [244]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 338 uniques categories.


We continue analysing the data by converting the data to a format usable by the k-means algorithm:

In [245]:
# First encode the categorical values using the one hot encoding method:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Add PostalCode column back to dataframe:
toronto_onehot['PostalCode'] = toronto_venues['PostalCode'] 

# Make PostalCode column the first column:
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,PostalCode,Accessories Store,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Animal Shelter,Antique Shop,Aquarium,Arcade,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we group the results by post code and take the mean of the frequency of occurrence of each category:

In [246]:
toronto_grouped = toronto_onehot.groupby('PostalCode').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,PostalCode,Accessories Store,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Animal Shelter,Antique Shop,Aquarium,Arcade,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.428571
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can print each neighborhood along with the top 5 most common venues to get a better feel for each of the postal areas:

In [247]:
# Select the number of top venue categories to see:
num_top_venues = 5

# For each postal area calculate and display the frequency of top venue categories:
for post_code in toronto_grouped['PostalCode']:
    print("----"+post_code+"----")
    
    temp = toronto_grouped[toronto_grouped['PostalCode'] == post_code].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M1B----
                  venue  freq
0           Zoo Exhibit  0.43
1  Other Great Outdoors  0.14
2                 Trail  0.14
3            Hobby Shop  0.14
4  Fast Food Restaurant  0.14


----M1C----
                venue  freq
0                Park  0.25
1      Breakfast Spot  0.25
2        Burger Joint  0.25
3  Italian Restaurant  0.25
4        Noodle House  0.00


----M1E----
                  venue  freq
0                  Park  0.11
1     Convenience Store  0.11
2           Supermarket  0.11
3  Gym / Fitness Center  0.05
4            Restaurant  0.05


----M1G----
               venue  freq
0  Indian Restaurant  0.11
1     Discount Store  0.11
2               Park  0.11
3   Department Store  0.11
4        Pizza Place  0.11


----M1H----
               venue  freq
0  Indian Restaurant  0.16
1        Gas Station  0.08
2        Coffee Shop  0.08
3             Bakery  0.08
4           Bus Line  0.04


----M1J----
                venue  freq
0      Ice Cream Shop  0.17
1      San

Let's prepare a method that will return the most common venues for a given postal area:

In [248]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now using the above method we can construct a DataFrame with post codes and corresponding top venues:

In [257]:
# Select the number of top venue types to see:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# Create column names for the chosen number of top venue types:
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Create a new dataframe with the just prepared column names:
post_codes_venues_sorted = pd.DataFrame(columns=columns)
post_codes_venues_sorted['PostalCode'] = toronto_grouped['PostalCode']

# Use the above defined method (return_most_common_venues) to populate the DataFrame:
for ind in np.arange(toronto_grouped.shape[0]):
    post_codes_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

post_codes_venues_sorted.head(20)

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Zoo Exhibit,Other Great Outdoors,Hobby Shop,Trail,Fast Food Restaurant,Farm,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Event Space
1,M1C,Breakfast Spot,Burger Joint,Park,Italian Restaurant,Zoo Exhibit,Fast Food Restaurant,Electronics Store,Ethiopian Restaurant,Event Space,Fabric Shop
2,M1E,Convenience Store,Park,Supermarket,Athletics & Sports,Sports Bar,Grocery Store,Restaurant,Gym / Fitness Center,Gymnastics Gym,Pharmacy
3,M1G,Discount Store,Pizza Place,Park,Indian Restaurant,Department Store,Coffee Shop,Supermarket,Sandwich Place,Chinese Restaurant,Thrift / Vintage Store
4,M1H,Indian Restaurant,Gas Station,Bakery,Coffee Shop,Flower Shop,Hakka Restaurant,Chinese Restaurant,Bus Line,Bank,Thai Restaurant
...,...,...,...,...,...,...,...,...,...,...,...
15,M1W,Chinese Restaurant,Coffee Shop,Pizza Place,Fast Food Restaurant,Intersection,Grocery Store,Bakery,Caribbean Restaurant,Supermarket,Other Great Outdoors
16,M2H,Chinese Restaurant,Park,Coffee Shop,Sandwich Place,Sushi Restaurant,Bakery,Cantonese Restaurant,Café,Grocery Store,Residential Building (Apartment / Condo)
17,M2J,Clothing Store,Coffee Shop,Fast Food Restaurant,Japanese Restaurant,Shoe Store,Bakery,Women's Store,Sandwich Place,Convenience Store,Food Court
18,M2K,Park,Café,Bank,Trail,Japanese Restaurant,Chinese Restaurant,Flower Shop,Fish & Chips Shop,Filipino Restaurant,Dumpling Restaurant


Next we cluster the postal areas using the k-means algorithm:

In [258]:
# Set the number of clusters:
kclusters = 5

# Drop the 'PostalCode' labels from the dataset as they don't participate in the clustering process:
toronto_grouped_clustering = toronto_grouped.drop('PostalCode', 1)

# Run the k-means clustering:
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# Check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 2, 3, 3, 3, 3, 3, 3, 3, 2], dtype=int32)

Now we combine the the original postal areas data with the results of clustering and the information about most common venue types: 

In [259]:
# Add clustering labels:
post_codes_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# Initiate a new DataFrame object from the original postal areas dataset ('toronto_neighbourhoods'):
toronto_merged = toronto_neighbourhoods

# Merge the original postal areas dataset (now within 'toronto_merged') with information about most common venue types (in 'post_codes_venues_sorted'):
toronto_merged = toronto_merged.join(post_codes_venues_sorted.set_index('PostalCode'), on='PostalCode')

# Drop any records with missing data and make sure cluster labels are of type Int: 
toronto_merged.dropna(inplace=True)
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype(int)
toronto_merged.reset_index(drop=True, inplace=True)

toronto_merged.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.75242,-79.329242,3,Park,Convenience Store,Pharmacy,Bus Stop,Pizza Place,Laundry Service,Chinese Restaurant,Supermarket,Fast Food Restaurant,Tennis Court
1,M4A,North York,Victoria Village,43.7306,-79.313265,3,Hockey Arena,Pet Store,Spa,Middle Eastern Restaurant,Thrift / Vintage Store,Coffee Shop,Intersection,Thai Restaurant,Portuguese Restaurant,Wings Joint
2,M5A,Downtown Toronto,Harbourfront,43.650295,-79.359166,1,Coffee Shop,Pub,Theater,Park,Café,Breakfast Spot,Bakery,Restaurant,Boat or Ferry,Bar
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.72327,-79.451286,1,Clothing Store,Restaurant,Coffee Shop,Dessert Shop,Furniture / Home Store,Fast Food Restaurant,Playground,Discount Store,Sushi Restaurant,Men's Store
4,M7A,Downtown Toronto,Queen's Park,43.66115,-79.391715,1,Coffee Shop,Japanese Restaurant,Gastropub,Bubble Tea Shop,Sushi Restaurant,Park,Café,Tea Room,Ramen Restaurant,Bar
5,M9A,Etobicoke,Islington Avenue,43.662299,-79.528195,3,Shopping Mall,Grocery Store,Pharmacy,Bakery,Liquor Store,Supermarket,Japanese Restaurant,Café,Bank,Bus Stop
6,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517,4,Zoo Exhibit,Other Great Outdoors,Hobby Shop,Trail,Fast Food Restaurant,Farm,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Event Space
7,M3B,North York,Don Mills North,43.749055,-79.362227,3,Coffee Shop,Japanese Restaurant,Supermarket,Italian Restaurant,Bagel Shop,Bank,Basketball Court,Juice Bar,Sushi Restaurant,Pharmacy
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.707535,-79.311773,3,Pizza Place,Brewery,Bus Line,Fast Food Restaurant,Gastropub,Bank,Bakery,Rock Climbing Spot,Coffee Shop,Restaurant
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657363,-79.37818,1,Coffee Shop,Clothing Store,Gastropub,Restaurant,Diner,Italian Restaurant,Japanese Restaurant,Plaza,Bakery,Sushi Restaurant


It's time to visualise the result of clustering on a map: 

In [260]:
# Initialise a map object:
map_clusters = folium.Map(location=[toronto_location[0], toronto_location[1]], zoom_start=11)

# Provide a color scheme for the clusters:
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers for each postal zone to the map:
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostalCode'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

# Display the map:
map_clusters

To help interpret the results let's have a look at the number of each venue type occuring most commonly in each of the clusters:

In [262]:
for c in range(0, kclusters):
    cluster = toronto_merged.loc[toronto_merged['Cluster Labels'] == c, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
    print('------ Cluster '+str(c)+' ------')
    print(cluster['1st Most Common Venue'].value_counts())
    print(cluster['2nd Most Common Venue'].value_counts())
    print(cluster['3rd Most Common Venue'].value_counts())
    print(cluster['4th Most Common Venue'].value_counts())
    print(cluster['5th Most Common Venue'].value_counts())
    print('\n')

------ Cluster 0 ------
Coffee Shop       3
Pizza Place       2
Sandwich Place    1
Name: 1st Most Common Venue, dtype: int64
Pizza Place               3
Gas Station               1
Furniture / Home Store    1
Fast Food Restaurant      1
Name: 2nd Most Common Venue, dtype: int64
Coffee Shop               2
Pharmacy                  1
Flea Market               1
Furniture / Home Store    1
Pizza Place               1
Name: 3rd Most Common Venue, dtype: int64
Sandwich Place               2
Metro Station                1
Shopping Mall                1
Intersection                 1
Middle Eastern Restaurant    1
Name: 4th Most Common Venue, dtype: int64
Eastern European Restaurant    1
Restaurant                     1
Sushi Restaurant               1
Pharmacy                       1
Chinese Restaurant             1
Fast Food Restaurant           1
Name: 5th Most Common Venue, dtype: int64


------ Cluster 1 ------
Coffee Shop          19
Café                  6
Hotel                 5
Bar

### Cluster 0
__red markers__  
Seems to be a type of suburban area. Characteristic venues are:
- Pizza Places

### Cluster 1
__purple markers__  
Seems to be a type of common urban area. Characteristic venues are:
- Hotels
- Italian Restaurants
- Restaurants
- Bars

### Cluster 2
__purple markers__  
Seems to be a type of suburban area. Characteristic venues are:
- Parks

### Cluster 3
__green markers__  
Seems to be a type of common suburban area. Characteristic venues are:
- Pizza Places
- Parks
- Gas Stations

### Cluster 4
__purple markers__  
Seems to be a type of natural area. Characteristic venues are:
- Zoo Exhibits
- Other Great Outdoors