# Assignment Week 3

This assignment consists of 3 parts:
1. Scraping and cleaning the data from wikipedia
2. Adding coordinates from Geocoder
3. Exploring and clustering the neighborhoods of Toronto

## 1. Scraping and cleaning the data from wikipedia

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [63]:
#import pandas
import pandas as pd

In [64]:
#Get data from Wikipedia. As several tables are found by read_html, 
#only the table containing the PostalCode M1A is selected.
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
raw_data = pd.read_html(url, match='M1A')

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [65]:
#Rename column Postcode, add column Neighborhood for concatenated neighbourhoods
boroughs=raw_data[0].rename(columns={'Postcode':'PostalCode'})
boroughs['Neighborhood'] = boroughs['Neighbourhood']
print(boroughs.shape)
boroughs.head(10)

(288, 4)


Unnamed: 0,PostalCode,Borough,Neighbourhood,Neighborhood
0,M1A,Not assigned,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned,Not assigned
2,M3A,North York,Parkwoods,Parkwoods
3,M4A,North York,Victoria Village,Victoria Village
4,M5A,Downtown Toronto,Harbourfront,Harbourfront
5,M5A,Downtown Toronto,Regent Park,Regent Park
6,M6A,North York,Lawrence Heights,Lawrence Heights
7,M6A,North York,Lawrence Manor,Lawrence Manor
8,M7A,Queen's Park,Not assigned,Not assigned
9,M8A,Not assigned,Not assigned,Not assigned


- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [66]:
#Remove Boroughs which are not assigned
boroughs=boroughs[boroughs['Borough'] != 'Not assigned']
print(boroughs.shape)
boroughs.head(10)

(211, 4)


Unnamed: 0,PostalCode,Borough,Neighbourhood,Neighborhood
2,M3A,North York,Parkwoods,Parkwoods
3,M4A,North York,Victoria Village,Victoria Village
4,M5A,Downtown Toronto,Harbourfront,Harbourfront
5,M5A,Downtown Toronto,Regent Park,Regent Park
6,M6A,North York,Lawrence Heights,Lawrence Heights
7,M6A,North York,Lawrence Manor,Lawrence Manor
8,M7A,Queen's Park,Not assigned,Not assigned
10,M9A,Etobicoke,Islington Avenue,Islington Avenue
11,M1B,Scarborough,Rouge,Rouge
12,M1B,Scarborough,Malvern,Malvern


- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [67]:
#Rename Not assigned Neighborhood
#\ is used for line break
boroughs.loc[boroughs[boroughs['Neighborhood'] == 'Not assigned'].index, 'Neighborhood'] \
= boroughs.loc[boroughs[boroughs['Neighborhood'] == 'Not assigned'].index, 'Borough']
boroughs.loc[boroughs[boroughs['Neighbourhood'] == 'Not assigned'].index, 'Neighbourhood'] \
= boroughs.loc[boroughs[boroughs['Neighbourhood'] == 'Not assigned'].index, 'Borough']
print(boroughs.shape)
#boroughs.to_csv('Assignment_Capstone_week_3_part_1a.csv')
boroughs.head(10)

(211, 4)


Unnamed: 0,PostalCode,Borough,Neighbourhood,Neighborhood
2,M3A,North York,Parkwoods,Parkwoods
3,M4A,North York,Victoria Village,Victoria Village
4,M5A,Downtown Toronto,Harbourfront,Harbourfront
5,M5A,Downtown Toronto,Regent Park,Regent Park
6,M6A,North York,Lawrence Heights,Lawrence Heights
7,M6A,North York,Lawrence Manor,Lawrence Manor
8,M7A,Queen's Park,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue,Islington Avenue
11,M1B,Scarborough,Rouge,Rouge
12,M1B,Scarborough,Malvern,Malvern


- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [68]:
#Concate Neighborhoods with the same PostalCode
#For each PostalCode, the names of the Neighbourhoods are concatenated. 
#The first Neighbourhood in each PostalCode 'pc' is added without additional characters,
#the following Neighbourhoods are added with a comma.
#After that all for all Neighbhourhoods in the PostalCode 'pc' Neighbhorhood (now without 'ou')
#is set to the concatenated Neighbourhoods under the assumption that all Neighbourhood names 
#are unique in the dataset.
#Drop duplicate rows after that and reset the index.
#boroughs = pd.read_csv('Assignment_Capstone_week_3_part_1a.csv')
nbh_conc = ''
for pc in boroughs['PostalCode'].unique():
    
    nbh_conc = ''
    for nbh in boroughs[boroughs['PostalCode'] == pc]['Neighbourhood']:
        if nbh_conc == '':
            nbh_conc = nbh
        else:
            nbh_conc = nbh_conc + ', ' + nbh
    for nbh in boroughs[boroughs['PostalCode'] == pc]['Neighbourhood']:
        boroughs.loc[(boroughs['Neighbourhood'] == nbh) & (boroughs['PostalCode'] == pc), 'Neighborhood'] = nbh_conc
        #boroughs.loc[boroughs['Neighbourhood'] == nbh, boroughs['PostalCode'] == pc, 'Neighborhood'] = nbh_conc
        #boroughs.loc[boroughs['Neighbourhood'] == nbh, 'Neighborhood'] = nbh_conc
del boroughs['Neighbourhood']
boroughs.drop_duplicates(inplace=True)
boroughs.reset_index(drop=True, inplace=True)
print(boroughs.shape)

(103, 3)


In [69]:
#save dataframe for future use
boroughs.to_csv('Assignment_Capstone_week_3_part_1.csv')

In [70]:
#Create dataframe as given in the assignment by creating a dataframe PostalCodes with the 
#PostalCodes in the order as given and then merge it with boroughs.
#I am not sure if this is required but this makes it easier to compare my work ;-)
PostalCodes = pd.DataFrame({'PostalCode': ['M5G','M2H','M4B','M1J','M4G','M4M','M1R','M9V','M9L','M5V','M1B','M5A']})
boroughs_shown = pd.merge(PostalCodes, boroughs, left_on='PostalCode', right_on='PostalCode')
boroughs_shown

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Woodbine Gardens, Parkview Hill"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Maryvale, Wexford"
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo..."


- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [71]:
boroughs.shape[0]

103

## 2. Adding coordinates from Geocoder

In [72]:
#Get data from saved dataframe and postal codes from alternative source. Merge both sources
boroughs = pd.merge(pd.read_csv('Assignment_Capstone_week_3_part_1.csv'), \
                    pd.read_csv('https://cocl.us/Geospatial_data'), \
                    left_on='PostalCode', right_on='Postal Code')

In [73]:
#delete unnecessary columns
boroughs.drop(['Unnamed: 0', 'Postal Code'], axis=1, inplace=True)

In [74]:
#quickcheck result
boroughs.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


In [75]:
#save results
boroughs.to_csv('Assignment_Capstone_week_3_part_2.csv')

## 3. Exploring and clustering the neighborhoods of Toronto

In [76]:
#install libraries
import numpy as np # library to handle data in a vectorized manner

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries imported.


In [77]:
#coordinates of Toronto, taken from wikipedia as geocode does not work
latitude = 43.741667
longitude = -79.373333
#get saved data from second part, delete unnecessary columns
boroughs=pd.read_csv('Assignment_Capstone_week_3_part_2.csv')
boroughs.drop(['Unnamed: 0'], axis=1, inplace=True)

In [78]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers for  to map
for pc, lat, lng, borough, neighborhood in zip(boroughs['PostalCode'], boroughs['Latitude'], boroughs['Longitude'], \
        boroughs['Borough'], boroughs['Neighborhood']):
    label = '{}, {}: {}'.format(pc, borough, neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
map_toronto

In [79]:
# The code was removed by Watson Studio for sharing.

In [80]:
#Create a function to get venues for all PostalCodes. 
#A radius of 500m around the downloaded coordinates is examined.
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 
                  'PostalCode Latitude', 
                  'PostalCode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [81]:
#get venues for Toronto
LIMIT = 1000
toronto_venues = getNearbyVenues(names=boroughs['PostalCode'],
                                   latitudes=boroughs['Latitude'],
                                   longitudes=boroughs['Longitude']
                                  )

M3A
M4A
M5A
M6A
M7A
M9A
M1B
M3B
M4B
M5B
M6B
M9B
M1C
M3C
M4C
M5C
M6C
M9C
M1E
M4E
M5E
M6E
M1G
M4G
M5G
M6G
M1H
M2H
M3H
M4H
M5H
M6H
M1J
M2J
M3J
M4J
M5J
M6J
M1K
M2K
M3K
M4K
M5K
M6K
M1L
M2L
M3L
M4L
M5L
M6L
M9L
M1M
M2M
M3M
M4M
M5M
M6M
M9M
M1N
M2N
M3N
M4N
M5N
M6N
M9N
M1P
M2P
M4P
M5P
M6P
M9P
M1R
M2R
M4R
M5R
M6R
M7R
M9R
M1S
M4S
M5S
M6S
M1T
M4T
M5T
M1V
M4V
M5V
M8V
M9V
M1W
M4W
M5W
M8W
M9W
M1X
M4X
M5X
M8X
M4Y
M7Y
M8Y
M8Z


In [82]:
#check the resulting dataframe
print(toronto_venues.shape)
toronto_venues.head()

(2250, 7)


Unnamed: 0,PostalCode,PostalCode Latitude,PostalCode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,M3A,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,M4A,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,M4A,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


In [83]:
#check how many venues are in each PostalCode area
toronto_venues.groupby('PostalCode').count()

Unnamed: 0_level_0,PostalCode Latitude,PostalCode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,2,2,2,2,2,2
M1C,3,3,3,3,3,3
M1E,7,7,7,7,7,7
M1G,4,4,4,4,4,4
M1H,7,7,7,7,7,7
M1J,2,2,2,2,2,2
M1K,6,6,6,6,6,6
M1L,9,9,9,9,9,9
M1M,2,2,2,2,2,2
M1N,4,4,4,4,4,4


In [84]:
#create dataframe with dummy columns for the venues
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['PostalCode'] = toronto_venues['PostalCode'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,PostalCode,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [85]:
#group rows by PostalCode and 
#get the mean of the frequency of the number of venues for each venue category
toronto_grouped = toronto_onehot.groupby('PostalCode').mean().reset_index()
toronto_grouped

Unnamed: 0,PostalCode,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.000000
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.000000
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.000000
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.000000
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.000000
5,M1J,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.000000
6,M1K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.000000
7,M1L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.000000
8,M1M,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.500000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.000000
9,M1N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.000000


In [86]:
#print each neighborhood and its 5 most common venues according to their frequency
num_top_venues = 5

for zipc in toronto_grouped['PostalCode']:
    print("----"+zipc+"----")
    temp = toronto_grouped[toronto_grouped['PostalCode'] == zipc].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M1B----
                             venue  freq
0             Fast Food Restaurant   0.5
1                       Print Shop   0.5
2                Accessories Store   0.0
3  Molecular Gastronomy Restaurant   0.0
4                      Music Venue   0.0


----M1C----
            venue  freq
0  History Museum  0.33
1             Bar  0.33
2     Golf Course  0.33
3    Neighborhood  0.00
4     Music Venue  0.00


----M1E----
                venue  freq
0  Mexican Restaurant  0.14
1   Electronics Store  0.14
2         Pizza Place  0.14
3      Medical Center  0.14
4      Breakfast Spot  0.14


----M1G----
               venue  freq
0        Coffee Shop  0.50
1  Korean Restaurant  0.25
2   Insurance Office  0.25
3  Accessories Store  0.00
4              Motel  0.00


----M1H----
                  venue  freq
0                  Bank  0.14
1                Bakery  0.14
2       Thai Restaurant  0.14
3  Caribbean Restaurant  0.14
4   Fried Chicken Joint  0.14


----M1J----
                 v

In [87]:
#define a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [88]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
postalcode_venues_sorted = pd.DataFrame(columns=columns)
postalcode_venues_sorted['PostalCode'] = toronto_grouped['PostalCode']

for ind in np.arange(toronto_grouped.shape[0]):
    postalcode_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

postalcode_venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Fast Food Restaurant,Print Shop,Yoga Studio,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
1,M1C,History Museum,Bar,Golf Course,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
2,M1E,Pizza Place,Rental Car Location,Medical Center,Mexican Restaurant,Electronics Store,Intersection,Breakfast Spot,Coworking Space,Falafel Restaurant,Comic Shop
3,M1G,Coffee Shop,Insurance Office,Korean Restaurant,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
4,M1H,Caribbean Restaurant,Fried Chicken Joint,Bank,Thai Restaurant,Athletics & Sports,Bakery,Hakka Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant


In [89]:
#Cluster neighborhood into 5 clusters
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [90]:
#create new dataframe that includes the cluster as well as the top 10 venues for each postalcode
# add clustering labels
postalcode_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = boroughs

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(postalcode_venues_sorted.set_index('PostalCode'), \
                                     on='PostalCode')

toronto_merged.head(20) # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,2.0,Fast Food Restaurant,Food & Drink Shop,Park,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Intersection,Hockey Arena,Portuguese Restaurant,Coffee Shop,Yoga Studio,Dumpling Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636,0.0,Coffee Shop,Park,Pub,Bakery,Café,Mexican Restaurant,Breakfast Spot,Restaurant,Shoe Store,Italian Restaurant
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,0.0,Furniture / Home Store,Women's Store,Event Space,Athletics & Sports,Clothing Store,Miscellaneous Shop,Accessories Store,Vietnamese Restaurant,Boutique,Coffee Shop
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494,0.0,Coffee Shop,Gym,Diner,Park,Spa,Smoothie Shop,Seafood Restaurant,Sandwich Place,Burger Joint,Burrito Place
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242,,,,,,,,,,,
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0.0,Fast Food Restaurant,Print Shop,Yoga Studio,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
7,M3B,North York,Don Mills North,43.745906,-79.352188,0.0,Japanese Restaurant,Caribbean Restaurant,Café,Gym / Fitness Center,Baseball Field,Basketball Court,Dumpling Restaurant,Doner Restaurant,Donut Shop,Drugstore
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937,0.0,Pizza Place,Fast Food Restaurant,Pet Store,Athletics & Sports,Gym / Fitness Center,Intersection,Café,Gastropub,Bank,Pharmacy
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,0.0,Coffee Shop,Clothing Store,Cosmetics Shop,Fast Food Restaurant,Middle Eastern Restaurant,Café,Tea Room,Italian Restaurant,Pizza Place,Diner


In [91]:
#replace missing values by Cluster label 6
for index, row in toronto_merged.iterrows():
    if pd.isna(toronto_merged.at[index, 'Cluster Labels']):
        toronto_merged.loc[index, ['Cluster Labels']]=5

In [92]:
toronto_merged

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,2.0,Fast Food Restaurant,Food & Drink Shop,Park,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Intersection,Hockey Arena,Portuguese Restaurant,Coffee Shop,Yoga Studio,Dumpling Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.654260,-79.360636,0.0,Coffee Shop,Park,Pub,Bakery,Café,Mexican Restaurant,Breakfast Spot,Restaurant,Shoe Store,Italian Restaurant
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,0.0,Furniture / Home Store,Women's Store,Event Space,Athletics & Sports,Clothing Store,Miscellaneous Shop,Accessories Store,Vietnamese Restaurant,Boutique,Coffee Shop
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494,0.0,Coffee Shop,Gym,Diner,Park,Spa,Smoothie Shop,Seafood Restaurant,Sandwich Place,Burger Joint,Burrito Place
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242,5.0,,,,,,,,,,
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0.0,Fast Food Restaurant,Print Shop,Yoga Studio,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
7,M3B,North York,Don Mills North,43.745906,-79.352188,0.0,Japanese Restaurant,Caribbean Restaurant,Café,Gym / Fitness Center,Baseball Field,Basketball Court,Dumpling Restaurant,Doner Restaurant,Donut Shop,Drugstore
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937,0.0,Pizza Place,Fast Food Restaurant,Pet Store,Athletics & Sports,Gym / Fitness Center,Intersection,Café,Gastropub,Bank,Pharmacy
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,0.0,Coffee Shop,Clothing Store,Cosmetics Shop,Fast Food Restaurant,Middle Eastern Restaurant,Café,Tea Room,Italian Restaurant,Pizza Place,Diner


In [93]:
#Show the clusters on the map
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters+1)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], \
                                  toronto_merged['PostalCode'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Interpretation:
We can see the algorithm puts most of the PostalCodes into the red cluster.
The second largest cluster is the blue cluster. The red cluster contains the PostalCodes
with the NaN values, meaning that no venues were found using the given search criteria. As most PostalCodes fall into only 2 Clusters and the remaining clusters are rather small, the k-means clustering performs not very well in clustering the PostalCodes.

### Examination of the clusters

#### Cluster 1

In [94]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, \
                   toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,0.0,Intersection,Hockey Arena,Portuguese Restaurant,Coffee Shop,Yoga Studio,Dumpling Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop
2,Downtown Toronto,0.0,Coffee Shop,Park,Pub,Bakery,Café,Mexican Restaurant,Breakfast Spot,Restaurant,Shoe Store,Italian Restaurant
3,North York,0.0,Furniture / Home Store,Women's Store,Event Space,Athletics & Sports,Clothing Store,Miscellaneous Shop,Accessories Store,Vietnamese Restaurant,Boutique,Coffee Shop
4,Queen's Park,0.0,Coffee Shop,Gym,Diner,Park,Spa,Smoothie Shop,Seafood Restaurant,Sandwich Place,Burger Joint,Burrito Place
6,Scarborough,0.0,Fast Food Restaurant,Print Shop,Yoga Studio,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
7,North York,0.0,Japanese Restaurant,Caribbean Restaurant,Café,Gym / Fitness Center,Baseball Field,Basketball Court,Dumpling Restaurant,Doner Restaurant,Donut Shop,Drugstore
8,East York,0.0,Pizza Place,Fast Food Restaurant,Pet Store,Athletics & Sports,Gym / Fitness Center,Intersection,Café,Gastropub,Bank,Pharmacy
9,Downtown Toronto,0.0,Coffee Shop,Clothing Store,Cosmetics Shop,Fast Food Restaurant,Middle Eastern Restaurant,Café,Tea Room,Italian Restaurant,Pizza Place,Diner
11,Etobicoke,0.0,Bank,Golf Course,Yoga Studio,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant
12,Scarborough,0.0,History Museum,Bar,Golf Course,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant


#### Cluster 2

In [95]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, \
                   toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
52,North York,1.0,Piano Bar,Yoga Studio,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store


#### Cluster 3

In [96]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, \
                   toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,2.0,Fast Food Restaurant,Food & Drink Shop,Park,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
10,North York,2.0,Japanese Restaurant,Pub,Bakery,Park,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
16,York,2.0,Trail,Field,Hockey Arena,Park,Yoga Studio,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore
21,York,2.0,Park,Market,Women's Store,Fast Food Restaurant,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
35,East York,2.0,Park,Pizza Place,Convenience Store,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
40,North York,2.0,Airport,Park,Yoga Studio,Electronics Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant
49,North York,2.0,Bakery,Construction & Landscaping,Park,Basketball Court,Yoga Studio,Eastern European Restaurant,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
61,Central Toronto,2.0,Lake,Swim School,Park,Bus Line,Yoga Studio,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant
64,York,2.0,Park,Yoga Studio,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
66,North York,2.0,Bank,Convenience Store,Flower Shop,Park,Yoga Studio,Electronics Store,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant


#### Cluster 4

In [97]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, \
                   toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
45,North York,3.0,Cafeteria,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store,Discount Store


#### Cluster 5

In [98]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, \
                   toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
53,North York,4.0,Home Service,Baseball Field,Food Truck,Yoga Studio,Electronics Store,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant
57,North York,4.0,Baseball Field,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store,Discount Store


#### Cluster 6 (NaN-Values)

In [99]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5, \
                   toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Etobicoke,5.0,,,,,,,,,,
95,Scarborough,5.0,,,,,,,,,,


Conclusion: the distribution of the clusters is odd as 85 of 104 of the PostalCodes fall into one cluster and 13 of the remaining PostalCodes fall into the second largest cluster. The remaining 6 PostalCodes are distributed in 4 clusters. Another clustering method should be used for further analysis.