## Coursera Capstone: Segmenting and Clustering Neighborhoods in Toronto
This project is to explore, segment, and cluster neighborhoods in the city of Toronto.

## Introduction

We will start by scraping the Wikipedia page <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">List of postal codes of Canada: M</a> to get a list of neighborhoods in Toronto. Then we will find the latitude and longitude of each neighborhood. Next, we will use the Foursquare API to explore the neighborhoods. We will get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. We will use the *k*-means clustering algorithm to complete this task. Finally, we will use the Folium library to visualize the neighborhoods in Toronto and their emerging clusters.

## Table of Contents

1. <a href="#toc1">Acquire Data</a>

2. <a href="#toc2">Explore Neighborhoods in Toronto</a>

3. <a href="#toc3">Analyze Each Neighborhood</a>

4. <a href="#toc4">Cluster Neighborhoods</a>

5. <a href="#toc5">Examine Clusters</a>    


First download all the dependencies that we will need for this analysis.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<a id="toc1"></a>
## 1. Acquire Data

We will scrape the list of Toronto neighborhoods from the Wikipedia page <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">List of postal codes of Canada: M</a>. Postal codes beginning with M are located within the city of Toronto. Then we will find the geographical coordinates of each post code.

#### Read data from web page

We will read the page and convert the table into a dataframe using the *pandas* read_html method.

In [2]:
#get Toronto neighborhoods from Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
#read the Wikipedia page - returns list of dataframes
dfs = pd.read_html(url, header=0)
#take the first dataframe from the returned list (it should be the only dataframe in the list)
df = dfs[0]
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


We will ignore any records where *Borough* is "Not assigned".

In [3]:
#create new dataframe with records where Borough is not 'Not assigned'
df_assigned = df[df['Borough'] != 'Not assigned']
df_assigned.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In rows where the value of *Neighbourhood* is 'Not assigned' we will replace it with the value of the *Borough*.

In [4]:
#create a list of neighborhoods, replacing the borough where neighborhood is 'Not assigned'
new_neigh = df_assigned['Neighbourhood'].where(df_assigned['Neighbourhood'] != 'Not assigned', other = df_assigned['Borough'], axis = 0)
#construct new dataframe using postcode and borough from the previous dataframe and neighborhood from the above list
df_replaced = pd.concat([df_assigned['Postcode'], df_assigned['Borough'], new_neigh], axis = 1)
df_replaced.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Whenever we have more than one row per postcode, we will concatenate all neighborhoods into a comma separated list

In [5]:
#group the dataframe by Postcode and Borough and concatenate all neighborhoods into comma separated list
toronto_neighborhoods = df_replaced.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(list).apply(lambda x: ', '.join(x)).to_frame()
toronto_neighborhoods.reset_index(inplace = True)
toronto_neighborhoods.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [6]:
#display the shape of the resulting dataframe
toronto_neighborhoods.shape

(103, 3)

### Assign geographical coordinates

The next step is to assign the geographical coordinates for each postcode. Instructions on the Coursera submission page suggest to use the Geocoder Python package. Unfortunately I was not able to get usable results from this package. As suggested on the submission page, I decided to use the dataset <a href='http://cocl.us/Geospatial_data'>http://cocl.us/Geospatial_data</a> with predefined coordinates per postal code.  

There is an alternate way by which we could get coordinates and that would be to use the <a href='https://geopy.readthedocs.io/en/stable/'>GeoPy</a> library with the Nominatim geolocator service (instead of Google Maps). However, this library does not return coordinates based on postal codes but rather on neighborhood names. This would mean that we would have to restructure the above dataframe back to what it was before we concatenated neighborhoods into comma separated values. I decided not to go this way, although I suspect it would have been a viable approach to solving this exercise.

In [7]:
#read provided dataset
coords = pd.read_csv('http://cocl.us/Geospatial_data')
coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We will merge this dataset with our *toronto_neighborhoods* dataset from above.

In [8]:
toronto_neighborhoods = pd.merge(toronto_neighborhoods, coords, left_on = 'Postcode', right_on = 'Postal Code')[['Postcode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude']]
toronto_neighborhoods.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [9]:
toronto_neighborhoods.shape

(103, 5)

### Add additional data

#### Number of Elementary and Secondary schools per postal code

We will enrich the previous dataset with information about how many Elementary and Secondary schools there are in each neighborhood by postal code. A list of public schools can be found on the <a href='https://www.ontario.ca/data/ontario-public-school-contact-information'>Ontario public school contact information</a> web site and this list can be transformed into a table with school counts.

In [10]:
#read the Canada public schools dataset
df = pd.read_excel('https://files.ontario.ca/opendata/publicly_funded_schools_xlsx_january_2019_en.xlsx')

We will remove the records where the post code is null and we have to filter for postal codes that begin with M - these are the Toronto postal codes. Additionally, we want to keep only the public schools as these are the only ones that we are interested in.

In [11]:
#keep only rows where the postal code is not null
df = df[df['Postal Code'].notna()]
#keep only rows where the postal code begins with M - these are Toronto postal codes
df = df[df['Postal Code'].str.startswith('M')]
#keep only first 3 characters of postal code
df['Postal Code'] = df['Postal Code'].str[:3]
#keep only public schools
df = df[df['School Type'] == 'Public']

#keep selected columns and store in dataframe called toronto_schools
toronto_schools = df[['School Level', 'School Name', 'Postal Code']]
toronto_schools.head()

Unnamed: 0,School Level,School Name,Postal Code
540,Elementary,Collège français élémentaire,M5B
541,Secondary,Collège français secondaire,M5B
542,Elementary,École élémentaire Académie Alexandre-Dumas,M1E
546,Elementary,École élémentaire Charles-Sauriol,M6N
551,Elementary,École élémentaire Étienne-Brûlé,M2L


Next, we want to count how many elementary and how many secondary shools we have in each postal code.

In [12]:
schools_count = toronto_schools.groupby(['Postal Code', 'School Level']).count().reset_index()
schools_count.columns = ['Postal Code', 'School Level', 'Number of Schools']
schools_count.head()

Unnamed: 0,Postal Code,School Level,Number of Schools
0,M1B,Elementary,16
1,M1B,Secondary,1
2,M1C,Elementary,8
3,M1C,Secondary,2
4,M1E,Elementary,13


For the analysis we will pivot this dataframe to create columns for school level.

In [13]:
#perform pivot
schools_count_pivot = schools_count.pivot(index='Postal Code', columns='School Level', values='Number of Schools')
#reset index
schools_count_pivot.reset_index(inplace = True)
#select columns that we need
schools_count_pivot = schools_count_pivot[['Postal Code', 'Elementary', 'Secondary']]
schools_count_pivot.head()

School Level,Postal Code,Elementary,Secondary
0,M1B,16.0,1.0
1,M1C,8.0,2.0
2,M1E,13.0,9.0
3,M1G,10.0,2.0
4,M1H,4.0,1.0


For further analysis we will normalize the school counts. Let's find the maximum school count value and divide each count by this value.

In [14]:
#find the maximum school count value
max_school_cnt = max(schools_count_pivot['Elementary'].max(), schools_count_pivot['Secondary'].max())

#divide each count by the maximum value
elementary_normalized = schools_count_pivot['Elementary']/max_school_cnt
secondary_normalized = schools_count_pivot['Secondary']/max_school_cnt

#add these columns to the dataframe
schools_count_pivot.insert(schools_count_pivot.shape[1], 'Elementary Normalized', elementary_normalized)
schools_count_pivot.insert(schools_count_pivot.shape[1], 'Secondary Normalized', secondary_normalized)

#drop columns we don't need
schools_count_pivot = schools_count_pivot.drop(['Elementary', 'Secondary'], 1)
schools_count_pivot.head()

School Level,Postal Code,Elementary Normalized,Secondary Normalized
0,M1B,1.0,0.0625
1,M1C,0.5,0.125
2,M1E,0.8125,0.5625
3,M1G,0.625,0.125
4,M1H,0.25,0.0625


#### Population count per postal code

We will enrich the dataset by adding population counts for each neighborhood by postal code as published on the Statistics Canada <a href = 'https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Table.cfm?Lang=Eng&T=1201&S=22&O=A'>Population and Dwelling Count Highlight Tables, 2016 Census</a> page.  

Three values are provided in this table: 
* Population, 2016: represent the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on census day
* Total private dwellings, 2016: refers to total private dwellings and private dwellings occupied by usual residents in Canada
* Private dwellings occupied by usual residents, 2016: refers to usual residents, not including tourists

In [15]:
#read population data from csv file downloaded from Statistics Canada
df = pd.read_csv('https://raw.githubusercontent.com/mferle/Coursera_Capstone/master/data/T120120190215054507.csv')

#keep only rows where the postal code begins with M - these are Toronto postal codes
df = df[df['Geographic code'].str.startswith('M')]

#remove any columns where the population count is less than 100
df = df[df['Population, 2016'] > 99]

#keep only columns we need for analysis
df = df[['Geographic code', 'Population, 2016', 'Total private dwellings, 2016', 'Private dwellings occupied by usual residents, 2016']]
df.head()

Unnamed: 0,Geographic code,"Population, 2016","Total private dwellings, 2016","Private dwellings occupied by usual residents, 2016"
895,M1B,66108,20957,20230
896,M1C,35626,11588,11274
897,M1E,46943,17637,17161
898,M1G,29690,10116,9767
899,M1H,24383,9274,8985


For the analysis, we are interested in residential areas and therefore we want to exclude tourists which means that our column of interest is *Private dwellings occupied by usual residents, 2016*. We will divide this number by the total population count to derive a percentage of residents as compared to the total population. In residential areas where most of the population is not made up of tourists, the percentage should be higher as elsewhere.

In [16]:
#calculate percentage and add to dataframe as new column
pct = df['Private dwellings occupied by usual residents, 2016'] / df['Population, 2016']
df.insert(df.shape[1], 'Percent Occupied', pct)
#drop unwanted columns
toronto_population = df.drop(['Population, 2016', 'Total private dwellings, 2016', 'Private dwellings occupied by usual residents, 2016'], 1)
toronto_population.head()


Unnamed: 0,Geographic code,Percent Occupied
895,M1B,0.306014
896,M1C,0.316454
897,M1E,0.365571
898,M1G,0.328966
899,M1H,0.368494


#### Join school counts and population data with Toronto neighborhoods dataframe

In [17]:
#join school counts
toronto_neighborhoods = toronto_neighborhoods.join(schools_count_pivot.set_index('Postal Code'), on = 'Postcode')

#replace null values with 0
toronto_neighborhoods['Elementary Normalized'].fillna(0, inplace=True)
toronto_neighborhoods['Secondary Normalized'].fillna(0, inplace=True)
toronto_neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Elementary Normalized,Secondary Normalized
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1.0,0.0625
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.5,0.125
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.8125,0.5625
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.625,0.125
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.25,0.0625


In [18]:
#join population data
toronto_neighborhoods = toronto_neighborhoods.join(toronto_population.set_index('Geographic code'), on = 'Postcode')

#replace null values with 0
toronto_neighborhoods['Percent Occupied'].fillna(0, inplace=True)
toronto_neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Elementary Normalized,Secondary Normalized,Percent Occupied
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1.0,0.0625,0.306014
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.5,0.125,0.316454
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.8125,0.5625,0.365571
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.625,0.125,0.328966
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.25,0.0625,0.368494


<a id="toc2"></a>
## 2. Explore Neighborhoods in Toronto

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [19]:
# The code was removed by Watson Studio for sharing.

#### Let's explore the first neighborhood in our dataframe

Find the name, latitude and longitude of the first neighborhood in the dataframe.

In [20]:
neighborhood_name = toronto_neighborhoods.loc[0, 'Neighbourhood'] # neighborhood name
neighborhood_latitude = toronto_neighborhoods.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_neighborhoods.loc[0, 'Longitude'] # neighborhood longitude value

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Rouge, Malvern are 43.806686299999996, -79.19435340000001.


#### Now, let's get the top 100 venues that are in the above neighborhood within a radius of 500 meters

First we will create the GET request URL.

In [21]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)


Send the GET request and examine the results

In [22]:
results = requests.get(url).json()
#results.head()

We see that all the information that we want is in the *items* key. Before we proceed, we will create the **get_category_type** function which extracts the category name from a JSON object.

In [23]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [24]:
#extract the items key from the results
venues = results['response']['groups'][0]['items']
#flatten JSON into a dataframe
nearby_venues = json_normalize(venues) 
#filter columns that we need for further analysis
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]
#extract the category for each row using the previously defined function
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns ?????????????????????????????????????????????
#nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng
0,Wendy's,Fast Food Restaurant,43.807448,-79.199056


#### Create a function to repeat the same process as above to all the neighborhoods in Toronto

In [25]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postcode', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

The following code executes the above function on each neighborhood and creates a new dataframe called *toronto_venues*.

In [26]:
toronto_venues = getNearbyVenues(names = toronto_neighborhoods['Postcode'],
                                   latitudes = toronto_neighborhoods['Latitude'],
                                   longitudes = toronto_neighborhoods['Longitude']
                                  )

M1B
M1C
M1E
M1G
M1H
M1J
M1K
M1L
M1M
M1N
M1P
M1R
M1S
M1T
M1V
M1W
M1X
M2H
M2J
M2K
M2L
M2M
M2N
M2P
M2R
M3A
M3B
M3C
M3H
M3J
M3K
M3L
M3M
M3N
M4A
M4B
M4C
M4E
M4G
M4H
M4J
M4K
M4L
M4M
M4N
M4P
M4R
M4S
M4T
M4V
M4W
M4X
M4Y
M5A
M5B
M5C
M5E
M5G
M5H
M5J
M5K
M5L
M5M
M5N
M5P
M5R
M5S
M5T
M5V
M5W
M5X
M6A
M6B
M6C
M6E
M6G
M6H
M6J
M6K
M6L
M6M
M6N
M6P
M6R
M6S
M7A
M7R
M7Y
M8V
M8W
M8X
M8Y
M8Z
M9A
M9B
M9C
M9L
M9M
M9N
M9P
M9R
M9V
M9W


Check the size and first few rows of the resulting dataframe

In [27]:
print(toronto_venues.shape)
toronto_venues.head()

(2240, 7)


Unnamed: 0,Postcode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,M1C,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,M1E,43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
3,M1E,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,M1E,43.763573,-79.188711,Big Bite Burrito,43.766299,-79.19072,Mexican Restaurant


Count number of appearances of each venue category

In [28]:
toronto_venues.groupby('Venue Category').count()

Unnamed: 0_level_0,Postcode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Accessories Store,1,1,1,1,1,1
Adult Boutique,1,1,1,1,1,1
Afghan Restaurant,1,1,1,1,1,1
Airport,2,2,2,2,2,2
Airport Food Court,1,1,1,1,1,1
Airport Gate,1,1,1,1,1,1
Airport Lounge,2,2,2,2,2,2
Airport Service,2,2,2,2,2,2
Airport Terminal,2,2,2,2,2,2
American Restaurant,34,34,34,34,34,34


We see that many venue categories appear only a few times. It doesn't make sense to include these in segmentation because they don't appear often enough to have an impact, but they contribute to noise in the dataset. Therefore we will exclude venue categories that appear less than 5 times from the dataset.  

But first, we have to check how many venue categories appear less than 5 times to ensure that we still have enough venue categories left for segmentation.

In [29]:
#store the results of the above counts into a dataframe
toronto_venues_count = toronto_venues.groupby('Venue Category').count()
print('There are {} venue categories that appear less than 5 times.'.format(toronto_venues_count[toronto_venues_count['Postcode'] < 5].shape[0]))

There are 164 venue categories that appear less than 5 times.


We know that we have a total of 259 venue categories which means that even after we remove 173 of them we should still have sufficient data for segmentation.  

Therefore we will exclude the venue categories that appear less than 5 times.

In [30]:
#create list with neighborhoods to exclude
neigh_to_exclude = toronto_venues_count[toronto_venues_count['Postcode'] < 5].index.tolist()
#create filtered dataframe by excluding neighborhoods in above list
toronto_venues_filt = toronto_venues[~toronto_venues['Venue Category'].isin(neigh_to_exclude)]
#check size of resulting dataframe
toronto_venues_filt.groupby('Venue Category').count().shape

(111, 6)

The number of venue categories is sufficient for further analysis. We will rename the filtered dataset *toronto_venues_filt* back to the original dataset *toronto_venues*.

In [31]:
toronto_venues = toronto_venues_filt

<a id="toc3"></a>
## 3. Analyze Each Neighborhood

We will do one hot encoding to pivot category values into columns of the dataframe.  

There is one observation that we have to be careful about: one of the category values is *Neighborhood*. After one hot encoding, this value will become a column name. We are already using the column *Neighborhood* to represent the neighborhood name. To avoid confusing these columns, we will rename the column that comes from one hot encoding as *Neighborhood Category*.

In [32]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

#rename the column 'Neighborhood' which represents a category name to 'Neighborhood Category' 
#this is to distinguish this column from the 'Neighborhood' column which we want to continue to use as the neighborhood name
toronto_onehot.rename(columns={'Neighborhood':'Neighborhood Category'}, inplace=True)

# add neighborhood column back to dataframe
toronto_onehot['Postcode'] = toronto_venues['Postcode'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

#toronto_onehot.head()


Check the new dataframe size:

In [33]:
toronto_onehot.shape

(1931, 112)

#### We will group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [34]:
toronto_grouped = toronto_onehot.groupby('Postcode').mean().reset_index()
toronto_grouped.head(10)

Unnamed: 0,Postcode,American Restaurant,Aquarium,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,...,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.142857,...,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,M1J,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,M1K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,M1L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M1M,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,M1N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Check the new dataframe size:

In [35]:
toronto_grouped.shape

(97, 112)

#### Store the above into a *pandas* dataframe

Write a function to sort the venues in descending order

In [36]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new dataframe and display the top 10 venues for each neighborhood

In [37]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postcode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns = columns)
neighborhoods_venues_sorted['Postcode'] = toronto_grouped['Postcode']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Fast Food Restaurant,Yoga Studio,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega
1,M1C,Bar,Yoga Studio,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega
2,M1E,Mexican Restaurant,Breakfast Spot,Electronics Store,Pizza Place,Yoga Studio,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop
3,M1G,Coffee Shop,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,Department Store
4,M1H,Lounge,Fried Chicken Joint,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Bakery,Bank,Yoga Studio,Deli / Bodega,Diner


In [38]:
neighborhoods_venues_sorted.groupby(['1st Most Common Venue']).size()

1st Most Common Venue
American Restaurant            1
Bakery                         1
Bank                           1
Bar                            3
Beer Store                     1
Café                           6
Chinese Restaurant             2
Clothing Store                 3
Coffee Shop                   23
Construction & Landscaping     2
Discount Store                 2
Dog Run                        1
Fast Food Restaurant           2
Food Truck                     1
Gift Shop                      1
Greek Restaurant               1
Grocery Store                  6
Gym                            1
Gym / Fitness Center           2
Indian Restaurant              2
Japanese Restaurant            1
Lounge                         1
Mexican Restaurant             1
Miscellaneous Shop             1
Park                          12
Pharmacy                       1
Pizza Place                    9
Playground                     4
Ramen Restaurant               1
Sandwich Place       

<a id="toc4"></a>
## 4. Cluster Neighborhoods


Run *k*-means to cluster the neighborhood into 5 clusters.

In [39]:
#join school counts
toronto_grouped = toronto_grouped.join(schools_count_pivot.set_index('Postal Code'), on = 'Postcode')

#replace null values with 0
toronto_grouped['Elementary Normalized'].fillna(0, inplace=True)
toronto_grouped['Secondary Normalized'].fillna(0, inplace=True)
toronto_grouped.head()

Unnamed: 0,Postcode,American Restaurant,Aquarium,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,...,Thai Restaurant,Theater,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio,Elementary Normalized,Secondary Normalized
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0625
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.125
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8125,0.5625
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.625,0.125
4,M1H,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.142857,...,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0625


In [40]:
#join population data
toronto_grouped = toronto_grouped.join(toronto_population.set_index('Geographic code'), on = 'Postcode')

#replace null values with 0
toronto_grouped['Percent Occupied'].fillna(0, inplace=True)
toronto_grouped.head()

Unnamed: 0,Postcode,American Restaurant,Aquarium,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,...,Theater,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio,Elementary Normalized,Secondary Normalized,Percent Occupied
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0625,0.306014
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.125,0.316454
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8125,0.5625,0.365571
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.625,0.125,0.328966
4,M1H,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.142857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0625,0.368494


In [41]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Postcode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([4, 3, 4, 4, 0, 2, 4, 0, 4, 0, 4, 4, 4, 4, 1, 4, 4, 4, 0, 4, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 4, 4, 0, 0, 4, 0, 0, 0, 1, 4, 0, 0, 1, 0, 0, 0, 2,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 4, 2, 2, 0, 2, 2,
       1, 4, 4, 4, 0, 1, 4, 4, 4, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 4, 0, 0,
       1, 0, 1, 4, 3], dtype=int32)

Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [42]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = neighborhoods_venues_sorted

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(toronto_neighborhoods.set_index('Postcode'), on='Postcode')

toronto_merged.head() # check the last columns!

Unnamed: 0,Cluster Labels,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Elementary Normalized,Secondary Normalized,Percent Occupied
0,4,M1B,Fast Food Restaurant,Yoga Studio,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1.0,0.0625,0.306014
1,3,M1C,Bar,Yoga Studio,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.5,0.125,0.316454
2,4,M1E,Mexican Restaurant,Breakfast Spot,Electronics Store,Pizza Place,Yoga Studio,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.8125,0.5625,0.365571
3,4,M1G,Coffee Shop,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,Department Store,Scarborough,Woburn,43.770992,-79.216917,0.625,0.125,0.328966
4,0,M1H,Lounge,Fried Chicken Joint,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Bakery,Bank,Yoga Studio,Deli / Bodega,Diner,Scarborough,Cedarbrae,43.773136,-79.239476,0.25,0.0625,0.368494


Visualize the resulting clusters

In [43]:
address = 'Toronto'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude


In [157]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Note: because GitHub doesn't display Folium maps, a print screen of the map is available <a href='img/Toronto.png'>here</a>.

<a id="toc5"></a>
## 5. Examine Clusters


We will examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can then assign a name to each cluster.

### Cluster 0: Residential

In [45]:
toronto_cluster0 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]))]]
toronto_cluster0

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Elementary Normalized,Secondary Normalized,Percent Occupied
4,M1H,Lounge,Fried Chicken Joint,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Bakery,Bank,Yoga Studio,Deli / Bodega,Diner,Scarborough,Cedarbrae,43.773136,-79.239476,0.25,0.0625,0.368494
7,M1L,Bakery,Bus Line,Park,Intersection,Food Truck,Fast Food Restaurant,Yoga Studio,Diner,Concert Hall,Construction & Landscaping,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577,0.375,0.0625,0.354266
9,M1N,Café,Skating Rink,Yoga Studio,Discount Store,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848,0.4375,0.0625,0.410869
18,M2K,Chinese Restaurant,Café,Japanese Restaurant,Bank,Yoga Studio,Dog Run,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,North York,Bayview Village,43.786947,-79.385975,0.1875,0.0,0.441892
20,M2P,Park,Bank,Bar,Yoga Studio,Dog Run,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,North York,York Mills West,43.752758,-79.400049,0.125,0.125,0.385057
21,M2R,Pharmacy,Wine Bar,Coffee Shop,Pizza Place,Diner,Cocktail Bar,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,North York,Willowdale West,43.782736,-79.442259,0.375,0.1875,0.381815
23,M3B,Gym / Fitness Center,Pool,Caribbean Restaurant,Café,Japanese Restaurant,Baseball Field,Discount Store,Concert Hall,Construction & Landscaping,Convenience Store,North York,Don Mills North,43.745906,-79.352188,0.1875,0.0625,0.375338
24,M3C,Coffee Shop,Gym,Asian Restaurant,Beer Store,Italian Restaurant,Clothing Store,Restaurant,Sandwich Place,Fast Food Restaurant,Sporting Goods Shop,North York,"Flemingdon Park, Don Mills South",43.7259,-79.340923,0.375,0.125,0.404848
25,M3H,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Diner,Restaurant,Bank,Deli / Bodega,Sushi Restaurant,Fried Chicken Joint,North York,"Bathurst Manor, Downsview North, Wilson Heights",43.754328,-79.442259,0.3125,0.0625,0.384372
26,M3J,Miscellaneous Shop,Bar,Coffee Shop,Yoga Studio,Dog Run,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,North York,"Northwood Park, York University",43.76798,-79.487262,0.25,0.0625,0.369293


Venue categories in this cluster are predominantly shops with some interspersed coffee shops and restaurants as well as parks and sporting venues. It apears that this cluster represents residential areas.

### Cluster 1: Parks

In [46]:
toronto_cluster1 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]))]]
toronto_cluster1

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Elementary Normalized,Secondary Normalized,Percent Occupied
14,M1V,Park,Playground,Grocery Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Scarborough,"Agincourt North, L'Amoreaux East, Milliken, St...",43.815252,-79.284577,0.6875,0.0625,0.294294
22,M3A,Park,Food & Drink Shop,Fast Food Restaurant,Yoga Studio,Discount Store,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,North York,Parkwoods,43.753259,-79.329656,0.625,0.125,0.382522
27,M3K,Park,Construction & Landscaping,Yoga Studio,Gym,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,North York,"CFB Toronto, Downsview East",43.737473,-79.464763,0.0625,0.0625,0.368184
37,M4J,Park,Coffee Shop,Convenience Store,Gym,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Cosmetics Shop,Creperie,Deli / Bodega,East York,East Toronto,43.685347,-79.338106,0.5625,0.5625,0.412838
41,M4N,Park,Bus Line,Yoga Studio,Coffee Shop,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,Central Toronto,Lawrence Park,43.72802,-79.38879,0.1875,0.0625,0.403783
47,M4W,Park,Playground,Trail,Discount Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Downtown Toronto,Rosedale,43.679563,-79.377529,0.0625,0.0625,0.499073
69,M6E,Park,Pharmacy,Women's Store,Fast Food Restaurant,Discount Store,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,York,Caledonia-Fairbanks,43.689026,-79.453512,0.25,0.0625,0.386977
74,M6L,Park,Construction & Landscaping,Bakery,Yoga Studio,Dog Run,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,North York,"Maple Leaf Park, North Park, Upwood Park",43.713756,-79.490074,0.25,0.0,0.358314
92,M9N,Park,Yoga Studio,Gym,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,York,Weston,43.706876,-79.518188,0.25,0.0625,0.405639
94,M9R,Pizza Place,Park,Bus Line,Yoga Studio,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv...",43.688905,-79.554724,0.4375,0.1875,0.365587


All of the first or second most common venue types in this neighborhood are parks.

### Cluster 2: Downtown

In [47]:
toronto_cluster2 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]))]]
toronto_cluster2

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Elementary Normalized,Secondary Normalized,Percent Occupied
5,M1J,Playground,Gym,Coffee Shop,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,Scarborough,Scarborough Village,43.744734,-79.239476,0.3125,0.0,0.334451
45,M4T,Playground,Trail,Discount Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,0.125,0.0,0.498136
57,M5K,Coffee Shop,Café,Hotel,Restaurant,American Restaurant,Italian Restaurant,Deli / Bodega,Gastropub,Bar,Beer Bar,Downtown Toronto,"Design Exchange, Toronto Dominion Centre",43.647177,-79.381576,0.0,0.0,0.0
58,M5L,Coffee Shop,Café,Hotel,Restaurant,American Restaurant,Steakhouse,Italian Restaurant,Gym,Gastropub,Seafood Restaurant,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817,0.0,0.0,0.0
64,M5W,Coffee Shop,Restaurant,Café,Italian Restaurant,Pub,Seafood Restaurant,Cocktail Bar,Hotel,Beer Bar,Japanese Restaurant,Downtown Toronto,Stn A PO Boxes 25 The Esplanade,43.646435,-79.374846,0.0,0.0,0.0
65,M5X,Café,Coffee Shop,Hotel,Restaurant,American Restaurant,Gym,Seafood Restaurant,Deli / Bodega,Steakhouse,Bar,Downtown Toronto,"First Canadian Place, Underground city",43.648429,-79.38228,0.0,0.0,0.0
67,M6B,Playground,Pub,Bakery,Japanese Restaurant,Discount Store,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,North York,Glencairn,43.709577,-79.445073,0.125,0.0,0.376131
68,M6C,Playground,Trail,Dog Run,Gastropub,Diner,Coffee Shop,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,York,Humewood-Cedarvale,43.693781,-79.428191,0.3125,0.0625,0.426248
80,M7A,Coffee Shop,Sushi Restaurant,Gym,Diner,Japanese Restaurant,Chinese Restaurant,Restaurant,Bubble Tea Shop,Burger Joint,Burrito Place,Queen's Park,Queen's Park,43.662301,-79.389494,0.0,0.0,0.0
81,M7R,Coffee Shop,Hotel,Gym / Fitness Center,Sandwich Place,Burrito Place,Fried Chicken Joint,Mediterranean Restaurant,Middle Eastern Restaurant,American Restaurant,Sushi Restaurant,Mississauga,Canada Post Gateway Processing Centre,43.636966,-79.615819,0.0,0.0,0.0


Most of the neighborhoods in this cluster appear to be geographically located near downtown. The venue categories are predominantly restaurants, coffee shops and hotels with some shops, gyms and playgrounds. Mail processing centers appear also to be included in this segment.

### Cluster 3: Outliers

In [48]:
toronto_cluster3 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]))]]
toronto_cluster3

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Elementary Normalized,Secondary Normalized,Percent Occupied
1,M1C,Bar,Yoga Studio,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.5,0.125,0.316454
96,M9W,Bar,Yoga Studio,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,Etobicoke,Northwest,43.706748,-79.594054,0.4375,0.125,0.33684


This cluster represents outliers. Both postal codes have just one venue, a bar in each. The remaining venues all appear 0 times. Obviously, they represent the same cluster as they are exactly the same based on the input to the clustering algorithm. To improve the segmentation, we should remove all such postal codes where the number of different venues is less than 10.

### Cluster 4: Quick eats

In [49]:
toronto_cluster4 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]))]]
toronto_cluster4

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Elementary Normalized,Secondary Normalized,Percent Occupied
0,M1B,Fast Food Restaurant,Yoga Studio,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1.0,0.0625,0.306014
2,M1E,Mexican Restaurant,Breakfast Spot,Electronics Store,Pizza Place,Yoga Studio,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.8125,0.5625,0.365571
3,M1G,Coffee Shop,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,Department Store,Scarborough,Woburn,43.770992,-79.216917,0.625,0.125,0.328966
6,M1K,Discount Store,Coffee Shop,Convenience Store,Department Store,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Cosmetics Shop,Creperie,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029,0.6875,0.25,0.370194
8,M1M,American Restaurant,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,Department Store,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476,0.4375,0.0625,0.376337
10,M1P,Indian Restaurant,Pet Store,Vietnamese Restaurant,Chinese Restaurant,Brewery,Discount Store,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Scarborough,"Dorset Park, Scarborough Town Centre, Wexford ...",43.75741,-79.273304,0.4375,0.4375,0.36295
11,M1R,Sandwich Place,Middle Eastern Restaurant,Breakfast Spot,Smoke Shop,Yoga Studio,Cocktail Bar,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Scarborough,"Maryvale, Wexford",43.750072,-79.295849,0.4375,0.3125,0.360607
12,M1S,Chinese Restaurant,Sandwich Place,Breakfast Spot,Lounge,Skating Rink,Diner,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Scarborough,Agincourt,43.7942,-79.262029,0.625,0.25,0.334772
13,M1T,Pizza Place,Thai Restaurant,Fast Food Restaurant,Italian Restaurant,Noodle House,Fried Chicken Joint,Chinese Restaurant,Comfort Food Restaurant,Concert Hall,Dessert Shop,Scarborough,"Clarks Corners, Sullivan, Tam O'Shanter",43.781638,-79.304302,0.625,0.0625,0.374378
15,M1W,Fast Food Restaurant,Coffee Shop,Chinese Restaurant,Pizza Place,Breakfast Spot,Japanese Restaurant,Pharmacy,Sandwich Place,Grocery Store,Bakery,Scarborough,"L'Amoreaux West, Steeles West",43.799525,-79.318389,0.6875,0.1875,0.333663


Venue categories in this cluster appear to be predominantly fast food restaurands, coffee shops, sandwich places, pizza places, grocery stores which all suggests places where one can find something quick to eat.

# Discussion

Among the above clusters it would appear that the following clusters are best suited for families with children:  

* **Cluster 0: Residential.** Venue categories in this cluster are predominantly shops with some interspersed coffee shops and restaurants as well as parks and sporting venues which is all suitable for families.
* **Cluster 4: Quick eats.** Venue categories in this cluster appear to be predominantly fast food restaurands, coffee shops, sandwich places, pizza places, grocery stores which all suggests places where one can find something quick to eat. These types of places as typical of shopping malls which in turn suggest residential areas.  

Let's do some analysis to verify these observations.

In [183]:
toronto_merged.groupby(['Cluster Labels']).mean().reset_index().drop(['Latitude', 'Longitude'], 1)

Unnamed: 0,Cluster Labels,Elementary Normalized,Secondary Normalized,Percent Occupied
0,0,0.207031,0.065104,0.450952
1,1,0.3375,0.125,0.387721
2,2,0.079545,0.005682,0.148633
3,3,0.46875,0.125,0.326647
4,4,0.596154,0.199519,0.384852


We can see that the value of *Percent Occupied* is largest in cluster 0. This makes sense, because we would expect a higher percentage of occupied dwellings in residential areas.  

We can also see that the largest percentage of both elementary and secondary schools is in cluster 4 which indicates that these are neighborhoods where families with school age children reside.

In [150]:
#neighborhoods that have been flagged as best to live in
toronto_flagged = pd.read_excel('https://github.com/mferle/Coursera_Capstone/blob/master/data/Top_blogs.xlsx?raw=true')
toronto_flagged.head()

Unnamed: 0,Postal Code,TopFamilyFlag
0,M5N,1
1,M2K,1
2,M5M,1
3,M6S,1
4,M6H,1


In [163]:
tf = toronto_flagged.join(toronto_merged.set_index('Postcode'), on = 'Postal Code')
tf = tf[~tf['Cluster Labels'].isna()]
tf['Cluster Labels'] = tf['Cluster Labels'].astype(int)
tf.head()

Unnamed: 0,Postal Code,TopFamilyFlag,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Elementary Normalized,Secondary Normalized,Percent Occupied
1,M2K,1,0,Chinese Restaurant,Café,Japanese Restaurant,Bank,Yoga Studio,Dog Run,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,North York,Bayview Village,43.786947,-79.385975,0.1875,0.0,0.441892
2,M5M,1,0,Coffee Shop,Fast Food Restaurant,Italian Restaurant,Indian Restaurant,Comfort Food Restaurant,Pharmacy,Pizza Place,Café,Pub,Restaurant,North York,"Bedford Park, Lawrence Manor East",43.733283,-79.41975,0.25,0.0,0.362194
3,M6S,1,0,Coffee Shop,Pizza Place,Café,Italian Restaurant,Diner,Sushi Restaurant,Pharmacy,Sandwich Place,French Restaurant,Food & Drink Shop,West Toronto,"Runnymede, Swansea",43.651571,-79.48445,0.3125,0.0625,0.427272
4,M6H,1,4,Discount Store,Pharmacy,Bakery,Supermarket,Gym / Fitness Center,Brewery,Fast Food Restaurant,Liquor Store,Middle Eastern Restaurant,Music Venue,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259,0.5,0.1875,0.427341
5,M4C,1,4,Beer Store,Park,Cosmetics Shop,Skating Rink,Pharmacy,Discount Store,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,East York,Woodbine Heights,43.695344,-79.318389,0.5,0.25,0.414608


In [162]:
# create map
#map_best = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neigh, flg in zip(tf['Latitude'], tf['Longitude'], tf['Neighbourhood'], tf['TopFamilyFlag']):
    label = folium.Popup(neigh, parse_html=True)
    folium.Marker(
    [lat, lon],
    popup=label).add_to(map_clusters)
       
map_clusters

In [164]:
tfg = tf[['Postal Code', 'Cluster Labels', 'Neighbourhood']]
tfg

Unnamed: 0,Postal Code,Cluster Labels,Neighbourhood
1,M2K,0,Bayview Village
2,M5M,0,"Bedford Park, Lawrence Manor East"
3,M6S,0,"Runnymede, Swansea"
4,M6H,4,"Dovercourt Village, Dufferin"
5,M4C,4,Woodbine Heights
6,M4S,0,Davisville
7,M4V,0,"Deer Park, Forest Hill SE, Rathnelly, South Hi..."
8,M3B,0,Don Mills North
9,M9C,4,"Bloordale Gardens, Eringate, Markland Wood, Ol..."
10,M9L,0,Humber Summit


In [149]:
tfg.groupby('Cluster Labels').count()

Unnamed: 0_level_0,Postal Code,Neighbourhood
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1
0,12,12
1,1,1
4,5,5


In [168]:
toronto_grouped[toronto_grouped['Postcode'] == 'M1C']

Unnamed: 0,Postcode,American Restaurant,Aquarium,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,...,Theater,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio,Elementary Normalized,Secondary Normalized,Percent Occupied
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.125,0.316454


In [169]:
toronto_grouped[toronto_grouped['Postcode'] == 'M9W']

Unnamed: 0,Postcode,American Restaurant,Aquarium,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,...,Theater,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio,Elementary Normalized,Secondary Normalized,Percent Occupied
96,M9W,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4375,0.125,0.33684


In [170]:
toronto_merged[toronto_merged['Postcode'] == 'M1C']

Unnamed: 0,Cluster Labels,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Elementary Normalized,Secondary Normalized,Percent Occupied
1,3,M1C,Bar,Yoga Studio,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.5,0.125,0.316454


In [171]:
neighborhoods_venues_sorted[neighborhoods_venues_sorted['Postcode'] == 'M1C']

Unnamed: 0,Cluster Labels,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,3,M1C,Bar,Yoga Studio,Dog Run,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega
