## Coursera Capstone: Identify Residential Neighborhoods in Toronto, Canada
This is the Capstone Project: The Battle of Neighborhoods

## Introduction

The purpose of this project is to identify residential neighborhoods in Toronto, Canada that are suitable for families with children.  

We will start by scraping the Wikipedia page <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">List of postal codes of Canada: M</a> to get a list of neighborhoods in Toronto. Then we will find the latitude and longitude of each neighborhood. Next, we will use the Foursquare API to explore the neighborhoods. We will get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. We will use the *k*-means clustering algorithm to complete this task. Finally, we will use the Folium library to visualize the neighborhoods in Toronto and their emerging clusters.

## Table of Contents

1. <a href="#toc1">Acquire Data</a>

2. <a href="#toc2">Explore Neighborhoods in Toronto</a>

3. <a href="#toc3">Analyze Each Neighborhood</a>

4. <a href="#toc4">Cluster Neighborhoods</a>

5. <a href="#toc5">Examine Clusters</a>    


First download all the dependencies that we will need for this analysis.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<a id="toc1"></a>
## 1. Acquire Data

We will scrape the list of Toronto neighborhoods from the Wikipedia page <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">List of postal codes of Canada: M</a>. Postal codes beginning with M are located within the city of Toronto. Then we will find the geographical coordinates of each post code.

#### Read data from web page

We will read the page and convert the table into a dataframe using the *pandas* read_html method.

In [2]:
#get Toronto neighborhoods from Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
#read the Wikipedia page - returns list of dataframes
dfs = pd.read_html(url, header=0)
#take the first dataframe from the returned list (it should be the only dataframe in the list)
df = dfs[0]
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


We will ignore any records where *Borough* is "Not assigned".

In [3]:
#create new dataframe with records where Borough is not 'Not assigned'
df_assigned = df[df['Borough'] != 'Not assigned']
df_assigned.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In rows where the value of *Neighbourhood* is 'Not assigned' we will replace it with the value of the *Borough*.

In [4]:
#create a list of neighborhoods, replacing the borough where neighborhood is 'Not assigned'
new_neigh = df_assigned['Neighbourhood'].where(df_assigned['Neighbourhood'] != 'Not assigned', other = df_assigned['Borough'], axis = 0)
#construct new dataframe using postcode and borough from the previous dataframe and neighborhood from the above list
df_replaced = pd.concat([df_assigned['Postcode'], df_assigned['Borough'], new_neigh], axis = 1)
df_replaced.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Whenever we have more than one row per postcode, we will concatenate all neighborhoods into a comma separated list

In [5]:
#group the dataframe by Postcode and Borough and concatenate all neighborhoods into comma separated list
toronto_neighborhoods = df_replaced.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(list).apply(lambda x: ', '.join(x)).to_frame()
toronto_neighborhoods.reset_index(inplace = True)
toronto_neighborhoods.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [6]:
#display the shape of the resulting dataframe
toronto_neighborhoods.shape

(103, 3)

### Assign geographical coordinates

The next step is to assign the geographical coordinates for each postcode. Instructions on Coursera suggest to use the Geocoder Python package. Unfortunately I was not able to get usable results from this package. As suggested on the submission page, I decided to use the dataset <a href='http://cocl.us/Geospatial_data'>http://cocl.us/Geospatial_data</a> with predefined coordinates per postal code.  

There is an alternate way by which we could get coordinates and that would be to use the <a href='https://geopy.readthedocs.io/en/stable/'>GeoPy</a> library with the Nominatim geolocator service (instead of Google Maps). However, this library does not return coordinates based on postal codes but rather on neighborhood names. This would mean that we would have to restructure the above dataframe back to what it was before we concatenated neighborhoods into comma separated values. I decided not to go this way, although I suspect it would have been a viable approach to solving this exercise.

In [7]:
#read provided dataset
coords = pd.read_csv('http://cocl.us/Geospatial_data')
coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We will merge this dataset with our *toronto_neighborhoods* dataset from above.

In [8]:
toronto_neighborhoods = pd.merge(toronto_neighborhoods, coords, left_on = 'Postcode', right_on = 'Postal Code')[['Postcode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude']]
toronto_neighborhoods.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [9]:
toronto_neighborhoods.shape

(103, 5)

### Add additional data

#### Number of Elementary and Secondary schools per postal code

We will enrich the previous dataset with information about how many Elementary and Secondary schools there are in each neighborhood by postal code. A list of public schools can be found on the <a href='https://www.ontario.ca/data/ontario-public-school-contact-information'>Ontario public school contact information</a> web site and this list can be transformed into a table with school counts.

In [10]:
#read the Canada public schools dataset
df = pd.read_excel('https://files.ontario.ca/opendata/publicly_funded_schools_xlsx_january_2019_en.xlsx')
df.head(3)

Unnamed: 0,Region,Board Number,Board Name,Board Type,Board Language,School Number,School Name,School Level,School Language,School Type,...,City,Province,Postal Code,Phone,Fax,Grade Range,Date Open,School Email,School Website,Board Website
0,Sudbury-North Bay Regional Office,B28010,Algoma DSB,Pub Dist Sch Brd (E/F),English,902344,Algoma Education Connection Secondary School,Secondary,English,Public,...,Sault Ste Marie,Ontario,P6B4J4,705-945-7194,705-945-7173,9-12,2010-09-07,,http://www.adsb.on.ca,www.adsb.on.ca
1,Sudbury-North Bay Regional Office,B28010,Algoma DSB,Pub Dist Sch Brd (E/F),English,19186,Anna McCrea Public School,Elementary,English,Public,...,Sault Ste Marie,Ontario,P6A3M7,705-945-7106,705-945-7221,JK-8,1969-09-01,,http://www.adsb.on.ca/sites/schools/amc/defaul...,www.adsb.on.ca
2,Sudbury-North Bay Regional Office,B28010,Algoma DSB,Pub Dist Sch Brd (E/F),English,67679,Arthur Henderson Public School,Elementary,English,Public,...,Bruce Mines,Ontario,P0R1C0,705-785-3483,705-785-3220,JK-3,1969-09-01,,http://www.adsb.on.ca/sites/schools/art/defaul...,www.adsb.on.ca


We will remove the records where the post code is null and we have to filter for postal codes that begin with M - these are the Toronto postal codes. Additionally, we want to keep only public schools as these are the only ones that we are interested in.

In [11]:
#keep only rows where the postal code is not null
df = df[df['Postal Code'].notna()]
#keep only rows where the postal code begins with M - these are Toronto postal codes
df = df[df['Postal Code'].str.startswith('M')]
#keep only first 3 characters of postal code
df['Postal Code'] = df['Postal Code'].str[:3]
#keep only public schools
df = df[df['School Type'] == 'Public']

#keep selected columns and store in dataframe called toronto_schools
toronto_schools = df[['School Level', 'School Name', 'Postal Code']]
toronto_schools.head()

Unnamed: 0,School Level,School Name,Postal Code
540,Elementary,Collège français élémentaire,M5B
541,Secondary,Collège français secondaire,M5B
542,Elementary,École élémentaire Académie Alexandre-Dumas,M1E
546,Elementary,École élémentaire Charles-Sauriol,M6N
551,Elementary,École élémentaire Étienne-Brûlé,M2L


We will add column *Neighborhood* by merging with the *toronto_neighborhoods* dataset

In [12]:
toronto_schools = toronto_schools.join(toronto_neighborhoods.set_index('Postcode'), on = 'Postal Code')
toronto_schools.head()

Unnamed: 0,School Level,School Name,Postal Code,Borough,Neighbourhood,Latitude,Longitude
540,Elementary,Collège français élémentaire,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
541,Secondary,Collège français secondaire,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
542,Elementary,École élémentaire Académie Alexandre-Dumas,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
546,Elementary,École élémentaire Charles-Sauriol,M6N,York,"The Junction North, Runnymede",43.673185,-79.487262
551,Elementary,École élémentaire Étienne-Brûlé,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714


#### Population count per postal code

We will enrich the dataset by adding population counts for each neighborhood by postal code as published on the Statistics Canada <a href = 'https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Table.cfm?Lang=Eng&T=1201&S=22&O=A'>Population and Dwelling Count Highlight Tables, 2016 Census</a> page.  

Three values are provided in this table: 
* Population, 2016: represent the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on census day
* Total private dwellings, 2016: refers to total private dwellings and private dwellings occupied by usual residents in Canada
* Private dwellings occupied by usual residents, 2016: refers to usual residents, not including tourists

In [13]:
#read population data from csv file downloaded from Statistics Canada
df = pd.read_csv('https://raw.githubusercontent.com/mferle/Coursera_Capstone/master/data/T120120190215054507.csv')

#keep only rows where the postal code begins with M - these are Toronto postal codes
df = df[df['Geographic code'].str.startswith('M')]

#remove any columns where the population count is less than 100
df = df[df['Population, 2016'] > 99]

#keep only columns we need for analysis
df = df[['Geographic code', 'Population, 2016', 'Total private dwellings, 2016', 'Private dwellings occupied by usual residents, 2016']]
df.head()

Unnamed: 0,Geographic code,"Population, 2016","Total private dwellings, 2016","Private dwellings occupied by usual residents, 2016"
895,M1B,66108,20957,20230
896,M1C,35626,11588,11274
897,M1E,46943,17637,17161
898,M1G,29690,10116,9767
899,M1H,24383,9274,8985


For the analysis, we are interested in residential areas and therefore we want to exclude tourists which means that our column of interest is *Private dwellings occupied by usual residents, 2016*. We will divide this number by the total population count to derive a percentage of residents as compared to the total population. In residential areas where most of the population is not made up of tourists, the percentage should be higher as elsewhere.

In [14]:
#calculate percentage and add to dataframe as new column
pct = df['Private dwellings occupied by usual residents, 2016'] / df['Population, 2016']
df.insert(df.shape[1], 'Percent Occupied', pct)
#drop unwanted columns
toronto_population = df.drop(['Population, 2016', 'Total private dwellings, 2016', 'Private dwellings occupied by usual residents, 2016'], 1)
toronto_population.head()


Unnamed: 0,Geographic code,Percent Occupied
895,M1B,0.306014
896,M1C,0.316454
897,M1E,0.365571
898,M1G,0.328966
899,M1H,0.368494


#### Join population data with Toronto neighborhoods dataframe

In [15]:
#join population data
toronto_neighborhoods = toronto_neighborhoods.join(toronto_population.set_index('Geographic code'), on = 'Postcode')

#replace null values with 0
toronto_neighborhoods['Percent Occupied'].fillna(0, inplace=True)
toronto_neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Percent Occupied
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0.306014
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.316454
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.365571
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.328966
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.368494


<a id="toc2"></a>
## 2. Explore Neighborhoods in Toronto

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [16]:
# The code was removed by Watson Studio for sharing.

#### Let's explore the first neighborhood in our dataframe

Find the name, latitude and longitude of the first neighborhood in the dataframe.

In [17]:
neighborhood_name = toronto_neighborhoods.loc[0, 'Neighbourhood'] # neighborhood name
neighborhood_latitude = toronto_neighborhoods.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_neighborhoods.loc[0, 'Longitude'] # neighborhood longitude value

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Rouge, Malvern are 43.806686299999996, -79.19435340000001.


#### Now, let's get the top 100 venues that are in the above neighborhood within a radius of 500 meters

First we will create the GET request URL.

In [18]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)


Send the GET request and examine the results

In [19]:
results = requests.get(url).json()
#results.head()

We see that all the information that we want is in the *items* key. Before we proceed, we will create the *get_category_type* function which extracts the category name from a JSON object.

In [20]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [21]:
#extract the items key from the results
venues = results['response']['groups'][0]['items']
#flatten JSON into a dataframe
nearby_venues = json_normalize(venues) 
#filter columns that we need for further analysis
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]
#extract the category for each row using the previously defined function
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns ?????????????????????????????????????????????
#nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng
0,Wendy's,Fast Food Restaurant,43.807448,-79.199056


Create a function to repeat the same process as above to all the neighborhoods in Toronto

In [22]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postcode', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

The following code executes the above function on each neighborhood and creates a new dataframe called *toronto_venues*.

In [23]:
toronto_venues = getNearbyVenues(names = toronto_neighborhoods['Postcode'],
                                   latitudes = toronto_neighborhoods['Latitude'],
                                   longitudes = toronto_neighborhoods['Longitude']
                                  )

M1B
M1C
M1E
M1G
M1H
M1J
M1K
M1L
M1M
M1N
M1P
M1R
M1S
M1T
M1V
M1W
M1X
M2H
M2J
M2K
M2L
M2M
M2N
M2P
M2R
M3A
M3B
M3C
M3H
M3J
M3K
M3L
M3M
M3N
M4A
M4B
M4C
M4E
M4G
M4H
M4J
M4K
M4L
M4M
M4N
M4P
M4R
M4S
M4T
M4V
M4W
M4X
M4Y
M5A
M5B
M5C
M5E
M5G
M5H
M5J
M5K
M5L
M5M
M5N
M5P
M5R
M5S
M5T
M5V
M5W
M5X
M6A
M6B
M6C
M6E
M6G
M6H
M6J
M6K
M6L
M6M
M6N
M6P
M6R
M6S
M7A
M7R
M7Y
M8V
M8W
M8X
M8Y
M8Z
M9A
M9B
M9C
M9L
M9M
M9N
M9P
M9R
M9V
M9W


Check the size and first few rows of the resulting dataframe

In [24]:
print(toronto_venues.shape)
toronto_venues.head()

(2254, 7)


Unnamed: 0,Postcode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,M1C,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,M1C,43.784535,-79.160497,Scarborough Historical Society,43.788755,-79.162438,History Museum
3,M1E,43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,M1E,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


Count number of appearances of each venue category

In [25]:
toronto_venues.groupby('Venue Category').count()

Unnamed: 0_level_0,Postcode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Accessories Store,2,2,2,2,2,2
Adult Boutique,1,1,1,1,1,1
Afghan Restaurant,1,1,1,1,1,1
Airport,2,2,2,2,2,2
Airport Food Court,1,1,1,1,1,1
Airport Gate,1,1,1,1,1,1
Airport Lounge,2,2,2,2,2,2
Airport Service,2,2,2,2,2,2
Airport Terminal,2,2,2,2,2,2
American Restaurant,34,34,34,34,34,34


We see that many venue categories appear only a few times. It doesn't make sense to include these in segmentation because they don't appear often enough to have an impact, but they contribute to noise in the dataset. Therefore we will exclude venue categories that appear less than 10 times from the dataset.  

But first, we have to check how many venue categories appear less than 10 times to ensure that we still have enough venue categories left for segmentation.

In [26]:
#store the results of the above counts into a dataframe
toronto_venues_count = toronto_venues.groupby('Venue Category').count()
print('There are {} venue categories that appear less than 10 times.'.format(toronto_venues_count[toronto_venues_count['Postcode'] < 10].shape[0]))

There are 220 venue categories that appear less than 10 times.


We know that we have a total of 279 venue categories which means that even after we remove 220 of them we should still have sufficient data for segmentation.  

Therefore we will exclude the venue categories that appear less than 10 times.

In [62]:
#create list with neighborhoods to exclude
neigh_to_exclude = toronto_venues_count[toronto_venues_count['Postcode'] < 10].index.tolist()
#create filtered dataframe by excluding neighborhoods in above list
toronto_venues_filt = toronto_venues[~toronto_venues['Venue Category'].isin(neigh_to_exclude)]
#check size of resulting dataframe
toronto_venues_filt.groupby('Venue Category').count().shape

(62, 6)

The number of venue categories is sufficient for further analysis. We will rename the filtered dataset *toronto_venues_filt* back to the original dataset *toronto_venues*.

In [63]:
toronto_venues = toronto_venues_filt

In [64]:
toronto_venues.groupby('Venue Category').count()

Unnamed: 0_level_0,Postcode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
American Restaurant,34,34,34,34,34,34
Art Gallery,13,13,13,13,13,13
Asian Restaurant,16,16,16,16,16,16
Bakery,52,52,52,52,52,52
Bank,18,18,18,18,18,18
Bar,44,44,44,44,44,44
Beer Bar,15,15,15,15,15,15
Bookstore,18,18,18,18,18,18
Breakfast Spot,24,24,24,24,24,24
Brewery,13,13,13,13,13,13


#### Add school dataset as additional venue categories

We will add the school data set where *Elementary School* and *Secondary School* will each represent a venue category.  

First, we will restructure the *toronto_schools* dataset with columns that correspond to the *toronto_venues* dataset.

In [65]:
toronto_schools_as_venues = pd.concat([toronto_schools['Postal Code'], toronto_schools['Latitude'], toronto_schools['Longitude'], toronto_schools['School Name'], toronto_schools['Latitude'], toronto_schools['Longitude'], toronto_schools['School Level']], axis = 1)
toronto_schools_as_venues.columns = ['Postcode','Neighborhood Latitude','Neighborhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
toronto_schools_as_venues.head()

Unnamed: 0,Postcode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
540,M5B,43.657162,-79.378937,Collège français élémentaire,43.657162,-79.378937,Elementary
541,M5B,43.657162,-79.378937,Collège français secondaire,43.657162,-79.378937,Secondary
542,M1E,43.763573,-79.188711,École élémentaire Académie Alexandre-Dumas,43.763573,-79.188711,Elementary
546,M6N,43.673185,-79.487262,École élémentaire Charles-Sauriol,43.673185,-79.487262,Elementary
551,M2L,43.75749,-79.374714,École élémentaire Étienne-Brûlé,43.75749,-79.374714,Elementary


Append this dataset to the *toronto_venues* dataset.

In [66]:
toronto_all_venues = pd.concat([toronto_venues, toronto_schools_as_venues], sort = False)
toronto_all_venues.shape

(2960, 7)

In [67]:
toronto_venues = toronto_all_venues
toronto_venues.head()

Unnamed: 0,Postcode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,M1C,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,M1E,43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,M1E,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
5,M1E,43.763573,-79.188711,Marina Spa,43.766,-79.191,Spa


In [68]:
toronto_venues.tail()

Unnamed: 0,Postcode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
4749,M3N,43.761631,-79.520999,Yorkwoods Public School,43.761631,-79.520999,Elementary
4750,M5B,43.657162,-79.378937,Youthdale - Covenant House,43.657162,-79.378937,Secondary
4751,M4J,43.685347,-79.338106,Youthdale Treatment Centre - CTA,43.685347,-79.338106,Secondary
4753,M6H,43.669005,-79.442259,Youthdale Treatment Ctr - Jesse Ketchum,43.669005,-79.442259,Elementary
4754,M2H,43.803762,-79.363452,Zion Heights Middle School,43.803762,-79.363452,Elementary


<a id="toc3"></a>
## 3. Analyze Each Neighborhood

We will do one hot encoding to pivot category values into columns of the dataframe.  

There is one observation that we have to be careful about: one of the category values is *Neighborhood*. After one hot encoding, this value will become a column name. We are already using the column *Neighborhood* to represent the neighborhood name. To avoid confusing these columns, we will rename the column that comes from one hot encoding as *Neighborhood Category*.

In [69]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

#rename the column 'Neighborhood' which represents a category name to 'Neighborhood Category' 
#this is to distinguish this column from the 'Neighborhood' column which we want to continue to use as the neighborhood name
toronto_onehot.rename(columns={'Neighborhood':'Neighborhood Category'}, inplace=True)

# add neighborhood column back to dataframe
toronto_onehot['Postcode'] = toronto_venues['Postcode'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

#toronto_onehot.head()


Check the new dataframe size:

In [70]:
toronto_onehot.shape

(2960, 63)

#### We will group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [71]:
toronto_grouped = toronto_onehot.groupby('Postcode').mean().reset_index()
toronto_grouped.head(10)

Unnamed: 0,Postcode,American Restaurant,Art Gallery,Asian Restaurant,Bakery,Bank,Bar,Beer Bar,Bookstore,Breakfast Spot,...,Spa,Sporting Goods Shop,Steakhouse,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020408,...,0.020408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.071429,0.071429,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0
5,M1J,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,M1K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,M1L,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M1M,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,M1N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Check the new dataframe size:

In [72]:
toronto_grouped.shape

(103, 63)

#### Store the above into a *pandas* dataframe

Write a function to sort the venues in descending order

In [73]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False).to_frame().reset_index()
    row_cutoff = row_categories_sorted.head(num_top_venues)
    row_cutoff.columns = ['Venue', 'Appears']
    return_array = row_cutoff['Venue'].where(row_cutoff['Appears'] != 0, other = np.NaN, axis = 0)

    return return_array.values

Create the new dataframe and display the top 10 venues for each neighborhood

In [74]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postcode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns = columns)
neighborhoods_venues_sorted['Postcode'] = toronto_grouped['Postcode']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Elementary,Secondary,Fast Food Restaurant,,,,,,,
1,M1C,Elementary,Secondary,Bar,,,,,,,
2,M1E,Elementary,Secondary,Pizza Place,Breakfast Spot,Mexican Restaurant,Spa,Electronics Store,,,
3,M1G,Elementary,Secondary,Coffee Shop,,,,,,,
4,M1H,Elementary,Secondary,Fried Chicken Joint,Thai Restaurant,Bakery,Bank,,,,


In [75]:
neighborhoods_venues_sorted.groupby(['1st Most Common Venue']).size()

1st Most Common Venue
Café               1
Clothing Store     1
Coffee Shop       10
Elementary        84
Hotel              1
Secondary          5
Yoga Studio        1
dtype: int64

<a id="toc4"></a>
## 4. Cluster Neighborhoods


Run *k*-means to cluster the neighborhood into 5 clusters.

In [76]:
#join population data
toronto_grouped = toronto_grouped.join(toronto_population.set_index('Geographic code'), on = 'Postcode')

#replace null values with 0
toronto_grouped['Percent Occupied'].fillna(0, inplace=True)
toronto_grouped.head()

Unnamed: 0,Postcode,American Restaurant,Art Gallery,Asian Restaurant,Bakery,Bank,Bar,Beer Bar,Bookstore,Breakfast Spot,...,Sporting Goods Shop,Steakhouse,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yoga Studio,Percent Occupied
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.306014
1,M1C,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.316454
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020408,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.365571
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.328966
4,M1H,0.0,0.0,0.0,0.071429,0.071429,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.368494


In [77]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Postcode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 3, 0, 3, 0, 3, 0, 0, 0, 3, 3, 3, 0, 0, 3, 0, 0, 1, 0, 3, 3,
       1, 3, 3, 0, 3, 1, 1, 0, 1, 1, 0, 0, 0, 1, 3, 0, 1, 1, 3, 1, 1, 1,
       3, 1, 1, 4, 0, 1, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 1, 3, 0, 4,
       4, 4, 3, 2, 2, 1, 1, 0, 3, 3, 1, 1, 1, 0, 3, 0, 1, 1, 1, 2, 2, 2,
       1, 1, 0, 0, 1, 3, 3, 0, 0, 3, 0, 1, 3, 3, 0], dtype=int32)

Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [78]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = neighborhoods_venues_sorted

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(toronto_neighborhoods.set_index('Postcode'), on='Postcode')

toronto_merged.head() # check the last columns!

Unnamed: 0,Cluster Labels,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Percent Occupied
0,0,M1B,Elementary,Secondary,Fast Food Restaurant,,,,,,,,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0.306014
1,0,M1C,Elementary,Secondary,Bar,,,,,,,,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.316454
2,3,M1E,Elementary,Secondary,Pizza Place,Breakfast Spot,Mexican Restaurant,Spa,Electronics Store,,,,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.365571
3,0,M1G,Elementary,Secondary,Coffee Shop,,,,,,,,Scarborough,Woburn,43.770992,-79.216917,0.328966
4,3,M1H,Elementary,Secondary,Fried Chicken Joint,Thai Restaurant,Bakery,Bank,,,,,Scarborough,Cedarbrae,43.773136,-79.239476,0.368494


Visualize the resulting clusters

In [79]:
address = 'Toronto'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude


In [80]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Note: because GitHub doesn't display Folium maps, a print screen of the map is available <a href='img/Toronto.png'>here</a>.

<a id="toc5"></a>
## 5. Examine Clusters


We will examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can then assign a name to each cluster.

### Cluster 0: Residential

In [81]:
toronto_cluster0 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]))]]
toronto_cluster0

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Percent Occupied
0,M1B,Elementary,Secondary,Fast Food Restaurant,,,,,,,,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0.306014
1,M1C,Elementary,Secondary,Bar,,,,,,,,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.316454
3,M1G,Elementary,Secondary,Coffee Shop,,,,,,,,Scarborough,Woburn,43.770992,-79.216917,0.328966
5,M1J,Elementary,,,,,,,,,,Scarborough,Scarborough Village,43.744734,-79.239476,0.334451
7,M1L,Elementary,Bakery,Secondary,Park,Fast Food Restaurant,,,,,,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577,0.354266
8,M1M,Elementary,Secondary,American Restaurant,,,,,,,,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476,0.376337
9,M1N,Elementary,Secondary,Café,,,,,,,,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848,0.410869
13,M1T,Elementary,Secondary,Pizza Place,Pharmacy,Italian Restaurant,Chinese Restaurant,Fast Food Restaurant,Fried Chicken Joint,Thai Restaurant,,Scarborough,"Clarks Corners, Sullivan, Tam O'Shanter",43.781638,-79.304302,0.374378
14,M1V,Elementary,Secondary,Park,,,,,,,,Scarborough,"Agincourt North, L'Amoreaux East, Milliken, St...",43.815252,-79.284577,0.294294
16,M1X,Elementary,,,,,,,,,,Scarborough,Upper Rouge,43.836125,-79.205636,0.241704


In [82]:
t0 = toronto_merged[toronto_merged['Cluster Labels'] == 0]
t0_all_venues = t0['1st Most Common Venue']
t0_all_venues = t0_all_venues.append(t0['2nd Most Common Venue'])
t0_all_venues = t0_all_venues.append(t0['3rd Most Common Venue'])
t0_all_venues = t0_all_venues.append(t0['4th Most Common Venue'])
t0_all_venues = t0_all_venues.append(t0['5th Most Common Venue'])
t0_all_venues = t0_all_venues.append(t0['6th Most Common Venue'])
t0_all_venues = t0_all_venues.append(t0['7th Most Common Venue'])
t0_all_venues = t0_all_venues.append(t0['8th Most Common Venue'])
t0_all_venues = t0_all_venues.append(t0['9th Most Common Venue'])
t0_all_venues = t0_all_venues.append(t0['10th Most Common Venue'])
t0_all_venues.value_counts()

Elementary              29
Secondary               19
Park                     8
Coffee Shop              5
Fast Food Restaurant     4
Pizza Place              4
Café                     3
Chinese Restaurant       3
Bakery                   3
Bar                      3
Pharmacy                 2
Italian Restaurant       1
Fried Chicken Joint      1
Pub                      1
Liquor Store             1
Bank                     1
Gym / Fitness Center     1
American Restaurant      1
Japanese Restaurant      1
Sushi Restaurant         1
Thai Restaurant          1
dtype: int64

Venue categories in this cluster are predominantly shops with some interspersed coffee shops and restaurants as well as parks and sporting venues. It apears that this cluster represents residential areas.

### Cluster 1: Parks

In [83]:
toronto_cluster1 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]))]]
toronto_cluster1

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Percent Occupied
18,M2J,Elementary,Clothing Store,Fast Food Restaurant,Coffee Shop,Secondary,Restaurant,Electronics Store,Bakery,Tea Room,Asian Restaurant,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,0.375277
22,M2N,Elementary,Secondary,Restaurant,Pizza Place,Coffee Shop,Café,Japanese Restaurant,Sandwich Place,Middle Eastern Restaurant,Fast Food Restaurant,North York,Willowdale South,43.77012,-79.408493,0.444879
27,M3C,Elementary,Secondary,Gym,Asian Restaurant,Coffee Shop,Japanese Restaurant,Clothing Store,Chinese Restaurant,Café,Restaurant,North York,"Flemingdon Park, Don Mills South",43.7259,-79.340923,0.404848
28,M3H,Elementary,Secondary,Coffee Shop,Pharmacy,Fried Chicken Joint,Diner,Deli / Bodega,Pizza Place,Restaurant,Sandwich Place,North York,"Bathurst Manor, Downsview North, Wilson Heights",43.754328,-79.442259,0.384372
30,M3K,Elementary,Elem/Sec,Secondary,Electronics Store,Park,,,,,,North York,"CFB Toronto, Downsview East",43.737473,-79.464763,0.368184
31,M3L,Elementary,Grocery Store,Bank,Hotel,Shopping Mall,,,,,,North York,Downsview West,43.739015,-79.506944,0.33589
35,M4B,Elementary,Pizza Place,Secondary,Fast Food Restaurant,Gastropub,Breakfast Spot,Gym / Fitness Center,Café,Pharmacy,Bank,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937,0.407934
38,M4G,Elementary,Sporting Goods Shop,Coffee Shop,Grocery Store,Secondary,Burger Joint,Sushi Restaurant,Breakfast Spot,Electronics Store,Dessert Shop,East York,Leaside,43.70906,-79.363452,0.405274
39,M4H,Elementary,Indian Restaurant,Burger Joint,Grocery Store,Gym / Fitness Center,Coffee Shop,Liquor Store,Park,Pizza Place,Sandwich Place,East York,Thorncliffe Park,43.705369,-79.349372,0.325833
41,M4K,Elementary,Greek Restaurant,Secondary,Coffee Shop,Ice Cream Shop,Italian Restaurant,Bookstore,Indian Restaurant,Grocery Store,Furniture / Home Store,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0.460881


In [84]:
t1 = toronto_merged[toronto_merged['Cluster Labels'] == 1]
t1_all_venues = t1['1st Most Common Venue']
t1_all_venues = t1_all_venues.append(t1['2nd Most Common Venue'])
t1_all_venues = t1_all_venues.append(t1['3rd Most Common Venue'])
t1_all_venues = t1_all_venues.append(t1['4th Most Common Venue'])
t1_all_venues = t1_all_venues.append(t1['5th Most Common Venue'])
t1_all_venues = t1_all_venues.append(t1['6th Most Common Venue'])
t1_all_venues = t1_all_venues.append(t1['7th Most Common Venue'])
t1_all_venues = t1_all_venues.append(t1['8th Most Common Venue'])
t1_all_venues = t1_all_venues.append(t1['9th Most Common Venue'])
t1_all_venues = t1_all_venues.append(t1['10th Most Common Venue'])
t1_all_venues.value_counts()

Elementary                   28
Coffee Shop                  19
Secondary                    18
Pizza Place                  12
Sandwich Place               10
Café                         10
Restaurant                    8
Park                          7
Gym                           7
Italian Restaurant            6
Burger Joint                  6
Pharmacy                      6
Fast Food Restaurant          6
Grocery Store                 6
Gym / Fitness Center          5
Pub                           5
Breakfast Spot                5
Liquor Store                  5
Clothing Store                5
Bakery                        5
Sushi Restaurant              4
Furniture / Home Store        4
Diner                         4
Electronics Store             4
Middle Eastern Restaurant     3
Japanese Restaurant           3
Bank                          3
Bar                           3
Dessert Shop                  3
Asian Restaurant              3
Vietnamese Restaurant         3
Bookstor

All of the first or second most common venue types in this neighborhood are parks.

### Cluster 2: Downtown

In [85]:
toronto_cluster2 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]))]]
toronto_cluster2

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Percent Occupied
60,M5K,Coffee Shop,Café,Hotel,American Restaurant,Restaurant,Deli / Bodega,Italian Restaurant,Gastropub,Seafood Restaurant,Steakhouse,Downtown Toronto,"Design Exchange, Toronto Dominion Centre",43.647177,-79.381576,0.0
61,M5L,Coffee Shop,Café,Hotel,Restaurant,American Restaurant,Seafood Restaurant,Bakery,Deli / Bodega,Gym,Italian Restaurant,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817,0.0
69,M5W,Coffee Shop,Restaurant,Café,Italian Restaurant,Pub,Beer Bar,Hotel,Seafood Restaurant,Cocktail Bar,Japanese Restaurant,Downtown Toronto,Stn A PO Boxes 25 The Esplanade,43.646435,-79.374846,0.0
70,M5X,Café,Coffee Shop,Hotel,Restaurant,American Restaurant,Seafood Restaurant,Asian Restaurant,Bakery,Bar,Deli / Bodega,Downtown Toronto,"First Canadian Place, Underground city",43.648429,-79.38228,0.0
85,M7A,Coffee Shop,Diner,Gym,Sushi Restaurant,Japanese Restaurant,Yoga Studio,Bubble Tea Shop,Fast Food Restaurant,Chinese Restaurant,Italian Restaurant,Queen's Park,Queen's Park,43.662301,-79.389494,0.0
86,M7R,Hotel,Coffee Shop,American Restaurant,Fried Chicken Joint,Burrito Place,Sandwich Place,Middle Eastern Restaurant,Gym / Fitness Center,,,Mississauga,Canada Post Gateway Processing Centre,43.636966,-79.615819,0.0
87,M7Y,Yoga Studio,Brewery,Farmers Market,Park,Restaurant,Burrito Place,Pizza Place,Fast Food Restaurant,,,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,0.0


In [86]:
t2 = toronto_merged[toronto_merged['Cluster Labels'] == 2]
t2_all_venues = t2['1st Most Common Venue']
t2_all_venues = t2_all_venues.append(t2['2nd Most Common Venue'])
t2_all_venues = t2_all_venues.append(t2['3rd Most Common Venue'])
t2_all_venues = t2_all_venues.append(t2['4th Most Common Venue'])
t2_all_venues = t2_all_venues.append(t2['5th Most Common Venue'])
t2_all_venues = t2_all_venues.append(t2['6th Most Common Venue'])
t2_all_venues = t2_all_venues.append(t2['7th Most Common Venue'])
t2_all_venues = t2_all_venues.append(t2['8th Most Common Venue'])
t2_all_venues = t2_all_venues.append(t2['9th Most Common Venue'])
t2_all_venues = t2_all_venues.append(t2['10th Most Common Venue'])
t2_all_venues.value_counts()

Coffee Shop                  6
Restaurant                   5
Hotel                        5
Seafood Restaurant           4
American Restaurant          4
Italian Restaurant           4
Café                         4
Deli / Bodega                3
Yoga Studio                  2
Japanese Restaurant          2
Gym                          2
Bakery                       2
Fast Food Restaurant         2
Burrito Place                2
Farmers Market               1
Sushi Restaurant             1
Pizza Place                  1
Pub                          1
Cocktail Bar                 1
Steakhouse                   1
Asian Restaurant             1
Bubble Tea Shop              1
Middle Eastern Restaurant    1
Sandwich Place               1
Diner                        1
Chinese Restaurant           1
Bar                          1
Brewery                      1
Park                         1
Gastropub                    1
Beer Bar                     1
Fried Chicken Joint          1
Gym / Fi

Most of the neighborhoods in this cluster appear to be geographically located near downtown. The venue categories are predominantly restaurants, coffee shops and hotels with some shops, gyms and playgrounds. Mail processing centers appear also to be included in this segment.

### Cluster 3: Outliers

In [87]:
toronto_cluster3 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]))]]
toronto_cluster3

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Percent Occupied
2,M1E,Elementary,Secondary,Pizza Place,Breakfast Spot,Mexican Restaurant,Spa,Electronics Store,,,,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.365571
4,M1H,Elementary,Secondary,Fried Chicken Joint,Thai Restaurant,Bakery,Bank,,,,,Scarborough,Cedarbrae,43.773136,-79.239476,0.368494
6,M1K,Elementary,Secondary,Coffee Shop,,,,,,,,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029,0.370194
10,M1P,Secondary,Elementary,Indian Restaurant,Chinese Restaurant,Vietnamese Restaurant,Furniture / Home Store,,,,,Scarborough,"Dorset Park, Scarborough Town Centre, Wexford ...",43.75741,-79.273304,0.36295
11,M1R,Elementary,Secondary,Middle Eastern Restaurant,Bakery,Breakfast Spot,,,,,,Scarborough,"Maryvale, Wexford",43.750072,-79.295849,0.360607
12,M1S,Elementary,Secondary,Clothing Store,Breakfast Spot,,,,,,,Scarborough,Agincourt,43.7942,-79.262029,0.334772
15,M1W,Elementary,Secondary,Chinese Restaurant,Fast Food Restaurant,Coffee Shop,Breakfast Spot,Pizza Place,Japanese Restaurant,Sandwich Place,Pharmacy,Scarborough,"L'Amoreaux West, Steeles West",43.799525,-79.318389,0.333663
20,M2L,Elementary,Secondary,,,,,,,,,North York,"Silver Hills, York Mills",43.75749,-79.374714,0.339165
21,M2M,Elementary,Secondary,Park,,,,,,,,North York,"Newtonbrook, Willowdale",43.789053,-79.408493,0.383632
23,M2P,Elementary,Secondary,Park,Bank,,,,,,,North York,York Mills West,43.752758,-79.400049,0.385057


In [88]:
t3 = toronto_merged[toronto_merged['Cluster Labels'] == 3]
t3_all_venues = t3['1st Most Common Venue']
t3_all_venues = t3_all_venues.append(t3['2nd Most Common Venue'])
t3_all_venues = t3_all_venues.append(t3['3rd Most Common Venue'])
t3_all_venues = t3_all_venues.append(t3['4th Most Common Venue'])
t3_all_venues = t3_all_venues.append(t3['5th Most Common Venue'])
t3_all_venues = t3_all_venues.append(t3['6th Most Common Venue'])
t3_all_venues = t3_all_venues.append(t3['7th Most Common Venue'])
t3_all_venues = t3_all_venues.append(t3['8th Most Common Venue'])
t3_all_venues = t3_all_venues.append(t3['9th Most Common Venue'])
t3_all_venues = t3_all_venues.append(t3['10th Most Common Venue'])
t3_all_venues.value_counts()

Elementary                   26
Secondary                    26
Park                          9
Coffee Shop                   6
Pharmacy                      5
Pizza Place                   4
Breakfast Spot                4
Fast Food Restaurant          3
Sandwich Place                3
Bank                          3
Grocery Store                 3
Restaurant                    2
Bakery                        2
Fried Chicken Joint           2
Japanese Restaurant           2
Café                          2
Chinese Restaurant            2
Middle Eastern Restaurant     1
Gym / Fitness Center          1
Thai Restaurant               1
Elem/Sec                      1
Mexican Restaurant            1
Cosmetics Shop                1
Clothing Store                1
Electronics Store             1
Spa                           1
Asian Restaurant              1
Italian Restaurant            1
Diner                         1
Furniture / Home Store        1
Vietnamese Restaurant         1
Liquor S

This cluster represents outliers. Both postal codes have just one venue, a bar in each. The remaining venues all appear 0 times. Obviously, they represent the same cluster as they are exactly the same based on the input to the clustering algorithm. To improve the segmentation, we should remove all such postal codes where the number of different venues is less than 10.

### Cluster 4: Quick eats

In [89]:
toronto_cluster4 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]))]]
toronto_cluster4

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Percent Occupied
47,M4S,Elementary,Dessert Shop,Sandwich Place,Pizza Place,Sushi Restaurant,Italian Restaurant,Café,Coffee Shop,Farmers Market,Indian Restaurant,Central Toronto,Davisville,43.704324,-79.38879,0.505433
51,M4X,Coffee Shop,Elementary,Restaurant,Bakery,Park,Italian Restaurant,Pizza Place,Café,Pub,Gastropub,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675,0.494957
52,M4Y,Secondary,Japanese Restaurant,Coffee Shop,Sushi Restaurant,Restaurant,Burger Joint,Gastropub,Pub,Elementary,Café,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,0.645347
53,M5A,Elementary,Coffee Shop,Park,Bakery,Café,Pub,Restaurant,Theater,Secondary,Breakfast Spot,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636,0.543673
54,M5B,Clothing Store,Coffee Shop,Café,Secondary,Cosmetics Shop,Middle Eastern Restaurant,Elementary,Diner,Bar,Tea Room,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,0.552053
55,M5C,Coffee Shop,Restaurant,Café,Hotel,Clothing Store,Gastropub,Park,Bakery,Breakfast Spot,Cocktail Bar,Downtown Toronto,St. James Town,43.651494,-79.375418,0.583192
56,M5E,Coffee Shop,Cocktail Bar,Restaurant,Farmers Market,Café,Elementary,Italian Restaurant,Bakery,Steakhouse,Pub,Downtown Toronto,Berczy Park,43.644771,-79.373306,0.623163
57,M5G,Coffee Shop,Secondary,Café,Italian Restaurant,Elementary,Bar,Burger Joint,Chinese Restaurant,Ice Cream Shop,Sandwich Place,Downtown Toronto,Central Bay Street,43.657952,-79.387383,0.585183
58,M5H,Coffee Shop,Café,American Restaurant,Thai Restaurant,Steakhouse,Gym,Restaurant,Clothing Store,Hotel,Bar,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568,0.61995
59,M5J,Coffee Shop,Hotel,Pizza Place,Café,Bakery,Restaurant,Brewery,Italian Restaurant,Fried Chicken Joint,Bar,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752,0.594637


In [90]:
t4 = toronto_merged[toronto_merged['Cluster Labels'] == 4]
t4_all_venues = t4['1st Most Common Venue']
t4_all_venues = t4_all_venues.append(t4['2nd Most Common Venue'])
t4_all_venues = t4_all_venues.append(t4['3rd Most Common Venue'])
t4_all_venues = t4_all_venues.append(t4['4th Most Common Venue'])
t4_all_venues = t4_all_venues.append(t4['5th Most Common Venue'])
t4_all_venues = t4_all_venues.append(t4['6th Most Common Venue'])
t4_all_venues = t4_all_venues.append(t4['7th Most Common Venue'])
t4_all_venues = t4_all_venues.append(t4['8th Most Common Venue'])
t4_all_venues = t4_all_venues.append(t4['9th Most Common Venue'])
t4_all_venues = t4_all_venues.append(t4['10th Most Common Venue'])
t4_all_venues.value_counts()

Café                             13
Coffee Shop                      13
Elementary                       10
Restaurant                        8
Bakery                            7
Bar                               6
Secondary                         6
Italian Restaurant                5
Pub                               5
Park                              4
Pizza Place                       4
Hotel                             3
Sandwich Place                    3
Clothing Store                    3
Chinese Restaurant                3
Gastropub                         3
Japanese Restaurant               2
Indian Restaurant                 2
Farmers Market                    2
Sushi Restaurant                  2
Cosmetics Shop                    2
Cocktail Bar                      2
Breakfast Spot                    2
Burger Joint                      2
Steakhouse                        2
Bookstore                         1
Liquor Store                      1
Mexican Restaurant          

Venue categories in this cluster appear to be predominantly fast food restaurands, coffee shops, sandwich places, pizza places, grocery stores which all suggests places where one can find something quick to eat.

# Discussion

Among the above clusters it would appear that the following clusters are best suited for families with children:  

* **Cluster 0: Residential.** Venue categories in this cluster are predominantly shops with some interspersed coffee shops and restaurants as well as parks and sporting venues which is all suitable for families.
* **Cluster 4: Quick eats.** Venue categories in this cluster appear to be predominantly fast food restaurands, coffee shops, sandwich places, pizza places, grocery stores which all suggests places where one can find something quick to eat. These types of places as typical of shopping malls which in turn suggest residential areas.  

Let's do some analysis to verify these observations.

In [91]:
toronto_merged.groupby(['Cluster Labels']).mean().reset_index().drop(['Latitude', 'Longitude'], 1)

Unnamed: 0,Cluster Labels,Percent Occupied
0,0,0.377432
1,1,0.425731
2,2,0.0
3,3,0.387998
4,4,0.559054


We can see that the value of *Percent Occupied* is largest in cluster 0. This makes sense, because we would expect a higher percentage of occupied dwellings in residential areas.  

We can also see that the largest percentage of both elementary and secondary schools is in cluster 4 which indicates that these are neighborhoods where families with school age children reside.

In [92]:
#neighborhoods that have been flagged as best to live in
toronto_flagged = pd.read_excel('https://github.com/mferle/Coursera_Capstone/blob/master/data/Top_blogs.xlsx?raw=true')
toronto_flagged.head()

Unnamed: 0,Postal Code,TopFamilyFlag
0,M5N,1
1,M2K,1
2,M5M,1
3,M6S,1
4,M6H,1


In [93]:
tf = toronto_flagged.join(toronto_merged.set_index('Postcode'), on = 'Postal Code')
tf = tf[~tf['Cluster Labels'].isna()]
tf['Cluster Labels'] = tf['Cluster Labels'].astype(int)
tf.head()

Unnamed: 0,Postal Code,TopFamilyFlag,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude,Percent Occupied
0,M5N,1,3,Elementary,Secondary,,,,,,,,,Central Toronto,Roselawn,43.711695,-79.416936,0.390488
1,M2K,1,0,Elementary,Chinese Restaurant,Bank,Japanese Restaurant,Café,,,,,,North York,Bayview Village,43.786947,-79.385975,0.441892
2,M5M,1,1,Elementary,Italian Restaurant,Thai Restaurant,Fast Food Restaurant,Coffee Shop,Restaurant,Sandwich Place,Café,Liquor Store,Pharmacy,North York,"Bedford Park, Lawrence Manor East",43.733283,-79.41975,0.362194
3,M6S,1,1,Elementary,Coffee Shop,Café,Pizza Place,Sushi Restaurant,Italian Restaurant,Secondary,Bookstore,Pub,Electronics Store,West Toronto,"Runnymede, Swansea",43.651571,-79.48445,0.427272
4,M6H,1,1,Elementary,Secondary,Pharmacy,Elem/Sec,Bakery,Gym,Gym / Fitness Center,Liquor Store,Middle Eastern Restaurant,Park,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259,0.427341


In [94]:
# create map
#map_best = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neigh, flg in zip(tf['Latitude'], tf['Longitude'], tf['Neighbourhood'], tf['TopFamilyFlag']):
    label = folium.Popup(neigh, parse_html=True)
    folium.Marker(
    [lat, lon],
    popup=label).add_to(map_clusters)
       
map_clusters

In [95]:
tfg = tf[['Postal Code', 'Cluster Labels', 'Neighbourhood']]
tfg

Unnamed: 0,Postal Code,Cluster Labels,Neighbourhood
0,M5N,3,Roselawn
1,M2K,0,Bayview Village
2,M5M,1,"Bedford Park, Lawrence Manor East"
3,M6S,1,"Runnymede, Swansea"
4,M6H,1,"Dovercourt Village, Dufferin"
5,M4C,3,Woodbine Heights
6,M4S,4,Davisville
7,M4V,1,"Deer Park, Forest Hill SE, Rathnelly, South Hi..."
8,M3B,3,Don Mills North
9,M9C,0,"Bloordale Gardens, Eringate, Markland Wood, Ol..."


In [96]:
tfg.groupby('Cluster Labels').count()

Unnamed: 0_level_0,Postal Code,Neighbourhood
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1
0,6,6
1,8,8
3,5,5
4,1,1
