## Coursera Capstone: Segmenting and Clustering Neighborhoods in Toronto
This project is to explore, segment, and cluster neighborhoods in the city of Toronto.

## Introduction

We will start by scraping the Wikipedia page <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">List of postal codes of Canada: M</a> to get a list of neighborhoods in Toronto. Then we will find the latitude and longitude of each neighborhood. Next, we will use the Foursquare API to explore the neighborhoods. We will get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. We will use the *k*-means clustering algorithm to complete this task. Finally, we will use the Folium library to visualize the neighborhoods in Toronto and their emerging clusters.

## Table of Contents

1. <a href="#toc1">Acquire Data</a>

2. <a href="#toc2">Explore Neighborhoods in Toronto</a>

3. <a href="#toc3">Analyze Each Neighborhood</a>

4. <a href="#toc4">Cluster Neighborhoods</a>

5. <a href="#toc5">Examine Clusters</a>    


First download all the dependencies that we will need for this analysis.

In [168]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<a id="toc1"></a>
## 1. Acquire Data

We will scrape the list of Toronto neighborhoods from the Wikipedia page <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">List of postal codes of Canada: M</a>. Postal codes beginning with M are located within the city of Toronto. Then we will find the geographical coordinates of each post code.

#### Read data from web page

We will read the page and convert the table into a dataframe using the *pandas* read_html method.

In [169]:
#get Toronto neighborhoods from Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
#read the Wikipedia page - returns list of dataframes
dfs = pd.read_html(url, header=0)
#take the first dataframe from the returned list (it should be the only dataframe in the list)
df = dfs[0]
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


We will ignore any records where *Borough* is "Not assigned".

In [170]:
#create new dataframe with records where Borough is not 'Not assigned'
df_assigned = df[df['Borough'] != 'Not assigned']
df_assigned.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In rows where the value of *Neighbourhood* is 'Not assigned' we will replace it with the value of the *Borough*.

In [171]:
#create a list of neighborhoods, replacing the borough where neighborhood is 'Not assigned'
new_neigh = df_assigned['Neighbourhood'].where(df_assigned['Neighbourhood'] != 'Not assigned', other = df_assigned['Borough'], axis = 0)
#construct new dataframe using postcode and borough from the previous dataframe and neighborhood from the above list
df_replaced = pd.concat([df_assigned['Postcode'], df_assigned['Borough'], new_neigh], axis = 1)
df_replaced.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Whenever we have more than one row per postcode, we will concatenate all neighborhoods into a comma separated list

In [172]:
#group the dataframe by Postcode and Borough and concatenate all neighborhoods into comma separated list
toronto_neighborhoods = df_replaced.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(list).apply(lambda x: ', '.join(x)).to_frame()
toronto_neighborhoods.reset_index(inplace = True)
toronto_neighborhoods.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [173]:
#display the shape of the resulting dataframe
toronto_neighborhoods.shape

(103, 3)

### Assign geographical coordinates

The next step is to assign the geographical coordinates for each postcode. Instructions on the Coursera submission page suggest to use the Geocoder Python package. Unfortunately I was not able to get usable results from this package. As suggested on the submission page, I decided to use the dataset <a href='http://cocl.us/Geospatial_data'>http://cocl.us/Geospatial_data</a> with predefined coordinates per postal code.  

There is an alternate way by which we could get coordinates and that would be to use the <a href='https://geopy.readthedocs.io/en/stable/'>GeoPy</a> library with the Nominatim geolocator service (instead of Google Maps). However, this library does not return coordinates based on postal codes but rather on neighborhood names. This would mean that we would have to restructure the above dataframe back to what it was before we concatenated neighborhoods into comma separated values. I decided not to go this way, although I suspect it would have been a viable approach to solving this exercise.

In [174]:
#read provided dataset
coords = pd.read_csv('http://cocl.us/Geospatial_data')
coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We will merge this dataset with our *toronto_neighborhoods* dataset from above.

In [175]:
toronto_neighborhoods = pd.merge(toronto_neighborhoods, coords, left_on = 'Postcode', right_on = 'Postal Code')[['Postcode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude']]
toronto_neighborhoods.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [176]:
toronto_neighborhoods.shape

(103, 5)

### Add additional data

#### Number of Elementary and Secondary schools per postal code

We will enrich the previous dataset with information about how many Elementary and Secondary schools there are in each neighborhood by postal code. A list of public schools can be found on the <a href='https://www.ontario.ca/data/ontario-public-school-contact-information'>Ontario public school contact information</a> web site and this list can be transformed into a table with school counts.

In [177]:
#read the Canada public schools dataset
df = pd.read_excel('https://files.ontario.ca/opendata/publicly_funded_schools_xlsx_january_2019_en.xlsx')

We will remove the records where the post code is null and we have to filter for postal codes that begin with M - these are the Toronto postal codes. Additionally, we want to keep only the public schools as these are the only ones that we are interested in.

In [178]:
#keep only rows where the postal code is not null
df = df[df['Postal Code'].notna()]
#keep only rows where the postal code begins with M - these are Toronto postal codes
df = df[df['Postal Code'].str.startswith('M')]
#keep only first 3 characters of postal code
df['Postal Code'] = df['Postal Code'].str[:3]
#keep only public schools
df = df[df['School Type'] == 'Public']

#keep selected columns and store in dataframe called toronto_schools
toronto_schools = df[['School Level', 'School Name', 'Postal Code']]
toronto_schools.head()

Unnamed: 0,School Level,School Name,Postal Code
540,Elementary,Collège français élémentaire,M5B
541,Secondary,Collège français secondaire,M5B
542,Elementary,École élémentaire Académie Alexandre-Dumas,M1E
546,Elementary,École élémentaire Charles-Sauriol,M6N
551,Elementary,École élémentaire Étienne-Brûlé,M2L


Next, we want to count how many elementary and how many secondary shools we have in each postal code.

In [179]:
schools_count = toronto_schools.groupby(['Postal Code', 'School Level']).count().reset_index()
schools_count.columns = ['Postal Code', 'School Level', 'Number of Schools']
schools_count.head()

Unnamed: 0,Postal Code,School Level,Number of Schools
0,M1B,Elementary,16
1,M1B,Secondary,1
2,M1C,Elementary,8
3,M1C,Secondary,2
4,M1E,Elementary,13


For the analysis we will pivot this dataframe to create columns for school level.

In [180]:
#perform pivot
schools_count_pivot = schools_count.pivot(index='Postal Code', columns='School Level', values='Number of Schools')
#reset index
schools_count_pivot.reset_index(inplace = True)
#select columns that we need
schools_count_pivot = schools_count_pivot[['Postal Code', 'Elementary', 'Secondary']]
schools_count_pivot.head()

School Level,Postal Code,Elementary,Secondary
0,M1B,16.0,1.0
1,M1C,8.0,2.0
2,M1E,13.0,9.0
3,M1G,10.0,2.0
4,M1H,4.0,1.0


For further analysis we will normalize the school counts. Let's find the maximum school count value and divide each count by this value.

In [181]:
#find the maximum school count value
max_school_cnt = max(schools_count_pivot['Elementary'].max(), schools_count_pivot['Secondary'].max())

#divide each count by the maximum value
elementary_normalized = schools_count_pivot['Elementary']/max_school_cnt
secondary_normalized = schools_count_pivot['Secondary']/max_school_cnt

#add these columns to the dataframe
schools_count_pivot.insert(schools_count_pivot.shape[1], 'Elementary Normalized', elementary_normalized)
schools_count_pivot.insert(schools_count_pivot.shape[1], 'Secondary Normalized', secondary_normalized)

#drop columns we don't need
schools_count_pivot = schools_count_pivot.drop(['Elementary', 'Secondary'], 1)
schools_count_pivot.head()

School Level,Postal Code,Elementary Normalized,Secondary Normalized
0,M1B,1.0,0.0625
1,M1C,0.5,0.125
2,M1E,0.8125,0.5625
3,M1G,0.625,0.125
4,M1H,0.25,0.0625


#### Population count per postal code

We will enrich the dataset by adding population counts for each neighborhood by postal code as published on the Statistics Canada <a href = 'https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Table.cfm?Lang=Eng&T=1201&S=22&O=A'>Population and Dwelling Count Highlight Tables, 2016 Census</a> page.  

Three values are provided in this table: 
* Population, 2016: represent the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on census day
* Total private dwellings, 2016: refers to total private dwellings and private dwellings occupied by usual residents in Canada
* Private dwellings occupied by usual residents, 2016: refers to usual residents, not including tourists

In [182]:
#read population data from csv file downloaded from Statistics Canada
df = pd.read_csv('https://raw.githubusercontent.com/mferle/Coursera_Capstone/master/data/T120120190215054507.csv')

#keep only rows where the postal code begins with M - these are Toronto postal codes
df = df[df['Geographic code'].str.startswith('M')]

#remove any columns where the population count is less than 100
df = df[df['Population, 2016'] > 99]

#keep only columns we need for analysis
df = df[['Geographic code', 'Population, 2016', 'Total private dwellings, 2016', 'Private dwellings occupied by usual residents, 2016']]
df.head()

Unnamed: 0,Geographic code,"Population, 2016","Total private dwellings, 2016","Private dwellings occupied by usual residents, 2016"
895,M1B,66108,20957,20230
896,M1C,35626,11588,11274
897,M1E,46943,17637,17161
898,M1G,29690,10116,9767
899,M1H,24383,9274,8985


For the analysis, we are interested in residential areas and therefore we want to exclude tourists which means that our column of interest is *Private dwellings occupied by usual residents, 2016*. We will divide this number by the total population count to derive a percentage of residents as compared to the total population. In residential areas where most of the population is not made up of tourists, the percentage should be higher as elsewhere.

In [183]:
#calculate percentage and add to dataframe as new column
pct = df['Private dwellings occupied by usual residents, 2016'] / df['Population, 2016']
df.insert(toronto_population.shape[1], 'Percent Occupied', pct)
#drop unwanted columns
toronto_population = df.drop(['Population, 2016', 'Total private dwellings, 2016', 'Private dwellings occupied by usual residents, 2016'], 1)
toronto_population.head()


Unnamed: 0,Geographic code,Percent Occupied
895,M1B,0.306014
896,M1C,0.316454
897,M1E,0.365571
898,M1G,0.328966
899,M1H,0.368494


#### Join school counts and population data with Toronto neighborhoods dataframe

In [184]:
#join school counts
toronto_neighborhoods = toronto_neighborhoods.join(schools_count_pivot.set_index('Postal Code'), on = 'Postcode')

#replace null values with 0
toronto_neighborhoods['Elementary Normalized'].fillna(0, inplace=True)
toronto_neighborhoods['Secondary Normalized'].fillna(0, inplace=True)
toronto_neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Elementary Normalized,Secondary Normalized
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1.0,0.0625
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.5,0.125
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.8125,0.5625
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.625,0.125
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.25,0.0625


In [185]:
#join population data
toronto_neighborhoods = toronto_neighborhoods.join(toronto_population.set_index('Geographic code'), on = 'Postcode')

#replace null values with 0
toronto_neighborhoods['Percent Occupied'].fillna(0, inplace=True)
toronto_neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Elementary Normalized,Secondary Normalized,Percent Occupied
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1.0,0.0625,0.306014
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.5,0.125,0.316454
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.8125,0.5625,0.365571
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.625,0.125,0.328966
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.25,0.0625,0.368494


<a id="toc2"></a>
## 2. Explore Neighborhoods in Toronto

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [186]:
# The code was removed by Watson Studio for sharing.

#### Let's explore the first neighborhood in our dataframe

Find the name, latitude and longitude of the first neighborhood in the dataframe.

In [187]:
neighborhood_name = toronto_neighborhoods.loc[0, 'Neighbourhood'] # neighborhood name
neighborhood_latitude = toronto_neighborhoods.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_neighborhoods.loc[0, 'Longitude'] # neighborhood longitude value

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Rouge, Malvern are 43.806686299999996, -79.19435340000001.


#### Now, let's get the top 100 venues that are in the above neighborhood within a radius of 500 meters

First we will create the GET request URL.

In [188]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)


Send the GET request and examine the results

In [189]:
results = requests.get(url).json()
#results.head()

We see that all the information that we want is in the *items* key. Before we proceed, we will create the **get_category_type** function which extracts the category name from a JSON object.

In [190]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [191]:
#extract the items key from the results
venues = results['response']['groups'][0]['items']
#flatten JSON into a dataframe
nearby_venues = json_normalize(venues) 
#filter columns that we need for further analysis
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]
#extract the category for each row using the previously defined function
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns ?????????????????????????????????????????????
#nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng
0,Wendy's,Fast Food Restaurant,43.807448,-79.199056


#### Create a function to repeat the same process as above to all the neighborhoods in Toronto

In [192]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

The following code executes the above function on each neighborhood and creates a new dataframe called *toronto_venues*.

In [193]:
toronto_venues = getNearbyVenues(names = toronto_neighborhoods['Neighbourhood'],
                                   latitudes = toronto_neighborhoods['Latitude'],
                                   longitudes = toronto_neighborhoods['Longitude']
                                  )

Rouge, Malvern
Highland Creek, Rouge Hill, Port Union
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park, Ionview, Kennedy Park
Clairlea, Golden Mile, Oakridge
Cliffcrest, Cliffside, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Scarborough Town Centre, Wexford Heights
Maryvale, Wexford
Agincourt
Clarks Corners, Sullivan, Tam O'Shanter
Agincourt North, L'Amoreaux East, Milliken, Steeles East
L'Amoreaux West, Steeles West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Bathurst Manor, Downsview North, Wilson Heights
Northwood Park, York University
CFB Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens, Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The D

Check the size and first few rows of the resulting dataframe

In [194]:
print(toronto_venues.shape)
toronto_venues.head()

(2251, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Big Bite Burrito,43.766299,-79.19072,Mexican Restaurant


Count number of appearances of each venue category

In [195]:
toronto_venues.groupby('Venue Category').count()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Accessories Store,2,2,2,2,2,2
Adult Boutique,1,1,1,1,1,1
Afghan Restaurant,1,1,1,1,1,1
Airport,2,2,2,2,2,2
Airport Food Court,1,1,1,1,1,1
Airport Gate,1,1,1,1,1,1
Airport Lounge,2,2,2,2,2,2
Airport Service,2,2,2,2,2,2
Airport Terminal,2,2,2,2,2,2
American Restaurant,34,34,34,34,34,34


We see that many venue categories appear only a few times. It doesn't make sense to include these in segmentation because they don't appear often enough to have an impact, but they contribute to noise in the dataset. Therefore we will exclude venue categories that appear less than 5 times from the dataset.  

But first, we have to check how many venue categories appear less than 5 times to ensure that we still have enough venue categories left for segmentation.

In [196]:
#store the results of the above counts into a dataframe
toronto_venues_count = toronto_venues.groupby('Venue Category').count()
print('There are {} venue categories that appear less than 5 times.'.format(toronto_venues_count[toronto_venues_count['Neighborhood'] < 5].shape[0]))

There are 173 venue categories that appear less than 5 times.


We know that we have a total of 259 venue categories which means that even after we remove 173 of them we should still have sufficient data for segmentation.  

Therefore we will exclude the venue categories that appear less than 5 times.

In [197]:
#create list with neighborhoods to exclude
neigh_to_exclude = toronto_venues_count[toronto_venues_count['Neighborhood'] < 5].index.tolist()
#create filtered dataframe by excluding neighborhoods in above list
toronto_venues_filt = toronto_venues[~toronto_venues['Venue Category'].isin(neigh_to_exclude)]
#check size of resulting dataframe
toronto_venues_filt.groupby('Venue Category').count().shape

(106, 6)

The number of venue categories is sufficient for further analysis. We will rename the filtered dataset *toronto_venues_filt* back to the original dataset *toronto_venues*.

In [198]:
toronto_venues = toronto_venues_filt

<a id="toc3"></a>
## 3. Analyze Each Neighborhood

We will do one hot encoding to pivot category values into columns of the dataframe.  

There is one observation that we have to be careful about: one of the category values is *Neighborhood*. After one hot encoding, this value will become a column name. We are already using the column *Neighborhood* to represent the neighborhood name. To avoid confusing these columns, we will rename the column that comes from one hot encoding as *Neighborhood Category*.

In [199]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

#rename the column 'Neighborhood' which represents a category name to 'Neighborhood Category' 
#this is to distinguish this column from the 'Neighborhood' column which we want to continue to use as the neighborhood name
toronto_onehot.rename(columns={'Neighborhood':'Neighborhood Category'}, inplace=True)

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

#toronto_onehot.head()


Check the new dataframe size:

In [200]:
toronto_onehot.shape

(1910, 107)

#### We will group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [201]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head(10)

Unnamed: 0,Neighborhood,American Restaurant,Aquarium,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,...,Supermarket,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,"Adelaide, King, Richmond",0.046512,0.0,0.011628,0.0,0.034884,0.0,0.0,0.0,0.034884,...,0.0,0.023256,0.0,0.046512,0.011628,0.0,0.011628,0.0,0.011628,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Bathurst Manor, Downsview North, Wilson Heights",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Bedford Park, Lawrence Manor East",0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0
8,Berczy Park,0.0,0.0,0.020408,0.0,0.0,0.0,0.0,0.020408,0.040816,...,0.0,0.0,0.020408,0.020408,0.0,0.0,0.0,0.0,0.0,0.0
9,"Birch Cliff, Cliffside West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Check the new dataframe size:

In [202]:
toronto_grouped.shape

(97, 107)

#### Store the above into a *pandas* dataframe

Write a function to sort the venues in descending order

In [203]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new dataframe and display the top 10 venues for each neighborhood

In [204]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns = columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,American Restaurant,Thai Restaurant,Steakhouse,Asian Restaurant,Bakery,Bar,Hotel,Gym
1,Agincourt,Skating Rink,Breakfast Spot,Lounge,Clothing Store,Yoga Studio,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Playground,Diner,Cocktail Bar,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Fast Food Restaurant,Coffee Shop,Pharmacy,Pizza Place,Sandwich Place,Beer Store,Liquor Store,Japanese Restaurant,Discount Store
4,"Alderwood, Long Branch",Pizza Place,Skating Rink,Gym,Pharmacy,Coffee Shop,Pub,Sandwich Place,Department Store,Cocktail Bar,Comfort Food Restaurant


In [205]:
neighborhoods_venues_sorted.groupby(['1st Most Common Venue']).size()

1st Most Common Venue
American Restaurant           1
Bakery                        1
Bank                          1
Bar                           2
Baseball Field                2
Beer Store                    1
Breakfast Spot                1
Bus Line                      1
Café                          7
Chinese Restaurant            3
Clothing Store                3
Coffee Shop                  20
Convenience Store             1
Discount Store                1
Fast Food Restaurant          4
Food Truck                    1
Fried Chicken Joint           1
Greek Restaurant              1
Grocery Store                 4
Gym                           1
Gym / Fitness Center          3
Indian Restaurant             2
Japanese Restaurant           1
Mexican Restaurant            2
Middle Eastern Restaurant     1
Park                         15
Pharmacy                      3
Pizza Place                   6
Playground                    1
Pub                           1
Ramen Restaurant  

<a id="toc4"></a>
## 4. Cluster Neighborhoods


Run *k*-means to cluster the neighborhood into 5 clusters.

In [74]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([3, 3, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 3, 0, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 1, 4, 3, 2, 3, 3, 3, 3, 3, 3, 0, 3, 3, 0, 3, 3, 3, 3,
       3, 4, 3, 0, 4, 3, 3, 3, 0, 3, 3, 3, 3, 3, 3, 3, 3, 0, 3, 3, 3, 0, 3,
       3, 3, 3, 3, 0], dtype=int32)

Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [75]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = neighborhoods_venues_sorted

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(toronto_neighborhoods.set_index('Neighbourhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,...,10th Most Common Venue,Postcode,Borough,Latitude,Longitude,Elementary,Secondary,Population 2016,Private dwellings 2016,Occupied dwellings 2016
0,3,"Adelaide, King, Richmond",Coffee Shop,Café,American Restaurant,Thai Restaurant,Steakhouse,Asian Restaurant,Bakery,Bar,...,Gym,M5H,Downtown Toronto,43.650571,-79.384568,0,0,2005,1718,1243
1,3,Agincourt,Skating Rink,Breakfast Spot,Lounge,Clothing Store,Yoga Studio,Coffee Shop,Comfort Food Restaurant,Concert Hall,...,Cosmetics Shop,M1S,Scarborough,43.7942,-79.262029,10,4,37769,13195,12644
2,0,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Playground,Diner,Cocktail Bar,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,...,Creperie,M1V,Scarborough,43.815252,-79.284577,11,1,54680,16449,16092
3,3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Fast Food Restaurant,Coffee Shop,Pharmacy,Pizza Place,Sandwich Place,Beer Store,Liquor Store,...,Discount Store,M9V,Etobicoke,43.739416,-79.588437,11,7,55959,17590,16808
4,3,"Alderwood, Long Branch",Pizza Place,Skating Rink,Gym,Pharmacy,Coffee Shop,Pub,Sandwich Place,Department Store,...,Comfort Food Restaurant,M8W,Etobicoke,43.602414,-79.543484,4,0,20674,9161,8770


Visualize the resulting clusters

In [76]:
address = 'Toronto'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude


In [77]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Note: because GitHub doesn't display Folium maps, a print screen of the map is available <a href='img/Toronto.png'>here</a>.

<a id="toc5"></a>
## 5. Examine Clusters


We will examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can then assign a name to each cluster.

### Cluster 0: Residential

In [78]:
toronto_cluster0 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]-2))]]
toronto_cluster0

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Postcode,Borough,Latitude,Longitude,Elementary,Secondary,Population 2016
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Playground,Diner,Cocktail Bar,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,M1V,Scarborough,43.815252,-79.284577,11,1,54680
13,"CFB Toronto, Downsview East",Park,Yoga Studio,Clothing Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,M3K,North York,43.737473,-79.464763,1,1,5997
15,Caledonia-Fairbanks,Park,Pharmacy,Fast Food Restaurant,Diner,Cocktail Bar,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,M6E,York,43.689026,-79.453512,4,1,38041
61,Lawrence Park,Park,Bus Line,Yoga Studio,Diner,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,M4N,Central Toronto,43.72802,-79.38879,3,1,15330
64,"Maple Leaf Park, North Park, Upwood Park",Park,Bakery,Yoga Studio,Discount Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,M6L,North York,43.713756,-79.490074,4,0,20616
72,Rosedale,Park,Playground,Trail,Dessert Shop,Clothing Store,Cocktail Bar,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,M4W,Downtown Toronto,43.679563,-79.377529,1,1,14561
77,"Silver Hills, York Mills",Park,Yoga Studio,Clothing Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,M2L,North York,43.75749,-79.374714,4,2,11717
86,"The Kingsway, Montgomery Road, Old Mill North",Park,Yoga Studio,Clothing Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,M8X,Etobicoke,43.653654,-79.506944,1,0,10787
90,Weston,Park,Yoga Studio,Clothing Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,M9N,York,43.706876,-79.518188,4,1,25074
96,York Mills West,Park,Bank,Yoga Studio,Discount Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,M2P,North York,43.752758,-79.400049,2,2,7843


The venue categories in this cluster are predominantly shops with some interspersed coffee shops and restaurants as well as parks and sporting venues. It apears that this cluster represents residential areas.

### Cluster 1: Ground transportation hub

In [79]:
toronto_cluster1 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]-2))]]
toronto_cluster1

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Postcode,Borough,Latitude,Longitude,Elementary,Secondary,Population 2016
25,"Cloverdale, Islington, Martin Grove, Princess ...",Bank,Yoga Studio,Electronics Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,M9B,Etobicoke,43.650943,-79.554724,5,4,32400
51,"Highland Creek, Rouge Hill, Port Union",Bar,Yoga Studio,Electronics Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,M1C,Scarborough,43.784535,-79.160497,8,2,35626


The venues in this cluster suggest that it is a transportation hub as it includes bus and metro stations.

### Cluster 2: Downtown

In [80]:
toronto_cluster2 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]-2))]]
toronto_cluster2

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Postcode,Borough,Latitude,Longitude,Elementary,Secondary,Population 2016
35,Downsview Central,Food Truck,Baseball Field,Yoga Studio,Discount Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,M3M,North York,43.728496,-79.495697,9,0,24046
40,"Emery, Humberlea",Baseball Field,Yoga Studio,Electronics Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,M9M,North York,43.724766,-79.532242,2,3,22263
54,"Humber Bay, King's Mill Park, Kingsway Park So...",Baseball Field,Yoga Studio,Electronics Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,M8Y,Etobicoke,43.636258,-79.498509,5,1,21299


Most of the neighborhoods in this cluster appear to be geographically located near downtown. The venue categories are predominantly restaurants, coffee shops and hotels with some shops and yoga studios. This is representative of a *Downtown* city area.

### Cluster 3: Quick eats

In [81]:
toronto_cluster3 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]-2))]]
toronto_cluster3

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Postcode,Borough,Latitude,Longitude,Elementary,Secondary,Population 2016
0,"Adelaide, King, Richmond",Coffee Shop,Café,American Restaurant,Thai Restaurant,Steakhouse,Asian Restaurant,Bakery,Bar,Hotel,Gym,M5H,Downtown Toronto,43.650571,-79.384568,0,0,2005
1,Agincourt,Skating Rink,Breakfast Spot,Lounge,Clothing Store,Yoga Studio,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,M1S,Scarborough,43.794200,-79.262029,10,4,37769
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Fast Food Restaurant,Coffee Shop,Pharmacy,Pizza Place,Sandwich Place,Beer Store,Liquor Store,Japanese Restaurant,Discount Store,M9V,Etobicoke,43.739416,-79.588437,11,7,55959
4,"Alderwood, Long Branch",Pizza Place,Skating Rink,Gym,Pharmacy,Coffee Shop,Pub,Sandwich Place,Department Store,Cocktail Bar,Comfort Food Restaurant,M8W,Etobicoke,43.602414,-79.543484,4,0,20674
5,"Bathurst Manor, Downsview North, Wilson Heights",Coffee Shop,Deli / Bodega,Diner,Sandwich Place,Fried Chicken Joint,Pharmacy,Pizza Place,Restaurant,Sushi Restaurant,Fast Food Restaurant,M3H,North York,43.754328,-79.442259,5,1,37011
6,Bayview Village,Chinese Restaurant,Café,Japanese Restaurant,Bank,Discount Store,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,M2K,North York,43.786947,-79.385975,3,0,23852
7,"Bedford Park, Lawrence Manor East",Fast Food Restaurant,Italian Restaurant,Coffee Shop,American Restaurant,Sandwich Place,Greek Restaurant,Grocery Store,Indian Restaurant,Japanese Restaurant,Juice Bar,M5M,North York,43.733283,-79.419750,4,0,25975
8,Berczy Park,Coffee Shop,Restaurant,Cocktail Bar,Italian Restaurant,Farmers Market,Beer Bar,Seafood Restaurant,Cheese Shop,Steakhouse,Pub,M5E,Downtown Toronto,43.644771,-79.373306,1,0,9118
9,"Birch Cliff, Cliffside West",Café,Skating Rink,Yoga Studio,Diner,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,M1N,Scarborough,43.692657,-79.264848,7,1,22136
10,"Bloordale Gardens, Eringate, Markland Wood, Ol...",Pharmacy,Liquor Store,Pizza Place,Café,Beer Store,Fried Chicken Joint,Department Store,Gift Shop,Coffee Shop,Comfort Food Restaurant,M9C,Etobicoke,43.643515,-79.577201,9,1,38291


The venues in this cluster are predominantly restaurants/fast food restaurants and similar and this suggests that it is an area for grabbing something quick to eat.

### Cluster 4: Airport

In [82]:
toronto_cluster4 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(2, toronto_merged.shape[1]-2))]]
toronto_cluster4

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Postcode,Borough,Latitude,Longitude,Elementary,Secondary,Population 2016
52,Hillcrest Village,Fast Food Restaurant,Mediterranean Restaurant,Yoga Studio,Discount Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,M2H,North York,43.803762,-79.363452,10,1,24497
70,Parkwoods,Park,Fast Food Restaurant,Yoga Studio,Diner,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,M3A,North York,43.753259,-79.329656,10,2,34615
73,"Rouge, Malvern",Fast Food Restaurant,Yoga Studio,Discount Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Deli / Bodega,M1B,Scarborough,43.806686,-79.194353,16,1,66108


This cluster is obviously an airport.

# Discussion

In [83]:
#neighborhoods that have been flagged as best to live in
df = pd.read_excel('https://github.com/mferle/Coursera_Capstone/blob/master/data/Top_blogs.xlsx?raw=true')
df['Top16'].fillna(0, inplace=True)
df['Top10'].fillna(0, inplace=True)
df['Top8'].fillna(0, inplace=True)
df = df.astype({'Top16' : int, 'Top10' : int, 'Top8' : int})
toronto_flagged = df
toronto_flagged.head()

Unnamed: 0,Postal Code,Top16,Top10,Top8
0,M4E,0,1,1
1,M4L,1,0,0
2,M4P,1,1,1
3,M4S,1,1,1
4,M4X,1,1,0


In [85]:
tf = toronto_flagged.join(toronto_neighborhoods.set_index('Postcode'), on = 'Postal Code')
tf.head()

Unnamed: 0,Postal Code,Top16,Top10,Top8,Borough,Neighbourhood,Latitude,Longitude,Elementary,Secondary,Population 2016,Private dwellings 2016,Occupied dwellings 2016
0,M4E,0,1,1,East Toronto,The Beaches,43.676357,-79.293031,6,1,25044,11284,10784
1,M4L,1,0,0,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,6,0,32640,14634,13901
2,M4P,1,1,1,Central Toronto,Davisville North,43.712751,-79.390197,2,3,20039,12207,11487
3,M4S,1,1,1,Central Toronto,Davisville,43.704324,-79.38879,3,0,26506,14011,13397
4,M4X,1,1,0,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675,2,0,20822,10809,10306


In [94]:
# create map
map_best = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tf['Latitude'], tf['Longitude'], tf['Neighbourhood'], tf['Top16']):
  if cluster == 1:
    label = folium.Popup(str(poi) + ' Top16 ', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=9,
        popup=label,
        color=rainbow[1],
        fill=True,
        fill_color=rainbow[1],
        fill_opacity=0.7).add_to(map_best)
for lat, lon, poi, cluster in zip(tf['Latitude'], tf['Longitude'], tf['Neighbourhood'], tf['Top10']):
  if cluster == 1:
    label = folium.Popup(str(poi) + ' Top10 ', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=7,
        popup=label,
        color=rainbow[2],
        fill=True,
        fill_color=rainbow[2],
        fill_opacity=0.7).add_to(map_best)
for lat, lon, poi, cluster in zip(tf['Latitude'], tf['Longitude'], tf['Neighbourhood'], tf['Top8']):
  if cluster == 1:
    label = folium.Popup(str(poi) + ' Top8 ', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[3],
        fill=True,
        fill_color=rainbow[3],
        fill_opacity=0.7).add_to(map_best)
       
map_best