# Applied Data Science Capstone

## Introduction (wk 1)

*This notebook will serve as the Capstone Project for IBM's Data Science Certification.*

In [356]:
import pandas as pd
import numpy as np

In [357]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


## Neighborhood Segmentation & Clustering (wk 3)

## Segmenting and clustering neighborhoods in the city of Toronto, Canada

### Part A: Scrape Wikipedia page and build a dataframe with the postal code of each neighborhood

Import relevant libraries and modules:

In [576]:
import pandas as pd

I scrape the table with Toronto FSAs from [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) and read into a dataframe, using pandas:

In [577]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
table = pd.read_html(url, header=0, keep_default_na=False) 
df = table[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


I rename column Postcode as PostalCode and Neighbourhood as Neighborhood:

In [578]:
df.columns = ['PostalCode','Borough','Neighborhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


I will process only rows with assigned boroughs, those not assigned one are dropped:

In [579]:
df= df.query('Borough != "Not assigned"').reset_index(drop=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


I combine neighborhoods with the same postal code, separated by a comma:

In [580]:
df = df.groupby('PostalCode', as_index=False).agg(lambda x: ', '.join(set(x.dropna())))
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Rouge Hill, Highland Creek, Port Union"
2,M1E,Scarborough,"West Hill, Guildwood, Morningside"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


I replace cell values not assigned a neighborhood with the name of the borough they belong to:

In [581]:
df.loc[df['Neighborhood'] == 'Not assigned', 'Neighborhood' ] = df['Borough']
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Rouge Hill, Highland Creek, Port Union"
2,M1E,Scarborough,"West Hill, Guildwood, Morningside"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Golden Mile, Oakridge, Clairlea"
8,M1M,Scarborough,"Scarborough Village West, Cliffcrest, Cliffside"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [582]:
df.shape[0]

103

### Part B: Add the location coordinates of each neighborhood to the dataframe

Relevant libraries and modules:

In [364]:
import pandas as pd

As the Geocoder Python package did not run properly, latitude and longitude of each postal code are fetched using a csv file containing the coordinates:

In [583]:
ll = pd.read_csv("http://cocl.us/Geospatial_data/")
ll.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


I rename column Postal Code as PostalCode to match final dataframe in Part A:

In [584]:
ll.columns = ['PostalCode','Latitude','Longitude']
ll.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


I verify number of postal codes:

In [585]:
ll.shape[0]

103

Latitude and longitude coordinates of each postal code are added to final dataframe in Part A: 

In [586]:
df = pd.merge(left=df, right=ll, on="PostalCode", how="right")
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Highland Creek, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"West Hill, Guildwood, Morningside",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Oakridge, Clairlea",43.711112,-79.284577
8,M1M,Scarborough,"Scarborough Village West, Cliffcrest, Cliffside",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


### Part C: Explore and cluster neighborhoods in Toronto

Relevant libraries and modules:

In [369]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # geocoder

import requests # library to handle requests

from pandas.io.json import json_normalize # transform JSON file into a pandas dataframe

import matplotlib.cm as cm # library for plotting
import matplotlib.colors as colors

from sklearn.cluster import KMeans # k-means for clustering stage

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.



Latitude and longitude of Toronto:

In [587]:
address = 'Toronto, ON'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.653963, -79.387207.


Map of Toronto with the neighborhoods superimposed

In [588]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius=5, popup=label, color='yellow', fill=True, fill_color='#f0e630', fill_opacity=0.7, parse_html=False).add_to(map_toronto)  

map_toronto

I choose to focus on Downtown Toronto and create a new dataframe of the area:

In [589]:
toronto_data = df[df['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


I get location coordinates for Downtown Toronto to create a map of the area with the neighborhoods superimposed: 

In [590]:
address = 'Downtown Toronto, ON'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Downtown Toronto are 43.6563221, -79.3809161.


In [591]:
# create map of Downtown Toronto using latitude and longitude values
map_downtown_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius=5, popup=label, color='yellow', fill=True, fill_color='#f0e630', fill_opacity=0.7, parse_html=False).add_to(map_downtown_toronto)  

map_downtown_toronto

Queen's Park seems to be out of place, so I check coordindates with those provided from the [Wikipedia page on Queen's Park](https://en.wikipedia.org/wiki/Queen%27s_Park_(Toronto)), according to which latitude is 43.6647 and longitude -79.3925:

In [592]:
toronto_data.loc[18]

PostalCode                   M9A
Borough         Downtown Toronto
Neighborhood        Queen's Park
Latitude                 43.6679
Longitude               -79.5322
Name: 18, dtype: object

I replace the above coordinates with the ones provided from the [Wikipedia page on Queen's Park](https://en.wikipedia.org/wiki/Queen%27s_Park_(Toronto)) and check changes:

In [593]:
toronto_data.at[18,'Latitude'] = 43.6647
toronto_data.at[18,'Longitude'] = -79.3925
toronto_data.loc[18]

PostalCode                   M9A
Borough         Downtown Toronto
Neighborhood        Queen's Park
Latitude                 43.6647
Longitude               -79.3925
Name: 18, dtype: object

I create an updated map with the new coordinates to check results: 

In [594]:
# create map of Downtown Toronto using latitude and longitude values
map_downtown_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius=5, popup=label, color='yellow', fill=True, fill_color='#f0e630', fill_opacity=0.7, parse_html=False).add_to(map_downtown_toronto)  

map_downtown_toronto

It looks fine now, so I proceed to utilise Foursquare to explore Downtown Toronto:

In [595]:
# Define credentials and version
CLIENT_ID = '2ZDVETLSRV3VQTBKHCPHTSQAIXVKQEPFVIQHMZW4T1XN5NIF' 
CLIENT_SECRET = 'FZPK0V2O43WYMC5BQYYNZRPWGWDTWRFSKBDC1VKYCJ1JJFNR' 
VERSION = '20180605' 

# define distance from current location and limit results
radius = 500
LIMIT = 100

# define URL
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=2ZDVETLSRV3VQTBKHCPHTSQAIXVKQEPFVIQHMZW4T1XN5NIF&client_secret=FZPK0V2O43WYMC5BQYYNZRPWGWDTWRFSKBDC1VKYCJ1JJFNR&ll=43.6563221,-79.3809161&v=20180605&radius=500&limit=100'

I create a function identifying top 100 venues of each neighborhood within a 500-meter radius:

In [596]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

I run above function in each neighborhood and create a new dataframe:

In [597]:
downtown_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude'])

Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, King, Adelaide
Toronto Islands, Union Station, Harbourfront East
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Harbord, University of Toronto
Kensington Market, Chinatown, Grange Park
Bathurst Quay, Harbourfront West, King and Spadina, CN Tower, South Niagara, Island airport, Railway Lands
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Queen's Park


Check resulting dataframe:

In [598]:
print(downtown_venues.shape)
downtown_venues.head()

(1320, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rosedale,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
1,Rosedale,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
2,Rosedale,43.679563,-79.377529,Alex Murray Parkette,43.6783,-79.382773,Park
3,Rosedale,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail
4,"St. James Town, Cabbagetown",43.667967,-79.367675,Cranberries,43.667843,-79.369407,Diner


In [599]:
print('There are {} unique categories.'.format(len(downtown_venues['Venue Category'].unique())))

There are 208 unique categories.


I create a new dataframe using one-hot encoding where categorical values of Venue Category are turned into numerical (dummy) variables, grouped by neighborhood:

In [600]:
# one hot encoding
downtown_onehot = pd.get_dummies(downtown_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
downtown_onehot['Neighborhood'] = downtown_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [downtown_onehot.columns[144]] + list(downtown_onehot.columns[:-144])
downtown_onehot = downtown_onehot[fixed_columns]

downtown_onehot.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Stadium,Basketball Stadium,Beach,Bed & Breakfast,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Boat or Ferry,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Butcher,Café,Camera Store,Candy Store,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Clothing Store,Cocktail Bar,Coffee Shop,College Arts Building,College Gym,College Rec Center,College Theater,Colombian Restaurant,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Dance Studio,Deli / Bodega
0,Rosedale,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Rosedale,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Rosedale,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Rosedale,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,"St. James Town, Cabbagetown",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


I group rows by neighborhood and replace count with the mean frequency of occurance: 

In [601]:
downtown_grouped = downtown_onehot.groupby('Neighborhood').mean().reset_index()
downtown_grouped.round(2)

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Stadium,Basketball Stadium,Beach,Bed & Breakfast,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Boat or Ferry,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Butcher,Café,Camera Store,Candy Store,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Clothing Store,Cocktail Bar,Coffee Shop,College Arts Building,College Gym,College Rec Center,College Theater,Colombian Restaurant,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Dance Studio,Deli / Bodega
0,"Bathurst Quay, Harbourfront West, King and Spa...",0.0,0.07,0.07,0.07,0.13,0.13,0.13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.04,0.0,0.0,0.0,0.02,0.02,0.0,0.04,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.02,0.04,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.02,0.04,0.09,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.0,0.0,0.02,0.0,0.0
2,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.04,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.14,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0
3,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.18,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0
4,Church and Wellesley,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.02,0.0,0.01,0.01,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0
5,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.03,0.0,0.02,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.01,0.02,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.03
6,"First Canadian Place, Underground city",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.03,0.0,0.03,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.03,0.01,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.03
7,"Garden District, Ryerson",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.07,0.0,0.09,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.04,0.0,0.0,0.0
8,"Harbord, University of Toronto",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.05,0.0,0.0,0.0,0.0,0.03,0.03,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.16,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.03,0.03,0.03,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Harbourfront,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.06,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.02,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0


In [602]:
downtown_grouped.shape

(19, 65)

The top 5 venues within each neighborhood are listed below:

In [603]:
num_top_venues = 5

for hood in downtown_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = downtown_grouped[downtown_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Quay, Harbourfront West, King and Spadina, CN Tower, South Niagara, Island airport, Railway Lands----
              venue  freq
0  Airport Terminal  0.13
1    Airport Lounge  0.13
2   Airport Service  0.13
3          Boutique  0.07
4     Boat or Ferry  0.07


----Berczy Park----
          venue  freq
0   Coffee Shop  0.09
1        Bakery  0.04
2  Cocktail Bar  0.04
3      Beer Bar  0.04
4          Café  0.04


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.14
1        Burger Joint  0.04
2                Café  0.04
3     Bubble Tea Shop  0.02
4  Chinese Restaurant  0.02


----Christie----
                venue  freq
0                Café  0.18
1          Baby Store  0.06
2         Candy Store  0.06
3         Coffee Shop  0.06
4  Athletics & Sports  0.06


----Church and Wellesley----
                 venue  freq
0          Coffee Shop  0.06
1                 Café  0.02
2      Bubble Tea Shop  0.02
3    Afghan Restaurant  0.01
4  Arts & Craft

I sort venues in descending order and transform num_top_venues into a pandas dataframe:

In [604]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [605]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = downtown_grouped['Neighborhood']

for ind in np.arange(downtown_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(downtown_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Bathurst Quay, Harbourfront West, King and Spa...",Airport Lounge,Airport Service,Airport Terminal,Boutique,Airport
1,Berczy Park,Coffee Shop,Beer Bar,Cocktail Bar,Bakery,Café
2,Central Bay Street,Coffee Shop,Café,Burger Joint,Chinese Restaurant,Bar
3,Christie,Café,Athletics & Sports,Candy Store,Coffee Shop,Baby Store
4,Church and Wellesley,Coffee Shop,Bubble Tea Shop,Café,Afghan Restaurant,Arts & Crafts Store


In [606]:
# set number of clusters
kclusters = 3

downtown_grouped_clustering = downtown_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(downtown_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 1, 1, 2, 1, 1, 1, 1, 2, 1], dtype=int32)

In [607]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

downtown_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
downtown_merged = downtown_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

downtown_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,1,Deli / Bodega,Dance Studio,Boat or Ferry,Bistro,Belgian Restaurant
1,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675,1,Coffee Shop,Bakery,Café,Deli / Bodega,Butcher
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,1,Coffee Shop,Bubble Tea Shop,Café,Afghan Restaurant,Arts & Crafts Store
3,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,1,Coffee Shop,Bakery,Café,Breakfast Spot,Chocolate Shop
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Bakery


Resulting clusters are presented in the map below:

In [608]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(downtown_merged['Latitude'], downtown_merged['Longitude'], downtown_merged['Neighborhood'], downtown_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

I check each cluster seperately to identify the defining categories and assign a name, accordingly:

In [609]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 0, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
14,Downtown Toronto,0,Airport Lounge,Airport Service,Airport Terminal,Boutique,Airport


In [610]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 1, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Downtown Toronto,1,Deli / Bodega,Dance Studio,Boat or Ferry,Bistro,Belgian Restaurant
1,Downtown Toronto,1,Coffee Shop,Bakery,Café,Deli / Bodega,Butcher
2,Downtown Toronto,1,Coffee Shop,Bubble Tea Shop,Café,Afghan Restaurant,Arts & Crafts Store
3,Downtown Toronto,1,Coffee Shop,Bakery,Café,Breakfast Spot,Chocolate Shop
4,Downtown Toronto,1,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Bakery
5,Downtown Toronto,1,Coffee Shop,Café,Clothing Store,Beer Bar,American Restaurant
6,Downtown Toronto,1,Coffee Shop,Beer Bar,Cocktail Bar,Bakery,Café
7,Downtown Toronto,1,Coffee Shop,Café,Burger Joint,Chinese Restaurant,Bar
8,Downtown Toronto,1,Coffee Shop,Café,Asian Restaurant,Bar,Burger Joint
9,Downtown Toronto,1,Coffee Shop,Aquarium,Café,Brewery,Baseball Stadium


In [611]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 2, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
12,Downtown Toronto,2,Café,Bookstore,Bar,Bakery,Beer Store
17,Downtown Toronto,2,Café,Athletics & Sports,Candy Store,Coffee Shop,Baby Store


Based on 1st most common venue, I change Cluster 1 name to Airport, Cluster 2 name to Coffee Shop, and Cluster 3 name to Café:

In [612]:
downtown_merged['Cluster Labels'].replace(0, 'Airport', inplace=True)
downtown_merged['Cluster Labels'].replace(1, 'Coffee Shop', inplace=True)
downtown_merged['Cluster Labels'].replace(2, 'Café', inplace=True)

The new dataframe is presented below:

In [613]:
downtown_merged.head(19)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,Coffee Shop,Deli / Bodega,Dance Studio,Boat or Ferry,Bistro,Belgian Restaurant
1,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675,Coffee Shop,Coffee Shop,Bakery,Café,Deli / Bodega,Butcher
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,Coffee Shop,Coffee Shop,Bubble Tea Shop,Café,Afghan Restaurant,Arts & Crafts Store
3,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,Coffee Shop,Coffee Shop,Bakery,Café,Breakfast Spot,Chocolate Shop
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,Coffee Shop,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Bakery
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,Coffee Shop,Coffee Shop,Café,Clothing Store,Beer Bar,American Restaurant
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,Coffee Shop,Coffee Shop,Beer Bar,Cocktail Bar,Bakery,Café
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,Coffee Shop,Coffee Shop,Café,Burger Joint,Chinese Restaurant,Bar
8,M5H,Downtown Toronto,"Richmond, King, Adelaide",43.650571,-79.384568,Coffee Shop,Coffee Shop,Café,Asian Restaurant,Bar,Burger Joint
9,M5J,Downtown Toronto,"Toronto Islands, Union Station, Harbourfront East",43.640816,-79.381752,Coffee Shop,Coffee Shop,Aquarium,Café,Brewery,Baseball Stadium
