# Segmenting and Clustering Toronto Neighborhoods

## Table of Contents

1. [Assignment Overview](#Assignment-Overview)
2. [Dependencies and Datasets](#Dependencies-and-Datasets)
3. [Data Exploration](#Data-Exploration)
4. [Data Preparation](#Data-Preparation)
5. [Fetch and Transform Foursquare Data](#Fetch-and-Transform-Foursquare-Data)
6. [Clustering](#Clustering)
    1. [Prepara data for clustering](#Prepara-data-for-clustering)
    2. [Cluster Neighborhoods](#Cluster-Neighborhoods)
    3. [Analyze clusters](#Analyze-clusters)
    4. [Visualize clusters](#Visualize-clusters)

## Assignment Overview

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

- to add enough Markdown cells to explain what you decided to do and to report any observations you make.
- to generate maps to visualize your neighborhoods and how they cluster together.
Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)

## Dependencies and Datasets

Import all required libraries

In [1]:
import pandas as pd
import numpy as np
import requests
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

Import the Toronto neighborhoods dataframe from the previous assignment

In [2]:
neighborhoods = pd.read_pickle('clustering-2.pkl')

## Data Exploration

Quick inspect of the loaded data

In [3]:
neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [4]:
print('The shape of dataframe is {}'.format(neighborhoods.shape))

The shape of dataframe is (103, 5)


In [5]:
print('The dataframe has {} unique boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 11 unique boroughs and 103 neighborhoods.


In [6]:
print('The dataframe contains {} boroughs that contain the word Toronto'.format(
    len(neighborhoods[neighborhoods['Borough'].str.contains('Toronto')]['Borough'].unique())
))

The dataframe contains 4 boroughs that contain the word Toronto


## Data Preparation

**Filter neighborhoods. Use only borough which contain word Toronto**

In [7]:
filtered_neighborhoods = neighborhoods[neighborhoods['Borough'].str.contains('Toronto')]

In [8]:
filtered_neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049


In [9]:
print('The shape of dataframe is {}'.format(filtered_neighborhoods.shape))

The shape of dataframe is (38, 5)


**Map of Toronto with superimposed boroughs and neighborhoods**

In [28]:
# Createa map
map_toronto = folium.Map(location=[43.653226, -79.3831843], zoom_start=12)

# set color scheme for the clusters
colors_array = cm.rainbow(np.linspace(0, 1, 4))
rainbow = [colors.rgb2hex(i) for i in colors_array]
boroughs = filtered_neighborhoods['Borough'].unique()
borough_rainbow = dict(zip(boroughs, rainbow))

# add markers to the map
for lat, lon, neighborhood, borough in zip(filtered_neighborhoods['Latitude'], filtered_neighborhoods['Longitude'], filtered_neighborhoods['Neighborhood'], filtered_neighborhoods['Borough']):
    label = folium.Popup(str(neighborhood) + ', ' + str(borough), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=borough_rainbow[borough],
        fill=True,
        fill_color=borough_rainbow[borough],
        fill_opacity=0.7).add_to(map_toronto)

In [30]:
map_toronto

## Fetch and Transform Foursquare Data

**Define Foursquare Credentials and Version**

In [56]:
CLIENT_ID = 'foobar' # your Foursquare ID
CLIENT_SECRET = 'foobar' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

**Create a helper function**

The function is a direct copy from the Capstone' week 3 lab. It goes through all given neighborhood and fetches 100 top venus for each of them. Then parses the location and category of venus.

In [12]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, limit=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            limit
        )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

**Get 100 top venus for each neighborhood**

Ociassionally the script fails, it has to be run a couple of time until it run all good 

In [13]:
venues = getNearbyVenues(names=filtered_neighborhoods['Neighborhood'], 
                         latitudes=filtered_neighborhoods['Latitude'],
                         longitudes=filtered_neighborhoods['Longitude']
                        )

The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront, Regent Park
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The 

**Let's check the size of the resulting dataframe**

In [14]:
print(venues.shape)
venues.head()

(1701, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
1,The Beaches,43.676357,-79.293031,Starbucks,43.678798,-79.298045,Coffee Shop
2,The Beaches,43.676357,-79.293031,Glen Stewart Ravine,43.6763,-79.294784,Other Great Outdoors
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


**How many venues were returned for each neighborhood**

In [15]:
venues[['Neighborhood', 'Venue']].groupby('Neighborhood').count()

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
"Adelaide, King, Richmond",100
Berczy Park,55
"Brockton, Exhibition Place, Parkdale Village",18
Business reply mail Processing Centre969 Eastern,19
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",14
"Cabbagetown, St. James Town",46
Central Bay Street,84
"Chinatown, Grange Park, Kensington Market",97
Christie,16
Church and Wellesley,86


**Number of unique categories that can be curated from all the returned venues**

In [16]:
len(venues['Venue Category'].unique())

234

## Clustering

### Prepara data for clustering

**Create a dataframe with one-hot encoding of venues categories**

In [17]:
# one hot encoding
onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot['Neighborhood'] = venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

onehot.head()

Unnamed: 0,Yoga Studio,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
print('The shape of onehot dataframe: {}'.format(onehot.shape))

The shape of onehot dataframe: (1701, 234)


**Create a dataframe with grouped onehot**

In [19]:
grouped = onehot.groupby('Neighborhood').mean().reset_index()
grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wings Joint,Women's Store
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business reply mail Processing Centre969 Eastern,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.0,0.071429,0.071429,0.071429,0.142857,0.142857,0.142857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
print('The shape of grouped dataframe: {}'.format(grouped.shape))

The shape of grouped dataframe: (38, 234)


**Create a dataframe with the most common venues group per neighborhood**

In [21]:
def return_most_common_venues_groups(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
top_cat_per_neigh = pd.DataFrame(columns=columns)
top_cat_per_neigh['Neighborhood'] = grouped['Neighborhood']

for ind in np.arange(grouped.shape[0]):
    top_cat_per_neigh.iloc[ind, 1:] = return_most_common_venues_groups(grouped.iloc[ind, :], num_top_venues)

In [22]:
top_cat_per_neigh

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,American Restaurant,Thai Restaurant,Steakhouse,Cosmetics Shop,Hotel,Asian Restaurant,Gym,Clothing Store
1,Berczy Park,Coffee Shop,Restaurant,Cocktail Bar,Pub,Cheese Shop,Italian Restaurant,Steakhouse,Seafood Restaurant,Farmers Market,Café
2,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Café,Breakfast Spot,Grocery Store,Furniture / Home Store,Pet Store,Convenience Store,Gym,Climbing Gym,Caribbean Restaurant
3,Business reply mail Processing Centre969 Eastern,Light Rail Station,Yoga Studio,Auto Workshop,Comic Shop,Pizza Place,Butcher,Recording Studio,Restaurant,Burrito Place,Brewery
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Lounge,Airport Terminal,Airport Service,Harbor / Marina,Sculpture Garden,Boutique,Plane,Boat or Ferry,Airport Gate,Airport
5,"Cabbagetown, St. James Town",Coffee Shop,Restaurant,Bakery,Indian Restaurant,Italian Restaurant,Café,Pub,Chinese Restaurant,Pizza Place,Park
6,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Ice Cream Shop,Burger Joint,Bar,Sandwich Place,Indian Restaurant,Falafel Restaurant,Japanese Restaurant
7,"Chinatown, Grange Park, Kensington Market",Café,Bar,Vietnamese Restaurant,Vegetarian / Vegan Restaurant,Coffee Shop,Bakery,Mexican Restaurant,Chinese Restaurant,Dumpling Restaurant,Burger Joint
8,Christie,Grocery Store,Café,Park,Diner,Athletics & Sports,Italian Restaurant,Restaurant,Nightclub,Coffee Shop,Convenience Store
9,Church and Wellesley,Japanese Restaurant,Sushi Restaurant,Coffee Shop,Gay Bar,Restaurant,Burger Joint,Mediterranean Restaurant,Gastropub,Bubble Tea Shop,Café


### Cluster Neighborhoods

In [46]:
# set number of clusters
kclusters = 4

grouped_clustering = grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [47]:
merged = filtered_neighborhoods

# add clustering labels
merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
merged = merged.join(top_cat_per_neigh.set_index('Neighborhood'), on='Neighborhood')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


### Analyze clusters

In [48]:
merged.loc[merged['Cluster Labels'] == 0, merged.columns[[1] + list(range(5, merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,East Toronto,0,Coffee Shop,Other Great Outdoors,Pub,Women's Store,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
41,East Toronto,0,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Bookstore,Yoga Studio,Bubble Tea Shop,Bakery,Spa,Juice Bar
42,East Toronto,0,Park,Sandwich Place,Light Rail Station,Fast Food Restaurant,Sushi Restaurant,Food & Drink Shop,Ice Cream Shop,Pub,Fish & Chips Shop,Burrito Place
43,East Toronto,0,Café,Coffee Shop,Bakery,Italian Restaurant,American Restaurant,Yoga Studio,Fish Market,Juice Bar,Latin American Restaurant,Bookstore
44,Central Toronto,0,Bus Line,Dim Sum Restaurant,Park,Swim School,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
45,Central Toronto,0,Breakfast Spot,Grocery Store,Sandwich Place,Park,Burger Joint,Food & Drink Shop,Clothing Store,Dance Studio,Hotel,Donut Shop
46,Central Toronto,0,Clothing Store,Sporting Goods Shop,Coffee Shop,Yoga Studio,Gift Shop,Italian Restaurant,Fast Food Restaurant,Mexican Restaurant,Diner,Dessert Shop
47,Central Toronto,0,Pizza Place,Sandwich Place,Dessert Shop,Sushi Restaurant,Italian Restaurant,Café,Seafood Restaurant,Coffee Shop,Skating Rink,Chinese Restaurant
48,Central Toronto,0,Intersection,Playground,Park,Restaurant,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
49,Central Toronto,0,Pub,Coffee Shop,American Restaurant,Sushi Restaurant,Bagel Shop,Fried Chicken Joint,Sports Bar,Supermarket,Convenience Store,Pizza Place


In [49]:
merged.loc[merged['Cluster Labels'] == 1, merged.columns[[1] + list(range(5, merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
57,Downtown Toronto,1,Coffee Shop,Café,Italian Restaurant,Ice Cream Shop,Burger Joint,Bar,Sandwich Place,Indian Restaurant,Falafel Restaurant,Japanese Restaurant
65,Central Toronto,1,Coffee Shop,Café,Sandwich Place,Pizza Place,French Restaurant,Pharmacy,Cosmetics Shop,Pub,Burger Joint,Liquor Store
68,Downtown Toronto,1,Airport Lounge,Airport Terminal,Airport Service,Harbor / Marina,Sculpture Garden,Boutique,Plane,Boat or Ferry,Airport Gate,Airport


In [50]:
merged.loc[merged['Cluster Labels'] == 2, merged.columns[[1] + list(range(5, merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
63,Central Toronto,2,Pool,Garden,Women's Store,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


In [51]:
merged.loc[merged['Cluster Labels'] == 3, merged.columns[[1] + list(range(5, merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
69,Downtown Toronto,3,Coffee Shop,Restaurant,Seafood Restaurant,Café,Cocktail Bar,Pub,Hotel,Italian Restaurant,Bakery,Cosmetics Shop


### Visualize clusters

In [54]:
# create map
map_clusters = folium.Map(location=[43.653226, -79.3831843], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(merged['Latitude'], merged['Longitude'], merged['Neighborhood'], merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

In [55]:
map_clusters