# Segmenting and Clustering Neighborhoods in Toronto

## Part 1
scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset

Import libraries

In [1]:
import pandas as pd
import numpy as np

Read table

In [90]:
df=pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0)[0]
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Check to see if Borough has any 'Not assigned' values that are not also in Neighbourhood

In [91]:
df[(df['Neighbourhood']=='Not assigned') & (df['Borough']!='Not assigned')].count()

Postal Code      0
Borough          0
Neighbourhood    0
dtype: int64

Drop all entires with 'Not assigned' values in Borough

In [92]:
df.drop(df[df['Borough']=='Not assigned'].index,inplace=True)
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Get dataframe shape

In [93]:
df.shape

(103, 3)

## Part 2
Get the latitude and the longitude coordinates of each neighborhood

In [94]:
#import geocoder # import geocoder
def getcoords(postal_code):
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude, longitude

Using csv file since Geocoder was taking too long to respond

Read csv files of locations

In [95]:
dfcoord=pd.read_csv('https://cocl.us/Geospatial_data')
dfcoord

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Join locations to each neighbourhood

In [96]:
df=df.join(dfcoord.set_index('Postal Code'),on='Postal Code')
df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.753259,-79.329656
3,M4A,North York,Victoria Village,43.725882,-79.315572
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
165,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


## Part 3
Explore and cluster the neighborhoods in Toronto

Import map rendering library

In [97]:
import folium 
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors

Filter neighborhoods to boroughs in Toronto proper

In [99]:
df=df[df['Borough'].str.contains('Toronto')].reset_index()
df

Unnamed: 0,level_0,index,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,0,4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,1,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,2,13,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,3,22,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,4,30,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,5,31,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,6,40,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,7,41,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,8,49,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,9,50,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


Calculate center of map

In [100]:
latitude=df['Latitude'].mean()
longitude=df['Longitude'].mean()
[latitude,longitude]

[43.66713498717949, -79.38987324871795]

 Generate map at centered location and add markers to map

In [74]:
dispmap=folium.Map(location=[latitude, longitude], zoom_start=11)

In [75]:
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(dispmap)  
    
dispmap

Get Foursquare credentials and build url for the top 100 venues that are within a radius of 500 meters

In [25]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

Get venues by categories for each neighborhood

In [76]:
def getNearbyVenueCats(neighborhoods, latitudes, longitudes):
    radius=500
    venues_list=[]
    for name, lat, lng in zip(neighborhoods, latitudes, longitudes):
           
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Venue Category']
    
    return(nearby_venues)

In [77]:
neighborhood_venues = getNearbyVenueCats(df['Neighbourhood'],df['Latitude'], df['Longitude'])

In [78]:
neighborhood_venues

Unnamed: 0,Neighborhood,Venue Category
0,"Regent Park, Harbourfront",Bakery
1,"Regent Park, Harbourfront",Coffee Shop
2,"Regent Park, Harbourfront",Distribution Center
3,"Regent Park, Harbourfront",Spa
4,"Regent Park, Harbourfront",Restaurant
...,...,...
1616,"Business reply mail Processing Centre, South C...",Light Rail Station
1617,"Business reply mail Processing Centre, South C...",Park
1618,"Business reply mail Processing Centre, South C...",Yoga Studio
1619,"Business reply mail Processing Centre, South C...",Butcher


One hot encode venue categories

In [79]:
neighborhood_onehot = pd.get_dummies(neighborhood_venues['Venue Category'], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
neighborhood_onehot['Neighborhood'] = neighborhood_venues['Neighborhood'] 
neighborhood_onehot=neighborhood_onehot[['Neighborhood']+[col for col in neighborhood_onehot if col != 'Neighborhood']]
neighborhood_onehot


Unnamed: 0,Neighborhood,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1616,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1617,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1618,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1619,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [80]:
neighborhood_grouped = neighborhood_onehot.groupby('Neighborhood').mean().reset_index()
neighborhood_grouped

Unnamed: 0,Neighborhood,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.016393,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.058824,0.058824,0.058824,0.117647,0.117647,0.117647,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.015385,0.0,0.0,0.015385,0.015385
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0125,0.0,0.0,0.0,0.0,0.0,0.0,0.0125,0.0,...,0.0125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Create a new dataframe and display the top 10 venues for each neighborhood

In [81]:
num_top_venues = 10

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    columns.append('Most Common Venue - {}'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = neighborhood_grouped['Neighborhood']

for ind in np.arange(neighborhood_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(neighborhood_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,Most Common Venue - 1,Most Common Venue - 2,Most Common Venue - 3,Most Common Venue - 4,Most Common Venue - 5,Most Common Venue - 6,Most Common Venue - 7,Most Common Venue - 8,Most Common Venue - 9,Most Common Venue - 10
0,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Pharmacy,Farmers Market,Seafood Restaurant,Cheese Shop,Beer Bar,Restaurant,Café
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Breakfast Spot,Grocery Store,Stadium,Burrito Place,Restaurant,Climbing Gym,Pet Store,Bakery
2,"Business reply mail Processing Centre, South C...",Yoga Studio,Pizza Place,Light Rail Station,Brewery,Farmers Market,Fast Food Restaurant,Burrito Place,Butcher,Restaurant,Auto Workshop
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Service,Airport Terminal,Coffee Shop,Harbor / Marina,Plane,Rental Car Location,Sculpture Garden,Boutique,Bar
4,Central Bay Street,Coffee Shop,Sandwich Place,Café,Italian Restaurant,Bank,Salad Place,Thai Restaurant,Bubble Tea Shop,Burger Joint,Yoga Studio
5,Christie,Grocery Store,Café,Park,Athletics & Sports,Italian Restaurant,Restaurant,Baby Store,Candy Store,Nightclub,Coffee Shop
6,Church and Wellesley,Sushi Restaurant,Japanese Restaurant,Coffee Shop,Gay Bar,Restaurant,Café,Fast Food Restaurant,Hotel,Dance Studio,Yoga Studio
7,"Commerce Court, Victoria Hotel",Coffee Shop,Restaurant,Café,Hotel,Gym,Italian Restaurant,Deli / Bodega,American Restaurant,Japanese Restaurant,Seafood Restaurant
8,Davisville,Dessert Shop,Sandwich Place,Pizza Place,Sushi Restaurant,Gym,Italian Restaurant,Café,Coffee Shop,Toy / Game Store,Greek Restaurant
9,Davisville North,Dance Studio,Department Store,Park,Gym / Fitness Center,Breakfast Spot,Sandwich Place,Food & Drink Shop,Hotel,Eastern European Restaurant,Donut Shop


Import k-means for clustering and run clustering

In [82]:
from sklearn.cluster import KMeans

In [83]:
kclusters = 5

neighborhood_grouped_clustering = neighborhood_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(neighborhood_grouped_clustering)

Add clustering labels and rebuild dataframe

In [84]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
neighborhoods_merged = df
neighborhoods_merged = neighborhoods_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')
neighborhoods_merged 

Unnamed: 0,level_0,index,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,Most Common Venue - 1,Most Common Venue - 2,Most Common Venue - 3,Most Common Venue - 4,Most Common Venue - 5,Most Common Venue - 6,Most Common Venue - 7,Most Common Venue - 8,Most Common Venue - 9,Most Common Venue - 10
0,0,4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,2,Coffee Shop,Park,Bakery,Café,Pub,Breakfast Spot,Theater,Cosmetics Shop,Shoe Store,Brewery
1,1,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,2,Coffee Shop,Sushi Restaurant,College Cafeteria,Yoga Studio,Bank,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Café
2,2,13,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,2,Clothing Store,Coffee Shop,Café,Bubble Tea Shop,Middle Eastern Restaurant,Cosmetics Shop,Theater,Pizza Place,Bookstore,Ramen Restaurant
3,3,22,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,2,Coffee Shop,Café,Clothing Store,Gastropub,Cosmetics Shop,American Restaurant,Cocktail Bar,Farmers Market,Gym,Italian Restaurant
4,4,30,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Trail,Health Food Store,Pub,Yoga Studio,Deli / Bodega,Escape Room,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant
5,5,31,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,2,Coffee Shop,Cocktail Bar,Bakery,Pharmacy,Farmers Market,Seafood Restaurant,Cheese Shop,Beer Bar,Restaurant,Café
6,6,40,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,2,Coffee Shop,Sandwich Place,Café,Italian Restaurant,Bank,Salad Place,Thai Restaurant,Bubble Tea Shop,Burger Joint,Yoga Studio
7,7,41,M6G,Downtown Toronto,Christie,43.669542,-79.422564,2,Grocery Store,Café,Park,Athletics & Sports,Italian Restaurant,Restaurant,Baby Store,Candy Store,Nightclub,Coffee Shop
8,8,49,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568,2,Coffee Shop,Café,Restaurant,Thai Restaurant,Clothing Store,Deli / Bodega,Gym,Salad Place,Bakery,Bar
9,9,50,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259,2,Pharmacy,Bakery,Grocery Store,Supermarket,Middle Eastern Restaurant,Music Venue,Park,Pet Store,Pizza Place,Café


Set color scheme for the clusters, generate markers, and plot map

In [87]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(neighborhoods_merged['Latitude'], neighborhoods_merged['Longitude'], neighborhoods_merged['Neighbourhood'], neighborhoods_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters