# Segmenting and Clustering Neighborhoods in Toronto

Overview: Implemention to explore, segment, and cluster the neighborhoods in the city of Toronto
    Toronto neighborhood data will be collected from Wikipedia https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. Neighborhood info to get the most common venue categories will be collected by consuming the Foursquare API 
     These feature will be used to group the neighborhoods into clusters using k-means clustering algorithm. Finally, Folium library will be used to visualize the neighborhoods in Toronto City and their emerging clusters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. Import data set and perform Data Wrangling

2. Get Geocoding info for each neighborhood

3. Explore and cluster the neighborhoods in Toronto
 
</font>
</div>

## Part 1 -  Data Wrangling

### Install the required packages

In [1]:
#!conda install -c conda-forge beautifulsoup4 --yes 
#!conda install -c conda-forge lxml --yes 
#!conda install -c conda-forge requests --yes 
#!conda install -c conda-forge geocoder --yes 
#!conda install -c conda-forge folium --yes 

### Import all the required packages

In [2]:
from bs4 import BeautifulSoup
import requests
import urllib.request
import numpy as np
import pandas as pd
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors


### import the data from the URL. Parse it using Beautifulsoup to get the table

In [3]:
fp = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
mybytes = fp.read()

mystr = mybytes.decode("utf8")
fp.close()

In [4]:
soup = BeautifulSoup(mystr,'lxml')
match=soup.find('table', class_='wikitable sortable')

### Transform the data in to Pandas Dataframe

In [5]:
dataFrameList = pd.read_html(str(match))
neighborhoods = dataFrameList[0]
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## Start Data Wrangling

In [6]:
#Cleanup step 1: Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

assignedBoroughs = neighborhoods['Borough']!='Not assigned'
neighborhoods = neighborhoods[assignedBoroughs]

In [7]:
#cleanup step 2: Combine rows with the same neighborhood into one row with the neighborhoods separated with a comma

neighborhoods = neighborhoods.groupby(['Postcode', 'Borough'], as_index=False, sort=False).agg(','.join)

In [8]:
#cleanup step 3:  If a cell has a borough but a Not assigned neighborhood, then assign borough to neighbourhood

neighborhoods['Neighbourhood'] = np.where(neighborhoods['Neighbourhood'] == 'Not assigned', neighborhoods['Borough'], neighborhoods['Neighbourhood'])

#Rename neighbourhood column to be consistent
neighborhoods.rename(columns={"Neighbourhood": "Neighborhood"}, inplace = True)

In [9]:
neighborhoods

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Queen's Park,Queen's Park
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


In [10]:
neighborhoods.shape

(103, 3)

## Part 2 - Adding Geocoding

In [11]:
#import geocoder

# initialize your variable to None
#lat_lng_coords = None

# loop until you get the coordinates
#while(lat_lng_coords is None):

#  g = geocoder.google('{}, Toronto, Ontario'.format('M5G'))
#  lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]
#print (latitude)
#print (longitude)

### Geocoder failing to return the coordinates after multiple retries. Continuing implementation the csv file

In [12]:
long_lat_df = pd.read_csv('http://cocl.us/Geospatial_data')
#long_lat_df.rename(columns={"Postal Code": "Postcode"}, inplace = True)
long_lat_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
neighborhoods = neighborhoods.merge(long_lat_df, left_on='Postcode', right_on='Postal Code')
neighborhoods.drop(['Postal Code'], axis=1)

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.654260,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
5,M9A,Queen's Park,Queen's Park,43.667856,-79.532242
6,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937


Get Latitude and Longitude for Toronto

In [14]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, ON are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, ON are 43.653963, -79.387207.


## Part 3 - Explore and cluster neighborhoods in Toronto

### Create a map of Toronto with neighbourhoods superimposed on top

In [15]:
import folium

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Initialize foursquare api credentials in a hidden cell

In [16]:
# The code was removed by Watson Studio for sharing.

In [17]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
VERSION = '20180605' # Foursquare API version



#### Get the top 100 venues for each of the neighborhoods within a radius of 500 meters.

In [18]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        #print (url)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
toronto_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

In [20]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Parkwoods,43.753259,-79.329656,TTC stop - 44 Valley Woods,43.755402,-79.333741,Bus Stop
3,Parkwoods,43.753259,-79.329656,Corrosion Service Company Limited,43.752432,-79.334661,Construction & Landscaping
4,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena


check how many venues were returned for each neighborhood

In [21]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100
Agincourt,4,4,4,4,4,4
"Agincourt North,L'Amoreaux East,Milliken,Steeles East",2,2,2,2,2,2
"Albion Gardens,Beaumond Heights,Humbergate,Jamestown,Mount Olive,Silverstone,South Steeles,Thistletown",11,11,11,11,11,11
"Alderwood,Long Branch",9,9,9,9,9,9
"Bathurst Manor,Downsview North,Wilson Heights",20,20,20,20,20,20
Bayview Village,4,4,4,4,4,4
"Bedford Park,Lawrence Manor East",23,23,23,23,23,23
Berczy Park,56,56,56,56,56,56
"Birch Cliff,Cliffside West",4,4,4,4,4,4


## Analyze each neighborhood

In [22]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [23]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,"Adelaide,King,Richmond",0.000000,0.0,0.000000,0.0000,0.0000,0.000,0.000,0.000,0.020000,...,0.0,0.020000,0.00,0.000000,0.000000,0.000000,0.010000,0.000000,0.0,0.01
1,Agincourt,0.000000,0.0,0.000000,0.0000,0.0000,0.000,0.000,0.000,0.000000,...,0.0,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0.000000,0.0,0.000000,0.0000,0.0000,0.000,0.000,0.000,0.000000,...,0.0,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",0.000000,0.0,0.000000,0.0000,0.0000,0.000,0.000,0.000,0.000000,...,0.0,0.000000,0.00,0.090909,0.000000,0.000000,0.000000,0.000000,0.0,0.00
4,"Alderwood,Long Branch",0.000000,0.0,0.000000,0.0000,0.0000,0.000,0.000,0.000,0.000000,...,0.0,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00
5,"Bathurst Manor,Downsview North,Wilson Heights",0.000000,0.0,0.000000,0.0000,0.0000,0.000,0.000,0.000,0.000000,...,0.0,0.000000,0.00,0.050000,0.000000,0.000000,0.000000,0.000000,0.0,0.00
6,Bayview Village,0.000000,0.0,0.000000,0.0000,0.0000,0.000,0.000,0.000,0.000000,...,0.0,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00
7,"Bedford Park,Lawrence Manor East",0.000000,0.0,0.000000,0.0000,0.0000,0.000,0.000,0.000,0.043478,...,0.0,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00
8,Berczy Park,0.000000,0.0,0.000000,0.0000,0.0000,0.000,0.000,0.000,0.000000,...,0.0,0.017857,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00
9,"Birch Cliff,Cliffside West",0.000000,0.0,0.000000,0.0000,0.0000,0.000,0.000,0.000,0.000000,...,0.0,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00


### print each neighborhood along with the top 5 most common venues

In [24]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide,King,Richmond----
         venue  freq
0  Coffee Shop  0.07
1         Café  0.04
2   Steakhouse  0.04
3          Bar  0.04
4   Restaurant  0.03


----Agincourt----
                       venue  freq
0  Latin American Restaurant  0.25
1             Breakfast Spot  0.25
2               Skating Rink  0.25
3                     Lounge  0.25
4                Yoga Studio  0.00


----Agincourt North,L'Amoreaux East,Milliken,Steeles East----
                             venue  freq
0                       Playground   0.5
1                             Park   0.5
2                      Yoga Studio   0.0
3               Mexican Restaurant   0.0
4  Molecular Gastronomy Restaurant   0.0


----Albion Gardens,Beaumond Heights,Humbergate,Jamestown,Mount Olive,Silverstone,South Steeles,Thistletown----
                  venue  freq
0         Grocery Store  0.18
1           Pizza Place  0.09
2          Liquor Store  0.09
3              Pharmacy  0.09
4  Fast Food Restaurant  0.09


----Alde

In [25]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create a dataframe with the venue info for each neighborhood

In [26]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Café,Steakhouse,Bar,Asian Restaurant,Restaurant,Breakfast Spot,Hotel,Thai Restaurant,Gastropub
1,Agincourt,Latin American Restaurant,Lounge,Skating Rink,Breakfast Spot,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",Park,Playground,Women's Store,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Drugstore
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",Grocery Store,Coffee Shop,Pizza Place,Fried Chicken Joint,Sandwich Place,Beer Store,Fast Food Restaurant,Video Store,Pharmacy,Liquor Store
4,"Alderwood,Long Branch",Pizza Place,Gym,Coffee Shop,Pharmacy,Skating Rink,Athletics & Sports,Pub,Sandwich Place,Discount Store,Dessert Shop


## Cluster the neighborhoods

In [27]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
print (kmeans.labels_[0:10])
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

[1 1 0 1 1 1 1 1 1 1]


create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [28]:
# add clustering labels

print (neighborhoods_venues_sorted['Cluster Labels'])
toronto_merged = neighborhoods

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood', how='right')
print (toronto_merged['Cluster Labels'])
toronto_merged.head()

0     1
1     1
2     0
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    0
17    1
18    1
19    1
20    1
21    1
22    1
23    1
24    1
25    1
26    1
27    1
28    1
29    1
     ..
70    1
71    1
72    1
73    1
74    0
75    1
76    4
77    1
78    1
79    1
80    1
81    1
82    1
83    1
84    1
85    1
86    1
87    1
88    1
89    1
90    1
91    1
92    1
93    0
94    1
95    1
96    1
97    1
98    1
99    0
Name: Cluster Labels, Length: 100, dtype: int32
0      1
1      1
2      1
3      1
4      1
5      1
6      4
7      1
8      1
9      1
10     1
12     1
13     1
14     1
15     1
16     1
17     1
18     1
19     1
20     1
21     0
22     1
23     1
24     1
25     1
26     1
27     1
28     1
29     1
30     1
      ..
72     1
73     1
74     1
75     1
76     1
77     1
78     1
79     1
80     1
81     1
82     1
83     1
84     1
85     0
86     1
87     1
88     1
89     1
90     1
91     0
92   

Unnamed: 0,Postcode,Borough,Neighborhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,M3A,43.753259,-79.329656,1,Park,Food & Drink Shop,Construction & Landscaping,Bus Stop,Women's Store,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run
1,M4A,North York,Victoria Village,M4A,43.725882,-79.315572,1,Coffee Shop,Pizza Place,Hockey Arena,Intersection,Portuguese Restaurant,Women's Store,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner
2,M5A,Downtown Toronto,Harbourfront,M5A,43.65426,-79.360636,1,Coffee Shop,Park,Bakery,Pub,Breakfast Spot,Restaurant,Café,Mexican Restaurant,Farmers Market,Event Space
3,M6A,North York,"Lawrence Heights,Lawrence Manor",M6A,43.718518,-79.464763,1,Clothing Store,Accessories Store,Furniture / Home Store,Event Space,Miscellaneous Shop,Boutique,Vietnamese Restaurant,Gift Shop,Coffee Shop,Doner Restaurant
4,M7A,Downtown Toronto,Queen's Park,M7A,43.662301,-79.389494,1,Coffee Shop,Gym,Park,Fast Food Restaurant,Salad Place,Portuguese Restaurant,Nightclub,Music Venue,Mexican Restaurant,Juice Bar


In [29]:
neighborhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,"Adelaide,King,Richmond",Coffee Shop,Café,Steakhouse,Bar,Asian Restaurant,Restaurant,Breakfast Spot,Hotel,Thai Restaurant,Gastropub
1,1,Agincourt,Latin American Restaurant,Lounge,Skating Rink,Breakfast Spot,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
2,0,"Agincourt North,L'Amoreaux East,Milliken,Steel...",Park,Playground,Women's Store,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Drugstore
3,1,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",Grocery Store,Coffee Shop,Pizza Place,Fried Chicken Joint,Sandwich Place,Beer Store,Fast Food Restaurant,Video Store,Pharmacy,Liquor Store
4,1,"Alderwood,Long Branch",Pizza Place,Gym,Coffee Shop,Pharmacy,Skating Rink,Athletics & Sports,Pub,Sandwich Place,Discount Store,Dessert Shop


Visualize the clusters

In [30]:


# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine the clusters

Examine each cluster and determine the discriminating venue categories that distinguish each cluster.

### Cluster 1

In [31]:
cluster1_df = toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
cluster1_df['1st Most Common Venue'].value_counts()

Park    7
Name: 1st Most Common Venue, dtype: int64

### Cluster 2

In [32]:
cluster2_df = toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
cluster2_df['1st Most Common Venue'].value_counts()

Coffee Shop                  22
Café                          8
Pizza Place                   6
Park                          5
Gym                           4
Grocery Store                 4
Pharmacy                      3
Clothing Store                2
Bar                           2
Indian Restaurant             2
Rental Car Location           1
Gas Station                   1
Asian Restaurant              1
Intersection                  1
Garden                        1
Fast Food Restaurant          1
Chinese Restaurant            1
Brewery                       1
Trail                         1
Spa                           1
Bakery                        1
Gift Shop                     1
Department Store              1
College Stadium               1
Latin American Restaurant     1
Dessert Shop                  1
Yoga Studio                   1
Mobile Phone Shop             1
American Restaurant           1
Mediterranean Restaurant      1
Jewelry Store                 1
Middle E

### Cluster 3

In [33]:
cluster3_df = toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
cluster3_df['1st Most Common Venue'].value_counts()

Piano Bar    1
Name: 1st Most Common Venue, dtype: int64

### Cluster 4

In [34]:
cluster4_df = toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
cluster4_df['1st Most Common Venue'].value_counts()

Baseball Field    1
Breakfast Spot    1
Food Truck        1
Name: 1st Most Common Venue, dtype: int64

### Cluster 5

In [35]:
cluster5_df = toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
cluster5_df['1st Most Common Venue'].value_counts()

Fast Food Restaurant    1
Name: 1st Most Common Venue, dtype: int64

# Conclusion

Neighborhoods in the city of Toronto have been explored, segmented and clustered into 5 broad categories. Here is what we could see from each of the clusters

Cluster1: Coffee shops and Cafés

Cluster2: Breakfast spots, Cafés and some restaurants

Cluster3: Bars

Cluster4: Parks

Cluster5: Breakfast spots and Baseball fields
    