# Segmenting and Clustering Neighborhoods in Toronto
This is peer-graded assignment for Week 3. For this assignment, I have choosen <b>North York</b> Borough.
I choose Pandas for Web Scraping. This assignment will be similar to the one for this week lab. Let's get started

### Import Libraries

In [305]:
import numpy as np
import pandas as pd

import requests

from geopy.geocoders import Nominatim

from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium

## Use the Notebook to build the code to scrape the Wikipedia page - Toronto's Postal Codes, Canada

First, define url, then use pd.read_html(url) to get data. This funtion will return data from table in Wikipedia page. In this case, the page has two tables. Data is return as a list. the first element in the list is the one.

In [380]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [381]:
toronto_postal = pd.read_html(wiki_url)[0]

In [382]:
print(toronto_postal.shape)
toronto_postal.head(5)

(180, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### Filter out 'Not assigned'. 

In [383]:
toronto_postal = toronto_postal[toronto_postal['Neighbourhood'] != 'Not assigned']

In [384]:
print(toronto_postal.shape)
toronto_postal.head()

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## Create Dataframe with Latitude and Longtitude

### Get Geospatial Coordinates
I use Geospatial_Coordinates.csv file for Latitude and Longtitude. After that, I merge the data with <b> north_york_data </b>

In [385]:
toronto_geo_coords = pd.read_csv('Geospatial_Coordinates.csv')

In [386]:
toronto_geo_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Merge Geospatial Coordinates with Toronto Postal

In [387]:
# toronto_nb = toronto neighbours
toronto_nb = pd.merge(toronto_postal, toronto_geo_coords, how='inner', on='Postal Code') 

In [388]:
toronto_nb.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### North York Borough
As I mentioned above, I have choosen <b> North York Borough</b>. 
Use <b> reset_index </b> to rearrange index.

In [391]:
north_york_data = toronto_nb[toronto_nb['Borough'] == 'North York'].reset_index(drop=True)

In [392]:
print(north_york_data.shape)
north_york_data.head()

(24, 5)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
3,M3B,North York,Don Mills,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073


### Create a map of North York
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent north_york_explorer, as shown below

In [393]:
address = 'North York, Toronto'

geolocator = Nominatim(user_agent="north_york_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York are 43.7543263, -79.44911696639593.


In [394]:
# create map of North York using latitude and longitude values
map_north_york = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(north_york_data['Latitude'], north_york_data['Longitude'], north_york_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_north_york)  

In [395]:
map_north_york

## Explore and cluster the neighborhoods in Toronto - North York

### Use Foursquare API to explore the neighborhoods and segment them

Define Foursquare Credentials and Version

In [396]:
CLIENT_ID = 'PPJZU5DJCESX03USVZX5AGZ0BLFHHBVVW3Z40MMFQV0TEPLD' # your Foursquare ID
CLIENT_SECRET = 'Z1IB55F2HM112MI42S2PH5MQUQ01SKRKB3DIOIB3IOV2DJOH' # your Foursquare Secret
ACCESS_TOKEN = 'OTCBV5CCPX3UXFC3MHISSPXSQTQIM5TLWSJCX0W5GTALW3EF' # your FourSquare Access Token
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PPJZU5DJCESX03USVZX5AGZ0BLFHHBVVW3Z40MMFQV0TEPLD
CLIENT_SECRET:Z1IB55F2HM112MI42S2PH5MQUQ01SKRKB3DIOIB3IOV2DJOH


Let's explore the first neighborhood in our dataframe

In [397]:
neighborhood_latitude = north_york_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = north_york_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = north_york_data.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


Get the neighborhood's latitude and longitude values.
- define url 
- send get request

In [398]:
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

print(url)

https://api.foursquare.com/v2/venues/explore?&client_id=PPJZU5DJCESX03USVZX5AGZ0BLFHHBVVW3Z40MMFQV0TEPLD&client_secret=Z1IB55F2HM112MI42S2PH5MQUQ01SKRKB3DIOIB3IOV2DJOH&v=20180605&ll=43.7532586,-79.3296565&radius=500&limit=100


In [399]:
results = requests.get(url).json()

From the Foursquare lab, we know that all the information is in the items key. Before we proceed, let's borrow the get_category_type function from the Foursquare lab.

In [400]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Clean the json and structure it into a pandas dataframe

In [401]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,Variety Store,Food & Drink Shop,43.751974,-79.333114
2,Corrosion Service Company Limited,Construction & Landscaping,43.752432,-79.334661


In [402]:
nearby_venues.shape

(3, 4)

### Explore Neighborhoods in North York

Let's create a function to repeat the same process to all the neighborhoods in North York. But frist, let's look at <b> north_york_data</b> once again 

In [403]:
north_york_data.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
3,M3B,North York,Don Mills,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073
5,M3C,North York,Don Mills,43.7259,-79.340923
6,M2H,North York,Hillcrest Village,43.803762,-79.363452
7,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259
8,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
9,M3J,North York,"Northwood Park, York University",43.76798,-79.487262


There are some duplications in <b>Neighbourhood</b> columns. For example, <b> Don Mills </b> has more than one postal code, and it has different Latitude and Longitude. I made modification to <b>getNearbyVenues()</b> function to differentiate a<b> neighbourhood</b> with the same name but has different <b> postal code and Latitude and Longitude </b> by adding postal code to neighbourhood value.

In [404]:
# explore all of 24 north yorks neighborhoods
def getNearbyVenues(names, latitudes, longitudes, postal_codes, radius=500):
        
    venues_list=[]
    for name, lat, lng, postal_code in zip(names, latitudes, longitudes, postal_codes):

        print(name + ',', postal_code)
                
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name + ', ' + postal_code, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [405]:
north_york_venues = getNearbyVenues(names=north_york_data['Neighbourhood'],
                                   latitudes=north_york_data['Latitude'],
                                   longitudes=north_york_data['Longitude'],
                                    postal_codes=north_york_data['Postal Code']
                                  )

Parkwoods, M3A
Victoria Village, M4A
Lawrence Manor, Lawrence Heights, M6A
Don Mills, M3B
Glencairn, M6B
Don Mills, M3C
Hillcrest Village, M2H
Bathurst Manor, Wilson Heights, Downsview North, M3H
Fairview, Henry Farm, Oriole, M2J
Northwood Park, York University, M3J
Bayview Village, M2K
Downsview, M3K
York Mills, Silver Hills, M2L
Downsview, M3L
North Park, Maple Leaf Park, Upwood Park, M6L
Humber Summit, M9L
Willowdale, Newtonbrook, M2M
Downsview, M3M
Bedford Park, Lawrence Manor East, M5M
Humberlea, Emery, M9M
Willowdale, Willowdale East, M2N
Downsview, M3N
York Mills West, M2P
Willowdale, Willowdale West, M2R


This are 24 North York Neighbourhoods, but there are 23 unique values in <b>north_york_venues</b>. So there is one Neighbourhood that does not return any venues.

In [406]:
len(north_york_venues['Neighbourhood'].unique())

23

In [407]:
print(north_york_venues.shape)
north_york_venues.head()

(246, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Parkwoods, M3A",43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,"Parkwoods, M3A",43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,"Parkwoods, M3A",43.753259,-79.329656,Corrosion Service Company Limited,43.752432,-79.334661,Construction & Landscaping
3,"Victoria Village, M4A",43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,"Victoria Village, M4A",43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


Let's check how many venues were returned for each neighborhood

In [408]:
north_york_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Wilson Heights, Downsview North, M3H",20,20,20,20,20,20
"Bayview Village, M2K",4,4,4,4,4,4
"Bedford Park, Lawrence Manor East, M5M",24,24,24,24,24,24
"Don Mills, M3B",4,4,4,4,4,4
"Don Mills, M3C",20,20,20,20,20,20
"Downsview, M3K",3,3,3,3,3,3
"Downsview, M3L",4,4,4,4,4,4
"Downsview, M3M",3,3,3,3,3,3
"Downsview, M3N",5,5,5,5,5,5
"Fairview, Henry Farm, Oriole, M2J",71,71,71,71,71,71


Let's find out how many unique categories can be curated from all the returned venues

In [409]:
print('There are {} uniques categories.'.format(len(north_york_venues['Venue Category'].unique())))

There are 104 uniques categories.


### Analyze Each Neighborhood

In [410]:
# one hot encoding
north_york_onehot = pd.get_dummies(north_york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
north_york_onehot['Neighbourhood'] = north_york_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [north_york_onehot.columns[-1]] + list(north_york_onehot.columns[:-1])
north_york_onehot = north_york_onehot[fixed_columns]

north_york_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,...,Steakhouse,Supermarket,Supplement Shop,Sushi Restaurant,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Vietnamese Restaurant,Women's Store
0,"Parkwoods, M3A",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Parkwoods, M3A",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Parkwoods, M3A",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Victoria Village, M4A",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Victoria Village, M4A",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [411]:
north_york_onehot.shape

(246, 105)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [412]:
north_york_grouped = north_york_onehot.groupby('Neighbourhood').mean().reset_index()

In [413]:
print(north_york_grouped.shape)
north_york_grouped.head()

(23, 105)


Unnamed: 0,Neighbourhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,...,Steakhouse,Supermarket,Supplement Shop,Sushi Restaurant,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Vietnamese Restaurant,Women's Store
0,"Bathurst Manor, Wilson Heights, Downsview Nort...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,...,0.0,0.05,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0
1,"Bayview Village, M2K",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East, M5M",0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.041667,0.041667,0.0,0.0,0.0,0.0,0.0
3,"Don Mills, M3B",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Don Mills, M3C",0.0,0.0,0.0,0.05,0.0,0.05,0.0,0.0,0.0,...,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's print each neighborhood along with the top 5 most common venues

In [414]:
num_top_venues = 5

for hood in north_york_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = north_york_grouped[north_york_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor, Wilson Heights, Downsview North, M3H----
                       venue  freq
0                Coffee Shop  0.10
1                       Bank  0.10
2                       Park  0.05
3             Ice Cream Shop  0.05
4  Middle Eastern Restaurant  0.05


----Bayview Village, M2K----
                 venue  freq
0   Chinese Restaurant  0.25
1  Japanese Restaurant  0.25
2                 Café  0.25
3                 Bank  0.25
4   Miscellaneous Shop  0.00


----Bedford Park, Lawrence Manor East, M5M----
                venue  freq
0  Italian Restaurant  0.12
1      Sandwich Place  0.08
2         Coffee Shop  0.08
3         Pizza Place  0.08
4    Greek Restaurant  0.04


----Don Mills, M3B----
                  venue  freq
0   Japanese Restaurant  0.25
1                   Gym  0.25
2  Caribbean Restaurant  0.25
3                  Café  0.25
4     Accessories Store  0.00


----Don Mills, M3C----
                venue  freq
0                 Gym  0.10
1          Beer Store

#### Let's put that into a pandas dataframe

First, let's write a function to sort the venues in descending order.

In [415]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [416]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = north_york_grouped['Neighbourhood']

for ind in np.arange(north_york_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(north_york_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview Nort...",Bank,Coffee Shop,Gas Station,Pharmacy,Bridal Shop,Park,Deli / Bodega,Restaurant,Middle Eastern Restaurant,Sandwich Place
1,"Bayview Village, M2K",Café,Bank,Japanese Restaurant,Chinese Restaurant,Women's Store,Dessert Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping
2,"Bedford Park, Lawrence Manor East, M5M",Italian Restaurant,Coffee Shop,Pizza Place,Sandwich Place,Pub,Restaurant,Liquor Store,Café,Butcher,Indian Restaurant
3,"Don Mills, M3B",Caribbean Restaurant,Café,Gym,Japanese Restaurant,Women's Store,Dim Sum Restaurant,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store
4,"Don Mills, M3C",Gym,Coffee Shop,Beer Store,Restaurant,Chinese Restaurant,Dim Sum Restaurant,Bike Shop,Discount Store,Sandwich Place,Sporting Goods Shop


### Cluster Neighborhoods

Run k-means to cluster the neighborhood into 5 clusters.

In [417]:
kclusters = 5

north_york_grouped_clustering = north_york_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(north_york_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [418]:
neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview Nort...",Bank,Coffee Shop,Gas Station,Pharmacy,Bridal Shop,Park,Deli / Bodega,Restaurant,Middle Eastern Restaurant,Sandwich Place
1,"Bayview Village, M2K",Café,Bank,Japanese Restaurant,Chinese Restaurant,Women's Store,Dessert Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping
2,"Bedford Park, Lawrence Manor East, M5M",Italian Restaurant,Coffee Shop,Pizza Place,Sandwich Place,Pub,Restaurant,Liquor Store,Café,Butcher,Indian Restaurant
3,"Don Mills, M3B",Caribbean Restaurant,Café,Gym,Japanese Restaurant,Women's Store,Dim Sum Restaurant,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store
4,"Don Mills, M3C",Gym,Coffee Shop,Beer Store,Restaurant,Chinese Restaurant,Dim Sum Restaurant,Bike Shop,Discount Store,Sandwich Place,Sporting Goods Shop


In [419]:
north_york_merged = north_york_data

Because I made a modification to <b>getNearbyVenues()</b> above by adding postal code to neighbourhood value, so I have to add postal code to north york neighbourhood to merge data.

In [420]:
north_york_merged['Neighbourhood'] = north_york_merged['Neighbourhood'] + ', ' +  north_york_merged['Postal Code']

In [421]:
north_york_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,"Parkwoods, M3A",43.753259,-79.329656
1,M4A,North York,"Victoria Village, M4A",43.725882,-79.315572
2,M6A,North York,"Lawrence Manor, Lawrence Heights, M6A",43.718518,-79.464763
3,M3B,North York,"Don Mills, M3B",43.745906,-79.352188
4,M6B,North York,"Glencairn, M6B",43.709577,-79.445073


Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [422]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
north_york_merged = north_york_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood', how='inner')

north_york_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,"Parkwoods, M3A",43.753259,-79.329656,1,Park,Food & Drink Shop,Construction & Landscaping,Women's Store,Dessert Shop,Chocolate Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant,Convenience Store
1,M4A,North York,"Victoria Village, M4A",43.725882,-79.315572,0,French Restaurant,Coffee Shop,Hockey Arena,Portuguese Restaurant,Women's Store,Dessert Shop,Chocolate Shop,Clothing Store,Comfort Food Restaurant,Construction & Landscaping
2,M6A,North York,"Lawrence Manor, Lawrence Heights, M6A",43.718518,-79.464763,0,Clothing Store,Furniture / Home Store,Accessories Store,Boutique,Event Space,Coffee Shop,Carpet Store,Miscellaneous Shop,Vietnamese Restaurant,American Restaurant
3,M3B,North York,"Don Mills, M3B",43.745906,-79.352188,0,Caribbean Restaurant,Café,Gym,Japanese Restaurant,Women's Store,Dim Sum Restaurant,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store
4,M6B,North York,"Glencairn, M6B",43.709577,-79.445073,0,Park,Bakery,Japanese Restaurant,Pub,Women's Store,Department Store,Chocolate Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant


Finally, let's visualize the resulting clusters

In [423]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(north_york_merged['Latitude'], north_york_merged['Longitude'], north_york_merged['Neighbourhood'], north_york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)

In [424]:
map_clusters

### Examine Clusters
let's look at dataframe for each cluster

#### Cluster 1

In [425]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 0, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,0,French Restaurant,Coffee Shop,Hockey Arena,Portuguese Restaurant,Women's Store,Dessert Shop,Chocolate Shop,Clothing Store,Comfort Food Restaurant,Construction & Landscaping
2,North York,0,Clothing Store,Furniture / Home Store,Accessories Store,Boutique,Event Space,Coffee Shop,Carpet Store,Miscellaneous Shop,Vietnamese Restaurant,American Restaurant
3,North York,0,Caribbean Restaurant,Café,Gym,Japanese Restaurant,Women's Store,Dim Sum Restaurant,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store
4,North York,0,Park,Bakery,Japanese Restaurant,Pub,Women's Store,Department Store,Chocolate Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant
5,North York,0,Gym,Coffee Shop,Beer Store,Restaurant,Chinese Restaurant,Dim Sum Restaurant,Bike Shop,Discount Store,Sandwich Place,Sporting Goods Shop
6,North York,0,Mediterranean Restaurant,Golf Course,Fast Food Restaurant,Athletics & Sports,Dog Run,Pool,Department Store,Chocolate Shop,Clothing Store,Coffee Shop
7,North York,0,Bank,Coffee Shop,Gas Station,Pharmacy,Bridal Shop,Park,Deli / Bodega,Restaurant,Middle Eastern Restaurant,Sandwich Place
8,North York,0,Clothing Store,Coffee Shop,Fast Food Restaurant,Japanese Restaurant,Restaurant,Shoe Store,Food Court,Juice Bar,Jewelry Store,Bank
9,North York,0,Furniture / Home Store,Caribbean Restaurant,Massage Studio,Coffee Shop,Falafel Restaurant,Bar,Metro Station,Women's Store,Department Store,Clothing Store
10,North York,0,Café,Bank,Japanese Restaurant,Chinese Restaurant,Women's Store,Dessert Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping


#### Cluster 2

In [426]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 1, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,1,Park,Food & Drink Shop,Construction & Landscaping,Women's Store,Dessert Shop,Chocolate Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant,Convenience Store
14,North York,1,Park,Construction & Landscaping,Bakery,Basketball Court,Women's Store,Dim Sum Restaurant,Clothing Store,Coffee Shop,Comfort Food Restaurant,Convenience Store
22,North York,1,Park,Construction & Landscaping,Convenience Store,Women's Store,Carpet Store,Chocolate Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant,Cosmetics Shop


#### Cluster 3

In [427]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 2, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,North York,2,Baseball Field,Women's Store,Diner,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega


#### Cluster 4

In [428]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 3, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
15,North York,3,Pizza Place,Home Service,Women's Store,Dim Sum Restaurant,Chocolate Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store


#### Cluster 5

In [429]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 4, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,North York,4,Park,Women's Store,Carpet Store,Chocolate Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop
