<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto, Canada</font></h1>

# Part 1: Create a dataframe

In [1]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import requests

import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!wget -q -O 'toronto_data.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
print('Libraries imported')

Libraries imported


In [2]:
#installations
!conda install -c anaconda lxml --yes
!conda install -c anaconda html5lib --yes

!conda install -c anaconda BeautifulSoup4 --yes
!conda install -c anaconda xlrd --yes
#!conda update -n base -c defaults conda --yes

!conda install -c conda-forge geopy --yes 

print('Libraries installed.')

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    numpy-base-1.15.4          |   py36h81de0dd_0         4.2 MB  anaconda
    numpy-1.15.4               |   py36h1d66e8a_0          35 KB  anaconda
    openssl-1.1.1              |       h7b6447c_0         5.0 MB  anaconda
    lxml-4.3.0                 |   py36hefd8a0e_0         1.5 MB  anaconda
    mkl_fft-1.0.6              |   py36h7dd41cf_0         150 KB  anaconda
    certifi-2019.11.28         |           py36_0         156 KB  anaconda
    blas-1.0                   |              mkl           6 KB  anaconda
    scipy-1.1.0                |   py36hfa4b5c9_1    

In [3]:
import keras
import lxml
import html5lib

from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as bs
#bs('test','lxml') # Testing the package


Using TensorFlow backend.


Reading data from the web page

In [4]:
# Reading data from the web page
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
dfs = pd.read_html(url,flavor="bs4")

postal codes is the first dataframe in the list - it is the first table in the page

In [5]:
# postal codes is the first dataframe in the list 
df = dfs[0]

REQUIREMENT: The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood  
Solution: need to rename `'Postcode'` column to  `PostalCode`

In [6]:
# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
df.rename(columns={'Postcode':'PostalCode'}, inplace=True)

REQUIREMENT: Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
Solution: replace `"Not assigned"` value with the Dataframe special `null` value and use built in functions to drop correspondent rows

In [7]:
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
df.replace("Not assigned", np.nan, inplace = True)
df.dropna(subset=["Borough"], axis=0, inplace=True)

# reset index, because we droped some rows
df.reset_index(drop=True, inplace=True)
print("data shape after unassigned codes were dropped: ", df.shape)


data shape after unassigned codes were dropped:  (210, 3)


REQUIREMENT: If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough.  
Since previously we replaced all **Not assigned** values with "nan" we can use it in the replace call

In [8]:
#replace the missing 'neighborhood' values by the correspondent 'borough'
df["Neighbourhood"].replace(np.nan, df["Borough"], inplace=True)

REQUIREMENT: More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.  

Solution: 
 * We will group data by all other columns, since there are only 2 we will simply list them. 
 * We apply the "join" operation on the 'Neighbourhood' column using ', ' as separator. 
 * We will then reset the index to account for the dropped rows


In [9]:
df = df.groupby(['PostalCode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


REQUIREMENT: In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [10]:
print("The number of rows in the dataframe: {}.".format(df.shape[0])) 

The number of rows in the dataframe: 103.


REQUIREMENT: Use the Geocoder package or the csv file to add latitude and longitude coordinates of a given postal code:

Read CSV file

In [11]:
!wget -q -O 'Geospatial_Coordinates.csv' http://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [12]:
df_gsc = pd.read_csv('Geospatial_Coordinates.csv')
df_gsc.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Set postal codes as indexes in the original dataframe we merge them. 
Once merged - we need to reset the index and give the Postal codes column name as it will loose it in the merge.

In [13]:
df_hood_gsc = pd.concat([df.set_index('PostalCode'),df_gsc.set_index('Postal Code')], axis=1, join='inner').reset_index() 
df_hood_gsc.rename(columns={'index':'PostalCode'}, inplace=True)

print("The number of rows in the dataframe: {}.".format(df_hood_gsc.shape[0])) 
df_hood_gsc.head()

The number of rows in the dataframe: 103.


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### Use geopy library to get the latitude and longitude values of Toronto.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent ny_explorer, as shown below.

In [14]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
address = 'Toronto, On'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


#### Create a map of Toronto with neighborhoods superimposed on top.

In [50]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df_hood_gsc['Latitude'], df_hood_gsc['Longitude'], df_hood_gsc['Borough'], df_hood_gsc['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Let's simplify the above map and segment and cluster only the neighborhoods in North York.   
So let's slice the original dataframe and create a new dataframe of the North York data.

In [51]:
northyork_data = df_hood_gsc[df_hood_gsc['Borough'] == 'North York'].reset_index(drop=True)
northyork_data.rename(columns={'Neighbourhood':'Neighborhood'}, inplace=True)
northyork_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M2H,North York,Hillcrest Village,43.803762,-79.363452
1,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
2,M2K,North York,Bayview Village,43.786947,-79.385975
3,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714
4,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493


Let's get the geographical coordinates of North York .

In [20]:
address = 'North York, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York, ON are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York, ON are 43.7543263, -79.44911696639593.


As we did with all of Toronto, let's visualize North York the neighborhoods in it.

In [21]:
# create map of North York using latitude and longitude values
map_northyork = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(northyork_data['Latitude'], northyork_data['Longitude'], northyork_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_northyork)  
    
map_northyork

Define Foursquare Credentials and Version

In [22]:
CLIENT_ID = 'STEDFTUIKFYIVB5VOA04F24OZ5IMI3IRQUIFCQYETTYTJLBL' # your Foursquare ID
CLIENT_SECRET = 'LKWSW3IEANWXFJQBRTQ5RGCQNWJR0A31Y0XT03F4PRS3EO4C' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: STEDFTUIKFYIVB5VOA04F24OZ5IMI3IRQUIFCQYETTYTJLBL
CLIENT_SECRET:LKWSW3IEANWXFJQBRTQ5RGCQNWJR0A31Y0XT03F4PRS3EO4C


Let's create a function to repeat the process of finding top venues in to all the neighborhoods in North York

In [25]:
LIMIT = 100
radius = 500
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Run the above function on each neighborhood and create a new dataframe called *northyork_venues*.

In [26]:
northyork_venues = getNearbyVenues(names=northyork_data['Neighbourhood'],
                                   latitudes=northyork_data['Latitude'],
                                   longitudes=northyork_data['Longitude']
                                  )


Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Bathurst Manor, Downsview North, Wilson Heights
Northwood Park, York University
CFB Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Bedford Park, Lawrence Manor East
Lawrence Heights, Lawrence Manor
Glencairn
Downsview, North Park, Upwood Park
Humber Summit
Emery, Humberlea


#### Let's check the size of the resulting dataframe

In [27]:
print(northyork_venues.shape)
northyork_venues.head()

(251, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hillcrest Village,43.803762,-79.363452,Eagle's Nest Golf Club,43.805455,-79.364186,Golf Course
1,Hillcrest Village,43.803762,-79.363452,AY Jackson Pool,43.804515,-79.366138,Pool
2,Hillcrest Village,43.803762,-79.363452,Villa Madina,43.801685,-79.363938,Mediterranean Restaurant
3,Hillcrest Village,43.803762,-79.363452,Duncan Creek Park,43.805539,-79.360695,Dog Run
4,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,The LEGO Store,43.778207,-79.343483,Toy / Game Store


Let's check how many venues were returned for each neighborhood

In [28]:
northyork_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Downsview North, Wilson Heights",20,20,20,20,20,20
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
"CFB Toronto, Downsview East",3,3,3,3,3,3
Don Mills North,5,5,5,5,5,5
Downsview Central,4,4,4,4,4,4
Downsview Northwest,4,4,4,4,4,4
Downsview West,6,6,6,6,6,6
"Downsview, North Park, Upwood Park",4,4,4,4,4,4
"Emery, Humberlea",1,1,1,1,1,1


#### Let's find out how many unique categories can be curated from all the returned venues

In [29]:
print('There are {} uniques categories.'.format(len(northyork_venues['Venue Category'].unique())))

There are 108 uniques categories.


##  Analyze Each Neighborhood

In [33]:
# one hot encoding
northyork_onehot = pd.get_dummies(northyork_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
northyork_onehot['Neighborhood'] = northyork_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [northyork_onehot.columns[-1]] + list(northyork_onehot.columns[:-1])
northyork_onehot = northyork_onehot[fixed_columns]

northyork_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Bakery,Bank,Bar,Baseball Field,...,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Video Store,Vietnamese Restaurant,Wings Joint,Women's Store,Yoga Studio
0,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Fairview, Henry Farm, Oriole",0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


And let's examine the new dataframe size.

In [34]:
northyork_onehot.shape

(251, 109)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [36]:
northyork_grouped = northyork_onehot.groupby('Neighborhood').mean().reset_index()
northyork_grouped

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Bakery,Bank,Bar,Baseball Field,...,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Video Store,Vietnamese Restaurant,Wings Joint,Women's Store,Yoga Studio
0,"Bathurst Manor, Downsview North, Wilson Heights",0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CFB Toronto, Downsview East",0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Don Mills North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Downsview Central,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Downsview Northwest,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Downsview West,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"Downsview, North Park, Upwood Park",0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Emery, Humberlea",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's confirm the new size

In [37]:
northyork_grouped.shape

(23, 109)

#### Let's put list of neighbourhoods with top venues into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [39]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [69]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = northyork_grouped['Neighborhood']

for ind in np.arange(northyork_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(northyork_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Downsview North, Wilson Heights",Coffee Shop,Fried Chicken Joint,Shopping Mall,Middle Eastern Restaurant,Pet Store,Pharmacy,Pizza Place,Deli / Bodega,Ice Cream Shop,Restaurant
1,Bayview Village,Chinese Restaurant,Bank,Café,Japanese Restaurant,Yoga Studio,Electronics Store,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store
2,"Bedford Park, Lawrence Manor East",Coffee Shop,Fast Food Restaurant,Italian Restaurant,Sandwich Place,Greek Restaurant,Indian Restaurant,Liquor Store,Juice Bar,Pharmacy,Pizza Place
3,"CFB Toronto, Downsview East",Park,Airport,Snack Place,Dog Run,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store
4,Don Mills North,Japanese Restaurant,Gym / Fitness Center,Caribbean Restaurant,Baseball Field,Café,Electronics Store,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store


##  Cluster Neighborhoods

Run k-means to cluster the neighborhood into 5 clusters.

In [70]:
# set number of clusters
kclusters = 5

northyork_grouped_clustering = northyork_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(northyork_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 2, 0, 0, 0, 0, 0, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [71]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

northyork_merged = northyork_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
northyork_merged = northyork_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')



In [85]:
northyork_merged.dropna(subset=["Cluster Labels"], axis=0, inplace=True)
northyork_merged[["Cluster Labels"]] = northyork_merged[["Cluster Labels"]].astype("int")
northyork_merged.head(10) # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M2H,North York,Hillcrest Village,43.803762,-79.363452,0,Dog Run,Mediterranean Restaurant,Pool,Golf Course,Frozen Yogurt Shop,Fried Chicken Joint,Concert Hall,Gas Station,Construction & Landscaping,Convenience Store
1,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,0,Clothing Store,Fast Food Restaurant,Coffee Shop,Japanese Restaurant,Women's Store,Toy / Game Store,Bus Station,Juice Bar,Bakery,Food Court
2,M2K,North York,Bayview Village,43.786947,-79.385975,0,Chinese Restaurant,Bank,Café,Japanese Restaurant,Yoga Studio,Electronics Store,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store
3,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714,3,Martial Arts Dojo,Yoga Studio,Dog Run,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop
5,M2N,North York,Willowdale South,43.77012,-79.408493,0,Sushi Restaurant,Ramen Restaurant,Sandwich Place,Coffee Shop,Café,Pizza Place,Restaurant,Hotel,Ice Cream Shop,Plaza
6,M2P,North York,York Mills West,43.752758,-79.400049,0,Park,Convenience Store,Bank,Bar,Yoga Studio,Electronics Store,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store
7,M2R,North York,Willowdale West,43.782736,-79.442259,0,Pizza Place,Pharmacy,Discount Store,Coffee Shop,Grocery Store,Diner,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store
8,M3A,North York,Parkwoods,43.753259,-79.329656,0,Park,Pool,Food & Drink Shop,Yoga Studio,Discount Store,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega
9,M3B,North York,Don Mills North,43.745906,-79.352188,0,Japanese Restaurant,Gym / Fitness Center,Caribbean Restaurant,Baseball Field,Café,Electronics Store,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store
10,M3C,North York,"Flemingdon Park, Don Mills South",43.7259,-79.340923,0,Coffee Shop,Clothing Store,Asian Restaurant,Gym,Beer Store,Chinese Restaurant,Sporting Goods Shop,Café,Dim Sum Restaurant,Japanese Restaurant


Finally, let's visualize the resulting clusters

In [93]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
#rainbow = ['red','green','blue']

# add markers to the map

markers_colors = []
for lat, lon, poi, cluster in zip(northyork_merged['Latitude'], northyork_merged['Longitude'], northyork_merged['Neighborhood'], northyork_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        #color="blue",
        fill=True,
        fill_color=rainbow[cluster-1],
        #fill_color="green",
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters

### Cluster 1 - Business Areas

In [88]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 0, northyork_merged.columns[[1] + list(range(5, northyork_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,0,Dog Run,Mediterranean Restaurant,Pool,Golf Course,Frozen Yogurt Shop,Fried Chicken Joint,Concert Hall,Gas Station,Construction & Landscaping,Convenience Store
1,North York,0,Clothing Store,Fast Food Restaurant,Coffee Shop,Japanese Restaurant,Women's Store,Toy / Game Store,Bus Station,Juice Bar,Bakery,Food Court
2,North York,0,Chinese Restaurant,Bank,Café,Japanese Restaurant,Yoga Studio,Electronics Store,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store
5,North York,0,Sushi Restaurant,Ramen Restaurant,Sandwich Place,Coffee Shop,Café,Pizza Place,Restaurant,Hotel,Ice Cream Shop,Plaza
6,North York,0,Park,Convenience Store,Bank,Bar,Yoga Studio,Electronics Store,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store
7,North York,0,Pizza Place,Pharmacy,Discount Store,Coffee Shop,Grocery Store,Diner,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store
8,North York,0,Park,Pool,Food & Drink Shop,Yoga Studio,Discount Store,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega
9,North York,0,Japanese Restaurant,Gym / Fitness Center,Caribbean Restaurant,Baseball Field,Café,Electronics Store,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store
10,North York,0,Coffee Shop,Clothing Store,Asian Restaurant,Gym,Beer Store,Chinese Restaurant,Sporting Goods Shop,Café,Dim Sum Restaurant,Japanese Restaurant
11,North York,0,Coffee Shop,Fried Chicken Joint,Shopping Mall,Middle Eastern Restaurant,Pet Store,Pharmacy,Pizza Place,Deli / Bodega,Ice Cream Shop,Restaurant


### Cluster2  - Residential community with sports focus

In [89]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 1, northyork_merged.columns[[1] + list(range(5, northyork_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,North York,1,Baseball Field,Yoga Studio,Empanada Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant


### Cluster 3  -  The Aiport area

In [90]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 2, northyork_merged.columns[[1] + list(range(5, northyork_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,North York,2,Park,Airport,Snack Place,Dog Run,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store


### Cluster 4 - Sports and entertaining

In [91]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 3, northyork_merged.columns[[1] + list(range(5, northyork_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,North York,3,Martial Arts Dojo,Yoga Studio,Dog Run,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop


### Cluster 5 - Residential community

In [92]:
northyork_merged.loc[northyork_merged['Cluster Labels'] == 4, northyork_merged.columns[[1] + list(range(5, northyork_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,North York,4,Empanada Restaurant,Electronics Store,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant
