# Clustering of Geospatial Data Based on Number of Businesses in Close Proximity and Vizualization on OpenStreetMap (A Test Case to Cluster Neighborhoods in Toronto Based On Venue Categories)

### Aim of Project

This project was carried out as real-world application of Python's unsupervised learning clustering algorithm to segregate neighorhoods in Toronto, Canada into clusters with similar high numbers of venue categories. By using insight from this project, marketing campaigns can be streamlined and made efficient by companies who wish to market products and services to targeted businesses in Toronto. Also, real estate agencies can gain insight into the unique selling points of different neighborhoods with regard to the proximity to venues of different categories. This could influence pricing of properties and ultimately boost profits. Businesses in the travel space can provide customers with high level information that can minimize churn rate.

After running the k-means clustering algorithm for five clusters, the following clusters were identified:
- Cluster 0: Neighborhoods with high number of dining venues. Color on map = Red
- Cluster 1: Neighborhoods with high number of photography studios. Color on map = Purple
- Cluster 2: Neighborhoods with high number of home service venues. Color on map = Cyan
- Cluster 3: Neighborhoods with high number of residential buildings close to parks. Color on map = Light Green
- Cluster 4: Neighborhoods with high number of outdoor playground and recreational venues. Color on map = Light Brown




### Motivation

Why Toronto, Canada?
Initially, I set out to carry out this project in my home country of Ghana, which, on one end of the spectrum, does not have enough data available on the internet to draw meaninful conclusions from. On the other end of the spectrum, developed nations like the U.S.A, England, France, and Germany had superfluous data available in different formats on the internet. I needed to find a happy medium.

I have come to learn that a large part of data scientist's job is data wrangling. Hence Toronto, as a test case, has just enough data available on the web to enable me apply Python's Beautiful Soup and pgeocode libraries, which I will need in the future. Additionally, Foursquare has curated data on businesses in Toronto, so I was able to practice making calls to Foursquare's API and parsing the resulting json file to extract needed data. Toronto, afforded me the opportunity to get real feedback on skills and concept I learnt in courses and pushed me to learn more and better ways to collect and clean data.

### Source of Data

The postal codes, neighborhood and borough data used for this project were scraped from the web, cleaned, imported into pandas and cleaned again. However, coordinates for the neighborhoods was not readily available on any webpage so I had to use pgeocode (pyhon's library for postal geocoding and distance calculations) to generate the coordinates. Data on venues within a 500m radius of neighborhoods in Toronto was collected by making calls to Foursquare's API.

### Project Details (The code for this project can be found in jupyter notebooks in my Gitbub and Kaggle Repositories)

### Import Libraries

In [33]:
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

from bs4 import BeautifulSoup
import requests 

from sklearn.cluster import KMeans

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # import geocoder

import folium 


print('Libraries imported.')

Libraries imported.


### Scrap Data on Canada's Postal Codes, Boroughs and Neighborhoods from the Internet

We utilize Python's Beautiful Soup Library.

In [34]:
# Extract data on Canada's postal codes, boroughs and neighborhoods
wikipedia = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
wiki_page = wikipedia.content
canadian_soup = BeautifulSoup(wiki_page, "html.parser")

### Import Scrapped Data into a Dataframe and Display Raw Dataframe

In [35]:
# Create an empty dataframe will predetermined column labels
column_names = ['Postalcode','Borough','Neighborhood']
df_raw = pd.DataFrame(columns = column_names)

# Parse through HTML for postal data in Table
content = canadian_soup.find('div', class_='mw-parser-output')
table = content.table.tbody
postcode = 0
borough = 0
neighborhood = 0

for tr in table.find_all('tr'):
    i = 0
    for td in tr.find_all('td'):
        if i == 0:
            postcode = td.text
            i = i + 1
        elif i == 1:
            borough = td.text
            i = i + 1
        elif i == 2: 
            neighborhood = td.text.strip('\n').replace(']','')
    df_raw = df_raw.append({'Postalcode': postcode,'Borough': borough,'Neighborhood': neighborhood},ignore_index=True)

df_raw.head()


Unnamed: 0,Postalcode,Borough,Neighborhood
0,0,0,0
1,M1A\n,Not assigned\n,Not assigned
2,M2A\n,Not assigned\n,Not assigned
3,M3A\n,North York\n,Parkwoods
4,M4A\n,North York\n,Victoria Village


### Clean Dataframe and Check the Shape of the Dataframe

In [36]:
# Remove '\n' from all entries
df_raw = df_raw.replace('\n',' ', regex=True)

# Remove rows with 0 values in all columns
df_raw = df_raw[df_raw['Borough'] != 0]

# Remove rows with unassigned Borough names
df = df_raw[~df_raw['Borough'].str.contains("Not assigned")]

# Drops current unordered index and replaces it with on of increasing integers
df.reset_index(drop = True, inplace = True)

# Check for null/nan values
df.isnull().sum()

Postalcode      0
Borough         0
Neighborhood    0
dtype: int64

In [37]:
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [38]:
df.shape

(103, 3)

### Import Python's Library for Postal Code Geolocating

With only the postal codes for Canada, pgeocode generates the latitude and longitude of every neighborhood in Canada.

In [39]:
import pgeocode

nomi = pgeocode.Nominatim('ca')
postal_code = df["Postalcode"].values.tolist()
location = nomi.query_postal_code(postal_code)

Add the latitude and longitude columns to the original dataframe to make it georeferenced. Now that this is done, we are ready to filter our data for only neighborhoods in Toronto and visualize our data on a map.

In [40]:
df["Latitude"] = location.latitude
df["Longitude"] = location.longitude
df.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.6662,-79.5282
6,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193
7,M3B,North York,Don Mills,43.745,-79.359
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.7063,-79.3094
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783


### Filter Dataframe for Only Neighborhoods in Toronto

In [41]:
df_tor = df[df['Borough'].str.contains("Toronto")]
df_tor.reset_index(drop = True, inplace = True)
df_tor.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756
4,M4E,East Toronto,The Beaches,43.6784,-79.2941


### A Visualization of Unclustered Neighborhoods in Toronto on an Open Street May Utilizing Python's Folium Library

In [42]:
# create map of Manhattan using latitude and longitude values
map_toronto = folium.Map(location=[43.653226, -79.383184], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df_tor['Latitude'], df_tor['Longitude'], df_tor['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='yellow',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Make Calls to Foursquare's API Using my Credentials For Top 100 Venues

### Create variables of Foursquare Credentials to be used later on to obtain a json file.

In [43]:
# Fill in John Owusu Duah's Foursquare API credentials
CLIENT_ID = 'TNXDY4LAQETOXCXRS4EMLJ5DWZ2H2GOTMRKAJANYVT5Q0OO1' 
CLIENT_SECRET = 'LUD1GSVT2QMVXDEQX3QRB30NKAC4MZWDT1QCMIEUDBCXBXRJ' 
VERSION = '20180605' 
LIMIT = 100 

### Define function to collect data on the Top 100 Venues within a 500m Radius of all Neighbourhoods in Dataframe 

'getNearbyVenues' was defined to extract top 100 venues within a 500m radius of all neighborhoods in dataframe; extract the name of the neighborhood, latitude, longitude and categories of the venues and package the data into a dataframe named nearby_venues in one fell swoop

In [44]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Utilize getNearbyVenues for our all the neighborhoods in Toronto, earlier defined as df_tor

In [45]:
toronto_venues = getNearbyVenues(names=df_tor['Neighborhood'],latitudes=df_tor['Latitude'],longitudes=df_tor['Longitude'])

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West,  Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport


### Check the shape of our dataframe that contains the categories and coordinates of top 100 the venues located within a 500m radius of all the neighborhoods in Toronto and display the first five rows of the dataframe

In [46]:
print(toronto_venues.shape)
toronto_venues.head()

(1519, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.6555,-79.3626,Tandem Coffee,43.653559,-79.361809,Coffee Shop
1,"Regent Park, Harbourfront",43.6555,-79.3626,Roselle Desserts,43.653447,-79.362017,Bakery
2,"Regent Park, Harbourfront",43.6555,-79.3626,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,"Regent Park, Harbourfront",43.6555,-79.3626,The Yoga Lounge,43.655515,-79.364955,Yoga Studio
4,"Regent Park, Harbourfront",43.6555,-79.3626,Body Blitz Spa East,43.654735,-79.359874,Spa


Group the dataframe containing the venues within a 500m radius of all the neighbourhoods in Toronto by neighbourhood to find distribution of venues around all the neighborhoods.

In [47]:
toronto_venues.groupby('Neighborhood').Venue.count()

Neighborhood
Berczy Park                                                                                                    92
Brockton, Parkdale Village, Exhibition Place                                                                   38
Business reply mail Processing Centre, South Central Letter Processing Plant Toronto                           13
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport     57
Central Bay Street                                                                                             58
Christie                                                                                                       11
Church and Wellesley                                                                                           75
Commerce Court, Victoria Hotel                                                                                100
Davisville                                                                 

### Create a dataframe which has normalized data of each venue category so that I can feed it into the k-means algorithm to cluster the neighborhoods according to venue categories. 

First, I created dummy variables of all the venue categories of each neighborhood. To do this, I had to make sure that my new dataframe with dummy variables had corresponding rows of neighbors. To do this efficiently, I had to concatenate the dummy variable dataframe and the torornto_venues dataframe, after which I ensured that there was only one column named 'neighborhood' and dropped unneccessary columns.

In [48]:
# Creating dummy columns of venue categorie for each neighborhood in Toronto
toronto_dum = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Concatenate dummy dataframe with the original dataframe to get the neighborhoods and dummies of venue categorie 
# in one DataFrame
df_one = pd.concat([toronto_venues, toronto_dum], axis=1)

# Afer concatenating, we realised that there were two neighbourhood column labels so we looped through the column 
# names and assigned separate suffixes to them so that we can drop the duplicate next.
cols = []
count = 1
for column in df_one.columns:
    if column == 'Neighborhood':
        cols.append(f'Neighborhood_{count}')
        count+=1
        continue
    cols.append(column)
df_one.columns = cols

# Remove unnecessary columns
df_one.drop(['Neighborhood Latitude', 'Neighborhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude','Venue Category','Neighborhood_2'], axis=1, inplace=True)

# Rename neighborhood_1 back to neighborhood
df_one = df_one.rename(columns={"Neighborhood_1":"Neighborhood"})

In [49]:
# Explore dummy dataframe
df_one.head()


Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Baby Store,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The dataframe with the dummy venue categries is aggregated by neighborhood and the mean of dummy entries for each venue category is computed. Computing the mean is equivalent normalizing data since the data is scaled to a range of 0 - 1 depending on the frequency. 

###  The k-means algorithm will be run using the daframe below without the neighborhood columm

In [50]:
df_one_grouped = df_one.groupby(by='Neighborhood', axis=0).mean().reset_index()
df_one_grouped

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Baby Store,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,Berczy Park,0.0,0.0,0.01087,0.021739,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01087,0.0,0.0,0.0,0.0,0.0,0.0,0.01087
1,"Brockton, Parkdale Village, Exhibition Place",0.026316,0.0,0.0,0.026316,0.0,0.026316,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.017544
4,Central Bay Street,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017241,0.017241,0.0,0.017241,0.0,0.0,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.013333,0.013333,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026667
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.03,0.01,0.0,0.0,0.03,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The function 'return_most_common_venues' was created to be used to sort our venues in descending order of frequency in display them in our dataframe.

In [51]:
# Define a function to sort out the venues in descending order of frequency
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### Create a new dataframe and display the top 8 venues for each neighborhood

In [52]:
# Lets  us create the new dataframe and display the top 8 venues for each neighborhood.
num_top_venues = 8

indicators = ['st', 'nd', 'rd']

# Create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = df_one_grouped['Neighborhood']

for ind in np.arange(df_one_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_one_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,Berczy Park,Coffee Shop,Hotel,Bakery,Café,Seafood Restaurant,Beer Bar,Japanese Restaurant,Restaurant
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Breakfast Spot,Thrift / Vintage Store,Gift Shop,Brewery,Sandwich Place,Chiropractor
2,"Business reply mail Processing Centre, South C...",Coffee Shop,Restaurant,Sushi Restaurant,Japanese Restaurant,Italian Restaurant,Bank,Intersection,Bookstore
3,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Italian Restaurant,Café,Bar,French Restaurant,Speakeasy,Grocery Store,Gym / Fitness Center
4,Central Bay Street,Coffee Shop,Middle Eastern Restaurant,Café,Sandwich Place,Bubble Tea Shop,Restaurant,Clothing Store,Italian Restaurant


### The k-means algorithm is run to segment the neighborhoods into five clusters

In [53]:
# set number of clusters
kclusters = 5

df_one_grouped_clustering = df_one_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_one_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

A new dataframe that includes the cluster labels as well as the top 10 venues for each neighborhood will be created so that it can serve as the data for a Folium map for visualization of clustered neighborhoods

In [54]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

df_merged = df_tor

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
df_merged = df_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

df_merged.head() # check the last columns!

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626,0,Coffee Shop,Breakfast Spot,Yoga Studio,Theater,Food Truck,Spa,Event Space,Electronics Store
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,0,Gym,Coffee Shop,Sushi Restaurant,College Theater,Martial Arts School,Burrito Place,Dance Studio,Bubble Tea Shop
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783,0,Coffee Shop,Clothing Store,Italian Restaurant,Japanese Restaurant,Cosmetics Shop,Middle Eastern Restaurant,Movie Theater,Café
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756,0,Coffee Shop,Seafood Restaurant,Café,Cocktail Bar,Gastropub,American Restaurant,Italian Restaurant,Bakery
4,M4E,East Toronto,The Beaches,43.6784,-79.2941,0,Pub,Health Food Store,Cheese Shop,Trail,Bakery,Gastropub,Dumpling Restaurant,Fish Market


### A Visualization of Clustered Neighborhoods in Toronto on an Open Street May Utilizing Python's Folium Library

In [55]:
# create follium map
map_cluster_tor = folium.Map(location=[43.653226, -79.383184], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Neighborhood'], df_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_cluster_tor)
       
map_cluster_tor

### Each cluster is examined to determine the discriminating venue categories. Based on the defining categories, we will assign a name to each cluster.

Cluster 1 = Neighborhoods with High Numbers of Dining Venues (Color=Red)

In [56]:
df_merged.loc[df_merged['Cluster Labels'] == 0, df_merged.columns[[2] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,"Regent Park, Harbourfront",0,Coffee Shop,Breakfast Spot,Yoga Studio,Theater,Food Truck,Spa,Event Space,Electronics Store
1,"Queen's Park, Ontario Provincial Government",0,Gym,Coffee Shop,Sushi Restaurant,College Theater,Martial Arts School,Burrito Place,Dance Studio,Bubble Tea Shop
2,"Garden District, Ryerson",0,Coffee Shop,Clothing Store,Italian Restaurant,Japanese Restaurant,Cosmetics Shop,Middle Eastern Restaurant,Movie Theater,Café
3,St. James Town,0,Coffee Shop,Seafood Restaurant,Café,Cocktail Bar,Gastropub,American Restaurant,Italian Restaurant,Bakery
4,The Beaches,0,Pub,Health Food Store,Cheese Shop,Trail,Bakery,Gastropub,Dumpling Restaurant,Fish Market
5,Berczy Park,0,Coffee Shop,Hotel,Bakery,Café,Seafood Restaurant,Beer Bar,Japanese Restaurant,Restaurant
6,Central Bay Street,0,Coffee Shop,Middle Eastern Restaurant,Café,Sandwich Place,Bubble Tea Shop,Restaurant,Clothing Store,Italian Restaurant
7,Christie,0,Café,Grocery Store,Playground,Coffee Shop,Park,Candy Store,Baby Store,Fish & Chips Shop
8,"Richmond, Adelaide, King",0,Café,Coffee Shop,Restaurant,Gym,Hotel,Asian Restaurant,Sushi Restaurant,Salad Place
9,"Dufferin, Dovercourt Village",0,Furniture / Home Store,Park,Grocery Store,Bakery,Pizza Place,Middle Eastern Restaurant,Pool,Bar


Cluster 2 = Neighborhoods with High Numbers of Photography Studios (Color=Purple)

In [57]:
df_merged.loc[df_merged['Cluster Labels'] == 1, df_merged.columns[[2] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
18,Lawrence Park,1,Photography Studio,Park,Yoga Studio,Donut Shop,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market


Cluster 3 = Neighborhoods with High Numbers of Home Service Venues (Color=Cyan)

In [58]:
df_merged.loc[df_merged['Cluster Labels'] == 2, df_merged.columns[[2] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
19,Roselawn,2,Home Service,Food Court,Flower Shop,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Falafel Restaurant


Cluster 4 = Neighborhoods with High Numbers of Residential Buildings Close to Parks (Color=Green)

In [59]:
df_merged.loc[df_merged['Cluster Labels'] == 3, df_merged.columns[[2] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
22,"High Park, The Junction South",3,Park,Residential Building (Apartment / Condo),Yoga Studio,Donut Shop,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market


Cluster 5 = Neighborhoods with High Numbers of Outdoor Playground and Recreational Venues (Color=Light Brown)

In [60]:
df_merged.loc[df_merged['Cluster Labels'] == 4, df_merged.columns[[2] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
21,"Forest Hill North & West, Forest Hill Road Park",4,Home Service,Park,Trail,Yoga Studio,Donut Shop,Fish Market,Fish & Chips Shop,Fast Food Restaurant
23,"North Toronto West, Lawrence Park",4,Playground,Gym Pool,Park,Garden,Donut Shop,Fish & Chips Shop,Fast Food Restaurant,Farmers Market
29,"Moore Park, Summerhill East",4,Park,Thai Restaurant,Gym,Grocery Store,Trail,Yoga Studio,Ethiopian Restaurant,Dumpling Restaurant
33,Rosedale,4,Playground,Park,Grocery Store,Candy Store,Donut Shop,Fish & Chips Shop,Fast Food Restaurant,Farmers Market
