# Segmenting and Clustering Neighborhoods in Toronto

## Part1

In [53]:
import numpy as np
import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup as BS
from urllib.parse import urlparse, urlsplit

We will open the wikipedia page by BeautifulSoup, the result will be the html code of the page

In [54]:
website_page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urlopen(website_page)
soup = BS(page)
soup



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of postal codes of Canada: M - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":890001695,"wgRevisionId":890001695,"wgArticleId":539066,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","w

In the wikipedia page we will extract the table of postal codes, in the page there are 5 tables and our table is the first [0]

In [55]:
Postal_Table = soup.find_all('table')[0]
Postal_Table

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

we will transform the table into a dataframe

In [76]:
df_table = pd.read_html(str(Postal_Table))[0]
df_table = pd.DataFrame(df_table)
df_table.head(10)

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned


we will rename columns and remove the first row which contains table headers

In [77]:
df_table.columns = ["PostalCode","Borough","Neighborhood"]
df_table = df_table.iloc[1:]
df_table.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned
10,M8A,Not assigned,Not assigned


delete rows with a borough that is Not assigned.

In [83]:
df_table_nb = df_table.drop(df_table[(df_table.Borough == "Not assigned")].index)
#df_table_nb.reset_index(drop=True, inplace=True)
df_table_nb.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


select If a cell has a borough but a Not assigned neighborhood

In [84]:
df_table_nb.loc[df_table_nb['Neighborhood'] == "Not assigned"]

Unnamed: 0,PostalCode,Borough,Neighborhood
9,M7A,Queen's Park,Not assigned


then the neighborhood will be the same as the borough

In [85]:
df_table_nb.loc[df_table['Neighborhood'] == "Not assigned", ['Neighborhood']] = df_table_nb['Borough']

Let's check if the neighborhood is replaced and by result there is no "Not Assigned" neighborhood

In [86]:
df_table_nb.loc[df_table_nb['Neighborhood'] == "Not assigned"]

Unnamed: 0,PostalCode,Borough,Neighborhood


More than one neighborhood can exist in one postal code area.These two rows will be combined into one row with the neighborhoods separated with a comma 

In [87]:
df_table_concat = df_table_nb.groupby(["PostalCode", "Borough"], as_index=False, sort=False).agg(','.join)
df_table_concat.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


In [88]:
df_table_concat.shape

(103, 3)

## Part2

we will get geographical coordinates by reading the csv file http://cocl.us/Geospatial_data

In [13]:
geo_data = pd.read_csv("https://cocl.us/Geospatial_data")
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In order to perform a join between the table of postal codes and the Geographic coordinates, we need a common field to join with.

we will rename the "Postal Code" field in geo_data dataframe and maket same as df_table_concat dataframe "PostalCode"

In [14]:
geo_data.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
geo_data.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [18]:
df_table_join = pd.merge(df_table_concat, geo_data[['PostalCode','Latitude', 'Longitude']], on='PostalCode')
df_table_join.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937


## Part3

In [19]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/DSX-Python35

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.3.1               |             py_0          25 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    certifi-2018.8.24          |        py35_1001         139 KB  conda-forge
    openssl-1.0.2r             |       h14c3975_0         3.1 MB  conda-forge
    altair-2.2.2               |           py35_1         462 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ca-certificates-2019.3.9   |       hecc5488_0         146 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.0 MB

The following NEW packages will

In [91]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/DSX-Python35

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.19.0               |             py_0          53 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          85 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.19.0-py_0 conda-forge


Downloading and Extracting Packages
geopy-1.19.0         | 53 KB     | ##################################### | 100% 
geographiclib-1.49   | 32 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [98]:
neighborhoods = df_table_join
df_table_join.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


In [None]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

#### Use geopy library to get the latitude and longitude values of Toronto City.

In [92]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))
#latitude = 43.653963
#longitude = -79.387207

The geograpical coordinate of New York City are 43.653963, -79.387207.


In [145]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in Toronto. So let's slice the original dataframe and create a new dataframe of only boroughs that contain the word Toronto.

In [118]:
east_toronto_data = neighborhoods[neighborhoods['Borough'] =='East Toronto'].reset_index(drop=True)
#toronto_data = neighborhoods.query('Borough.str.contains("Toronto")', engine='python').reset_index(drop=True)
east_toronto_data.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558


Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [106]:
CLIENT_ID = 'L2YGMVEQ4LJAC5K1I3EIGVOS4JCDH3CBVFJTLZLG34IQLRXB' # your Foursquare ID
CLIENT_SECRET = 'YGMFHYRSQHRA2ZEQMQWVL2IJ2FNC15VOVRKTYDFIDK4DXJSB' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: L2YGMVEQ4LJAC5K1I3EIGVOS4JCDH3CBVFJTLZLG34IQLRXB
CLIENT_SECRET:YGMFHYRSQHRA2ZEQMQWVL2IJ2FNC15VOVRKTYDFIDK4DXJSB


#### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [119]:
east_toronto_data.loc[0,'Neighborhood']

'The Beaches'

In [120]:
neighborhood_latitude = east_toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = east_toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = east_toronto_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of The Beaches are 43.67635739999999, -79.2930312.


#### Now, let's get the top 100 venues that are in Marble Hill within a radius of 500 meters.

First, let's create the GET request URL. Name your URL **url**.

In [112]:
radius=500
LIMIT = 100
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, neighborhood_latitude, 
                                                                                                                            neighborhood_longitude,radius, LIMIT)
# display URL
url

'https://api.foursquare.com/v2/venues/explore?&client_id=L2YGMVEQ4LJAC5K1I3EIGVOS4JCDH3CBVFJTLZLG34IQLRXB&client_secret=YGMFHYRSQHRA2ZEQMQWVL2IJ2FNC15VOVRKTYDFIDK4DXJSB&v=20180605&ll=43.67635739999999,-79.2930312&radius=500&limit=100'

In [113]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5cd9565ef594df21bb8097d5'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-4ad4c062f964a52011f820e3-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/shops/food_grocery_',
          'suffix': '.png'},
         'id': '50aa9e744b90af0d42d5de0e',
         'name': 'Health Food Store',
         'pluralName': 'Health Food Stores',
         'primary': True,
         'shortName': 'Health Food Store'}],
       'id': '4ad4c062f964a52011f820e3',
       'location': {'address': '125 Southwood Dr',
        'cc': 'CA',
        'city': 'Toronto',
        'country': 'Canada',
        'distance': 471,
        'formattedAddress': ['125 Southwood Dr',
         'Toronto ON M4E 0B8',
         'Canada'],
        'labeledLatLngs': [{'label': 'display',
      

In [114]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [115]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
1,Grover Pub and Grub,Pub,43.679181,-79.297215
2,St-Denis Studios Inc.,Music Venue,43.675031,-79.288022
3,Upper Beaches,Neighborhood,43.680563,-79.292869


And how many venues were returned by Foursquare?

In [116]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


#### Let's create a function to repeat the same process to all the neighborhoods in Toronto

In [117]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [122]:
east_toronto_venues = getNearbyVenues(names=east_toronto_data['Neighborhood'],
                                   latitudes=east_toronto_data['Latitude'],
                                   longitudes=east_toronto_data['Longitude']
                                  )

The Beaches
The Danforth West,Riverdale
The Beaches West,India Bazaar
Studio District
Business Reply Mail Processing Centre 969 Eastern


#### Let's check the size of the resulting dataframe

In [123]:
print(east_toronto_venues.shape)
east_toronto_venues.head()

(124, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
1,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
2,The Beaches,43.676357,-79.293031,St-Denis Studios Inc.,43.675031,-79.288022,Music Venue
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West,Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


Let's check how many venues were returned for each neighborhood

In [124]:
east_toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Business Reply Mail Processing Centre 969 Eastern,17,17,17,17,17,17
Studio District,37,37,37,37,37,37
The Beaches,4,4,4,4,4,4
"The Beaches West,India Bazaar",22,22,22,22,22,22
"The Danforth West,Riverdale",44,44,44,44,44,44


#### Let's find out how many unique categories can be curated from all the returned venues

In [125]:
print('There are {} uniques categories.'.format(len(east_toronto_venues['Venue Category'].unique())))

There are 68 uniques categories.


### Analyze Each Neighborhood

In [126]:
# one hot encoding
east_toronto_onehot = pd.get_dummies(east_toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
east_toronto_onehot['Neighborhood'] = east_toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [east_toronto_onehot.columns[-1]] + list(east_toronto_onehot.columns[:-1])
east_toronto_onehot = east_toronto_onehot[fixed_columns]

east_toronto_onehot.head()

Unnamed: 0,Yoga Studio,American Restaurant,Auto Workshop,Bakery,Bank,Bar,Bookstore,Brewery,Bubble Tea Shop,Burger Joint,Burrito Place,Café,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Clothing Store,Coffee Shop,Comfort Food Restaurant,Comic Shop,Convenience Store,Cosmetics Shop,Coworking Space,Dessert Shop,Diner,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Fish Market,Food & Drink Shop,Fruit & Vegetable Store,Furniture / Home Store,Garden,Garden Center,Gastropub,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Health Food Store,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Juice Bar,Latin American Restaurant,Light Rail Station,Liquor Store,Middle Eastern Restaurant,Movie Theater,Music Store,Music Venue,Neighborhood,Park,Pet Store,Pizza Place,Pub,Recording Studio,Restaurant,Sandwich Place,Seafood Restaurant,Skate Park,Smoke Shop,Spa,Sports Bar,Stationery Store,Steakhouse,Sushi Restaurant,Trail
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,The Beaches,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"The Danforth West,Riverdale",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [127]:
east_toronto_onehot.shape

(124, 68)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [128]:
east_toronto_grouped = east_toronto_onehot.groupby('Neighborhood').mean().reset_index()
east_toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,American Restaurant,Auto Workshop,Bakery,Bank,Bar,Bookstore,Brewery,Bubble Tea Shop,Burger Joint,Burrito Place,Café,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Clothing Store,Coffee Shop,Comfort Food Restaurant,Comic Shop,Convenience Store,Cosmetics Shop,Coworking Space,Dessert Shop,Diner,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Fish Market,Food & Drink Shop,Fruit & Vegetable Store,Furniture / Home Store,Garden,Garden Center,Gastropub,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Health Food Store,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Juice Bar,Latin American Restaurant,Light Rail Station,Liquor Store,Middle Eastern Restaurant,Movie Theater,Music Store,Music Venue,Park,Pet Store,Pizza Place,Pub,Recording Studio,Restaurant,Sandwich Place,Seafood Restaurant,Skate Park,Smoke Shop,Spa,Sports Bar,Stationery Store,Steakhouse,Sushi Restaurant,Trail
0,Business Reply Mail Processing Centre 969 Eastern,0.058824,0.0,0.058824,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.058824,0.058824,0.0,0.0,0.0,0.0,0.0,0.058824,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.117647,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.058824,0.0,0.058824,0.058824,0.0,0.0,0.058824,0.058824,0.0,0.0,0.0,0.0,0.0,0.0
1,Studio District,0.027027,0.054054,0.0,0.054054,0.027027,0.027027,0.027027,0.027027,0.0,0.0,0.0,0.108108,0.0,0.027027,0.027027,0.027027,0.081081,0.027027,0.0,0.027027,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,0.027027,0.0,0.0,0.0,0.027027,0.0,0.027027,0.0,0.054054,0.0,0.0,0.027027,0.0,0.0,0.027027,0.0,0.027027,0.0,0.027027,0.0,0.027027,0.0,0.0,0.0,0.027027,0.027027,0.0,0.0,0.0,0.0,0.027027,0.0,0.0,0.0
2,The Beaches,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"The Beaches West,India Bazaar",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,0.045455,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.045455,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0,0.045455,0.045455,0.0,0.045455,0.0,0.0,0.090909,0.045455,0.045455,0.045455,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.045455,0.0
4,"The Danforth West,Riverdale",0.022727,0.022727,0.0,0.022727,0.0,0.0,0.045455,0.022727,0.022727,0.0,0.0,0.022727,0.022727,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.022727,0.0,0.022727,0.022727,0.0,0.0,0.0,0.0,0.0,0.022727,0.045455,0.0,0.0,0.0,0.181818,0.022727,0.0,0.0,0.0,0.068182,0.022727,0.045455,0.022727,0.022727,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.022727,0.0,0.022727,0.0,0.0,0.0,0.0,0.022727,0.022727,0.0,0.0,0.022727,0.022727


#### Let's confirm the new size

In [129]:

east_toronto_grouped.shape

(5, 68)

#### Let's print each neighborhood along with the top 5 most common venues

In [130]:
num_top_venues = 5

for hood in east_toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = east_toronto_grouped[east_toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Business Reply Mail Processing Centre 969 Eastern----
                venue  freq
0  Light Rail Station  0.12
1         Yoga Studio  0.06
2          Skate Park  0.06
3       Garden Center  0.06
4              Garden  0.06


----Studio District----
                 venue  freq
0                 Café  0.11
1          Coffee Shop  0.08
2               Bakery  0.05
3  American Restaurant  0.05
4   Italian Restaurant  0.05


----The Beaches----
                       venue  freq
0          Health Food Store  0.25
1                        Pub  0.25
2                Music Venue  0.25
3                Yoga Studio  0.00
4  Latin American Restaurant  0.00


----The Beaches West,India Bazaar----
            venue  freq
0            Park  0.09
1  Sandwich Place  0.09
2     Pizza Place  0.05
3     Coffee Shop  0.05
4  Ice Cream Shop  0.05


----The Danforth West,Riverdale----
                    venue  freq
0        Greek Restaurant  0.18
1             Coffee Shop  0.09
2          Ice Cream Sho

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [131]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [132]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = east_toronto_grouped['Neighborhood']

for ind in np.arange(east_toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(east_toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Yoga Studio,Farmers Market,Comic Shop,Burrito Place,Park,Pizza Place,Recording Studio,Restaurant,Brewery
1,Studio District,Café,Coffee Shop,Italian Restaurant,American Restaurant,Bakery,Ice Cream Shop,Gym / Fitness Center,Fish Market,Coworking Space,Convenience Store
2,The Beaches,Music Venue,Pub,Health Food Store,Trail,Dessert Shop,Comic Shop,Convenience Store,Cosmetics Shop,Coworking Space,Diner
3,"The Beaches West,India Bazaar",Park,Sandwich Place,Liquor Store,Food & Drink Shop,Sushi Restaurant,Gym,Coffee Shop,Ice Cream Shop,Italian Restaurant,Light Rail Station
4,"The Danforth West,Riverdale",Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Bookstore,Furniture / Home Store,Trail,Diner,Indian Restaurant,Grocery Store


### Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [133]:
# set number of clusters
kclusters = 5

east_toronto_grouped_clustering = east_toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(east_toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([3, 4, 1, 2, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [134]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

east_toronto_merged = east_toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
east_toronto_merged = east_toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

east_toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Music Venue,Pub,Health Food Store,Trail,Dessert Shop,Comic Shop,Convenience Store,Cosmetics Shop,Coworking Space,Diner
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Bookstore,Furniture / Home Store,Trail,Diner,Indian Restaurant,Grocery Store
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,2,Park,Sandwich Place,Liquor Store,Food & Drink Shop,Sushi Restaurant,Gym,Coffee Shop,Ice Cream Shop,Italian Restaurant,Light Rail Station
3,M4M,East Toronto,Studio District,43.659526,-79.340923,4,Café,Coffee Shop,Italian Restaurant,American Restaurant,Bakery,Ice Cream Shop,Gym / Fitness Center,Fish Market,Coworking Space,Convenience Store
4,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,3,Light Rail Station,Yoga Studio,Farmers Market,Comic Shop,Burrito Place,Park,Pizza Place,Recording Studio,Restaurant,Brewery


Finally, let's visualize the resulting clusters

In [144]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(east_toronto_merged['Latitude'], east_toronto_merged['Longitude'], east_toronto_merged['Neighborhood'], east_toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine Clusters

#### Cluster 1

In [138]:
east_toronto_merged.loc[east_toronto_merged['Cluster Labels'] == 0, east_toronto_merged.columns[[1] + list(range(5, east_toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,East Toronto,0,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Bookstore,Furniture / Home Store,Trail,Diner,Indian Restaurant,Grocery Store


#### Cluster 2


In [139]:
east_toronto_merged.loc[east_toronto_merged['Cluster Labels'] == 1, east_toronto_merged.columns[[1] + list(range(5, east_toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,1,Music Venue,Pub,Health Food Store,Trail,Dessert Shop,Comic Shop,Convenience Store,Cosmetics Shop,Coworking Space,Diner


#### Cluster 3


In [141]:
east_toronto_merged.loc[east_toronto_merged['Cluster Labels'] == 2, east_toronto_merged.columns[[1] + list(range(5, east_toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,East Toronto,2,Park,Sandwich Place,Liquor Store,Food & Drink Shop,Sushi Restaurant,Gym,Coffee Shop,Ice Cream Shop,Italian Restaurant,Light Rail Station


#### Cluster 4


In [142]:
east_toronto_merged.loc[east_toronto_merged['Cluster Labels'] == 3, east_toronto_merged.columns[[1] + list(range(5, east_toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,East Toronto,3,Light Rail Station,Yoga Studio,Farmers Market,Comic Shop,Burrito Place,Park,Pizza Place,Recording Studio,Restaurant,Brewery


#### Cluster 5


In [143]:
east_toronto_merged.loc[east_toronto_merged['Cluster Labels'] == 4, east_toronto_merged.columns[[1] + list(range(5, east_toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,East Toronto,4,Café,Coffee Shop,Italian Restaurant,American Restaurant,Bakery,Ice Cream Shop,Gym / Fitness Center,Fish Market,Coworking Space,Convenience Store
