# Segmenting and Clustering Neighborhoods in Toronto

My coursera submission to segment and cluster neighborhoods.

<span style="font-size:xx-small;"><b>Do not copy or distribute.</b></span>

**Table of Contents**:
 - [Part 1](#Segmenting-and-Clustering-Neighborhoods-in-Toronto)
 - [Part 2: Geocoding](#Part-2:-Geocoding)

In [1]:
import pandas as pd
import numpy as np
import requests
import json

user_agent = 'datascience jupyter notebook/0.0 (Austin Rainwater, paco@heckin.io)'

## Step 2

 > Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

I will start by grabbing the data from Wikipedia using its API.

See https://stackoverflow.com/questions/40210536/how-to-obtain-data-in-a-table-from-wikipedia-api

In [2]:
url = 'https://en.wikipedia.org/w/api.php?action=parse&format=json&prop=sections&page=List_of_postal_codes_of_Canada%3A_M'

results = requests.get(url, headers={'User-Agent': user_agent}).json()
sections = results['parse']['sections']

section_index = None
for section in sections:
    if section['line'] == 'Toronto - 103 FSAs':
        section_index = section['index']

if section_index is None:
    raise ValueError("The Toronto FSAs were not found.")
    
"Section index is " + section_index

'Section index is 1'

In [3]:
url = 'https://en.wikipedia.org/w/api.php?action=parse&format=json&page=List_of_postal_codes_of_Canada%3A_M&prop=text&section=' + section_index

results = requests.get(url, headers={'User-Agent': user_agent}).json()
html = results['parse']['text']['*']

"Beginning of HTML: " + html[:1000]

'Beginning of HTML: <div class="mw-parser-output"><h2><span class="mw-headline" id="Toronto_-_103_FSAs"><a href="/wiki/Toronto" title="Toronto">Toronto</a> - 103 <a href="/wiki/Postal_codes_in_Canada#Forward_sortation_areas" title="Postal codes in Canada">FSAs</a></span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=List_of_postal_codes_of_Canada:_M&amp;action=edit&amp;section=1" title="Edit section: Toronto - 103 FSAs">edit</a><span class="mw-editsection-bracket">]</span></span></h2>\n<p>Note: There are no rural FSAs in Toronto, hence no postal codes should start with M0. However, the postal code M0R 8T0 is assigned to an <a href="/wiki/Amazon_(company)" title="Amazon (company)">Amazon</a> warehouse in Mississauga, suggesting that Canada Post may have reserved the M0 FSA for high volume addresses.\n</p>\n<table class="wikitable sortable">\n<tbody><tr>\n<th>Postal Code\n</th>\n<th>Borough\n</th>\n<th>Neighbourhood\n</th></tr>\n<tr>

I was going to try to build a [regular expression to parse the HTML](https://stackoverflow.com/a/1732454), but then I learned that pandas already has a [`read_html()` function](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html):

In [4]:
df = pd.read_html(html)[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Step 3

 > - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
 > - Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
 > - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
 > - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
 > - Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
 > - In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.


I'll grab the values that are not "`Not assigned`", and then get the rest of the unique values to see what's left.

In [5]:
df = df[df['Borough'] != 'Not assigned']
df['Borough'].unique()

array(['North York', 'Downtown Toronto', 'Etobicoke', 'Scarborough',
       'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

All of those look like good values.

Looking at the page, it looks like it's been recently edited so that the format is actually as stated in the instructions. I can confirm this by seeing how many unique boroughs each postal code has, as well as how many unique neighborhood values are in each borough.

In [6]:
a = df[['Postal Code', 'Borough']].groupby('Postal Code').nunique().max()
b = df.groupby(['Postal Code', 'Borough']).nunique().max()
print(f"Max unique borough values per postal code: {int(a)}")
print(f"Max unique neighborhood values per borough: {int(b)}")

Max unique borough values per postal code: 1
Max unique neighborhood values per borough: 1


However, I am curious if I can list out the neighborhoods in each postal code.

In [7]:
grouped_df = df[['Postal Code', 'Neighbourhood']].groupby('Postal Code')
grouped_df.aggregate(', '.join)

Unnamed: 0_level_0,Neighbourhood
Postal Code,Unnamed: 1_level_1
M1B,"Malvern, Rouge"
M1C,"Rouge Hill, Port Union, Highland Creek"
M1E,"Guildwood, Morningside, West Hill"
M1G,Woburn
M1H,Cedarbrae
...,...
M9N,Weston
M9P,Westmount
M9R,"Kingsview Village, St. Phillips, Martin Grove ..."
M9V,"South Steeles, Silverstone, Humbergate, Jamest..."


I would use the same concept to perform the aggregation as requested on boroughs.

Now to continue assigning a value to "not assigned" neighborhoods.

In [8]:
def process_row(row):
    if row['Neighbourhood'] == 'Not assigned':
        return row['Borough']
    return row['Neighbourhood']

df['Neighbourhood'] = df.apply(process_row, axis=1)
'Not assigned' in df['Neighbourhood']

False

Let's finally check the shape of the dataframe.

In [9]:
df.shape

(103, 3)

---

# Part 2: Geocoding

Converting postal codes to coordinates.

In [10]:
!pip --quiet install --upgrade geocoder

import geocoder

In [11]:
def get_latitude_longitude(row):
    postal_code = row['Postal Code']
    geo = f"{postal_code}, Toronto, Ontario"
    
    i = 0
    while True:
        try:
            lat_lng = geocoder.google(geo).latlng
            assert lat_lng is not None
            break
        except AssertionError:
            i = i + 1
            if i > 25:
                raise ValueError("Tried 25 times and did not get a result")
    
    return pd.Series(lat_lng, ['Latitude', 'Longitude'])

Let's see if we get any results before trying that on all zip codes first

In [12]:
from traceback import print_exc

try:
    df.head(2).apply(get_latitude_longitude, axis=1, result_type='expand')
except:
    print_exc()

Traceback (most recent call last):
  File "<ipython-input-11-5d555894ad42>", line 9, in get_latitude_longitude
    assert lat_lng is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<ipython-input-12-c67d69be9800>", line 4, in <module>
    df.head(2).apply(get_latitude_longitude, axis=1, result_type='expand')
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/frame.py", line 7547, in apply
    return op.get_result()
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/apply.py", line 180, in get_result
    return self.apply_standard()
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/apply.py", line 255, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/apply.py", line 284, in apply_series_generator
    results[i] = self.f(v)
  File "<ipython-input-11-5d555894ad42>", line 14, in get_latitude_long

Hmmm...that did not work. What happens if I try with one value?

In [13]:
result = geocoder.google('M3A, Toronto, Ontario')
result

<[REQUEST_DENIED] Google - Geocode [empty]>

`REQUEST_DENIED` tells me this is no longer working. That makes sense...I'd imagine the geocoder for Google would not continue working if they wanted people for the API.

Let's try [osm](https://geocoder.readthedocs.io/providers/OpenStreetMap.html#geocoding).

In [14]:
def get_latitude_longitude(row):
    postal_code = row['Postal Code']
    neighbourhood = row['Neighbourhood']
    geo = f"{postal_code}, Toronto, Ontario"
    ngeo = f"{neighbourhood}, Toronto, Ontario"
    
    i = 0
    while True:
        try:
            # Have to modify it a little because the response is different
            result = geocoder.osm(geo).json or geocoder.osm(ngeo).json
            assert result is not None
            lat_lng = (result['lat'], result['lng'])
            assert lat_lng is not None
            return pd.Series(lat_lng, ['Latitude', 'Longitude'])
        except AssertionError:
            i = i + 1
            if i > 25:
                raise ValueError(f"Tried 25 times for postal code {postal_code} and did not get a result")

In [15]:
try:
    result = df.head(2).apply(get_latitude_longitude, axis=1, result_type='expand')
except:
    print_exc()
    
result

Unnamed: 0,Latitude,Longitude
2,43.653482,-79.383935
3,43.732658,-79.311189


In [16]:
result = None

try:
    df = pd.concat(df, df.apply(get_latitude_longitude, axis=1, result_type='expand'))
    result = df.head(20)
except:
    print_exc()
    
result

Traceback (most recent call last):
  File "<ipython-input-14-86b0e35a0993>", line 12, in get_latitude_longitude
    assert result is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<ipython-input-16-b3440c6a1e19>", line 4, in <module>
    df = pd.concat(df, df.apply(get_latitude_longitude, axis=1, result_type='expand'))
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/frame.py", line 7547, in apply
    return op.get_result()
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/apply.py", line 180, in get_result
    return self.apply_standard()
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/apply.py", line 255, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/opt/conda/lib/python3.8/site-packages/pandas/core/apply.py", line 284, in apply_series_generator
    results[i] = self.f(v)
  File "<ipython-input-14-86b0e35a0993>", line 19, in get_l

Darn. I'm out of ideas. I guess I will use the data provided by the course.

In [17]:
lat_lng_df = pd.read_csv('https://cocl.us/Geospatial_data')
lat_lng_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [18]:
postal_df = df.merge(lat_lng_df, on='Postal Code')
postal_df.set_index('Postal Code', inplace=True)
postal_df.head()

Unnamed: 0_level_0,Borough,Neighbourhood,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3A,North York,Parkwoods,43.753259,-79.329656
M4A,North York,Victoria Village,43.725882,-79.315572
M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


---

# Step 3: Neighborhood Clustering

In [19]:
!pip --quiet install --upgrade aiohttp pyyaml folium

# Using Python's async API allows one to make more requests 
# while waiting for the response from another.

import aiohttp 
import asyncio

import requests
import yaml
import os

from pandas import json_normalize
from matplotlib import cm, colors

from sklearn.cluster import KMeans

import folium

with open('secrets.yaml') as f:
    credentials = yaml.safe_load(f)
client_id = credentials['4SQ_CLIENT_ID']
client_secret = credentials['4SQ_CLIENT_SECRET']
v = '20201108'

## Caching data

I'm going to start by caching all of the responses for each neighborhood by storing them as a "pickle" (stores binary Python data structure). This will save me time (and API calls) if I have to come back to this (as I am likely to do).

In [20]:
async def get_neighborhood_data(postal, row, session):
    queries = {
        'postal': {'near': f"{postal}, Toronto, Ontario"},
        'neighborhood': {'near': f"{row['Neighbourhood'].split(',')[0]}, Toronto, Ontario"},
        'latlong': {'ll': f"{row.Latitude},{row.Longitude}"}
    }
    
    err = None
    for strategy, q in queries.items():
        url = "https://api.foursquare.com/v2/venues/explore"
        params = {
            'client_id': client_id,
            'client_secret': client_secret,
            'limit': '50',
            'v': v,
            **q
        }
        try:
            async with session.get(url, params=params) as result:
                data = await result.json()
                assert 'response' in data and 'groups' in data['response']
                return pd.Series(
                    [postal, strategy, q, data], 
                    ['postal_code', 'strategy', 'query', 'json_response']
                )
        except Exception as e:
            err = e
            continue
    raise err

async def query_data(test_rows=None):
    async with aiohttp.ClientSession() as session:
        tasks = [get_neighborhood_data(postal, row, session) for postal, row in postal_df.head(test_rows).iterrows()]
        results = await asyncio.gather(*tasks)
        return pd.DataFrame(results)

await query_data(3)

Unnamed: 0,postal_code,strategy,query,json_response
0,M3A,postal,"{'near': 'M3A, Toronto, Ontario'}","{'meta': {'code': 200, 'requestId': '5faa04a62..."
1,M4A,postal,"{'near': 'M4A, Toronto, Ontario'}","{'meta': {'code': 200, 'requestId': '5faa04a62..."
2,M5A,postal,"{'near': 'M5A, Toronto, Ontario'}","{'meta': {'code': 200, 'requestId': '5faa04a65..."


In [21]:
path = os.path.join('data', 'foursquare.pkl')
try:
    foursquare_df = pd.read_pickle(path)
except:
    print("Foursquare pickle does not exist or cannot be read; reprocessing.")
    foursquare_df = await query_data()
    
    foursquare_df.to_pickle(path)
    
foursquare_df

Foursquare pickle does not exist or cannot be read; reprocessing.


Unnamed: 0,postal_code,strategy,query,json_response
0,M3A,postal,"{'near': 'M3A, Toronto, Ontario'}","{'meta': {'code': 200, 'requestId': '5faa04a76..."
1,M4A,postal,"{'near': 'M4A, Toronto, Ontario'}","{'meta': {'code': 200, 'requestId': '5faa04a73..."
2,M5A,postal,"{'near': 'M5A, Toronto, Ontario'}","{'meta': {'code': 200, 'requestId': '5faa04a7c..."
3,M6A,postal,"{'near': 'M6A, Toronto, Ontario'}","{'meta': {'code': 200, 'requestId': '5faa04a70..."
4,M7A,postal,"{'near': 'M7A, Toronto, Ontario'}","{'meta': {'code': 200, 'requestId': '5faa04a79..."
...,...,...,...,...
98,M8X,postal,"{'near': 'M8X, Toronto, Ontario'}","{'meta': {'code': 200, 'requestId': '5faa04a7e..."
99,M4Y,postal,"{'near': 'M4Y, Toronto, Ontario'}","{'meta': {'code': 200, 'requestId': '5faa04a79..."
100,M7Y,postal,"{'near': 'M7Y, Toronto, Ontario'}","{'meta': {'code': 200, 'requestId': '5faa04a7e..."
101,M8Y,postal,"{'near': 'M8Y, Toronto, Ontario'}","{'meta': {'code': 200, 'requestId': '5faa04a72..."


In [22]:
foursquare_df.strategy.value_counts()

postal     102
latlong      1
Name: strategy, dtype: int64

In [23]:
foursquare_df[foursquare_df['strategy'] == 'latlong']

Unnamed: 0,postal_code,strategy,query,json_response
76,M7R,latlong,"{'ll': '43.6369656,-79.61581899999999'}","{'meta': {'code': 200, 'requestId': '5faa04a8b..."


I now have `foursquare_df` that contains the responses from Foursquare for each of the postal codes I passed in. Let's take a look at one.

In [24]:
foursquare_df.json_response.sample(1).iat[0]

{'meta': {'code': 200, 'requestId': '5faa04a768a18376623d690f'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'geocode': {'what': '',
   'where': 'm2m toronto ontario',
   'center': {'lat': 43.7915, 'lng': -79.4103},
   'displayString': 'M2M, Canada',
   'cc': 'CA',
   'longId': '146648462866283642'},
  'headerLocation': 'Current map view',
  'headerFullLocation': 'Current map view',
  'headerLocationGranularity': 'unknown',
  'totalResults': 102,
  'suggestedBounds': {'ne': {'lat': 43.8182336846777, 'lng': -79.377073401},
   'sw': {'lat': 43.76375180838496, 'lng': -79.44910298727369}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '58ba1f446d349d32d862ff70',
       'name': 'Daldongnae',
       '

## Hierarchy of Categories

I also want to grab Foursquare's [hierarchy of categories](https://developer.foursquare.com/docs/build-with-foursquare/categories/) to help build a better model.

In [25]:
url = 'https://api.foursquare.com/v2/venues/categories'
params = {
    'client_id': client_id,
    'client_secret': client_secret,
    'v': v
}
foursquare_categories = requests.get(url, params=params).json()
foursquare_categories

{'meta': {'code': 200, 'requestId': '5faa04a93f31f8193787e441'},
 'response': {'categories': [{'id': '4d4b7104d754a06370d81259',
    'name': 'Arts & Entertainment',
    'pluralName': 'Arts & Entertainment',
    'shortName': 'Arts & Entertainment',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/default_',
     'suffix': '.png'},
    'categories': [{'id': '56aa371be4b08b9a8d5734db',
      'name': 'Amphitheater',
      'pluralName': 'Amphitheaters',
      'shortName': 'Amphitheater',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/default_',
       'suffix': '.png'},
      'categories': []},
     {'id': '4fceea171983d5d06c3e9823',
      'name': 'Aquarium',
      'pluralName': 'Aquariums',
      'shortName': 'Aquarium',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/aquarium_',
       'suffix': '.png'},
      'categories': []},
     {'id': '4bf58dd8d48988d1e1931735',
      'name': 'A

Let's calculate the deepest level of categories.

In [26]:
def get_deepest_level(categories):
    if len(categories) == 0:
        return 0
    levels = []
    for category in categories:
        levels.append(1 + get_deepest_level(category['categories']))
    return max(levels)

get_deepest_level(foursquare_categories['response']['categories'])

5

So a category will be 5 levels deep at most. I'm going to make a dataframe of venues alongside the postal code and combine that with up to 5 levels of categories. 

Since there isn't a lot of data included with categories, I'm satisfied just using the short name.

I'm going to create a categories dataframe that is indexed on the category ID so I can quickly obtain the category hierarchy.

In [27]:
def category_hier(categories, prefix=[]):
    result = []
    
    for category in categories:
        category = json_normalize(category).iloc[0]
        current_category = pd.Series(
            data=prefix + [category.shortName] + [np.nan] * (4 - len(prefix)),
            name=str(category.id),
            index=[
                'level_1',
                'level_2',
                'level_3',
                'level_4',
                'level_5'
            ]
        )
        result.append(current_category)
        if subcategories := category.categories:
            result += category_hier(subcategories, prefix + [category.shortName])
            
    return result

categories = foursquare_categories['response']['categories']
category_df = pd.DataFrame(category_hier(categories))

category_df

Unnamed: 0,level_1,level_2,level_3,level_4,level_5
4d4b7104d754a06370d81259,Arts & Entertainment,,,,
56aa371be4b08b9a8d5734db,Arts & Entertainment,Amphitheater,,,
4fceea171983d5d06c3e9823,Arts & Entertainment,Aquarium,,,
4bf58dd8d48988d1e1931735,Arts & Entertainment,Arcade,,,
4bf58dd8d48988d1e2931735,Arts & Entertainment,Art Gallery,,,
...,...,...,...,...,...
52f2ab2ebcbc57f1066b8b51,Travel,Tram,,,
54541b70498ea6ccd0204bff,Travel,Transportation Services,,,
4f04b25d2fb6e1c99f3db0c0,Travel,Lounge,,,
57558b36e4b065ecebd306dd,Travel,Truck Stop,,,


I did notice that each venue has an "array" of categories. I could see a venue beint a combination of two categories. I want to double check if that's the case here, because if so, I should account for that.

In [28]:
max_categories = 0

for index, row in foursquare_df.iterrows():
    items = json_normalize(row.json_response['response']['groups'][0]['items'])
    for index, item in items.iterrows():
        if max_categories < len(item['venue.categories']):
            max_categories = len(item['venue.categories'])
        
max_categories

1

Cool, so I don't need to worry about venues with multiple categories. Let's create this dataframe, then.

## Venue Scoring DataFrame

In [29]:
venues = []

for index, row in foursquare_df.iterrows():
    items = json_normalize(row.json_response['response']['groups'][0]['items'])
    for item_index, item in items.iterrows():
        postal = postal_df.loc[row.postal_code]
        category = category_df.loc[item['venue.categories'][0]['id']]
        venues.append((
            row.postal_code,
            postal['Latitude'],
            postal['Longitude'],
            item['venue.location.lat'],
            item['venue.location.lng'],
            item['venue.name'],
            category.level_1,
            category.level_2,
            category.level_3,
            category.level_4,
            category.level_5
        ))
        columns = [
            'postal_code',
            'postal_lat',
            'postal_lng',
            'venue_lat',
            'venue_lng',
            'venue',
            'category_1',
            'category_2',
            'category_3',
            'category_4',
            'category_5'
        ]
        
venue_df = pd.DataFrame(venues, columns=columns)

In [30]:
toronto_df = pd.merge(
    venue_df[['postal_code']], 
    pd.get_dummies(
        venue_df[['category_1', 'category_2', 'category_3', 'category_4']], 
        prefix=['Category', 'Subcategory', 'Type 1', 'Type 2'], 
        prefix_sep=': '
    ),
    left_index=True, 
    right_index=True
)

toronto_df

Unnamed: 0,postal_code,Category: Arts & Entertainment,Category: College & Education,Category: Food,Category: Nightlife,Category: Outdoors & Recreation,Category: Professional,Category: Shops,Category: Travel,Subcategory: Afghan,...,Type 2: Hakka,Type 2: Martial Arts,Type 2: Peruvian,Type 2: Ramen,Type 2: Sushi,Type 2: Szechuan,Type 2: Taiwanese,Type 2: Track,Type 2: Xinjiang,Type 2: Yoga Studio
0,M3A,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M3A,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M3A,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M3A,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M3A,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5145,M8Z,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5146,M8Z,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5147,M8Z,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5148,M8Z,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
toronto_df = toronto_df.groupby('postal_code').mean()
toronto_df

Unnamed: 0_level_0,Category: Arts & Entertainment,Category: College & Education,Category: Food,Category: Nightlife,Category: Outdoors & Recreation,Category: Professional,Category: Shops,Category: Travel,Subcategory: Afghan,Subcategory: African,...,Type 2: Hakka,Type 2: Martial Arts,Type 2: Peruvian,Type 2: Ramen,Type 2: Sushi,Type 2: Szechuan,Type 2: Taiwanese,Type 2: Track,Type 2: Xinjiang,Type 2: Yoga Studio
postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M1B,0.26,0.0,0.38,0.00,0.16,0.0,0.18,0.02,0.0,0.0,...,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.00,0.0
M1C,0.14,0.0,0.40,0.04,0.26,0.0,0.16,0.00,0.0,0.0,...,0.00,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.00,0.0
M1E,0.02,0.0,0.42,0.04,0.24,0.0,0.26,0.02,0.0,0.0,...,0.00,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.0
M1G,0.00,0.0,0.46,0.06,0.18,0.0,0.28,0.02,0.0,0.0,...,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.0
M1H,0.02,0.0,0.52,0.04,0.16,0.0,0.24,0.02,0.0,0.0,...,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
M9N,0.02,0.0,0.58,0.02,0.14,0.0,0.22,0.02,0.0,0.0,...,0.00,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.00,0.0
M9P,0.02,0.0,0.56,0.02,0.16,0.0,0.20,0.04,0.0,0.0,...,0.00,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.00,0.0
M9R,0.00,0.0,0.54,0.04,0.14,0.0,0.14,0.14,0.0,0.0,...,0.00,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.00,0.0
M9V,0.04,0.0,0.78,0.00,0.04,0.0,0.12,0.02,0.0,0.0,...,0.00,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.00,0.0


Now I can start clustering.

Because I kept the index as the postal codes, I can use `toronto_df` without dropping anything because indexes aren't technically columns on a dataframe.

In [32]:
clusters = 5

k_means = KMeans(n_clusters=clusters).fit(toronto_df)

k_means.labels_[:10]

array([0, 0, 0, 1, 1, 1, 1, 1, 0, 0], dtype=int32)

In [33]:
latitude = (lat_lng_df['Latitude'].min() + lat_lng_df['Latitude'].max())/2
longitude = (lat_lng_df['Longitude'].min() + lat_lng_df['Longitude'].max())/2

In [34]:
toronto_df['cluster'] = k_means.labels_.astype('int')

## Visualizing the clusters

The clusters are created, now let's see them on a map.

In [35]:
postal_map = folium.Map(location=(latitude, longitude), zoom_start=11)

# I don't 100% understand this code because it wasn't explained
x = np.arange(clusters)
ys = [i + x + (i*x)**2 for i in range(clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

for index, row in toronto_df.iterrows():
    neigh = postal_df.loc[index].Neighbourhood
    lat = postal_df.loc[index].Latitude
    lng = postal_df.loc[index].Longitude
    label = folium.Popup(f"{neigh}<br />Cluster {int(row.cluster)}")
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[int(row.cluster-1)],
        fill=True,
        fill_color=rainbow[int(row.cluster-1)],
        fill_opacity=0.7
    ).add_to(postal_map)

postal_map

## Cluster definitions

I'm going to finally write some code to see what defines the clusters. It will split the top values by category, subcategory, and the two types.

In [36]:
get_columns = lambda prefix: [col for col in toronto_df.columns if col.startswith(prefix)]

category_columns = get_columns('Category')
subcategory_columns = get_columns('Subcategory')
type1_columns = get_columns('Type 1')
type2_columns = get_columns('Type 2')

overall_avgs = toronto_df.mean()

In [37]:
def print_categories(name, series):
    part1 = f"\t{name}: "
    part2 = ', '.join([cat.split(": ")[1] for cat in series.index])
    part3 = "\n\t\t({})\n\n".format(", ".join(['{:.2f}%'.format(val*100) for val in series]))
    return part1 + part2 + part3

for cluster in range(5):
    areas = toronto_df[toronto_df.cluster == cluster]
    top_vals = lambda cols: areas[cols].mean().sort_values(ascending=False)[:5]
    top_categories = top_vals(category_columns)
    top_subcategories = top_vals(subcategory_columns)
    top_type_1 = top_vals(type1_columns)
    top_type_2 = top_vals(type2_columns)
    print(
        f"==Cluster {str(cluster)}==\n",
        print_categories('Top categories', top_categories),
        print_categories('Top subcategories', top_subcategories),
        print_categories('Top venue types 1', top_type_1),
        print_categories('Top venue types 2', top_type_2),
        '\n'
    )

==Cluster 0==
 	Top categories: Food, Outdoors & Recreation, Shops, Arts & Entertainment, Nightlife
		(42.92%, 23.23%, 18.00%, 8.92%, 4.46%)

 	Top subcategories: Park, Food & Drink, Asian, Athletics & Sports, Zoo
		(9.23%, 8.15%, 5.38%, 5.23%, 5.23%)

 	Top venue types 1: Zoo Exhibit, Japanese, Gym / Fitness, Grocery Store, Liquor Store
		(4.31%, 2.46%, 2.15%, 2.15%, 2.15%)

 	Top venue types 2: Gym, Sushi, Hakka, Xinjiang, Yoga Studio
		(1.69%, 0.92%, 0.46%, 0.31%, 0.15%)

 

==Cluster 1==
 	Top categories: Food, Shops, Outdoors & Recreation, Arts & Entertainment, Nightlife
		(54.76%, 24.82%, 11.00%, 4.29%, 2.18%)

 	Top subcategories: Asian, Food & Drink, Athletics & Sports, Coffee Shop, Middle Eastern
		(11.18%, 8.82%, 6.00%, 5.76%, 3.12%)

 	Top venue types 1: Japanese, Gym / Fitness, Grocery Store, Chinese, Liquor Store
		(3.53%, 3.24%, 2.82%, 2.59%, 1.94%)

 	Top venue types 2: Gym, Sushi, Ramen, Climbing Gym, Cantonese
		(1.47%, 1.29%, 0.53%, 0.41%, 0.41%)

 

==Cluster 2==
 	T