# Toronto Data Segmentation and Clustering
By Jonathan Dietrich

## Part 1: Get the data and prepare it

In [1]:
import numpy as np
import pandas as pd

Put the toronto postal code table from the Wikipedia page into a Pandas Dataframe and remove the postal codes with no assigned burrow. 

In [2]:
table_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df_postal = pd.read_html(table_url, header=0)[0]
df_postal = df_postal[df_postal.Borough != 'Not assigned']
df_postal.reset_index(inplace=True)
del df_postal['index']
assert df_postal['Postal Code'].is_unique
df_postal.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [3]:
df_postal.shape

(103, 3)

## Part 2: Get the coordinates for the postal codes

In [4]:
# ! pip install geocoder
import geocoder

Get csv file with coordinates (geocoder API was not working).

In [5]:
df_coords = pd.read_csv('Geospatial_Coordinates.csv')
df_coords.set_index('Postal Code', inplace=True)
df_coords.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


Put coordinates in df_postal by merging the dataframes.

In [6]:
df = df_postal.join(df_coords, on='Postal Code', how='left')
df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## Part 3: Explore and Cluster Neighborhoods

In [20]:
import json # library to handle JSON files

#! pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#! pip install folium
import folium # map rendering library

print('Libraries imported.')

Collecting geopy
  Downloading geopy-2.0.0-py3-none-any.whl (111 kB)
Collecting geographiclib<2,>=1.49
  Downloading geographiclib-1.50-py3-none-any.whl (38 kB)
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.0.0
Libraries imported.


Only use boroughs that contain the word Toronto.

In [53]:
toronto_df = df[df['Borough'].str.contains("Toronto")]
toronto_df.reset_index(inplace=True)
del toronto_df['index']
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


Count boroughs and neighborhoods.

In [12]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(toronto_df['Borough'].unique()),
        toronto_df.shape[0]
    )
)

The dataframe has 4 boroughs and 39 neighborhoods.


Show neighborhoods on map.

In [24]:
# get mean latitude and longitude
latitude_mean = toronto_df.Latitude.mean()
longitude_mean = toronto_df.Longitude.mean()

# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude_mean, longitude_mean], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

Define Foursquare credentials and version in separate config file that is not in the git repository so my credentials are not published.

In [33]:
from foursquare_config import *

### 3a Explore first neighborhood

Let's explore the first postal code in our dataframe.

Get the first row.

In [34]:
toronto_df.loc[0, :]

Postal Code                            M5A
Borough                   Downtown Toronto
Neighbourhood    Regent Park, Harbourfront
Latitude                           43.6543
Longitude                         -79.3606
Name: 0, dtype: object

Get the top 100 venues that are in the neighborhood Regend Park using a foursquare get request.

In [36]:
# type your answer here
search_query = toronto_df.loc[0, 'Neighbourhood'].split(', ')[0]
neighborhood_latitude = toronto_df.loc[0, 'Latitude']
neighborhood_longitude = toronto_df.loc[0, 'Longitude']
radius = 500
LIMIT = 100
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, search_query, radius, LIMIT)

Send the get request and examine the results.

In [37]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f40dbe9539a1348c59651ff'},
 'response': {'venues': [{'id': '4eef45335c5c794ad4305a1d',
    'name': 'Regent Park School of Music',
    'location': {'address': '534 Queen Street East',
     'lat': 43.65691161146316,
     'lng': -79.3571683258484,
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.65691161146316,
       'lng': -79.3571683258484}],
     'distance': 406,
     'cc': 'CA',
     'city': 'Toronto',
     'state': 'ON',
     'country': 'Canada',
     'formattedAddress': ['534 Queen Street East', 'Toronto ON', 'Canada']},
    'categories': [{'id': '4bf58dd8d48988d1e5931735',
      'name': 'Music Venue',
      'pluralName': 'Music Venues',
      'shortName': 'Music Venue',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/musicvenue_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1598086271',
    'hasPerk': False},
   {'id': '56cde159498e8978c386d953',
    'name': 'Regent P

In [45]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Check how the venues data is structured

In [46]:
results['response']['venues']

[{'id': '4eef45335c5c794ad4305a1d',
  'name': 'Regent Park School of Music',
  'location': {'address': '534 Queen Street East',
   'lat': 43.65691161146316,
   'lng': -79.3571683258484,
   'labeledLatLngs': [{'label': 'display',
     'lat': 43.65691161146316,
     'lng': -79.3571683258484}],
   'distance': 406,
   'cc': 'CA',
   'city': 'Toronto',
   'state': 'ON',
   'country': 'Canada',
   'formattedAddress': ['534 Queen Street East', 'Toronto ON', 'Canada']},
  'categories': [{'id': '4bf58dd8d48988d1e5931735',
    'name': 'Music Venue',
    'pluralName': 'Music Venues',
    'shortName': 'Music Venue',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/musicvenue_',
     'suffix': '.png'},
    'primary': True}],
  'referralId': 'v-1598086271',
  'hasPerk': False},
 {'id': '56cde159498e8978c386d953',
  'name': 'Regent Park Employment Services',
  'location': {'lat': 43.65785,
   'lng': -79.36189,
   'labeledLatLngs': [{'label': 'display', 'lat': 43.65785

Clean json structure into pandas dataframe.

In [51]:
venues = results['response']['venues']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['name', 'categories', 'location.lat', 'location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Regent Park School of Music,Music Venue,43.656912,-79.357168
1,Regent Park Employment Services,Government Building,43.65785,-79.36189
2,Regent Park / Duke of York Junior Public School,Middle School,43.657764,-79.363933
3,Regent Park Focus,Community Center,43.658712,-79.364197
4,Regent Park South Rink,Skating Rink,43.659131,-79.364923


How many venues were returned by foursquare?

In [52]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

17 venues were returned by Foursquare.


### 3b Explore all neighborhoods in Toronto

Let's create a function to repeat the same process to all the neighborhoods in Toronto.

In [55]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)