### Coursera Capstone Project - Segmenting and Clustering Neighborhoods in the City of Toronto

Author: Jörn Grimmer
Date: Dec. 2019

### Table of Contents
#### Part I Create initial file of City of Toronto, Postal Codes & Neighborhoods
#### Part II Assign Geospital Data to Postal Codes
#### Part III Perform Analysis

###  Part I - Create Initial File - Approach
#### 1) Scrape the data from wikipedia
#### 2) Drop 'Boroughs" with value "Not assigned"
#### 3) Combine 'Neighborhood with the identical 'PostalCode"
#### 4) Rename "Neighborhood" where value is 'Not assigned' with the value of 'Borough'
####
####
#### 1) Scrape the data from wikipedia
In this notebook, we will explore and cluster the neighborhoods in Toronto.
We will build the code to scrape the following Wikipedia page
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.
In order to obtain the data that is in the table of postal codes we will transform the data into a pandas dataframe.

In [1]:
# First, we import all required packages, not only for screeen scraping, but for Clustering and Visualization, too.
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes
from bs4 import BeautifulSoup
import urllib3
from urllib.request import urlopen
import requests
import csv

import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 

from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

print('Libraries imported.')

Libraries imported.


In [2]:
# Specify the url & load url and get the table of postal codes
# The idea for this program code was found on https://scipython.com/blog/scraping-a-wikipedia-table-with-beautiful-soup/
# get a local copy of the Wikipedia article
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
req = urlopen(url)
article = req.read().decode()
with open('List_of_postal_codes_of_Canada:_M', 'w') as fo:
    fo.write(article)

**Scraping with the Package BeautifulSoap**

Extract all the < table >tags and search for the one with the headings corresponding to the data we want. Finally, iterate over its rows, pulling out the columns we want and writing the cell text to the file 'List_of_postal_codes_of_Canada:_M.txt'. The file should be interpreted as utf-8 encoded.

In [3]:
# Load article, turn into soup and get the <table>s.
article = open('List_of_postal_codes_of_Canada:_M').read()
soup = BeautifulSoup(article, 'html.parser')
tables = soup.find_all('table', class_='sortable')

# Search through the tables for the one with the headings we want.
for table in tables:
    ths = table.find_all('th')
    headings = [th.text.strip() for th in ths]
    if headings[:5] == ['Postcode', 'Borough', 'Neighbourhood_Draft']:
        break

# Extract the columns we want and write to a semicolon-delimited text file.
with open('List_of_postal_codes_of_Canada:_M', 'w') as fo:
    for tr in table.find_all('tr'):
        tds = tr.find_all('td')
        if not tds:
            continue
        postcode, borough, neighbourhood = [td.text.strip() for td in tds[:4]]
        print('; '.join([postcode, borough, neighbourhood]), file=fo)

In [4]:
# Read file 'List_of_postal_codes_of_Canada:_M' with Pandas and create Pandas.dataframe
data = pd.read_csv('List_of_postal_codes_of_Canada:_M', sep=";", header=None, names=["PostalCode", "Borough", "Neighborhood_Draft"])
data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood_Draft
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [5]:
data.sort_values('Borough',ascending=True)
data.shape

(287, 3)

#### 2) Drop 'Boroughs" with value "Not assigned"

In [6]:
# Find all cells in column 'Borough' containing ' Not assigned'
data[data['Borough'].str.contains('Not assigned',regex=False)]

Unnamed: 0,PostalCode,Borough,Neighborhood_Draft
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
8,M8A,Not assigned,Not assigned
12,M2B,Not assigned,Not assigned
19,M7B,Not assigned,Not assigned
20,M8B,Not assigned,Not assigned
29,M2C,Not assigned,Not assigned
35,M7C,Not assigned,Not assigned
36,M8C,Not assigned,Not assigned
44,M2E,Not assigned,Not assigned


In [7]:
# Drop all cells, where column 'Borough' contains ' Not assigned'
to_drop = [' Not assigned']
data_new = data[~data['Borough'].isin(to_drop)]
data_new.shape

(210, 3)

#### 3) Combine 'Neighborhood with the identical 'PostalCode"

In [8]:
# Group all Postal codes with more than one neighborhood, and join corresponding neighborhoods
data_new = data_new.groupby(['PostalCode','Borough'])['Neighborhood_Draft'].apply(', '.join).reset_index()
data_new.head(103)

Unnamed: 0,PostalCode,Borough,Neighborhood_Draft
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village ..."
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [9]:
data_new.shape

(103, 3)

#### 4) Rename "Neighborhood" where value is 'Not assigned' with the value of 'Borough'

In [10]:
# Find rows where value of Neighborhood_Draft is "Not assigned"
data_new[data_new['Neighborhood_Draft'].str.contains('Not assigned')]

Unnamed: 0,PostalCode,Borough,Neighborhood_Draft
85,M7A,Queen's Park,Not assigned


In [11]:
# Rename "Neighborhood" where value is 'Not assigned' with the value of 'Borough
data_final=data_new['Neighborhood_Draft'].str.replace('Not assigned',"Queen's Park")
data_final.head(103)

0                                        Rouge,  Malvern
1               Highland Creek,  Rouge Hill,  Port Union
2                    Guildwood,  Morningside,  West Hill
3                                                 Woburn
4                                              Cedarbrae
5                                    Scarborough Village
6          East Birchmount Park,  Ionview,  Kennedy Park
7                      Clairlea,  Golden Mile,  Oakridge
8       Cliffcrest,  Cliffside,  Scarborough Village ...
9                           Birch Cliff,  Cliffside West
10      Dorset Park,  Scarborough Town Centre,  Wexfo...
11                                    Maryvale,  Wexford
12                                             Agincourt
13             Clarks Corners,  Sullivan,  Tam O'Shanter
14      Agincourt North,  L'Amoreaux East,  Milliken,...
15                                       L'Amoreaux West
16                                           Upper Rouge
17                             

In [12]:
data_submit_draft=data_final.rename("Neighborhood")
data_submit_draft.head()

0                              Rouge,  Malvern
1     Highland Creek,  Rouge Hill,  Port Union
2          Guildwood,  Morningside,  West Hill
3                                       Woburn
4                                    Cedarbrae
Name: Neighborhood, dtype: object

In [13]:
# join dataframe with updated series "Neighborhood"
data_submit=pd.concat([data_new,data_submit_draft],axis=1, join='inner')
data_submit

Unnamed: 0,PostalCode,Borough,Neighborhood_Draft,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern","Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union","Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill","Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn,Woburn
4,M1H,Scarborough,Cedarbrae,Cedarbrae
5,M1J,Scarborough,Scarborough Village,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park","East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge","Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village ...","Cliffcrest, Cliffside, Scarborough Village ..."
9,M1N,Scarborough,"Birch Cliff, Cliffside West","Birch Cliff, Cliffside West"


In [14]:
# Drop column "Neighborhood_Draft", create final dataframe
data_submit_final=data_submit.drop(['Neighborhood_Draft'], axis=1)
data_submit_final

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village ..."
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [15]:
data_submit_final.shape

(103, 3)

#### Part II Assign Geospital Data to Postal Codes

In [16]:
# Read file 'Geospatial_Coordinates.csv' with Pandas and create Pandas.dataframe
geospatial= pd.read_csv('Geospatial_Coordinates.csv', sep=",")
geospatial.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [17]:
# Join dataframes 'data_submit_final' and 'geospatial'
All_data_draft=pd.concat([data_submit_final,geospatial],axis=1, join='inner')
All_data_draft

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,M1J,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",M1K,43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",M1L,43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village ...",M1M,43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",M1N,43.692657,-79.264848


In [18]:
# Drop column "Postal Code", create final dataframe
All_data=All_data_draft.drop(['Postal Code'], axis=1)
All_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village ...",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [19]:
All_data.shape

(103, 5)

#### Part III Perform Analysis

In [20]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



#### Use geopy library to get the latitude and longitude values of Toronto.

In [21]:
address = 'Toronto'

geolocator = Nominatim(user_agent="joern.grimmer@arcor.de")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


#### Create a map of Toronto with neighborhoods superimposed on top.

In [22]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(All_data['Latitude'], All_data['Longitude'],All_data['Neighborhood'],All_data['Borough']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

For the purpose of this exercise, we simplify the above map and segment and cluster only the neighborhoods in Manhattan. So we slice the original dataframe and create a new dataframe to work with only boroughs that contain the word Toronto.

In [23]:
# Select boroughs that contain the word Toronto
toronto_data = All_data[All_data['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [24]:
# create map of Toronto_only using latitude and longitude values
map_toronto_only = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'],toronto_data['Neighborhood'],toronto_data['Borough']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_only)  
    
map_toronto_only

In [25]:
CLIENT_ID = 'TVWVE10530JQXGX42ZBNVUKQKY4CMCI1DL4C00QBN301H55J' # your Foursquare ID
CLIENT_SECRET = 'SQ54GUPAO3NWOP4YR1WW403BPLLDGNCFNSSF15WVXDECCC0J' # your Foursquare Secret
VERSION = '20180604' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: TVWVE10530JQXGX42ZBNVUKQKY4CMCI1DL4C00QBN301H55J
CLIENT_SECRET:SQ54GUPAO3NWOP4YR1WW403BPLLDGNCFNSSF15WVXDECCC0J


#### We explore the first neighborhood in our Toronto dataframe.

In [26]:
# Get the first dataset of the dataframe
toronto_data.loc[0, 'Neighborhood']

' The Beaches'

In [27]:
# Get the neighborhood's geospatial values
neighborhood_latitude = toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of  The Beaches are 43.67635739999999, -79.2930312.


In [28]:
#### Get the top 100 venues that are in The Beaches within a radius of 500 meters.

In [29]:
# First, let's create the GET request URL. Name your URL **url**.
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=TVWVE10530JQXGX42ZBNVUKQKY4CMCI1DL4C00QBN301H55J&client_secret=SQ54GUPAO3NWOP4YR1WW403BPLLDGNCFNSSF15WVXDECCC0J&v=20180604&ll=43.67635739999999,-79.2930312&radius=500&limit=100'

In [30]:
# Send the get request - Examine results
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e033d5247e0d6001b3c38f1'},
 'response': {'headerLocation': 'The Beaches',
  'headerFullLocation': 'The Beaches, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 5,
  'suggestedBounds': {'ne': {'lat': 43.680857404499996,
    'lng': -79.28682091449052},
   'sw': {'lat': 43.67185739549999, 'lng': -79.29924148550948}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd461bc77b29c74a07d9282',
       'name': 'Glen Manor Ravine',
       'location': {'address': 'Glen Manor',
        'crossStreet': 'Queen St.',
        'lat': 43.67682094413784,
        'lng': -79.29394208780985,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.67682094413784,
          'lng': -79.29394208780985}],
        'distanc

All the information is in the *items* key. we use the **get_category_type** function.

In [31]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [32]:
# We need to import json library to perform next step
import json # library to handle JSON files
from pandas.io.json import json_normalize

In [33]:
# Now we are ready to clean the json and structure it into a *pandas* dataframe.
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869
4,Dip 'n Sip,Coffee Shop,43.678897,-79.297745


In [34]:
# count # of venues that were found by Foursquare.
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

5 venues were returned by Foursquare.


### Explore neighborhoods in Toronto

#### Create a function to repeat the same process to all the neighborhoods in Toronto

In [35]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Run the above function on each neighborhood and create a new dataframe called *toronto_venues*.

In [36]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

 The Beaches
 The Danforth West,  Riverdale
 The Beaches West,  India Bazaar
 Studio District
 Lawrence Park
 Davisville North
 North Toronto West
 Davisville
 Moore Park,  Summerhill East
 Deer Park,  Forest Hill SE,  Rathnelly,  South Hill,  Summerhill West
 Rosedale
 Cabbagetown,  St. James Town
 Church and Wellesley
 Harbourfront
 Ryerson,  Garden District
 St. James Town
 Berczy Park
 Central Bay Street
 Adelaide,  King,  Richmond
 Harbourfront East,  Toronto Islands,  Union Station
 Design Exchange,  Toronto Dominion Centre
 Commerce Court,  Victoria Hotel
 Roselawn
 Forest Hill North,  Forest Hill West
 The Annex,  North Midtown,  Yorkville
 Harbord,  University of Toronto
 Chinatown,  Grange Park,  Kensington Market
 CN Tower,  Bathurst Quay,  Island airport,  Harbourfront West,  King and Spadina,  Railway Lands,  South Niagara
 Stn A PO Boxes 25 The Esplanade
 First Canadian Place,  Underground city
 Christie
 Dovercourt Village,  Dufferin
 Little Portugal,  Trinity
 Brockton,

In [37]:
# Size of the resulting dataframe
print(toronto_venues.shape)
toronto_venues.head()

(1682, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,The Beaches,43.676357,-79.293031,Dip 'n Sip,43.678897,-79.297745,Coffee Shop


In [38]:
# Let's check how many venues were returned for each neighborhood
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Berczy Park,57,57,57,57,57,57
"Brockton, Exhibition Place, Parkdale Village",21,21,21,21,21,21
Business Reply Mail Processing Centre 969 Eastern,19,19,19,19,19,19
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",17,17,17,17,17,17
"Cabbagetown, St. James Town",46,46,46,46,46,46
Central Bay Street,82,82,82,82,82,82
"Chinatown, Grange Park, Kensington Market",94,94,94,94,94,94
Christie,18,18,18,18,18,18
Church and Wellesley,85,85,85,85,85,85


In [39]:
#### Let's find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 237 uniques categories.


#### Analyze Each Neighborhood

In [40]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [41]:
# size of the dataframe
toronto_onehot.shape

(1682, 237)

#### Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [42]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eas...,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, H...",0.0,0.0,0.058824,0.058824,0.058824,0.117647,0.117647,0.117647,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.012195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012195,...,0.0,0.0,0.0,0.0,0.0,0.012195,0.0,0.0,0.012195,0.0
7,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.042553,0.0,0.053191,0.010638,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.023529,0.011765,0.0,0.0,0.0,0.0,0.0,0.0,0.011765,...,0.011765,0.0,0.0,0.0,0.0,0.0,0.0,0.011765,0.0,0.011765


In [43]:
# size of the dataframe
toronto_grouped.shape

(38, 237)

#### Print each neighborhood along with the top 5 most common venues

In [44]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---- Adelaide,  King,  Richmond----
              venue  freq
0       Coffee Shop  0.08
1              Café  0.05
2        Steakhouse  0.04
3               Bar  0.04
4  Asian Restaurant  0.03


---- Berczy Park----
                venue  freq
0         Coffee Shop  0.09
1        Cocktail Bar  0.05
2          Steakhouse  0.04
3   French Restaurant  0.04
4  Seafood Restaurant  0.04


---- Brockton,  Exhibition Place,  Parkdale Village----
            venue  freq
0            Café  0.14
1     Coffee Shop  0.10
2  Breakfast Spot  0.10
3   Burrito Place  0.05
4   Grocery Store  0.05


---- Business Reply Mail Processing Centre 969 Eastern----
                venue  freq
0  Light Rail Station  0.11
1         Yoga Studio  0.05
2       Garden Center  0.05
3                Park  0.05
4          Comic Shop  0.05


---- CN Tower,  Bathurst Quay,  Island airport,  Harbourfront West,  King and Spadina,  Railway Lands,  South Niagara----
              venue  freq
0    Airport Lounge  0.12
1   Airpor

#### Create pandas dataframe and put the data into it

In [45]:
# first write a function to sort venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [46]:
# Create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Bar,Steakhouse,Salad Place,Asian Restaurant,Restaurant,Burger Joint,Bakery,Thai Restaurant
1,Berczy Park,Coffee Shop,Cocktail Bar,Café,Cheese Shop,Farmers Market,Seafood Restaurant,French Restaurant,Beer Bar,Steakhouse,Bakery
2,"Brockton, Exhibition Place, Parkdale Village",Café,Breakfast Spot,Coffee Shop,Furniture / Home Store,Convenience Store,Burrito Place,Stadium,Italian Restaurant,Bar,Restaurant
3,Business Reply Mail Processing Centre 969 Eas...,Light Rail Station,Auto Workshop,Park,Comic Shop,Pizza Place,Recording Studio,Restaurant,Burrito Place,Brewery,Skate Park
4,"CN Tower, Bathurst Quay, Island airport, H...",Airport Lounge,Airport Service,Airport Terminal,Harbor / Marina,Bar,Plane,Coffee Shop,Rental Car Location,Sculpture Garden,Boat or Ferry


####  Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [47]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [48]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0.0,Pub,Health Food Store,Coffee Shop,Trail,Wings Joint,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0.0,Greek Restaurant,Coffee Shop,Ice Cream Shop,Restaurant,Italian Restaurant,Bookstore,Furniture / Home Store,Indian Restaurant,Fruit & Vegetable Store,Juice Bar
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,0.0,Park,Sandwich Place,Pet Store,Pub,Burger Joint,Burrito Place,Fast Food Restaurant,Fish & Chips Shop,Italian Restaurant,Steakhouse
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0.0,Café,Coffee Shop,Bakery,Gastropub,Italian Restaurant,American Restaurant,Yoga Studio,Convenience Store,Seafood Restaurant,Sandwich Place
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,4.0,Gym / Fitness Center,Park,Bus Line,Swim School,Discount Store,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store


In [49]:
toronto_merged.shape

(39, 16)

Finally, let's visualize the resulting clusters

In [50]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [51]:
toronto_merged['Cluster Labels'].isnull()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38     True
Name: Cluster Labels, dtype: bool

In [52]:
toronto_merged_clean=toronto_merged.dropna()
toronto_merged_clean.shape

(38, 16)

In [53]:
toronto_merged_clean['Cluster Labels'] = toronto_merged_clean['Cluster Labels'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [55]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged_clean['Latitude'], toronto_merged_clean['Longitude'], toronto_merged_clean['Neighborhood'], toronto_merged_clean['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### 5. Examine Clusters
Examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, assign a name to each cluster.

In [58]:
toronto_merged_clean.loc[toronto_merged_clean['Cluster Labels'] == 0, toronto_merged_clean.columns[[1] + list(range(5, toronto_merged_clean.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,0,Pub,Health Food Store,Coffee Shop,Trail,Wings Joint,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
1,East Toronto,0,Greek Restaurant,Coffee Shop,Ice Cream Shop,Restaurant,Italian Restaurant,Bookstore,Furniture / Home Store,Indian Restaurant,Fruit & Vegetable Store,Juice Bar
2,East Toronto,0,Park,Sandwich Place,Pet Store,Pub,Burger Joint,Burrito Place,Fast Food Restaurant,Fish & Chips Shop,Italian Restaurant,Steakhouse
3,East Toronto,0,Café,Coffee Shop,Bakery,Gastropub,Italian Restaurant,American Restaurant,Yoga Studio,Convenience Store,Seafood Restaurant,Sandwich Place
5,Central Toronto,0,Gym,Breakfast Spot,Food & Drink Shop,Hotel,Clothing Store,Sandwich Place,Convenience Store,Park,Ethiopian Restaurant,Discount Store
6,Central Toronto,0,Sporting Goods Shop,Coffee Shop,Clothing Store,Yoga Studio,Restaurant,Rental Car Location,Miscellaneous Shop,Bagel Shop,Mexican Restaurant,Dessert Shop
7,Central Toronto,0,Sandwich Place,Dessert Shop,Gym,Italian Restaurant,Café,Coffee Shop,Pizza Place,Sushi Restaurant,Gourmet Shop,Asian Restaurant
8,Central Toronto,0,Restaurant,Playground,Summer Camp,Trail,Tennis Court,Doner Restaurant,Diner,Discount Store,Dog Run,Dumpling Restaurant
9,Central Toronto,0,Pub,Coffee Shop,Bagel Shop,Vietnamese Restaurant,Liquor Store,Supermarket,Restaurant,Pizza Place,Sushi Restaurant,Light Rail Station
11,Downtown Toronto,0,Pizza Place,Coffee Shop,Café,Restaurant,Bakery,Pub,Italian Restaurant,Chinese Restaurant,Playground,Plaza


In [59]:
toronto_merged_clean.loc[toronto_merged_clean['Cluster Labels'] == 1, toronto_merged_clean.columns[[1] + list(range(5, toronto_merged_clean.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Downtown Toronto,1,Park,Playground,Trail,Wings Joint,Dim Sum Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


In [60]:
toronto_merged_clean.loc[toronto_merged_clean['Cluster Labels'] == 2, toronto_merged_clean.columns[[1] + list(range(5, toronto_merged_clean.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Central Toronto,2,Garden,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


In [62]:
toronto_merged_clean.loc[toronto_merged_clean['Cluster Labels'] == 3, toronto_merged_clean.columns[[1] + list(range(5, toronto_merged_clean.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,Central Toronto,3,Trail,Jewelry Store,Mexican Restaurant,Sushi Restaurant,Wings Joint,Discount Store,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant


In [63]:
toronto_merged_clean.loc[toronto_merged_clean['Cluster Labels'] == 4, toronto_merged_clean.columns[[1] + list(range(5, toronto_merged_clean.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Central Toronto,4,Gym / Fitness Center,Park,Bus Line,Swim School,Discount Store,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store


* According to analysis there are the following five clusters in Toronto:

1) Food, Coffee, Lunch & Dine

2) Park & Playground

3) Garden & Events

4) Trail & Shop

5) Leisure & Sports