# Segmenting and Clustering Neighborhoods in Toronto
### Applied Data Science Capstone, Peer-graded Assignment, Week 3

This notebook is to explore, segment, and cluster the neighborhoods in the city of Toronto. 

## Part 1: Scraping data from Wiki

### 1. Convert html table into pandas dataframe

Load libraries:

In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

Get the html of the page:

In [2]:
# Create Beautiful Soup object from the html
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
soup = BeautifulSoup(page.content,'lxml')

In [3]:
# Check the type
type(soup)

bs4.BeautifulSoup

In [4]:
# Check the title
soup.title

<title>List of postal codes of Canada: M - Wikipedia</title>

Convert html table first into a list of rows that will be then converted to pandas dataframe:

In [5]:
# Store the table with neigborhoods
table = soup.find_all('table')[0]

In [6]:
# Get the rows of the table
rows = table.find_all('tr')

# Print the first three rows (including header)
rows[0:3]

[<tr>
 <th>Postcode</th>
 <th>Borough</th>
 <th>Neighbourhood
 </th></tr>,
 <tr>
 <td>M1A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>,
 <tr>
 <td>M2A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>]

In [7]:
# Create a list of rows (excluding header)
list_rows = []

for row in rows:
    row_td = row.find_all('td')
    str_cells = str(row_td)
    cleantext = BeautifulSoup(str_cells, "lxml").get_text()
    list_rows.append(cleantext)

# Look at the first three rows from the table (excluding header)
list_rows[0:3]

['[]',
 '[M1A, Not assigned, Not assigned\n]',
 '[M2A, Not assigned, Not assigned\n]']

The first element in list_rows is empty because the first row contains header (th instead of td). 

In [8]:
# Convert the list of rows into pandas dataframe
df = pd.DataFrame(list_rows[1:])

In [9]:
# Look at the first few rows
df.head()

Unnamed: 0,0
0,"[M1A, Not assigned, Not assigned\n]"
1,"[M2A, Not assigned, Not assigned\n]"
2,"[M3A, North York, Parkwoods\n]"
3,"[M4A, North York, Victoria Village\n]"
4,"[M5A, Downtown Toronto, Harbourfront\n]"


In [10]:
# Check the size
df.shape

(287, 1)

### 2. Clean the dataframe

Dataframe is not in a desired shape. A couple of adjustments need to be done:
- split one column with postal code, borough and neighborhood into multiple columns
- remove any unnecessary characters (brackets, leading/trailing spaces)
- add column names
- remove rows with a borough that is Not assigned
- join the rows with the same postal code into one row and drop duplicate rows afterwards
- use borough name as neighborhood name for rows with neighborhood that is not assigned
- reset index after all steps

In [11]:
# Split the column into multiple columns
df = df[0].str.split(',', expand=True)

# Remove brackets [] and \n
df[0] = df[0].str.strip('[')
df[2] = df[2].str.strip('\n]')

# Remove leading and trailing spaces
for column in df.columns:
    df[column] = df[column].str.strip()

# Assign column names
col_labels = ['PostalCode', 'Borough', 'Neighborhood']
df.columns = col_labels

# Look at the first ten rows in the dataframe
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Etobicoke,Islington Avenue


In [12]:
# Remove rows with a Borough that is "Not assigned"
print('The number of rows with a Borough that is "Not assigned" before removing: ', df[df['Borough'] == 'Not assigned'].shape[0])

# Create the mask to filter for the respective rows
mask = df['Borough'] == 'Not assigned'

# Remove rows
df = df[~mask]

# Check the outcome
print('The number of rows with a Borough that is "Not assigned" after removing: ', df[df['Borough'] == 'Not assigned'].shape[0])

The number of rows with a Borough that is "Not assigned" before removing:  77
The number of rows with a Borough that is "Not assigned" after removing:  0


In [13]:
# Check the size of the dataframe
df.shape

(210, 3)

In [14]:
# Check the first ten rows
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Etobicoke,Islington Avenue
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


In [15]:
# Join the rows with the same postal code into one row and drop duplicate rows afterwards
df.loc[:, 'Neighborhood'] = df.groupby('PostalCode')['Neighborhood'].transform(', '.join)
df.drop_duplicates(inplace=True)

# Check the result
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,"Lawrence Heights, Lawrence Manor"
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Etobicoke,Islington Avenue
10,M1B,Scarborough,"Rouge, Malvern"
13,M3B,North York,Don Mills North
14,M4B,East York,"Woodbine Gardens, Parkview Hill"
16,M5B,Downtown Toronto,"Ryerson, Garden District"


In [16]:
# Check the size of the reduced dataframe
df.shape

(103, 3)

In [17]:
# Check if a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
# It is enough to check only on neighborhood because any rows with not assigned borough have been dropped before
print('Number of rows with a borough but without neighborhood before: ', df[df['Neighborhood'] == 'Not assigned'].shape[0])

# Replace neighborhood with borough if necessary
df.loc[:, 'Neighborhood'] = df.apply(lambda row: row['Borough'] if row['Neighborhood'] == 'Not assigned' else row['Neighborhood'], axis=1)

# Check that all neighborhoods are assigned
print('Number of rows with a borough but without neighborhood after: ', df[df['Neighborhood'] == 'Not assigned'].shape[0])

Number of rows with a borough but without neighborhood before:  0
Number of rows with a borough but without neighborhood after:  0


In [18]:
# Reset index
df.reset_index(drop=True, inplace=True)

__Final dataframe:__

In [19]:
# Check the final dataframe
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [20]:
# Check the size of the final dataframe
df.shape

(103, 3)

## Part 2: Get the latitude and the longitude coordinates of each neighborhood

Use [pgeocode](https://pypi.org/project/pgeocode/) library to obtain coordinates for neighborhoods. (Function geocoder.google from [Geocoder package](https://geocoder.readthedocs.io/index.html) didn't work.)

In [21]:
# Import the library
import pgeocode

In [22]:
# Create a geolocator instance for Canada
nomi = pgeocode.Nominatim('ca')
nomi

<pgeocode.Nominatim at 0xcaad450>

Check the functionality on the neighborhood with postal code M5G:

In [23]:
nomi.query_postal_code('M5G')

postal_code                                         M5G
country code                                         CA
place_name        Downtown Toronto (Central Bay Street)
state_name                                      Ontario
state_code                                           ON
county_name                                     Toronto
county_code                                 8.13339e+06
community_name                                      NaN
community_code                                      NaN
latitude                                        43.6564
longitude                                       -79.386
accuracy                                              6
Name: 0, dtype: object

Coordinates are in latitude and longitude attributes.

Let's write a loop that would assign latitude and longitude coordinates to every neighborhood in our dataset:

In [24]:
# Define an array to store postal code, latitude and longitude
coordinates = []

for pcode in df['PostalCode']:
    # Assign latitude and longitude to respective variables
    latitude = nomi.query_postal_code(pcode).latitude
    longitude = nomi.query_postal_code(pcode).longitude
    
    # Add postal code, latitude and longitude to coordinates
    coordinates.append([pcode, latitude, longitude])

Create a dataframe from coordinates and add column headers:

In [25]:
# Create dataframe from coordinates
df_coord = pd.DataFrame(coordinates)

# Add column headers
df_coord.columns = ['PostalCode', 'Latitude', 'Longitude']

# Check the size of the new dataframe
print('Size of the dataframe with postal code, latitude and longitude:\n {} rows, {} columns\n\n'.format(df_coord.shape[0], df_coord.shape[1]))

# Check the result
df_coord.head()

Size of the dataframe with postal code, latitude and longitude:
 103 rows, 3 columns




Unnamed: 0,PostalCode,Latitude,Longitude
0,M3A,43.7545,-79.33
1,M4A,43.7276,-79.3148
2,M5A,43.6555,-79.3626
3,M6A,43.7223,-79.4504
4,M7A,43.6641,-79.3889


Join the new dataframe containing coordinates with the old dataframe:

In [26]:
# Use join based on PostalCode column
# PostalCode must be set as an index for both dataframes
# Remove index at the end
df = df.set_index('PostalCode').join(df_coord.set_index('PostalCode')).reset_index()

# Check the size of the new dataframe
print('Size of the dataframe with postal code, latitude and longitude:\n {} rows, {} columns\n\n'.format(df.shape[0], df.shape[1]))

# Check the result
df.head(10)

Size of the dataframe with postal code, latitude and longitude:
 103 rows, 5 columns




Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,Harbourfront,43.6555,-79.3626
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.7223,-79.4504
4,M7A,Downtown Toronto,Queen's Park,43.6641,-79.3889
5,M9A,Etobicoke,Islington Avenue,43.6662,-79.5282
6,M1B,Scarborough,"Rouge, Malvern",43.8113,-79.193
7,M3B,North York,Don Mills North,43.745,-79.359
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.7063,-79.3094
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.6572,-79.3783


In [27]:
# Check that every neighborhood contains values for latitude and longitude
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 5 columns):
PostalCode      103 non-null object
Borough         103 non-null object
Neighborhood    103 non-null object
Latitude        102 non-null float64
Longitude       102 non-null float64
dtypes: float64(2), object(3)
memory usage: 2.9+ KB


In [28]:
df_missing = df[df['Latitude'].isnull()]

In [29]:
df_missing

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
76,M7R,Mississauga,Canada Post Gateway Processing Centre,,


There is only one row with missing coordinates. Let's try to get the coordinates for this row again:

In [30]:
nomi.query_postal_code('M7R')

postal_code       M7R
country code      NaN
place_name        NaN
state_name        NaN
state_code        NaN
county_name       NaN
county_code       NaN
community_name    NaN
community_code    NaN
latitude          NaN
longitude         NaN
accuracy          NaN
Name: 0, dtype: object

One neighborhood is missing latitude and longitude - let's drop it for now.

In [31]:
df_neigh = df.dropna().reset_index(drop=True)

__Check the final dataframe again:__

In [32]:
# Check the size
df_neigh.shape

(102, 5)

In [33]:
# View the first ten rows
df_neigh.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,Harbourfront,43.6555,-79.3626
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.7223,-79.4504
4,M7A,Downtown Toronto,Queen's Park,43.6641,-79.3889
5,M9A,Etobicoke,Islington Avenue,43.6662,-79.5282
6,M1B,Scarborough,"Rouge, Malvern",43.8113,-79.193
7,M3B,North York,Don Mills North,43.745,-79.359
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.7063,-79.3094
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.6572,-79.3783


## Part 3: Exploring and clustering the neighborhoods in Toronto

Load libraries:

In [34]:
# Import geopy
from geopy.geocoders import Nominatim

# Import matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Import k-means from clustering stage
from sklearn.cluster import KMeans

# Import folium, map rendering library
import folium

# Import package to manipulate with json files
import json

# Tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize

### 1. Show Toronto and its neighborhoods on a map

Get the latitude and longitude coordinates of Toronto using [geopy](https://geopy.readthedocs.io/en/stable/) library:

In [35]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent='ca_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.653963, -79.387207.


Create a map of Toronto with neighborhoods superimposed on top

In [36]:
# Create the map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add markers to the map
for lat, lng, borough, neighborhood in zip(df_neigh['Latitude'], df_neigh['Longitude'], df_neigh['Borough'], df_neigh['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### 2. Get top 100 venues for each neighborhood in radius of 500 metres

Define Foursquare credentials and version (Note: Credentials are stored in a separate file credential.json that is not version-controlled on github.)

In [37]:
with open('credentials.json') as file:
    data = json.load(file)
    CLIENT_ID = data['id']    # Foursquare ID
    CLIENT_SECRET = data['secret']    # Foursquare Secret

file.close()
    
VERSION = '20200310'    # Foursquare API version

print('Credentials loaded.')

Credentials loaded.


Define function that extracts the category of a venue:

In [38]:
# Function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Define function to get nearby venues for neighborhoods:

In [39]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Get nearby venues for Toronto neighborhoods and store it in a pandas dataframe:

In [40]:
LIMIT = 100

toronto_venues = getNearbyVenues(names=df_neigh['Neighborhood'],
                                   latitudes=df_neigh['Latitude'],
                                   longitudes=df_neigh['Longitude']
                                )

Parkwoods
Victoria Village
Harbourfront
Lawrence Heights, Lawrence Manor
Queen's Park
Islington Avenue
Rouge, Malvern
Don Mills North
Woodbine Gardens, Parkview Hill
Ryerson, Garden District
Glencairn
Cloverdale, Islington, Martin Grove, Princess Gardens, West Deane Park
Highland Creek, Rouge Hill, Port Union
Flemingdon Park, Don Mills South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Bloordale Gardens, Eringate, Markland Wood, Old Burnhamthorpe
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Downsview North, Wilson Heights
Thorncliffe Park
Adelaide, King, Richmond
Dovercourt Village, Dufferin
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto
Harbourfront East, Toronto Islands, Union Station
Little Portugal, Trinity
East Birchmount Park, Ionview, Kennedy Park
Bayview Village
CFB Toronto, Downsview East
The Danforth West,

### 3. Explore the new dataframe with venues and prepare dataset for clustering

In [41]:
# Check the size of the new dataframe with venues
toronto_venues.shape

(2264, 7)

In [42]:
# Check the dataframe with venues
toronto_venues.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.7545,-79.33,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.7545,-79.33,TTC stop - 44 Valley Woods,43.755402,-79.333741,Bus Stop
2,Parkwoods,43.7545,-79.33,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.7276,-79.3148,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.7276,-79.3148,Tim Hortons,43.725517,-79.313103,Coffee Shop
5,Victoria Village,43.7276,-79.3148,Portugril,43.725819,-79.312785,Portuguese Restaurant
6,Victoria Village,43.7276,-79.3148,Eglinton Ave E & Sloane Ave/Bermondsey Rd,43.726086,-79.31362,Intersection
7,Victoria Village,43.7276,-79.3148,Pizza Nova,43.725824,-79.31286,Pizza Place
8,Victoria Village,43.7276,-79.3148,Wigmore Park,43.731023,-79.310771,Park
9,Harbourfront,43.6555,-79.3626,Tandem Coffee,43.653559,-79.361809,Coffee Shop


Get the number of venues per neighborhood:

In [43]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Agincourt,4,4,4,4,4,4
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",1,1,1,1,1,1
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",13,13,13,13,13,13
"Alderwood, Long Branch",8,8,8,8,8,8
"Bathurst Manor, Downsview North, Wilson Heights",6,6,6,6,6,6
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",25,25,25,25,25,25
Berczy Park,90,90,90,90,90,90
"Birch Cliff, Cliffside West",4,4,4,4,4,4


In [44]:
print('There are {} uniques categories.'.format(toronto_venues['Venue Category'].nunique()))

There are 262 uniques categories.


There are a few venues with category Neighborhood (probably by mistake):

In [45]:
toronto_venues[toronto_venues['Venue Category'] == 'Neighborhood']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
440,The Beaches,43.6784,-79.2941,Upper Beaches,43.680563,-79.292869,Neighborhood
582,Central Bay Street,43.6564,-79.386,Downtown Toronto,43.653232,-79.385296,Neighborhood
769,"Adelaide, King, Richmond",43.6496,-79.3833,Downtown Toronto,43.653232,-79.385296,Neighborhood
1127,"Brockton, Exhibition Place, Parkdale Village",43.6383,-79.4301,Parkdale,43.640524,-79.4322,Neighborhood
1914,Stn A PO Boxes 25 The Esplanade,43.6437,-79.3787,Harbourfront,43.639526,-79.380688,Neighborhood


Let's drop venues that have category Neighborhood:

In [46]:
# Create a mask to filter venues that have category Neighborhood
mask = toronto_venues['Venue Category'] == 'Neighborhood'

# Remove rows with venues that have category Neighborhood from venues dataframe
toronto_venues = toronto_venues[~mask]

# Reset index
toronto_venues.reset_index(drop=True, inplace=True)

# Check the size of the new dataframe with venues
print('Dataframe with venues has {} rows and {} columns.'.format(toronto_venues.shape[0], toronto_venues.shape[1]))
print()

# Check the dataframe with venues
toronto_venues.head()

Dataframe with venues has 2259 rows and 7 columns.



Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.7545,-79.33,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.7545,-79.33,TTC stop - 44 Valley Woods,43.755402,-79.333741,Bus Stop
2,Parkwoods,43.7545,-79.33,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.7276,-79.3148,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.7276,-79.3148,Tim Hortons,43.725517,-79.313103,Coffee Shop


Get the number of different venue categories again:

In [47]:
print('There are {} uniques categories.'.format(toronto_venues['Venue Category'].nunique()))

There are 261 uniques categories.


Get the number of venues per neighborhood again:

In [48]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",99,99,99,99,99,99
Agincourt,4,4,4,4,4,4
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",1,1,1,1,1,1
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",13,13,13,13,13,13
"Alderwood, Long Branch",8,8,8,8,8,8
"Bathurst Manor, Downsview North, Wilson Heights",6,6,6,6,6,6
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",25,25,25,25,25,25
Berczy Park,90,90,90,90,90,90
"Birch Cliff, Cliffside West",4,4,4,4,4,4


Some of the neighborhoods don't have many venues. Let's select only neighborhoods with at least 10 venues and use those for clustering:

In [49]:
# Create a new dataframe with neighborhoods with at least 10 venues
toronto_venues_filtered = toronto_venues.groupby('Neighborhood').filter(lambda x: len(x) > 10)

# Reset index
toronto_venues_filtered.reset_index(drop=True, inplace=True)

Check the new (and final) dataframe:

In [50]:
# Check the size
toronto_venues_filtered.shape

(2014, 7)

In [51]:
# Check the number of neighborhoods to analyze
toronto_venues_filtered['Neighborhood'].nunique()

44

In [52]:
# Count the venues in neighborhoods
toronto_venues_filtered.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",99,99,99,99,99,99
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",13,13,13,13,13,13
"Bedford Park, Lawrence Manor East",25,25,25,25,25,25
Berczy Park,90,90,90,90,90,90
"Brockton, Exhibition Place, Parkdale Village",39,39,39,39,39,39
Business Reply Mail Processing Centre 969 Eastern,16,16,16,16,16,16
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",57,57,57,57,57,57
"Cabbagetown, St. James Town",41,41,41,41,41,41
Central Bay Street,90,90,90,90,90,90
"Chinatown, Grange Park, Kensington Market",84,84,84,84,84,84


### 4. Analyze neighborhoods

#### Prepare dataframe for clustering

In [53]:
# One hot encoding
toronto_onehot = pd.get_dummies(toronto_venues_filtered[['Venue Category']], prefix='', prefix_sep='')

# Add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues_filtered['Neighborhood'] 

# Move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,BBQ Joint,...,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The size of the new dataframe:

In [54]:
toronto_onehot.shape

(2014, 238)

Calculate average frequency of venues by neighborhood:

In [55]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,BBQ Joint,...,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.030303,0.010101,0.0,0.030303,0.0,0.0,0.0,...,0.010101,0.0,0.0,0.0,0.0,0.010101,0.0,0.0,0.0,0.0
1,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Berczy Park,0.0,0.0,0.011111,0.022222,0.0,0.0,0.0,0.0,0.011111,...,0.011111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Brockton, Exhibition Place, Parkdale Village",0.025641,0.0,0.0,0.025641,0.025641,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625
6,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544
7,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Central Bay Street,0.0,0.0,0.011111,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.011111,0.011111,0.0,0.0,0.011111,0.0,0.0,0.0,0.0
9,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.02381,0.011905,0.0,0.0,0.0,0.0,...,0.035714,0.0,0.047619,0.0,0.0,0.011905,0.0,0.0,0.0,0.0


Check the size of the new grouped dataframe:

In [56]:
toronto_grouped.shape

(44, 238)

#### Create a pandas dataframe with the top 10 venues for each neighborhood

In [57]:
# Define function to return the top venues
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [58]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# Create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Café,Restaurant,Coffee Shop,Steakhouse,Gastropub,Japanese Restaurant,Hotel,Thai Restaurant,Asian Restaurant,Gym
1,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Pharmacy,Liquor Store,Sandwich Place,Japanese Restaurant,Discount Store,Fried Chicken Joint,Caribbean Restaurant,Hardware Store,Pizza Place
2,"Bedford Park, Lawrence Manor East",Italian Restaurant,Coffee Shop,Restaurant,Sandwich Place,Comfort Food Restaurant,Pharmacy,Pizza Place,Pub,Café,Butcher
3,Berczy Park,Coffee Shop,Café,Hotel,Restaurant,Bakery,Seafood Restaurant,Cocktail Bar,Beer Bar,Japanese Restaurant,Breakfast Spot
4,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Café,Japanese Restaurant,Breakfast Spot,Thrift / Vintage Store,Gift Shop,Liquor Store,Supermarket,Boutique,Brewery


#### Cluster neighborhoods

In [59]:
# Set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# Run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# Check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 0, 2, 3, 3, 3, 3, 3, 3, 3])

In [60]:
# Add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# toronto_merged = df_neigh
toronto_merged = df_neigh[df_neigh['Neighborhood'].isin(toronto_venues_filtered['Neighborhood'].unique())]

# Merge dataframes to get a row containing all the information for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,M5A,Downtown Toronto,Harbourfront,43.6555,-79.3626,3,Coffee Shop,Breakfast Spot,Restaurant,Thrift / Vintage Store,Spa,Event Space,Mexican Restaurant,Electronics Store,Pub,Beer Store
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.7223,-79.4504,3,Clothing Store,Coffee Shop,Women's Store,Jewelry Store,Bakery,Toy / Game Store,Pharmacy,Restaurant,Men's Store,Electronics Store
4,M7A,Downtown Toronto,Queen's Park,43.6641,-79.3889,3,Coffee Shop,Italian Restaurant,Gym,Portuguese Restaurant,Burger Joint,Dance Studio,Seafood Restaurant,Café,Fast Food Restaurant,Chinese Restaurant
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.6572,-79.3783,3,Coffee Shop,Clothing Store,Café,Japanese Restaurant,Middle Eastern Restaurant,Restaurant,Electronics Store,Pizza Place,Plaza,Theater
10,M6B,North York,Glencairn,43.7081,-79.4479,0,Pizza Place,Grocery Store,Sushi Restaurant,Mediterranean Restaurant,Latin American Restaurant,Fast Food Restaurant,Ice Cream Shop,Gas Station,Japanese Restaurant,Electronics Store


Check the labels:

In [61]:
toronto_merged['Cluster Labels'].unique()

array([3, 0, 2, 1, 4], dtype=int64)

In [62]:
# Check the size
toronto_merged.shape

(44, 16)

#### Create a map of Toronto with color-coded neighborhoods

In [63]:
# Create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# Set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Examine clusters

Get the number of neighborhoods in clusters:

In [64]:
toronto_merged[['Neighborhood', 'Cluster Labels']].groupby('Cluster Labels').count()

Unnamed: 0_level_0,Neighborhood
Cluster Labels,Unnamed: 1_level_1
0,3
1,2
2,10
3,28
4,1


Let's get neighborhoods belonging to different clusters

##### Cluster 1

In [65]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,North York,0,Pizza Place,Grocery Store,Sushi Restaurant,Mediterranean Restaurant,Latin American Restaurant,Fast Food Restaurant,Ice Cream Shop,Gas Station,Japanese Restaurant,Electronics Store
60,North York,0,Discount Store,Pizza Place,Grocery Store,Pharmacy,Sandwich Place,Liquor Store,Caribbean Restaurant,Fried Chicken Joint,Beer Store,Gas Station
88,Etobicoke,0,Grocery Store,Pharmacy,Liquor Store,Sandwich Place,Japanese Restaurant,Discount Store,Fried Chicken Joint,Caribbean Restaurant,Hardware Store,Pizza Place


There are three neighborhoods belonging to cluster 1. The most common and discriminating venue categories could be following:
- Pizza Place
- Grocery Store
- Discount Store
- Pharmacy
- Liquor Store
- Sandwich Place  

However, only the first two (Pizza Place, Grocery Store) occur in all three neighborhoods.

##### Cluster 2

In [66]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25,Downtown Toronto,1,Grocery Store,Café,Candy Store,Park,Coffee Shop,Playground,Athletics & Sports,Baby Store,Empanada Restaurant,Ethiopian Restaurant
31,West Toronto,1,Park,Bakery,Pharmacy,Grocery Store,Bus Line,Furniture / Home Store,Middle Eastern Restaurant,Brazilian Restaurant,Pool,Bar


Cluster 2 has only two neighborhoods. The venue categories that occur in both of them are (top 10):
- Grocery Store
- Park

##### Cluster 3

In [67]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Scarborough,2,Pizza Place,Coffee Shop,Pharmacy,Fast Food Restaurant,Grocery Store,Beer Store,Sandwich Place,Liquor Store,Restaurant,Burger Joint
29,East York,2,Indian Restaurant,Sandwich Place,Afghan Restaurant,Coffee Shop,Turkish Restaurant,Bank,Burger Joint,Restaurant,Fried Chicken Joint,Supermarket
47,East Toronto,2,Park,Sandwich Place,Liquor Store,Movie Theater,Burrito Place,Food & Drink Shop,Brewery,Steakhouse,Sushi Restaurant,Pub
55,North York,2,Italian Restaurant,Coffee Shop,Restaurant,Sandwich Place,Comfort Food Restaurant,Pharmacy,Pizza Place,Pub,Café,Butcher
59,North York,2,Pizza Place,Ramen Restaurant,Sushi Restaurant,Shopping Mall,Coffee Shop,Sandwich Place,Café,Japanese Restaurant,Middle Eastern Restaurant,Fried Chicken Joint
74,Central Toronto,2,Café,Sandwich Place,American Restaurant,History Museum,Middle Eastern Restaurant,French Restaurant,Burger Joint,Flower Shop,Mexican Restaurant,Pub
76,Etobicoke,2,Pharmacy,Bank,Shopping Mall,Sandwich Place,Supermarket,Chinese Restaurant,Gas Station,Beer Store,Mobile Phone Shop,Bus Line
78,Central Toronto,2,Dessert Shop,Coffee Shop,Italian Restaurant,Sandwich Place,Café,Farmers Market,Restaurant,Toy / Game Store,Diner,Sushi Restaurant
81,Scarborough,2,Pizza Place,Pharmacy,Thai Restaurant,Convenience Store,Fast Food Restaurant,Gas Station,Fried Chicken Joint,Chinese Restaurant,Bank,Italian Restaurant
89,Scarborough,2,Fast Food Restaurant,Chinese Restaurant,Pharmacy,Supermarket,Pizza Place,Coffee Shop,Electronics Store,Sandwich Place,Other Great Outdoors,Bubble Tea Shop


The most common venue categories for cluster 3 with ten neighborhoods:
- Sandwich Place
- Pizza Place
- Coffee Shop
- Pharmacy
- Cafe

##### Cluster 4

In [68]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Downtown Toronto,3,Coffee Shop,Breakfast Spot,Restaurant,Thrift / Vintage Store,Spa,Event Space,Mexican Restaurant,Electronics Store,Pub,Beer Store
3,North York,3,Clothing Store,Coffee Shop,Women's Store,Jewelry Store,Bakery,Toy / Game Store,Pharmacy,Restaurant,Men's Store,Electronics Store
4,Downtown Toronto,3,Coffee Shop,Italian Restaurant,Gym,Portuguese Restaurant,Burger Joint,Dance Studio,Seafood Restaurant,Café,Fast Food Restaurant,Chinese Restaurant
9,Downtown Toronto,3,Coffee Shop,Clothing Store,Café,Japanese Restaurant,Middle Eastern Restaurant,Restaurant,Electronics Store,Pizza Place,Plaza,Theater
15,Downtown Toronto,3,Coffee Shop,Café,Restaurant,Seafood Restaurant,Hotel,Bakery,Italian Restaurant,Breakfast Spot,Clothing Store,Cosmetics Shop
20,Downtown Toronto,3,Coffee Shop,Café,Hotel,Restaurant,Bakery,Seafood Restaurant,Cocktail Bar,Beer Bar,Japanese Restaurant,Breakfast Spot
23,East York,3,Department Store,Electronics Store,Restaurant,Sporting Goods Shop,Sports Bar,Pet Store,Sushi Restaurant,Coffee Shop,Portuguese Restaurant,Rental Car Location
24,Downtown Toronto,3,Coffee Shop,Clothing Store,Japanese Restaurant,Italian Restaurant,Sushi Restaurant,Middle Eastern Restaurant,Chinese Restaurant,Electronics Store,Juice Bar,Café
30,Downtown Toronto,3,Café,Restaurant,Coffee Shop,Steakhouse,Gastropub,Japanese Restaurant,Hotel,Thai Restaurant,Asian Restaurant,Gym
33,North York,3,Clothing Store,Fast Food Restaurant,Japanese Restaurant,Coffee Shop,Women's Store,Baseball Field,Tea Room,Juice Bar,Cosmetics Shop,Spa


The most common venue categories for neighborhoods belonging to cluster 4 are:
- Coffee Shop
- Cafe
- Restaurant

The neighborhoods have plenty of places where you can get coffee and also restaurants that have category Restaurant (further not specified).

##### Cluster 5

In [69]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
44,Scarborough,4,Intersection,Bus Line,Bakery,Metro Station,Soccer Field,Coffee Shop,Park,Bus Station,Fast Food Restaurant,Fish & Chips Shop


The last cluster has only one neighborhood that has some traffic related venues, for example Intersection, Bus Line, Metro Station. 

##### Summary

__To sum up, the clusters of neighborhoods could be shortly described as follows:__
- Cluster 1 - Pizza, Stores (Grocery, Discount, Liqour), Sandwich, Pharmacy
- Cluster 2 - Grocery Store, Park
- Cluster 3 - Sandwich and Pizza, Coffee, Pharmacy
- Cluster 4 - Coffee, Restaurant
- Cluster 5 - Traffic, Transportation

