# IBM Applied Data Science Capstone
## Peer-graded Assignment
## Segmenting and Clustering Neighborhoods in Toronto
### Sidclay da Silva
### June 2020
---

### Introduction

This notebook contains the Peer-graded Assignment for the Week 3 for the Course IBM Applied Data Science Capstone on Coursera, which requires to explore, segment, and cluster the neighborhoods in the city of Toronto. In short words, the assignment is composed of three main tasks as following:

1. Build a dataframe with the Toronto Postal Codes from a web page.
1. Include the coordinates for each neighborhood in the dataframe.
1. Explore, cluster and display the neighborhoods clusters on a map.

Most of the code could be groupped having shorter notebook, but the objective is to clarify each step, for this reason the code has been broken with Markdown explanations.

The tool of my choice to perform this assignment was a Jupyter Notebook on Jupyter Lab. The notebook is going to be available in a GitHub repository allowing peers to grade it.

---

### Important notice:
Unfortunately the Folium maps are not displayed in GitHub, they are interactive objects not supported, as an option to see the maps the [Jupyter nbviewer](https://nbviewer.org/) can be used. Once it is opened just need to paste the URL of the GitHub document.

---

### Task 1 - Build a dataframe with the Toronto Postal Codes from a web page

Import required libraries. For this task the __Requests__ library will be used to send web request, and __BeautifulSoup__ to parse the data from the web.

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

Send request to the provided URL and check if data was successfully loaded.

In [2]:
# send a request to the URL and store the response
raw = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

# check if data was loaded [status 200 means success]
if raw:
    print('Data loaded, status', raw.status_code)
else:
    print('Error loading data', raw.status_code)

Data loaded, status 200


Parse the raw data from web using __BeautifulSoup__. The provided page on _Wikipedia_ contains more tables than the required __Toroto Postal Code table__, but it will be the only one to be loaded. The __tag table__ will be used to load only tables from the parsed data, and the __index 0__ will be used to select only the first table, which is the required for this assignment.

In [3]:
# parse the raw data
par = BeautifulSoup(raw.text, 'html.parser')

# load only the first table from parsed data [tag 'table' / index 0]
par_table = par.findAll('table')[0]

Check the number of columns and their headers. The headers will be used to name the dataframe columns, the __tag th__ will be used to select them when runnig a loop.

In [4]:
# print the number of columns and the columns' headers
print('The source table has {} columns'.format(len(par_table.find_all('th'))))
par_table.find_all('th')

The source table has 3 columns


[<th>Postal Code
 </th>,
 <th>Borough
 </th>,
 <th>Neighborhood
 </th>]

Store the the columns' headers in a list.

In [5]:
# define a empty list object
headers = list()

# run a loop to append the headers to the list [tag 'th']
for h in par_table.find_all('th'):
    headers.append(h.get_text())

# check the headers
headers

['Postal Code\n', 'Borough\n', 'Neighborhood\n']

Unfortunatelly the *get_text()* also returned unwanted characters, suchs as __'\n'__, they wiil be removed as the blank spaces between words, to be used as dataframe column names.

In [6]:
# run a loop to remove the '\n' from headers
for i, h in enumerate(headers):
    headers[i] = h.replace('\n','')

# run a loop to remove the blank spaces between words
for i, h in enumerate(headers):
    headers[i] = h.replace(' ','')

# check the clean headers
headers

['PostalCode', 'Borough', 'Neighborhood']

Create an empty dataframe using the table headers as column names.

In [7]:
pcode = pd.DataFrame(columns=headers)
pcode.reset_index()
pcode

Unnamed: 0,PostalCode,Borough,Neighborhood


Before populating the dataframe with the postal code data, first check how many rows the tables contains, excluding the header, the __tag tr__ will be used for this.

In [8]:
# print the number of rows the table contains
print('The source table has {} rows'.format(len(par_table.find_all('tr'))-1))

The source table has 180 rows


Populate the dataframe can be done running a nested loop. The first level will run by row, the __tag tr__ will be used to identify them,  for each row the second level will run on column, the __tag td__ will be used as identification. The data will be stored temporary in a list, then after reading each row, the list will stored into the dataframe, in case of Borough is not assigned the complete row will be ignored.

In [9]:
# run a loop by row [tag 'tr']
for i, row in enumerate(par_table.find_all('tr')):
    # skip the first row [headers]
    if i > 0:
        # create an empty list
        d = list()
        
        # run a loop by column for the current row [tag 'td']
        for column in row.find_all('td'):
            # append the text of current cell to the list, already removing the '\n'
            d.append(column.get_text().replace('\n',''))

        # if Borough is not 'not assigned' then store the list into the dataframe 
        if d[1].lower()!='not assigned':
            pcode = pcode.append(pd.Series(d, index = pcode.columns), ignore_index=True)
            
# inform when it is finished
print('Dataframe populated.')

Dataframe populated.


Check the first 10 observations in the dataframe.

In [10]:
pcode.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Quick resume of the data.

In [11]:
print('There are {} unique postal codes.'.format(len(pcode['PostalCode'].unique())))
print('There are {} unique boroughs.'.format(len(pcode['Borough'].unique())))

There are 103 unique postal codes.
There are 10 unique boroughs.


Check how many observations the dataframe contains.

In [12]:
pcode.shape

(103, 3)

This completes the __Task 1__.

---

### Task 2. Include the coordinates for each neighborhood in the dataframe.

Import required library. For this task the __PGeocode__ will be used to get the coordinates, latitude and longitude.

In [13]:
import pgeocode

Get the coordinates for each neighborhood in the Toronto Postal Code dataframe. It will be accomplished using the __query_postal_code__ from __pgeocode__, which only requires a list of target postal codes as input. Its output is a Pandas Data Frame containing among others latitude and longitude information.

In [14]:
# define the user agent
geol = pgeocode.Nominatim('ca')

# get the geo data
loct = geol.query_postal_code(pcode['PostalCode'].tolist())

# inform when it is finished
print('Coordinates loaded.')

Coordinates loaded.


Check the first 5 returned observations.

In [15]:
loct.head()

Unnamed: 0,postal_code,country code,place_name,state_name,state_code,county_name,county_code,community_name,community_code,latitude,longitude,accuracy
0,M3A,CA,North York (York Heights / Victoria Village / ...,Ontario,ON,North York,,,,43.7545,-79.33,1.0
1,M4A,CA,North York (Sweeney Park / Wigmore Park),Ontario,ON,,,,,43.7276,-79.3148,6.0
2,M5A,CA,Downtown Toronto (Regent Park / Port of Toronto),Ontario,ON,Toronto,8133394.0,,,43.6555,-79.3626,6.0
3,M6A,CA,North York (Lawrence Manor / Lawrence Heights),Ontario,ON,North York,,,,43.7223,-79.4504,6.0
4,M7A,CA,Queen's Park Ontario Provincial Government,Ontario,ON,,,,,43.6641,-79.3889,


Update the Toronto Postal Code dataframe with the coordinates. It will be done by simply adding the two returned columns, latitude and ongitude, at the end of the Toronto Postal Code dataframe.

In [16]:
# add the two returned columns to the dataframe
pcode['Latitude'] = loct.latitude
pcode['Longitude'] = loct.longitude

print('Coordinates added to the dataframe.')

Coordinates added to the dataframe.


Check the first 10 observations in the dataframe.

In [17]:
pcode.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.6662,-79.5282
6,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193
7,M3B,North York,Don Mills,43.745,-79.359
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.7063,-79.3094
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783


This completes the __Task 2__.

---

### Task3. Explore, cluster and display the neighborhoods clusters on a map.

Import required libraries. To create maps, __Folium__ will be used, to handle colors __Matplotlib Colors__ and __Pyplot__ will be used, __JSON__ library to handle JSON files, and for the clustering tasks, __KMeans__ from __SKLearn__ will be used.

In [18]:
import folium
import json
from sklearn.cluster import KMeans
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt

Before exploring the neighborhoods, check if tere are any missing coordinates that cannot be displayed or explored.

In [19]:
pcode[np.isnan(pcode['Latitude'])]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
76,M7R,Mississauga,Canada Post Gateway Processing Centre,,


For Postal Code M7R __pgeocode__ could not find the coordinates. The __M7R__ is the postal code for __Canada Post Gateway Processing Centre__, it will be droped out of the postal code dataframe.

In [20]:
didx = pcode[np.isnan(pcode['Latitude'])]['Neighborhood'].index.tolist()
pcode = pcode.drop(index=didx)
pcode.shape

(102, 5)

A __color table by Borough__ will be created, allowing each Neighborhood to be pointed on the Toronto's map with its Borough specific color. The color list will be create using __Pyplot__, but for each color it creates a list with four values (RGBA format), they will be converted to hexadecimal (HEX format) using __Matplolib Colors__, this way the colors can be used in folium map.

In [21]:
# create a list of unique boroughs
boroughs = pcode['Borough'].unique()

# create a color list by borough [RGBA format]
# tab10 is the chosen color map 
trgb = plt.cm.tab10(np.linspace(0, 1, len(boroughs)))

# convert the color from RGBA to HEX
# RGBA format contains 4 positions, to convert RGB to HEX only the first 3 positions are taken
thex = list()
for i in range(len(trgb)):
    thex.append(mcolors.rgb2hex(trgb[i][:3]))

# create a temporary list to connect bouroughs and colors
tmplist = list()
for b, c in zip(boroughs, thex):
    tmplist.append([b,c])

# create a dataframe from the temporary list
colortable = pd.DataFrame(columns=['Borough','Color'], data=tmplist)

# show the borough color table
colortable

Unnamed: 0,Borough,Color
0,North York,#1f77b4
1,Downtown Toronto,#ff7f0e
2,Etobicoke,#2ca02c
3,Scarborough,#d62728
4,East York,#8c564b
5,York,#e377c2
6,East Toronto,#7f7f7f
7,West Toronto,#bcbd22
8,Central Toronto,#17becf


Merge the Toronto Postal Code with the Borough Color Table into a new dataframe using the Borough feature as key.

In [22]:
# merge postal code and color table data frames
dfmap = pd.merge(pcode, colortable, on='Borough')

# show the first 10 observations
dfmap.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Color
0,M3A,North York,Parkwoods,43.7545,-79.33,#1f77b4
1,M4A,North York,Victoria Village,43.7276,-79.3148,#1f77b4
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504,#1f77b4
3,M3B,North York,Don Mills,43.745,-79.359,#1f77b4
4,M6B,North York,Glencairn,43.7081,-79.4479,#1f77b4
5,M3C,North York,Don Mills,43.7334,-79.3329,#1f77b4
6,M2H,North York,Hillcrest Village,43.8015,-79.3577,#1f77b4
7,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.7535,-79.4472,#1f77b4
8,M2J,North York,"Fairview, Henry Farm, Oriole",43.7801,-79.3479,#1f77b4
9,M3J,North York,"Northwood Park, York University",43.7694,-79.4921,#1f77b4


Create a Toronto map showing the neighborhoods. Initially the map was centered using the Toronto coordinates, but the neighborhoods on the top right corner were out of the view, and an area without any neighborhood at the bottom portion was visible, I decided to use the coordinates from the neighborhoods in Toronto Postal Code dataframe to center the map, in this way all of them will be visible.

In [23]:
# calculate mid point for latitude and longitude
tlat = round((min(pcode['Latitude'])+max(pcode['Latitude']))/2, 4)
tlng = round((min(pcode['Longitude'])+max(pcode['Longitude']))/2, 4)

# print calculated coordinates
print('Latitude: {} , Longitude: {}'.format(tlat, tlng))

Latitude: 43.7181 , Longitude: -79.3736


Create the Toronto map using the calculated coordinates. The __Folium__ library will be used to create it.

In [24]:
toronto_map = folium.Map(location=[tlat, tlng], zoom_start=11)

Create the neighborhoods points on the map, having each one the borough specific color and a label.

In [25]:
# run a loop through the dataframe creating the points (circle marks)
for lat, lng, borough, neighborhood, cl in zip(dfmap['Latitude'], dfmap['Longitude'], dfmap['Borough'], dfmap['Neighborhood'], dfmap['Color']):
    # create the label
    label = '{}; Borough: {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    
    # create the circle mark
    folium.CircleMarker([lat, lng],
                        radius = 5,
                        popup = label,
                        color = cl,
                        fill = True,
                        fill_color = cl,
                        fill_opacity = 0.3,
                        parse_html = False,
                        no_touch=True).add_to(toronto_map)

Show the map

In [26]:
toronto_map

Looking at the map, having each borough its own color, it was easy to notice that between the red circles there is a gray one. The gray color represents borough __East Toronto__. Check what is in the Toronto Postal Code dataframe.

In [27]:
# check observations for East Toronto
dfmap[dfmap['Borough']=='East Toronto']

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Color
82,M4E,East Toronto,The Beaches,43.6784,-79.2941,#7f7f7f
83,M4K,East Toronto,"The Danforth West, Riverdale",43.6803,-79.3538,#7f7f7f
84,M4L,East Toronto,"India Bazaar, The Beaches West",43.6693,-79.3155,#7f7f7f
85,M4M,East Toronto,Studio District,43.6561,-79.3406,#7f7f7f
86,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.7804,-79.2505,#7f7f7f


From the postal code pattern, it really seems that __M7Y__ does not fit to East Toronto borough, its coordinates are also a bit away from the others, as we could see from the map. Check the borough represented by red color - __Scarborough__.

In [28]:
# check observations for Scarborough
dfmap[dfmap['Borough']=='Scarborough']

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Color
55,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193,#d62728
56,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.7878,-79.1564,#d62728
57,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.7678,-79.1866,#d62728
58,M1G,Scarborough,Woburn,43.7712,-79.2144,#d62728
59,M1H,Scarborough,Cedarbrae,43.7686,-79.2389,#d62728
60,M1J,Scarborough,Scarborough Village,43.7464,-79.2323,#d62728
61,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.7298,-79.2639,#d62728
62,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.7122,-79.2843,#d62728
63,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.7247,-79.2312,#d62728
64,M1N,Scarborough,"Birch Cliff, Cliffside West",43.6952,-79.2646,#d62728


From the postal code pattern, __M7Y__ does not fit to Scarborough as well. It is quite strange, maybe a deeper search on the Toronto Postal Code methodology would be required to understand it, but it is not in the scope of this assignment, it will just be left it as it is.

For the __clustering__ assignment the borough __Downtown Toronto__ will be used. Create a new dataframe only with borough Downtown Toronto.

In [29]:
# copy the observation from Downtown Toronto to a new dataframe
dtdata = dfmap[dfmap['Borough']=='Downtown Toronto']

# reset the index
dtdata = dtdata.reset_index(drop=True)

# print how many observations are
print('There are {} neighborhoods in Downtown Toronto.'.format(dtdata.shape[0]))

# print the first 10 observations
dtdata.head(10)

There are 19 neighborhoods in Downtown Toronto.


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Color
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626,#ff7f0e
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,#ff7f0e
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783,#ff7f0e
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756,#ff7f0e
4,M5E,Downtown Toronto,Berczy Park,43.6456,-79.3754,#ff7f0e
5,M5G,Downtown Toronto,Central Bay Street,43.6564,-79.386,#ff7f0e
6,M6G,Downtown Toronto,Christie,43.6683,-79.4205,#ff7f0e
7,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.6496,-79.3833,#ff7f0e
8,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.623,-79.3936,#ff7f0e
9,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.6469,-79.3823,#ff7f0e


Have a closer look at Downtown Toronto with its neighborhoods. Calculate the coordinates to centralize the map using the neighborhoods coordinates.

In [30]:
# calculate mid point for latitude and longitude
tlat = round((min(dtdata['Latitude'])+max(dtdata['Latitude']))/2, 4)
tlng = round((min(dtdata['Longitude'])+max(dtdata['Longitude']))/2, 4)

# print calculated coordinates
print('Latitude: {} , Longitude: {}'.format(tlat, tlng))

Latitude: 43.6529 , Longitude: -79.3915


Create the Downtown Toronto map using the calculated coordinates and create the neighborhoods points on the map.

In [31]:
# create the map
dt_map = folium.Map(location=[tlat, tlng], zoom_start=13)

# run a loop through the dataframe creating the points (circle marks)
for lat, lng, neighborhood, cl in zip(dtdata['Latitude'], dtdata['Longitude'], dtdata['Neighborhood'], dtdata['Color']):
    # create the label
    label = folium.Popup(neighborhood, parse_html=True)
    
    # create the circle mark
    folium.CircleMarker([lat, lng],
                        radius = 5,
                        popup = label,
                        color = cl,
                        fill = True,
                        fill_color = cl,
                        fill_opacity = 0.3,
                        parse_html = False).add_to(dt_map)

# show the map
dt_map

Explore the Downtown Toronto. The __Foursquare API__ will be used to explore the borough, to be able to send requests to Foursquare API the __client id__ and __client secret__ must be used, they will be define as sensitive code to be kept as secret. The verson of the API shoud also be defined, but it is not secret.

In [32]:
CLIENT_ID = '4FXQP03JMHBRAXLKCARKIEOWOIWJ1K44JFWTXY4L0EFN4YQC'
CLIENT_SECRET = 'IWQQXWISVTC5MYLGLW4QRUOAMBJAXG4NRFCMARR3FZZNLOX2'

In [37]:
# define the version of API Foursquare to be used
VERSION = '20180605'

# define the limit of venues to be returned
LIMIT = 100

# define the radius to be explored for each neighborhood 
RADIUS = 500

Sent requests to __API Foursquare__ to get the list of venues nearby each neighborhood. A loop through the Downtown Toronto dataframe will be done to collect the venues information and store them in a temporary list.

In [38]:
# create an empty list to store the venues information
venueslist = list()

# run a loop through the dataframe
for nname, nlat, nlng in zip(dtdata['Neighborhood'], dtdata['Latitude'], dtdata['Longitude']):
    # create the API Foursquare request URL
    url='https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        nlat,
        nlng,
        RADIUS,
        LIMIT)
    
    # make the GET request and store in JSON format
    results = requests.get(url).json()['response']['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for v in results:
        venueslist.append([nname,
                            nlat,
                            nlng,
                            v['venue']['name'],
                            v['venue']['location']['lat'],
                            v['venue']['location']['lng'],
                            v['venue']['categories'][0]['name']])

Check how many venues were returned.

In [39]:
# print the quantity of venues returned
print(len(venueslist), 'venues returned')

1166 venues returned


Create a dataframe to store the returned venues and show the first 10 observations.

In [40]:
dfvenues = pd.DataFrame(columns=['Neighborhood',
                               'NeighborhoodLatitude',
                               'NeighborhoodLongitude',
                               'VenueName',
                               'VenueLatitude',
                               'VenueLongitude',
                               'VenueCategory'],
                      data=venueslist)

# show the first 10 observations
dfvenues.head(10)

Unnamed: 0,Neighborhood,NeighborhoodLatitude,NeighborhoodLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,"Regent Park, Harbourfront",43.6555,-79.3626,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.6555,-79.3626,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.6555,-79.3626,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,"Regent Park, Harbourfront",43.6555,-79.3626,The Yoga Lounge,43.655515,-79.364955,Yoga Studio
4,"Regent Park, Harbourfront",43.6555,-79.3626,Dominion Pub and Kitchen,43.656919,-79.358967,Pub
5,"Regent Park, Harbourfront",43.6555,-79.3626,Body Blitz Spa East,43.654735,-79.359874,Spa
6,"Regent Park, Harbourfront",43.6555,-79.3626,Sumach Espresso,43.658135,-79.359515,Coffee Shop
7,"Regent Park, Harbourfront",43.6555,-79.3626,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
8,"Regent Park, Harbourfront",43.6555,-79.3626,Sukhothai,43.658444,-79.365681,Thai Restaurant
9,"Regent Park, Harbourfront",43.6555,-79.3626,Berkeley Church,43.655123,-79.365873,Event Space


Check how many venues returned for each neighborhood, the limit was set to 100.

In [41]:
dfvenues.groupby('Neighborhood').count()

Unnamed: 0_level_0,NeighborhoodLatitude,NeighborhoodLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,92,92,92,92,92,92
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",57,57,57,57,57,57
Central Bay Street,62,62,62,62,62,62
Christie,12,12,12,12,12,12
Church and Wellesley,77,77,77,77,77,77
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
"First Canadian Place, Underground city",100,100,100,100,100,100
"Garden District, Ryerson",100,100,100,100,100,100
"Harbourfront East, Union Station, Toronto Islands",4,4,4,4,4,4
"Kensington Market, Chinatown, Grange Park",51,51,51,51,51,51


Use __One hot encoding__ technique to cross all venues categories with all neighborhoods, the result will be a dataframe showing how many of each vanues categories the neighborhoods contain.

In [42]:
# create the one hot enconding dataframe
dfvenues_ohe = pd.get_dummies(dfvenues[['VenueCategory']], prefix='', prefix_sep='')

Notice that the feature __Neighborhood__ is already part of the __One Hot Encoding dataframe__.

In [43]:
dfvenues_ohe['Neighborhood']

0       0
1       0
2       0
3       0
4       0
       ..
1161    0
1162    0
1163    0
1164    0
1165    0
Name: Neighborhood, Length: 1166, dtype: uint8

Update the feature __Neighborhood__ with the information from __Venues dataframe__.

In [44]:
dfvenues_ohe['Neighborhood'] = dfvenues['Neighborhood']

Move the feature __Neighborhood__ to the first position.

In [45]:
# check the column position of feature Neighborhood
clidx = dfvenues_ohe.columns.get_loc('Neighborhood')

# define the columns positions having Neighborhood at first
columnnames = ['Neighborhood']+list(dfvenues_ohe.columns[:clidx])+list(dfvenues_ohe.columns[clidx+1:])

# change the columns positions
dfvenues_ohe = dfvenues_ohe[columnnames]

# show the first 10 observations
dfvenues_ohe.head(10)

Unnamed: 0,Neighborhood,Afghan Restaurant,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Baby Store,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Check the shape of the One Hot Encoding dataframe.

In [46]:
dfvenues_ohe.shape

(1166, 182)

Group the One Hot Encoding dataframe by Neighborhood calculating the mean of each venue category.

In [47]:
# group the dataframe and calculate mean
dfvenues_grp = dfvenues_ohe.groupby('Neighborhood').mean().reset_index()

# show the first 10 observations
dfvenues_grp.head(10)

Unnamed: 0,Neighborhood,Afghan Restaurant,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Baby Store,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.01087,0.021739,0.0,0.0,0.0,0.0,0.01087,0.0,...,0.0,0.01087,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01087
1,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.017544
2,Central Bay Street,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.016129,0.016129,0.0,0.016129,0.0,0.0,0.0,0.0
3,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.083333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Church and Wellesley,0.012987,0.012987,0.0,0.0,0.012987,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.012987,0.0,0.0,0.025974
5,"Commerce Court, Victoria Hotel",0.0,0.03,0.01,0.0,0.0,0.03,0.0,0.0,0.0,...,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0
6,"First Canadian Place, Underground city",0.0,0.03,0.01,0.0,0.0,0.03,0.0,0.0,0.0,...,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0
7,"Garden District, Ryerson",0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.01,0.0,0.01,0.0,0.0,0.0,0.0
8,"Harbourfront East, Union Station, Toronto Islands",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Kensington Market, Chinatown, Grange Park",0.0,0.0,0.039216,0.0,0.039216,0.0,0.0,0.0,0.0,...,0.0,0.058824,0.0,0.039216,0.0,0.019608,0.0,0.0,0.0,0.0


Chack the shape of the Venues Group dataframe.

In [48]:
dfvenues_grp.shape

(19, 182)

Notice that there 19 observations again, the same number as the neighborhoods in Downtown Toronto.

Return most common venues categories, from top 1 to 10, for each neighborhood. This will be done in a new dataframe __Venues Top10__.

In [50]:
# define the column names for the new dataframe
columnnames = ['Neighborhood',
               'CommonTop01','CommonTop02','CommonTop03','CommonTop04','CommonTop05',
               'CommonTop06','CommonTop07','CommonTop08','CommonTop09','CommonTop10']

# create the new dataframe
dfvenues_top10 = pd.DataFrame(columns=columnnames)

# get the neighborhoods from Venues Group dataframe
dfvenues_top10['Neighborhood'] = dfvenues_grp['Neighborhood']

Define a funciton to return the most common venues for the requested neighborhood. It will be used in the next step.

In [51]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Return the most common venues categories information using the function defined just above running a loop through the Venues Group dataframe and store the information into the __Venues Top10__ dataframe.

In [52]:
# set the limit of top to return
numtop = 10

# run a loop to get the most common venues categories for each neighborhood
for i in np.arange(dfvenues_grp.shape[0]):
    dfvenues_top10.iloc[i,1:] = return_most_common_venues(dfvenues_grp.iloc[i,:], numtop)

# show the first 10 observations
dfvenues_top10.head(10)

Unnamed: 0,Neighborhood,CommonTop01,CommonTop02,CommonTop03,CommonTop04,CommonTop05,CommonTop06,CommonTop07,CommonTop08,CommonTop09,CommonTop10
0,Berczy Park,Coffee Shop,Hotel,Café,Restaurant,Cocktail Bar,Seafood Restaurant,Beer Bar,Japanese Restaurant,Bakery,Cheese Shop
1,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Café,Bar,Italian Restaurant,French Restaurant,Gym / Fitness Center,Restaurant,Park,Bank,Bakery
2,Central Bay Street,Coffee Shop,Middle Eastern Restaurant,Sandwich Place,Bubble Tea Shop,Café,Italian Restaurant,Japanese Restaurant,Clothing Store,Restaurant,Ramen Restaurant
3,Christie,Grocery Store,Café,Athletics & Sports,Park,Coffee Shop,Candy Store,Baby Store,Playground,BBQ Joint,Fish Market
4,Church and Wellesley,Japanese Restaurant,Sushi Restaurant,Coffee Shop,Gay Bar,Restaurant,Yoga Studio,Men's Store,Mediterranean Restaurant,Bubble Tea Shop,Burger Joint
5,"Commerce Court, Victoria Hotel",Coffee Shop,Café,Restaurant,Hotel,Gym,Salad Place,Seafood Restaurant,Steakhouse,Japanese Restaurant,Asian Restaurant
6,"First Canadian Place, Underground city",Coffee Shop,Café,Restaurant,Hotel,Gym,Salad Place,Seafood Restaurant,Steakhouse,Japanese Restaurant,Asian Restaurant
7,"Garden District, Ryerson",Coffee Shop,Clothing Store,Italian Restaurant,Cosmetics Shop,Middle Eastern Restaurant,Café,Japanese Restaurant,Bakery,Lingerie Store,Hotel
8,"Harbourfront East, Union Station, Toronto Islands",Harbor / Marina,Park,Café,Music Venue,Dog Run,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
9,"Kensington Market, Chinatown, Grange Park",Café,Mexican Restaurant,Bakery,Vegetarian / Vegan Restaurant,Art Gallery,Grocery Store,Vietnamese Restaurant,Arts & Crafts Store,Coffee Shop,Park


__Clustering__ the neighborhoods using the Venues Group dataframe, excluding the Neighborhood feature. The neighborhood will be clustered by the most common venues categories they contain, the number of cluster will be set to 4.

In [53]:
# set number of clusters
k = 4

# create a features dataframe to run the clustering model
dfvenues_grp_clustering = dfvenues_grp.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(dfvenues_grp_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 0, 2, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
      dtype=int32)

Include the Clustering results into the Venues Top10 dataframe.

In [54]:
# include the cluster results into the Venues Top10 dataframe
dfvenues_top10.insert(0, 'Cluster', kmeans.labels_)

# show the first 10 observations
dfvenues_top10.head(10)

Unnamed: 0,Cluster,Neighborhood,CommonTop01,CommonTop02,CommonTop03,CommonTop04,CommonTop05,CommonTop06,CommonTop07,CommonTop08,CommonTop09,CommonTop10
0,0,Berczy Park,Coffee Shop,Hotel,Café,Restaurant,Cocktail Bar,Seafood Restaurant,Beer Bar,Japanese Restaurant,Bakery,Cheese Shop
1,0,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Café,Bar,Italian Restaurant,French Restaurant,Gym / Fitness Center,Restaurant,Park,Bank,Bakery
2,0,Central Bay Street,Coffee Shop,Middle Eastern Restaurant,Sandwich Place,Bubble Tea Shop,Café,Italian Restaurant,Japanese Restaurant,Clothing Store,Restaurant,Ramen Restaurant
3,2,Christie,Grocery Store,Café,Athletics & Sports,Park,Coffee Shop,Candy Store,Baby Store,Playground,BBQ Joint,Fish Market
4,0,Church and Wellesley,Japanese Restaurant,Sushi Restaurant,Coffee Shop,Gay Bar,Restaurant,Yoga Studio,Men's Store,Mediterranean Restaurant,Bubble Tea Shop,Burger Joint
5,0,"Commerce Court, Victoria Hotel",Coffee Shop,Café,Restaurant,Hotel,Gym,Salad Place,Seafood Restaurant,Steakhouse,Japanese Restaurant,Asian Restaurant
6,0,"First Canadian Place, Underground city",Coffee Shop,Café,Restaurant,Hotel,Gym,Salad Place,Seafood Restaurant,Steakhouse,Japanese Restaurant,Asian Restaurant
7,0,"Garden District, Ryerson",Coffee Shop,Clothing Store,Italian Restaurant,Cosmetics Shop,Middle Eastern Restaurant,Café,Japanese Restaurant,Bakery,Lingerie Store,Hotel
8,3,"Harbourfront East, Union Station, Toronto Islands",Harbor / Marina,Park,Café,Music Venue,Dog Run,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
9,0,"Kensington Market, Chinatown, Grange Park",Café,Mexican Restaurant,Bakery,Vegetarian / Vegan Restaurant,Art Gallery,Grocery Store,Vietnamese Restaurant,Arts & Crafts Store,Coffee Shop,Park


Create a __Clustering Map dataframe__ for the Folium map having the neighborhood, its coordinates, the Top10 information and the Clustering results.

In [55]:
# merge the Toronto Dataframe with the Venues Top10 Dataframe
dfclustermap = pd.merge(dtdata[['Neighborhood', 'Latitude', 'Longitude']], dfvenues_top10, on='Neighborhood')

# show the first 10 observations
dfclustermap.head(10)

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster,CommonTop01,CommonTop02,CommonTop03,CommonTop04,CommonTop05,CommonTop06,CommonTop07,CommonTop08,CommonTop09,CommonTop10
0,"Regent Park, Harbourfront",43.6555,-79.3626,0,Coffee Shop,Restaurant,Breakfast Spot,Yoga Studio,Bakery,Pub,Distribution Center,Electronics Store,Event Space,Food Truck
1,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,0,Sushi Restaurant,Ramen Restaurant,Italian Restaurant,Bubble Tea Shop,Burrito Place,Café,Martial Arts Dojo,Coffee Shop,Burger Joint,Indian Restaurant
2,"Garden District, Ryerson",43.6572,-79.3783,0,Coffee Shop,Clothing Store,Italian Restaurant,Cosmetics Shop,Middle Eastern Restaurant,Café,Japanese Restaurant,Bakery,Lingerie Store,Hotel
3,St. James Town,43.6513,-79.3756,0,Café,Coffee Shop,Seafood Restaurant,American Restaurant,Cocktail Bar,Gastropub,Restaurant,Italian Restaurant,Moroccan Restaurant,Clothing Store
4,Berczy Park,43.6456,-79.3754,0,Coffee Shop,Hotel,Café,Restaurant,Cocktail Bar,Seafood Restaurant,Beer Bar,Japanese Restaurant,Bakery,Cheese Shop
5,Central Bay Street,43.6564,-79.386,0,Coffee Shop,Middle Eastern Restaurant,Sandwich Place,Bubble Tea Shop,Café,Italian Restaurant,Japanese Restaurant,Clothing Store,Restaurant,Ramen Restaurant
6,Christie,43.6683,-79.4205,2,Grocery Store,Café,Athletics & Sports,Park,Coffee Shop,Candy Store,Baby Store,Playground,BBQ Joint,Fish Market
7,"Richmond, Adelaide, King",43.6496,-79.3833,0,Café,Coffee Shop,Restaurant,Gym,Hotel,American Restaurant,Salad Place,Asian Restaurant,Japanese Restaurant,Steakhouse
8,"Harbourfront East, Union Station, Toronto Islands",43.623,-79.3936,3,Harbor / Marina,Park,Café,Music Venue,Dog Run,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
9,"Toronto Dominion Centre, Design Exchange",43.6469,-79.3823,0,Hotel,Coffee Shop,Café,Restaurant,American Restaurant,Japanese Restaurant,Italian Restaurant,Salad Place,Seafood Restaurant,Bakery


Check the Clustering Map Dataframe shape.

In [56]:
dfclustermap.shape

(19, 14)

__Create the Clustering Map__ using the __Folium__ library, each cluster will be represented by one different color.

In [58]:
# create the Folium map
cluster_map = folium.Map(location=[tlat,tlng], zoom_start=13)

# set colors for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
c_array = plt.cm.tab10(np.linspace(0, 1, len(ys)))
c = [mcolors.rgb2hex(i) for i in c_array]

# add markers to the map
markers_colors = []
for lat, lng, nbh, clt in zip(dfclustermap['Latitude'], dfclustermap['Longitude'], dfclustermap['Neighborhood'], dfclustermap['Cluster']):
    label = folium.Popup(str(nbh) + ' Cluster ' + str(clt), parse_html=True)
    folium.CircleMarker([lat, lng],
                        radius = 5,
                        popup = label,
                        color = c[clt-1],
                        fill = True,
                        fill_color = c[clt-1],
                        fill_opacity = 0.3).add_to(cluster_map)

# show the map
cluster_map

Check the clusters distribution in the Venues Top10 dataframe.

In [59]:
dfvenues_top10.groupby('Cluster')['Neighborhood'].count()

Cluster
0    16
1     1
2     1
3     1
Name: Neighborhood, dtype: int64

From the 19 neighborhoods in Downtown Toronto, the KMeans clustering model, running based on the venues category, has clustered 16 of them together and for the 3 left, it has defined one cluster for each.

This completes the __Task 3__.

---