# Applied Data Science Capstone Project
IBM's Data Science Specialization on Coursera <br>
Isabel Almeida <br>
April 2020

# Table of Contents

[SECTION 1 | Obtaining the data](#section1)<br>
    [1.1 Scraping with BeautifulSoup](#section1.1)<br>
    [1.2 Scraping with pandas](#section1.2)<br>
    [1.3 Data Wrangling](#section1.3)<br><br>
    
[SECTION 2 | Retrieving the latitude and longitude of each postal code](#section2)<br><br>

[SECTION 3  |  Neighborhood analysis and clustering](#section3)<br>
    [3.1 Listing venues in each borough/neighborhood](#section3.1)<br>
    [3.2 Uncovering most common types of venues in each borough/neighborhood](#section3.2)<br>
    [3.3 Clustering neighborhoods](#section3.3)<br>

In [1]:
import numpy as np
import pandas as pd

<a id='section1'></a>
# Section 1  |  Obtaining the data

<a id='section1.1'></a>
## 1.1 Scraping with BeautifulSoup

In [2]:
from bs4 import BeautifulSoup as bs
import requests

In [3]:
# Retrieve webpage we want to get data from.
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source_page = requests.get(url).text

# Parse and print webpage's HTML (expand the "..." under this cell).
soup = bs(source_page,'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"652df336-7a6c-4f10-8854-52123f8436fb","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":960187814,"wgRevisionId":960187814,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Communications in Ontario","Postal codes in Canada","Toron

By examining the page's HTML, we notice the information we need is under the tag < table > whose attribute class equals to "wikitable". Since class is not a unique attribute, we make sure that's the only table in the page with such class value by counting all instances of this combination:

In [4]:
tables_classwiki = soup.find_all('table', class_='wikitable')

print('Instances of tag <table class="wikitable">:',
      len(tables_classwiki)
     )

Instances of tag <table class="wikitable">: 1


Now that we're sure there is only one instance of a table tag with class="wikitable" and that we are pulling only the information we want from the page, we assign it to a variable so we can work specifically with its subtags.

In [5]:
# Since there is only one element in the tables_classwiki result set, we know its index is 0.
table = tables_classwiki[0] 

print(table.prettify()) # Expand the "..." under this cell to see the resulting HTML.

<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postal Code
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighborhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    North York
   </td>
   <td>
    Parkwoods
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    North York
   </td>
   <td>
    Victoria Village
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    Downtown Toronto
   </td>
   <td>
    Regent Park, Harbourfront
   </td>
  </tr>
  <tr>
   <td>
    M6A
   </td>
   <td>
    North York
   </td>
   <td>
    Lawrence Manor, Lawrence Heights
   </td>
  </tr>
  <tr>
   <td>
    M7A
   </td>
   <td>
    Downtown Toronto
   </td>
   <td>
    Queen's Park, Ontario Provincial Government
   </td>
  </tr>
  <tr>
   <td>
    M8A

We can see that the tag < tr > starts a table row. The first row corresponds to the table header, where it'll contain the subtags < th >, and the rest correspond to the table data, where the rows will contain the subtags < td >.

In [6]:
rows = table.find_all('tr')

# Get rid of headers to facilitate looping.
del rows[0]

# Create empty lists to receive table data from loop.
postal_code = []
borough = []
neighbourhood = []

for row in rows:
    postal_code.append(row.find_all('td')[0].text)
    borough.append(row.find_all('td')[1].text)
    neighbourhood.append(row.find_all('td')[2].text)
    
cols = {'Postal Code': postal_code,'Borough': borough,'Neighbourhood': neighbourhood}

df = pd.DataFrame(cols)
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"
...,...,...,...
175,M5Z\n,Not assigned\n,Not assigned\n
176,M6Z\n,Not assigned\n,Not assigned\n
177,M7Z\n,Not assigned\n,Not assigned\n
178,M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor,..."


<a id='section1.2'></a>
## 1.2 Scraping with pandas

Since we are trying to scrape data from a table, we can simply use pandas' read_html, which gives us a list of all tables it can find in an HTML.

In [7]:
tables = pd.read_html(url)
print('No. of tables found on page:',
      len(tables)
     )

No. of tables found on page: 3


We have found 3 tables on our Wikipedia page. By trying indexes, we find that the table we want is the one with index 0, which we isolate in a new dataframe.

In [8]:
df = tables[0]
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


<a id='section1.3'></a>
## 1.3 Data Wrangling

Because it turned out cleaner, we will use the table scraped with pandas to wrangle the data. Moving through each criteria:

#### 1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood <br>
Criteria already met.

#### 2. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [9]:
df = df[df.Borough != 'Not assigned']
df = df.reset_index(drop=True)
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


#### 3. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

The data seems to already be organized in this way, but to check if this is true for all rows we will pivot the data using the postal code as the index and count the number of neighborhood rows associated.

In [10]:
pd.pivot_table(df, index=["Postal Code"], values="Neighborhood",aggfunc=len, margins=True)

Unnamed: 0_level_0,Neighborhood
Postal Code,Unnamed: 1_level_1
M1B,1
M1C,1
M1E,1
M1G,1
M1H,1
...,...
M9P,1
M9R,1
M9V,1
M9W,1


Because the count of neighbourhood values is equal to the number of unique postal codes (103), we conclude the entire data is already organized in the manner required.

#### 4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [11]:
check = df.Neighborhood.str.contains('Not assigned', case=False)
df[check]

Unnamed: 0,Postal Code,Borough,Neighborhood


There doesn't seem to be any unassigned neighborhoods left in our data.

#### 5. Use the .shape method to print the number of rows of your dataframe.

In [12]:
df.shape

(103, 3)

After excluding unassigned boroughs and grouping neighbourhoods, we are left with 103 unique postal codes in our data.

<a id='section2'></a>
# Section 2  |  Retrieving the latitude and longitude of each postal code

We will use the provided .csv to get the information needed.

In [13]:
csv = pd.read_csv('https://cocl.us/Geospatial_data')
csv

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


We can merge the data based on the postal code column.

In [14]:
df = df.merge(csv, how='inner')
df

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


<a id='section3'></a>
# Section 3  |  Neighborhood analysis and clustering

In [15]:
import folium

First, let's visualize all boroughs and their respective neighborhoods on a map.

In [16]:
# Creating a map of Toronto. A quick Google search returns latitude = 43.651070 and longitude = -79.347015.
map_toronto = folium.Map(location=[43.651070, -79.347015], zoom_start=10)

# Adding markers to map.
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

<a id='section3.1'></a>
## 3.1 Listing venues in each borough/neighborhood

Now, for each borough we will look up nearby venues in a 500 m radius, to a maximum of 50 venues. For this, we will make use of FourSquare's data, which can be accessed via its API.

In [17]:
# Defining my FourSquare credentials.
CLIENT_ID = 'B0240SNMOPNJ2VYPJIS0NWIUL25DKHDB3EXWGFMJCX4LZPOR' # my Foursquare ID
CLIENT_SECRET = '5H1ZK1D2PLFZ1EMAAGKULI1LPZKEB2TQWM1NKODM2X0SITEC' # my Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [18]:
# Defining the function that will pull nearby venues from each borough.
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=50):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # Creating the API request URL.
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Making the GET request.
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Returning only relevant information for each nearby venue.
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [20]:
toronto_venues = getNearbyVenues(names=df['Neighborhood'],
                                 latitudes=df['Latitude'],
                                 longitudes=df['Longitude'])

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

Our call has returned a total of 1676 different venues in the boroughs we are looking into.

In [21]:
print(toronto_venues.shape)
toronto_venues.head()

(1676, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


The number of venues returned in each borough (listed by neighborhood group) is found below.

In [22]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,5,5,5,5,5,5
"Alderwood, Long Branch",9,9,9,9,9,9
"Bathurst Manor, Wilson Heights, Downsview North",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
...,...,...,...,...,...,...
"Willowdale, Willowdale East",32,32,32,32,32,32
"Willowdale, Willowdale West",5,5,5,5,5,5
Woburn,3,3,3,3,3,3
Woodbine Heights,8,8,8,8,8,8


The 1676 venues listed are spread over 250 unique categories.

In [23]:
print('There are {} unique venue categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 250 unique venue categories.


<a id='section3.2'></a>
## 3.2 Uncovering most common types of venues in each borough/neighborhood

First, we will count each venue under its respective category.

In [24]:
# One hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Adding neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# Moving neighborhood column to the first column
toronto_onehot = toronto_onehot.set_index('Neighborhood').reset_index()

toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Then we will average the frequency of occurrence of each category per borough (listed as neighborhood group).

In [25]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,"Willowdale, Willowdale East",0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.0
90,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0
91,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0
92,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0


Finally, we will display the 10 most common types of venues for each neighborhood.

In [26]:
# Creating a function to sort venues in descending order of frequency.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [69]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# Creating columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

# Creating dataframe listing most common types of venues for each neighborhood.
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Breakfast Spot,Lounge,Skating Rink,Latin American Restaurant,Clothing Store,Donut Shop,Discount Store,Distribution Center,Dog Run,Doner Restaurant
1,"Alderwood, Long Branch",Pizza Place,Athletics & Sports,Dance Studio,Pharmacy,Coffee Shop,Pub,Sandwich Place,Gym,Antique Shop,Department Store
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Fried Chicken Joint,Bridal Shop,Sandwich Place,Diner,Deli / Bodega,Restaurant,Middle Eastern Restaurant,Supermarket
3,Bayview Village,Bank,Japanese Restaurant,Chinese Restaurant,Café,Yoga Studio,Dog Run,Diner,Discount Store,Distribution Center,Doner Restaurant
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant,Restaurant,Coffee Shop,Comfort Food Restaurant,Thai Restaurant,Juice Bar,Butcher,Café,Indian Restaurant


<a id='section3.3'></a>
## 3.3 Clustering neighborhoods

In [35]:
from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

Based on the frequency of each type of venue in each borough, we will categorize the latter in clusters by using the k-means method.

In [227]:
# Setting number of clusters.
kclusters = 4

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# Running k-means clustering.
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# Checking cluster labels generated for each row in the dataframe.
kmeans.labels_[0:10]

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)

The following table shows which cluster each borough and contained neighborhoods belong to (scroll all the way to the right to see the 'Cluster Labels' column).

In [229]:
# Adding clustering labels.
neighborhoods_venues_sorted['Cluster Labels'] = kmeans.labels_

toronto_merged = df

# Merging dataframes to add latitude/longitude for each neighborhood.
toronto_merged = toronto_merged.merge(neighborhoods_venues_sorted, how='inner', on='Neighborhood')

toronto_merged

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,M3A,North York,Parkwoods,43.753259,-79.329656,Fast Food Restaurant,Park,Food & Drink Shop,Yoga Studio,Dessert Shop,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,0
1,M4A,North York,Victoria Village,43.725882,-79.315572,Pizza Place,Coffee Shop,Portuguese Restaurant,Hockey Arena,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Doner Restaurant,2
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,Coffee Shop,Bakery,Park,Pub,Theater,Café,Breakfast Spot,Farmers Market,Distribution Center,Restaurant,2
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,Clothing Store,Furniture / Home Store,Coffee Shop,Event Space,Miscellaneous Shop,Accessories Store,Vietnamese Restaurant,Boutique,Women's Store,Eastern European Restaurant,2
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Coffee Shop,Sushi Restaurant,Yoga Studio,Discount Store,Park,Music Venue,Mexican Restaurant,Italian Restaurant,Hobby Shop,General Entertainment,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,River,Park,Pool,Yoga Studio,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Doner Restaurant,0
94,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,Sushi Restaurant,Coffee Shop,Restaurant,Yoga Studio,Men's Store,Burger Joint,Gay Bar,Park,Juice Bar,Diner,2
95,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,Light Rail Station,Yoga Studio,Auto Workshop,Comic Shop,Pizza Place,Restaurant,Burrito Place,Brewery,Skate Park,Smoke Shop,2
96,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,Baseball Field,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,3


Finally, we create a map to visualize the clusters and get and idea of how the city is divided based on various venues activities.<br><br>
Comments on each cluster can be found further below.

In [230]:
# Creating a map. As mentionend earlier, Toronto's coordinates are: latitude = 43.651070 and longitude = -79.347015.
map_clusters = folium.Map(location=[43.651070, -79.347015], zoom_start=11)

# Setting color scheme for the clusters.
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Adding markers to the map.
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Cluster 0

Cluster 0, in red on the map, is marked by many parks.

In [231]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,North York,Fast Food Restaurant,Park,Food & Drink Shop,Yoga Studio,Dessert Shop,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,0
19,York,Park,Women's Store,Pool,Yoga Studio,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,0
33,East York,Park,Convenience Store,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,0
48,North York,Park,Bakery,Construction & Landscaping,Basketball Court,Yoga Studio,Donut Shop,Discount Store,Distribution Center,Dog Run,Doner Restaurant,0
57,Central Toronto,Park,Swim School,Bus Line,Yoga Studio,Dessert Shop,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,0
60,York,Park,Convenience Store,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,0
62,North York,Park,Construction & Landscaping,Convenience Store,Yoga Studio,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,0
64,Central Toronto,Jewelry Store,Park,Sushi Restaurant,Trail,Dog Run,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Yoga Studio,0
79,Central Toronto,Restaurant,Park,Trail,Yoga Studio,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,0
81,Scarborough,Bakery,Playground,Park,Coffee Shop,Yoga Studio,Dog Run,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,0


In [232]:
pd.pivot_table(
    toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]],
    index = "1st Most Common Venue",
    values = "Borough",
    aggfunc = len)

Unnamed: 0_level_0,Borough
1st Most Common Venue,Unnamed: 1_level_1
Bakery,1
Fast Food Restaurant,1
Jewelry Store,1
Park,7
Restaurant,1
River,1


#### Cluster 1

Cluster 1, in purple on the map, consists of only one borough, where the most common venue category is playground.

In [233]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
30,Scarborough,Playground,Yoga Studio,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,Dog Run,1


#### Cluster 2

Cluster 2, in turquoise on the map, spans a wide area and offers many coffee shops. This cluster seems to concentrate most economic activity in the city.

In [235]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
1,North York,Pizza Place,Coffee Shop,Portuguese Restaurant,Hockey Arena,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Doner Restaurant,2
2,Downtown Toronto,Coffee Shop,Bakery,Park,Pub,Theater,Café,Breakfast Spot,Farmers Market,Distribution Center,Restaurant,2
3,North York,Clothing Store,Furniture / Home Store,Coffee Shop,Event Space,Miscellaneous Shop,Accessories Store,Vietnamese Restaurant,Boutique,Women's Store,Eastern European Restaurant,2
4,Downtown Toronto,Coffee Shop,Sushi Restaurant,Yoga Studio,Discount Store,Park,Music Venue,Mexican Restaurant,Italian Restaurant,Hobby Shop,General Entertainment,2
5,Scarborough,Fast Food Restaurant,Department Store,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,Dog Run,2
...,...,...,...,...,...,...,...,...,...,...,...,...
91,Downtown Toronto,Coffee Shop,Pizza Place,Italian Restaurant,Bakery,Restaurant,Café,Chinese Restaurant,Pub,Playground,Pharmacy,2
92,Downtown Toronto,Café,Coffee Shop,Restaurant,Japanese Restaurant,Seafood Restaurant,Tea Room,Concert Hall,Deli / Bodega,Gym,Hotel,2
94,Downtown Toronto,Sushi Restaurant,Coffee Shop,Restaurant,Yoga Studio,Men's Store,Burger Joint,Gay Bar,Park,Juice Bar,Diner,2
95,East Toronto,Light Rail Station,Yoga Studio,Auto Workshop,Comic Shop,Pizza Place,Restaurant,Burrito Place,Brewery,Skate Park,Smoke Shop,2


In [236]:
pd.pivot_table(
    toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]],
    index = "1st Most Common Venue",
    values = "Borough",
    aggfunc = len)

Unnamed: 0_level_0,Borough
1st Most Common Venue,Unnamed: 1_level_1
Airport Service,1
Auto Garage,1
Bakery,1
Bank,1
Bar,2
Breakfast Spot,2
Café,10
Chinese Restaurant,1
Clothing Store,4
Coffee Shop,12


#### Cluster 4

Cluster 4, in faint green on the map, has only two boroughs, where the most common venues are for sports.

In [237]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
54,North York,Baseball Field,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,3
96,Etobicoke,Baseball Field,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Doner Restaurant,3
