
## Applied Data Science Capstone Course - Week 3 Assignment - Part 3


Assignment Instructions:

Explore and cluster the neighborhoods in Toronto. 

Make sure:

- to add enough Markdown cells to explain what you decided to do and to report any observations you make.
- to generate maps to visualize your neighborhoods and how they cluster together.


In [1]:
# import dependencies
import pandas as pd
import numpy as np

import requests

from bs4 import BeautifulSoup
import lxml

import json
from pandas.io.json import json_normalize

from sklearn.cluster import KMeans 

import matplotlib.cm as cm
import matplotlib.colors as colors

print("Libraries imported.")

Libraries imported.


In [2]:
# install and import folium for working with maps
!conda install -c conda-forge folium=0.5.0 --yes 
import folium

Solving environment: done

# All requested packages already installed.



In [3]:
# install geopy and import Nominatim for working with geo coordinates
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

Solving environment: done

# All requested packages already installed.



First, we will repeat steps from Week 3 Assignment - Part 1 and Week 3 Assignment - Part 2 to build the dataframe of Toronto neighborhoods

In [4]:
# get the neighborhoods data wiki page and parse content with BeautifulSoup and lxml parser
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
table = BeautifulSoup(page.content, 'lxml').find('table', class_="wikitable sortable")
print("Wiki page data captured.")

Wiki page data captured.


In [5]:
# clean up data and prepare dataframe
clean_data = []

for row in table.find_all('tr'):
    cells = row.find_all('td')
    l=[]
    
    if len(cells) == 0:
        pass
    else:
        PostalCode = cells[0].text
        Borough = cells[1].text
        Neighborhood = (cells[2].text).rstrip() #remove any whitespace / newline
    
        # skip records where Borough = "Not assigned" and populate Neighborhood where "Not assigned"
        if Borough == "Not assigned":
            pass
        else:
            if Neighborhood == "Not assigned":
                Neighborhood = Borough
                
            l.append(PostalCode)
            l.append(Borough)
            l.append(Neighborhood)
            clean_data.append(l)                

# create pandas dataframe form clean data
df = pd.DataFrame(clean_data, columns = ['PostalCode', 'Borough', 'Neighborhood'])

# group Neighborhoods within the same Borough and PostalCode into a list 
group = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(list)

# convert list of Neighborhoods to String and save group as dataframe
df_grouped = pd.DataFrame(group.str.join(", "))

#reset index
df_grouped.reset_index(inplace=True)
df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [6]:
# access geo coordinates data and add to neighborhoods dataframe
df_geo = pd.read_csv("http://cocl.us/Geospatial_data")

df_merged = pd.merge(df_grouped, df_geo, left_on = 'PostalCode', right_on = 'Postal Code')

# remove redundant Postal Code column
df_merged.drop(columns = 'Postal Code', inplace = True)

df_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


For our clustering exercise we will use a subset of this data, only focusing on Borough that contains the name "Toronto".

In [7]:
# create a subset dataframe where Borough contains word "Toronto"  
df_toronto = df_merged[df_merged['Borough'].str.contains('Toronto')]
df_toronto = df_toronto.reset_index(drop = True)

print("The shape of our new subset is: " + str(df_toronto.shape) + "\n")
df_toronto.head()

The shape of our new subset is: (38, 5)



Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


We will now use Folium library to display the map of Toronto and mark coordinates of our neighborhoods.

In [8]:
# use geopy to obtain Toronto coordinates
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="none")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create a map of Toronto
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add neighborhood markers to the map
for lat, lng, label in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='orange',
        fill=True,
        fill_color='#ffa500',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

#display map    
map_toronto

Now we will use Foursquare API to explore popular venues in each neighborhood

In [9]:
# define Foursquare credentials
CLIENT_ID = 'ANBPM1JB0HUGQP0KFU1MOJHTOHDNSQ2PGHKYT1BN0KIJOFXY'
CLIENT_SECRET = 'V3AWAR22AYUR24YVKPXANX23GWAOIZZ4UEX1EUI3V30T5WSW'
VERSION = '20180605' # Foursquare API version

# set limit and radius
LIMIT = 100
radius = 500

In [10]:
# define function to retrieve nearby venues (optional: pass query as 'q'; e.g. q='restaurants' )
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100, q=''):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create request URL and make request
        url = 'https://api.foursquare.com/v2/venues/explore?&query={}&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            q, CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
        
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # collect only relevant information
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    # store
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We will now use this function to retrieve up to 100 popular venues for all neighborhoods in our filtered dataframe. 

In [11]:
# neighborhoods and their coordinates are stored in df_toronto dataframe
toronto_venues = getNearbyVenues(names=df_toronto['Neighborhood'], latitudes=df_toronto['Latitude'], longitudes=df_toronto['Longitude'])

print("We have retrieved " + str(toronto_venues.shape[0]) + " venues from Foursquare API\n")
toronto_venues.head()

We have retrieved 1691 venues from Foursquare API



Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
1,The Beaches,43.676357,-79.293031,Starbucks,43.678798,-79.298045,Coffee Shop
2,The Beaches,43.676357,-79.293031,Williamson Road Playground,43.674716,-79.297338,Playground
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


In [12]:
# check how many unique categories returned
print("There were " + str(len(toronto_venues['Venue Category'].unique())) + " unique categories returned")

There were 231 unique categories returned


We will be using k-means clustering method to analyze Toronto neighborhoods data.<br>
Since our data so far consists of categorical variables, first we need to determine relevant numeric characteristics for this data.<br>
We will be using frequency of appearance of each Category in nearby venues in each neighborhood as the dataset to be used with k-means clustering.

In [13]:
#construct a dataframe for Category counters
df_counters = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add in the neighborhoods column
df_counters['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column (for ease of review)
df_counters = df_counters[[df_counters.columns[-1]] + list(df_counters.columns[:-1])]

# calculate category frequency grouping by neighborhood; store results in the new dataframe and reset index
df_grouped = df_counters.groupby('Neighborhood').mean().reset_index()

print("The shape of our dataset for clustering is: " + str(df_grouped.shape) + "\n")
df_grouped.head()


The shape of our dataset for clustering is: (38, 231)



Unnamed: 0,Neighborhood,Yoga Studio,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wings Joint,Women's Store
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.01
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business reply mail Processing Centre969 Eastern,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.0,0.076923,0.076923,0.076923,0.153846,0.153846,0.153846,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can now use k-means clustering to group neighborhoods based on their most popular venues

In [14]:
# set number of clusters
kclusters = 5

# prepare dataframe for clustering removing non-numeric columns
df_clustering = df_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_clustering)

# check cluster labels that were generated
kmeans.labels_

array([2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2,
       1, 2, 0, 2, 2, 3, 4, 2, 2, 2, 2, 2, 2, 2, 1, 2], dtype=int32)

We can now review our clusters.<br>
We will look at top 5 most popular categories within each cluster to gain insights about the neighborhoods in this cluster.

In [15]:
#add labels to clustering dataframe and calculate average by category
df_clustering['Label'] = kmeans.labels_
df_clusters = df_clustering.groupby('Label').mean().reset_index()
print("We have created " + str(df_clusters.shape[0]) + " clusters from " + str(df_clustering.shape[0]) + " neighborhoods\n")
df_clusters.head()

We have created 5 clusters from 38 neighborhoods



Unnamed: 0,Label,Yoga Studio,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wings Joint,Women's Store
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.013889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,0.006681,0.000358,0.000358,0.002481,0.002481,0.002481,0.004963,0.004963,0.004963,...,0.0013,0.000768,0.001626,0.006041,0.001336,0.005672,0.000323,0.003194,0.000358,0.000645
3,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We will now examine individual clusters:

In [16]:
# Cluster 0: 
# isolate the cluster using filter by Label, then remove non-numeric column, transpose and sort in descending order
c0 = df_clusters[df_clusters['Label']==0].drop('Label', 1).transpose().sort_values(by = 0, ascending = False)
c0.head()

Unnamed: 0,0
Playground,0.5
Restaurant,0.5
Nightclub,0.0
Men's Store,0.0
Mexican Restaurant,0.0


**Insights**: Cluster 0 neighborhood is likely a mostly residential area, where restaurants and playgrounds are popular venues.

In [17]:
# Cluster 1: 
# isolate the cluster using filter by Label, then remove non-numeric column, transpose and sort in descending order
c1 = df_clusters[df_clusters['Label']==1].drop('Label', 1).transpose().sort_values(by = 1, ascending = False)
c1.head()

Unnamed: 0,1
Bus Line,0.1125
Park,0.091667
Sushi Restaurant,0.076389
Jewelry Store,0.0625
Trail,0.0625


**Insights**: Cluster 1 neighborhood is likely a mix of residential and commercial properties, where shopping, dining and outdoor attractions are popular destinations.

In [18]:
# Cluster 2: 
# isolate the cluster using filter by Label, then remove non-numeric column, transpose and sort in descending order
c2 = df_clusters[df_clusters['Label']==2].drop('Label', 1).transpose().sort_values(by = 2, ascending = False)
c2.head()

Unnamed: 0,2
Coffee Shop,0.085801
Café,0.056629
Restaurant,0.027945
Italian Restaurant,0.025659
Pub,0.025022


**Insights**: Cluster 2 neighborhood is likely a dense urban area, where most popular venues are coffee shops and restaurants.

In [19]:
# Cluster 3: 
# isolate the cluster using filter by Label, then remove non-numeric column, transpose and sort in descending order
c3 = df_clusters[df_clusters['Label']==3].drop('Label', 1).transpose().sort_values(by = 3, ascending = False)
c3.head()

Unnamed: 0,3
Park,0.5
Playground,0.25
Trail,0.25
New American Restaurant,0.0
Men's Store,0.0


**Insights**: Cluster 3 neighborhood is likely a residential area, located close to park/natural zone, where most popular attractions are outdoor activities.

In [20]:
# Cluster 4: 
# isolate the cluster using filter by Label, then remove non-numeric column, transpose and sort in descending order
c4 = df_clusters[df_clusters['Label']==4].drop('Label', 1).transpose().sort_values(by = 4, ascending = False)
c4.head()

Unnamed: 0,4
Pool,0.5
Garden,0.5
Yoga Studio,0.0
Nightclub,0.0
Mexican Restaurant,0.0


**Insights**: Cluster 4 neighborhood is likely a mix of commercial/municipal properties near a residential area.

Now we will visualize our clusters using a map of Toronto

In [21]:
#combine together cluster labels by neighborhood and the geographical neibourhood data
df_grouped['Label'] = kmeans.labels_
df_labeled = pd.concat([df_grouped['Neighborhood'],df_grouped['Label']], axis=1)
df_final = pd.merge(df_labeled, df_toronto, left_on = 'Neighborhood', right_on = 'Neighborhood')

print("We will map and label " + str(df_final.shape[0]) + " neighborhoods\n")
df_final.head()


We will map and label 38 neighborhoods



Unnamed: 0,Neighborhood,Label,PostalCode,Borough,Latitude,Longitude
0,"Adelaide, King, Richmond",2,M5H,Downtown Toronto,43.650571,-79.384568
1,Berczy Park,2,M5E,Downtown Toronto,43.644771,-79.373306
2,"Brockton, Exhibition Place, Parkdale Village",2,M6K,West Toronto,43.636847,-79.428191
3,Business reply mail Processing Centre969 Eastern,1,M7Y,East Toronto,43.662744,-79.321558
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",2,M5V,Downtown Toronto,43.628947,-79.39442


In [22]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add cluster markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_final['Latitude'], df_final['Longitude'], df_final['Neighborhood'], df_final['Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

#display map       
map_clusters

Thank you.