# Applied Data Science Capstone Project - Week 3

In this capstone project, we explored the postal code data for Toronto, mapped them, and analyzed them.

## Conversion of the Wikipedia Page Table to a Data Frame

Importing some libraries that will be used in this section

In [39]:
import requests # library to handle web requests
import json # library to handle json files
import pandas as pd
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import numpy as np

#!conda install -c conda-forge urllib --yes # uncomment if not installed - library to parse the html data
#!conda install -c conda-forge bs4 --yes # uncomment if not installed - BeautifylSoup for html files

#!conda install -c conda-forge geopy --yes # uncomment this line if not installed
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# some libraries for html manipulation
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if not installed
import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

This is the method I know that should work to ignore SSL error. I copied it from a previous course as we were advised by Prof. Chuck!

In [2]:
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

Let's get the contents of the html file and save it somewhere! Then use BeautifulSoup to parse the html. If you just see the content of the 'soup' file you see it contains all the Wikipedia page.

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = urllib.request.urlopen(url, context = ctx).read()
soup = BeautifulSoup(html, 'html.parser')

We then use some codes and trail/error to find the exact location of the table containing the data we need. We note that, if we get all the 'table' tags, the first item in the list the populated list is the table that we need!

In [4]:
table_whole = soup.find_all('table')
table_whole[0]

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td>

Let's just focus on the pieces that we need. As you can see, all the data are stored in table data or 'td' tags which are nested inside the 'tr' tags, so we need to loop through all the 'tr' tags to find the relevant 'td' text and save them in a dataframe.

In [5]:
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 
neighborhoods = pd.DataFrame(columns=column_names)

trs = table_whole[0].find_all('tr')
for tr in trs[1:]: # we used [1:] because the first item in the list - the header - (after inspection/trail/error) was shown to have a length of zero, se we were getting an error.
    tds = tr.find_all('td')
    postal_code = str(tds[0].get_text()).rstrip()
    borough = str(tds[1].get_text()).rstrip()
    neighborhood = str(tds[2].get_text()).rstrip()
    neighborhoods = neighborhoods.append({'PostalCode': postal_code,
                                          'Borough': borough,
                                          'Neighborhood': neighborhood}, ignore_index=True
                                         )

Below is the pulled table, but according to the instructions, it needs some clean-up.

In [6]:
neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


First, let's remove all the rows with Borough = Not Assigned, then reset the index to start from zero again:

In [7]:
neighborhoods = neighborhoods[neighborhoods.Borough != 'Not assigned']

In [8]:
neighborhoods.reset_index(drop = True, inplace=True)

In [9]:
neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Second, if there is no Neighborhood defined for a specific row, we need to use the Borough's name as the Neighborhood name. Interestingly, I didn't find such an instance!

In [10]:
neighborhoods[neighborhoods.Neighborhood == '']

Unnamed: 0,PostalCode,Borough,Neighborhood


The shape of the final dataframe is as follows:

In [11]:
neighborhoods.shape

(103, 3)

## Geocoding Neighborhoods

The first step is to create a list of all the neighborhood (the last column in the Neighborhoods dataframe). Given that there are some rows with more than one neighborhood, separated by comma, we use the split function and indexing to get the first neighborhood.

In [12]:
neigh_loc = []
for i in np.arange(neighborhoods.shape[0]):
    neigh_loc.append(neighborhoods.loc[i, 'Neighborhood'].split(',')[0])
neigh_loc

['Parkwoods',
 'Victoria Village',
 'Regent Park',
 'Lawrence Manor',
 "Queen's Park",
 'Islington Avenue',
 'Malvern',
 'Don Mills',
 'Parkview Hill',
 'Garden District',
 'Glencairn',
 'West Deane Park',
 'Rouge Hill',
 'Don Mills',
 'Woodbine Heights',
 'St. James Town',
 'Humewood-Cedarvale',
 'Eringate',
 'Guildwood',
 'The Beaches',
 'Berczy Park',
 'Caledonia-Fairbanks',
 'Woburn',
 'Leaside',
 'Central Bay Street',
 'Christie',
 'Cedarbrae',
 'Hillcrest Village',
 'Bathurst Manor',
 'Thorncliffe Park',
 'Richmond',
 'Dufferin',
 'Scarborough Village',
 'Fairview',
 'Northwood Park',
 'East Toronto',
 'Harbourfront East',
 'Little Portugal',
 'Kennedy Park',
 'Bayview Village',
 'Downsview',
 'The Danforth West',
 'Toronto Dominion Centre',
 'Brockton',
 'Golden Mile',
 'York Mills',
 'Downsview',
 'India Bazaar',
 'Commerce Court',
 'North Park',
 'Humber Summit',
 'Cliffside',
 'Willowdale',
 'Downsview',
 'Studio District',
 'Bedford Park',
 'Del Ray',
 'Humberlea',
 'Birch C

Next, we will loop through all the neighborhoods in the list we created and pull the lat and long using geopy library and save the lat and long in two separate lists and append them to the end of our master dataframe. It should be noted that there are a few neighborhoods that have no specific lat/long, like 'Canada Post Gateway Processing Centre', so I defined an if conditions to use 'No Coordinates' for those locations. They will then be removed from the final database:

In [13]:
latitude = []
longitude = []
for loc in neigh_loc:
    geolocator = Nominatim(user_agent="toronto_explorer")
    location = geolocator.geocode(loc)
    if location == None:
        latitude.append('No Coordinates')
        longitude.append('No Coordinates')
    else:
        latitude.append(location.latitude)
        longitude.append(location.longitude)

It seems we created the list successfully! Let's see

In [14]:
neighborhoods['Latitude'] = latitude
neighborhoods['Longitude'] = longitude
neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,37.8568,-122.221
1,M4A,North York,Victoria Village,43.7327,-79.3112
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6607,-79.3605
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7221,-79.4375
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6597,-79.3903
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.6474,-79.5113
99,M4Y,Downtown Toronto,Church and Wellesley,43.6655,-79.3838
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",No Coordinates,No Coordinates
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",-35.0266,138.808


A quick look at the table above shows that several of the coordinates are messed up and the geopy did not do a good job. So we will use the csv file given to us. Let's see what is inside the file:

In [15]:
neigh_csv = pd.read_csv('https://cocl.us/Geospatial_data')
neigh_csv.head(5)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


It seems the Postal Code, Latitude, and Longitude are in there, but Borough and Neighborhood data are missing. So, let's add them to the dataframe and create the final dataframe for analysis. For this purpose, we can use the pandas merge function, but the two columns in two dataframse that we are using should have the same name. So, we first rename the 'Postal Code' to 'PostalCode' for consistency:

In [16]:
neigh_csv.columns = ['PostalCode', 'Latitude', 'Longitude']
neigh_csv.head(5)

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We then drop the Latitude and Longitude columns of the neighborhoods dataframe and merge them the new Latitude and Longitude columns. Table below is the final dataset:

In [17]:
neighborhoods.drop(columns=['Latitude', 'Longitude'], inplace = True)

In [18]:
neighborhoods = neighborhoods.merge(neigh_csv, on='PostalCode')
neighborhoods.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


At the end, let's recreate the latitude and longitude lists (as we replaced the data with csv):

In [19]:
latitude = neighborhoods['Latitude']
latitude = list(map(float, latitude))
longitude = neighborhoods['Longitude']
longitude = list(map(float, longitude))

## Neighborhood Clustering

### What I decided to do was to look at the top 5 venues for all the neighborhoods in Toronto, then cluster neighborhoods based on the type/frequencies of the venues. I chose three clusters, just for practice. We may choose a different number.

Let's first create a map and see how the coordinates look...they look good!

In [20]:
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
loc_toronto = geolocator.geocode(address)
lat_toronto = loc_toronto.latitude
long_toronto = loc_toronto.longitude

map_toronto = folium.Map(location=[lat_toronto, long_toronto], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

******Talk about the type of the analysis you did

Let's load my credentials on Foursquare:

In [21]:
CLIENT_ID = '***' # my Foursquare ID - removed for security reasons!
CLIENT_SECRET = '***' # my Foursquare Secret - removed for security reasons!
VERSION = '20200531' # Foursquare API version

Let's create a function to repeat a process, i.e., getting all the nearby venues, to all the neighborhoods in Toronto.

In [22]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Then we get all the venues in Toronto

In [23]:
LIMIT = 100
radius = 500

toronto_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

Let's see what is in the new dataframe and the size of it:

In [24]:
print(toronto_venues.shape)
toronto_venues.head()

(2129, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


Getting the dataframe ready for analysis by one-hot encoding and some processing:

In [25]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add the neighborhood column as the the first column
toronto_onehot.pop('Neighborhood') 
toronto_onehot.insert(0, 'Neighborhood', toronto_venues['Neighborhood'])

toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's look at the size:

In [26]:
toronto_onehot.shape

(2129, 272)

The next step is to group all the rows based on the neighborhoods. This way, we get the frequency of the venues by neighborhood. Here is it:

In [27]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').sum().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Alderwood, Long Branch",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Bathurst Manor, Wilson Heights, Downsview North",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Bayview Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Bedford Park, Lawrence Manor East",0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,"Willowdale, Willowdale West",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
92,Woburn,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
93,Woodbine Heights,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
94,York Mills West,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Interestingly, some neighborhoods did not have any venues identified. That is why the number of rows is 96 (and not the total 103 neighborhoods in Toronto). Just like the Lab, we print out each neighborhood with the top 5 most common venues. I know there are some neighborhoods with less than 5 venues, so curious to see how the following code, which is copied from the Lab, works!

In [28]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(int)
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0             Clothing Store     1
1               Skating Rink     1
2             Breakfast Spot     1
3                     Lounge     1
4  Latin American Restaurant     1


----Alderwood, Long Branch----
          venue  freq
0   Pizza Place     2
1      Pharmacy     1
2          Pool     1
3  Skating Rink     1
4           Pub     1


----Bathurst Manor, Wilson Heights, Downsview North----
                       venue  freq
0                       Bank     2
1                Coffee Shop     2
2  Middle Eastern Restaurant     1
3              Shopping Mall     1
4          Mobile Phone Shop     1


----Bayview Village----
                 venue  freq
0                 Bank     1
1  Japanese Restaurant     1
2                 Café     1
3   Chinese Restaurant     1
4                Motel     0


----Bedford Park, Lawrence Manor East----
                venue  freq
0          Restaurant     2
1      Sandwich Place     2
2  Italian 

Interesting! It seems in case the neighborhood has less than 5 venues, we see additional one to get to 5 items in the list, with the associated frequency of zero. A closer look shows that those zero-value neighborhoods appear on the list alphabetically. Anyways! We need to put them in a pandas dataframe.

In [29]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 5 venues for each neighborhood.

In [73]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agincourt,Lounge,Breakfast Spot,Latin American Restaurant,Skating Rink,Clothing Store
1,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Sandwich Place,Pub,Skating Rink
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Diner,Sushi Restaurant,Middle Eastern Restaurant
3,Bayview Village,Café,Bank,Chinese Restaurant,Japanese Restaurant,Yoga Studio
4,"Bedford Park, Lawrence Manor East",Restaurant,Italian Restaurant,Coffee Shop,Sandwich Place,Comfort Food Restaurant


OK, starting some cool stuff...clustering using k-means! I would also consider 3 clusters.

In [74]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 3

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 2, 0, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the clusters as well as the top 5 venues for each neighborhood.

In [75]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = neighborhoods

# merge toronto_grouped with neighborhoods to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0.0,Food & Drink Shop,Park,Yoga Studio,Dim Sum Restaurant,Diner
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Hockey Arena,French Restaurant,Coffee Shop,Portuguese Restaurant,Dog Run
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,2.0,Coffee Shop,Bakery,Pub,Park,Breakfast Spot
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0.0,Accessories Store,Coffee Shop,Miscellaneous Shop,Furniture / Home Store,Boutique
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,2.0,Coffee Shop,Sushi Restaurant,Bank,Bar,Beer Bar


Finally, let's visualize the resulting clusters

After some trials, I noticed two issues: 1. There are some NaN labels, 2. The values of the cluster labels in the toronto_merge dataframe are float. These two issues will blow up the fol loop for map creation. So we fixed them. For the first one, we simply replaced the NaN with zero, it should work given that the number of NaNs were small (I think two). For the second one, I added int function to the for loop so it reads rainbow[int(cluster)-1]

In [78]:
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].fillna(0)

In [82]:
# creating the Toronto coordinates:
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
loc_toronto = geolocator.geocode(address)
lat_toronto = loc_toronto.latitude
long_toronto = loc_toronto.longitude

# create map
map_clusters = folium.Map(location=[lat_toronto, long_toronto], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Cluster 1 - this is the cluster that encompasses the most neighborhoods. This cluster mostly includes open-space venues, such as parks, shops, gyms, fields, and playgrounds. That is why on the map they are mostly located out of the congested downtown area.

In [83]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,North York,0.0,Food & Drink Shop,Park,Yoga Studio,Dim Sum Restaurant,Diner
1,North York,0.0,Hockey Arena,French Restaurant,Coffee Shop,Portuguese Restaurant,Dog Run
3,North York,0.0,Accessories Store,Coffee Shop,Miscellaneous Shop,Furniture / Home Store,Boutique
5,Etobicoke,0.0,,,,,
6,Scarborough,0.0,Fast Food Restaurant,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner
7,North York,0.0,Gym,Beer Store,Japanese Restaurant,Coffee Shop,Restaurant
8,East York,0.0,Pizza Place,Pet Store,Intersection,Athletics & Sports,Gastropub
10,North York,0.0,Pub,Park,Japanese Restaurant,Metro Station,Dog Run
11,Etobicoke,0.0,Filipino Restaurant,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner
12,Scarborough,0.0,Bar,Yoga Studio,Doner Restaurant,Dim Sum Restaurant,Diner


Cluster 2 - DOwntown Toronto where most of the coffee shops and cafes are located.

In [85]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
9,Downtown Toronto,1.0,Clothing Store,Coffee Shop,Middle Eastern Restaurant,Café,Cosmetics Shop
24,Downtown Toronto,1.0,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Japanese Restaurant
30,Downtown Toronto,1.0,Coffee Shop,Café,Restaurant,Thai Restaurant,Deli / Bodega
36,Downtown Toronto,1.0,Coffee Shop,Aquarium,Hotel,Café,Restaurant
42,Downtown Toronto,1.0,Coffee Shop,Café,Hotel,Restaurant,Italian Restaurant
48,Downtown Toronto,1.0,Coffee Shop,Café,Restaurant,Hotel,American Restaurant
92,Downtown Toronto,1.0,Coffee Shop,Café,Seafood Restaurant,Japanese Restaurant,Italian Restaurant
97,Downtown Toronto,1.0,Coffee Shop,Café,Japanese Restaurant,Restaurant,Hotel


Cluster 3 - Similar to downtown Toronto, but Cafes are more prevalent, unlike downtown area. My guess is, if I had chose two clusters, these two clusters (2 and 3) could have been combined.

In [86]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Downtown Toronto,2.0,Coffee Shop,Bakery,Pub,Park,Breakfast Spot
4,Downtown Toronto,2.0,Coffee Shop,Sushi Restaurant,Bank,Bar,Beer Bar
15,Downtown Toronto,2.0,Café,Coffee Shop,Cocktail Bar,Restaurant,American Restaurant
20,Downtown Toronto,2.0,Coffee Shop,Cocktail Bar,Cheese Shop,Beer Bar,Bakery
33,North York,2.0,Clothing Store,Coffee Shop,Restaurant,Fast Food Restaurant,Japanese Restaurant
37,West Toronto,2.0,Bar,Asian Restaurant,Café,Coffee Shop,Vegetarian / Vegan Restaurant
41,East Toronto,2.0,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store
54,East Toronto,2.0,Café,Coffee Shop,Bakery,Gastropub,American Restaurant
79,Central Toronto,2.0,Dessert Shop,Sandwich Place,Pizza Place,Café,Sushi Restaurant
80,Downtown Toronto,2.0,Café,Bakery,Bar,Italian Restaurant,Japanese Restaurant
