# Segmenting and Clustering neighborhoods in Toronto, CA
This notebook is created as the evaluation exercise for Week 3 of the Applied Data Science Capstone in Coursera.  
This course is part of the IBM Data Science Professional Certificate - https://www.coursera.org/professional-certificates/ibm-data-science?skipBrowseRedirect=true

## Part 1 - Getting borough and neighborhood information for Toronto from Wikipedia
In this section, we will use the Beautiful Soup package to scrape the Postal Code, Borough and Neighborhood from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and create a pandas dataframe that we can analyze further. 

In [1]:
# Let's import the packages we'll be using in this part
import bs4
import numpy as np
import pandas as pd

from requests import get

#### Lets start by grabbing the url and feeding it to Beatiful Soup. Then we will print the prettified version to look where the data we want is stored.

In [2]:
response = get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
webpage  = bs4.BeautifulSoup(response.text, 'html.parser')
print(webpage.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"Xo02agpAIHwAA6rXcZcAAABE","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":949497198,"wgRevisionId":949497198,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Communications in Ontario","Postal codes in Canada","Toronto","Ontario

#### According to what we've seen above, the data we want is inside a table tagged "table". Each information triplet is then stored inside a ''tr'' tag within this table, with each element within a "td" tag and the headers within a "th" tag. So let's grab them...

In [3]:
headers = [i.get_text().replace('\n', '') for i in webpage.table.find_all('th')]
values  = [i.get_text().replace('\n', '') for i in webpage.table.find_all('td')]

#### Now we can use those elements to build our pd.DataFrame and populate it with information from the values list. Let's just go through the list 3 values at a time, and if the 'Borough' field is set to 'Not assigned', just ignore that values triplet. 

In [4]:
toronto_df = pd.DataFrame(columns = headers)
for i in range(0,len(values),3):
    pcode, boro, neighb  = values[i:i+3]    
    if boro == 'Not assigned':
        continue
    else:
        toronto_df = toronto_df.append({headers[0]: pcode.strip(),
                                        headers[1]: boro.strip(),
                                        headers[2]: neighb.replace(' /', ',') if neighb != '' else boro.strip()},
                                       ignore_index = True)
toronto_df.head()

Unnamed: 0,Postal code,Borough,Neighborhood


#### Now we show the final shape of the data frame to answer question 1 from the assignment.

In [6]:
toronto_df.shape

(103, 3)

## Part 2 - Gather latitude and longitude information for all available postal codes
This is done using the geocoder package and providing the postal code of each location of interest.  
There is a limit of 2500 calls/day that can be made using geocoder, but with 103 postal codes it should be sufficient.

In [7]:
# Let's import the packages we'll be using in this part
import folium

#### Unfortunately, Google is now requiring an API-key to use geocoder.google. In that case, we will use the .csv file provided.

In [8]:
locations = pd.read_csv('Geospatial_Coordinates.csv')

# Now we loop through our data frame combining the proper coordinates with each postal code
lat = []
lon = []
for pcode in toronto_df.iloc[:,0]:
    vi  = locations['Postal Code'] == pcode
    lat.append(locations.loc[vi,'Latitude'].values)
    lon.append(locations.loc[vi,'Longitude'].values)
    
toronto_df['Latitude']  = np.array(lat)
toronto_df['Longitude'] = np.array(lon)
toronto_df.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


#### Now let's create a folium map to showcase these points in a map of Toronto

In [9]:
toronto_map = folium.Map(location = [toronto_df.Latitude.mean(), toronto_df.Longitude.mean()], zoom_start=10)

# add markers to map
for lat, lon, boro, neighb in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighborhood']):
    label = '{}, {}'.format(neighb, boro)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker([lat, lon], radius = 5, popup = label, color = 'blue', fill = True,
                        fill_color = 'lightblue', fill_opacity = 0.7, parse_html = False).add_to(toronto_map)  
    
toronto_map

## Part 3 - Neighborhood segmentation across Toronto using Foursquare data

First, let's see which boroughs are there in Toronto and how many neighborhoods are present in each one

In [95]:
# Libraries to be used in this section
from sklearn.cluster import KMeans
from pandas.io.json import json_normalize 

In [14]:
boroughs = toronto_df.groupby('Borough').count()
boroughs

Unnamed: 0_level_0,Postal code,Neighborhood,Latitude,Longitude
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Central Toronto,9,9,9,9
Downtown Toronto,19,19,19,19
East Toronto,5,5,5,5
East York,5,5,5,5
Etobicoke,12,12,12,12
Mississauga,1,1,1,1
North York,24,24,24,24
Scarborough,17,17,17,17
West Toronto,6,6,6,6
York,5,5,5,5


#### Looks like we have a total of 39 different postal codes in boroughs with "Toronto" in their names. This is a good sample size for us to segment our data. So let's start by extracting only the data associated with neighborhoods that contain the word "Toronto". We will plot it into a map with different colors for each borough.

In [25]:
toronto_borough = toronto_df[toronto_df['Borough'].str.contains('Toronto')]
borough_map     = folium.Map(location = [toronto_borough.Latitude.mean(), 
                                         toronto_borough.Longitude.mean()], zoom_start=10)

# add markers to map
for lat, lon, boro, neighb in zip(toronto_borough['Latitude'], toronto_borough['Longitude'], toronto_borough['Borough'], toronto_borough['Neighborhood']):
    label = '{}, {}'.format(neighb, boro)
    label = folium.Popup(label, parse_html = True)
    if 'Downtown' in boro:
        color = 'blue'
    elif 'East' in boro:
        color = 'red'
    elif 'West' in boro:
        color = 'green'
    elif 'Central' in boro:
        color = 'black'
    else:
         raise ValueError('Something went wrong during the color definition...')   
    
    folium.CircleMarker([lat, lon], radius = 5, popup = label, color = color, fill = True,
                        fill_color = 'lightblue', fill_opacity = 0.7, parse_html = False).add_to(borough_map)  

borough_map

#### Okay, now we now how each borough is distributed. The question is, does this borough separation also persists if we try to segment the different neighborhoods based on the type of places we can find within each one?
#### First, let's define our Foursquare credentials (sensitive fields ommitted for protection).

In [37]:
foursqr_api = 'https://api.foursquare.com/v2/venues/explore'
params      = {'client_id'    : 'My_id',
               'client_secret': 'My_secret',
               'v'            : '20180605',
               'radius'       : '1000',
               'll'           : '',
               'query'        : '',
               'limit'        : '30'}

#### Now we will loop through all postal code/neighborhoods in our toronto_borough dataframe and grab up to 30 locations within a 1000 meters radius from the postal code central coordinate. This data will be saved directly into a new pandas dataframe containing the venue's name, latitude, longitude, category, unique id and the associated pcode and neighborhood laitutde and longitude.

In [82]:
headers        = ['Postal code', 'Latitude', 'Longitude', 'VenueName', 'VenueCategory', 'VenueLatitude', 'VenueLongitude', 'VenueId']
toronto_venues = pd.DataFrame(columns = headers)
for pcode, lat, lon in zip(toronto_borough['Postal code'], toronto_borough['Latitude'], toronto_borough['Longitude']):
    params['ll'] = str(lat) + ',' + str(lon)
    results      = get(url = foursqr_api, params = params).json()
    
    for v in results['response']['groups'][0]['items']:
        toronto_venues = toronto_venues.append({headers[0]: pcode,
                                                headers[1]: lat,
                                                headers[2]: lon,
                                                headers[3]: v['venue']['name'],
                                                headers[4]: v['venue']['categories'][0]['name'], 
                                                headers[5]: v['venue']['location']['lat'],
                                                headers[6]: v['venue']['location']['lng'],
                                                headers[7]: v['venue']['id']}, ignore_index = True)

toronto_venues.head()

Unnamed: 0,Postal code,Latitude,Longitude,VenueName,VenueCategory,VenueLatitude,VenueLongitude,VenueId
0,M5A,43.65426,-79.360636,Roselle Desserts,Bakery,43.653447,-79.362017,54ea41ad498e9a11e9e13308
1,M5A,43.65426,-79.360636,Tandem Coffee,Coffee Shop,43.653559,-79.361809,53b8466a498e83df908c3f21
2,M5A,43.65426,-79.360636,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008,574c229e498ebb5c6b257902
3,M5A,43.65426,-79.360636,Impact Kitchen,Restaurant,43.656369,-79.35698,5612b1cc498e3dd742af0dc8
4,M5A,43.65426,-79.360636,The Distillery Historic District,Historic Site,43.650244,-79.359323,4ad4c05ef964a520bff620e3


#### Okay, we have our venues dataframe. Let's check its size and how many values we got for each postal code.

In [85]:
print(toronto_venues.groupby('Postal code')['VenueName'].count())
print(toronto_venues.shape)

Postal code
M4E    30
M4K    30
M4L    30
M4M    30
M4N     9
M4P    30
M4R    30
M4S    30
M4T    30
M4V    30
M4W    21
M4X    30
M4Y    30
M5A    30
M5B    30
M5C    30
M5E    30
M5G    30
M5H    30
M5J    30
M5K    30
M5L    30
M5N    23
M5P    30
M5R    30
M5S    30
M5T    30
M5V    15
M5W    30
M5X    30
M6G    30
M6H    30
M6J    30
M6K    30
M6P    30
M6R    30
M6S    30
M7A    30
M7Y    30
Name: VenueName, dtype: int64
(1118, 8)


#### It seems that, overall, we got a decent amount of venues at each location. We did get a single postal code with only 9 venues, which could be indicative of a quiet neighborhood.
#### Let's see how many unique categories we found as well, since they are going to form the basis of our K-means analysis.

In [84]:
print('We found {} unique categories within those neighborhoods.'.format(len(toronto_venues.VenueCategory.unique())))

We found 197 unique categories within those neighborhoods.


#### So we have 197 unique categories within 1118 venues. Let's create a one hot encoding of these 197 categories and define a new dataframe with the respective postal code of each row.

In [93]:
venues_onehot                = pd.get_dummies(toronto_venues.VenueCategory)
venues_onehot['Postal code'] = toronto_venues['Postal code']

# And put the "Postal code" as the first column
col_order     = [venues_onehot.columns[-1]] + list(venues_onehot.columns[:-1])
venues_onehot = venues_onehot[col_order]

venues_onehot.head()

Unnamed: 0,Postal code,Airport,Airport Lounge,American Restaurant,Amphitheater,Aquarium,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Track,Trail,Train Station,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Now we obtained the average score across each category for each postal code. This gives us back our 39 original postal codes, now with information about the frequency of each venue category in each postal code.

In [94]:
venues_aggregate = venues_onehot.groupby('Postal code').mean().reset_index()
venues_aggregate.shape

(39, 198)

#### Now we run our K-Means clustering algorithm. Remember that our goal is to see whether or not we can get the same 4 original boroughs back after segmenting these neighborhoods/postal codes based on the available venues in each location. For that reason, we define the number of clusters to be 4.
#### After running K-Means, we put the estimated labels back into our original toronto_borough dataframe and see which unique labels fall within each borough.

In [101]:
kmeans = KMeans(n_clusters = 4, random_state = 13).fit(venues_aggregate.iloc[:,1:])

toronto_borough['KMeans_labels'] = kmeans.labels_
toronto_borough.groupby('Borough')['KMeans_labels'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Borough
Central Toronto        [1, 0, 2]
Downtown Toronto       [0, 2, 1]
East Toronto        [3, 0, 1, 2]
West Toronto           [2, 0, 1]
Name: KMeans_labels, dtype: object

#### As we can see, segmenting our neighborhoods did not produce consistent labels within each borough. Let's put this into a map to give us a visual perspective and see if any spatial patterns are apparent.

In [110]:
map_clusters = folium.Map(location = [toronto_borough.Latitude.mean(), 
                                      toronto_borough.Longitude.mean()], zoom_start = 12)

# add markers to the map
markers_colors = []
for lat, lon, poi, boro, cluster in zip(toronto_borough.Latitude, toronto_borough.Longitude, toronto_borough['Neighborhood'], toronto_borough.Borough, toronto_borough.KMeans_labels):
    label = folium.Popup(str(poi) + ' / Cluster ' + str(cluster), parse_html = True)
    
    # Marker color will be based on the cluster label
    if cluster == 0:
        color = 'red'
    elif cluster == 1:
        color = 'green'
    elif cluster == 2:
        color = 'blue'
    elif cluster == 3:
        color = 'yellow'
    else:
        raise ValueError('Number of clusters exceeded. Check your KMeans n_clusters arguments.')            
    
    # Marker shape will be based on the Borough
    if 'Downtown' in boro:
        sides = 3
    elif 'East' in boro:
        sides = 4
    elif 'West' in boro:
        sides = 6
    elif 'Central' in boro:
        sides = 30
    else:
         raise ValueError('Something went wrong during the color definition...') 
            
    folium.RegularPolygonMarker(location = [lat, lon], popup = label, radius = 10, number_of_sides = sides, 
                        fill_color = color, fill_opacity = 0.7).add_to(map_clusters)
       
map_clusters

#### From the above image, we can see that there is no apparent spatial pattern. Only a single neighborhood was labelled as Cluster 3 (yellow), and it's located in East Toronto (diamond shaped). This was also the only borough to have all cluster labels found in it. Downtown Toronto (triangles) was mostly composed of Cluster labels 0 (red) and 2 (blue), whereas Central Toronto (circles) was dominated by Cluster labels 1 (green) and 2 (blue). Finally, West Toronto (hexagons) was dominated by Cluster label 2 (blue).
#### Finally, let's see what are the dominant categories for each cluster and save it into a new data frame. We'll extract the top 10.

In [161]:
venues_aggregate['Clusters'] = kmeans.labels_

clusters_dominant = pd.DataFrame(columns = [c for c in np.sort(venues_aggregate['Clusters'].unique())], index = np.arange(1,11))
for c in np.sort(venues_aggregate['Clusters'].unique()):
    clust = pd.Series(index = venues_aggregate.columns[1:-1],
                      data  = np.squeeze(venues_aggregate.loc[venues_aggregate.Clusters == c, venues_aggregate.columns[1:]].groupby('Clusters').mean().values.T))
    
    clusters_dominant.loc[:,c] = clust.sort_values(ascending = False)[:10].index.values
    
clusters_dominant

Unnamed: 0,0,1,2,3
1,Coffee Shop,Café,Italian Restaurant,Bus Line
2,Park,Coffee Shop,Café,Park
3,Café,Restaurant,Coffee Shop,Gym / Fitness Center
4,Japanese Restaurant,Gastropub,Sushi Restaurant,Trail
5,Breakfast Spot,American Restaurant,Bakery,Café
6,Gastropub,Hotel,Bar,College Quad
7,Dance Studio,Japanese Restaurant,Park,College Gym
8,Ice Cream Shop,Bakery,Pizza Place,Coffee Shop
9,Pub,Cocktail Bar,Restaurant,Bookstore
10,Bakery,Beer Bar,Brewery,Dog Run


#### It becomes quite clear why Cluster 3 was so unique within our distribution. It is the only cluster for which the dominant venues have nothing to do with food, being dominated by places of activities (park, gym and trail). Cluster 0 is predominantly made up of food venues, although a large frequency of parks (second position) and dance studios distinguishes it from the remaining 1 and 2 Clusters. Cluster 1 has a high frequency of hotels (sixth position) and bars, whereas Cluster 2 is mostly restaurants. A deeper analysis could probably be achieved if we aggregated all similar venues into major categories, e.g. Restaurants and Bars. This would reduce the number of unique venues and allow the segmentation algorithm to focus on major differences between neighborhoods.