# Segmenting and Clustering Neighborhoods in Toronto

## Introduction
As part of the final assigment for IBM Data Science Certification, we are going to explore neighborhoods in Toronto, use the FourSquare API to get venues for each neighborhood (restaurants, bars, sports venues, etc...) and then cluster those neighborhoods using a summary of those venues as the features for our algorithm.

## 1. Webscrapping and creating Toronto postcodes dataframe
As a first step, we are going to extract the list of Toronto postcodes, boroughs and neighborhoods using an HTML defined in a Wikipedia article: https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969



In [1]:
#Install and imports that we need
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import requests
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [2]:
#Download the html from the URL and convert into a BeautifulSoup object
url='https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969'
html_data  = requests.get(url).text 
soup_object = BeautifulSoup(html_data,"html5lib")  # create a soup object using the variable 'html_data'

In [3]:
#Extract the tables/table
wiki_tables = soup_object.find_all('table')

#Use pandas to transform the table into a dataframe
wiki_df = pd.read_html(str(wiki_tables[0]),flavor='bs4')[0]

In [4]:
wiki_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [5]:
#Need to clean the dataframe: There are not assigned postal codes
wiki_clean_df = wiki_df[wiki_df['Neighbourhood']!='Not assigned'].reset_index(drop=True)
print(wiki_clean_df.shape)
wiki_clean_df.head()

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


**Note: Adjacent neighborhoods with same post-code have been joined and considered a unique neighborhood. Wikipedia table used as source of data already had this join**

## 2. Add geolocation data (latitude and longitude) to dataframe

As we will need to use the FourSquare API, we need to add into our dataframe the geographical coordinates for each postal code.
One option is to use the Geocoder Python package, that will return the latitude and longitude positions for each one of those postal codes.

However, there is a problem with this package which is making it very unreliable. 
Hence, is impossible to get the coordinates for all the target postal codes within an acceptable amount of time.

In [6]:
#!pip install geocoder
#import geocoder # import geocoder
# initialize your variable to None
#lat_lng_coords = None
#postal_code = 'M3A' 

# loop until you get the coordinates
#while(lat_lng_coords is None):
#  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#  lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]
#latitude

So, instead of Geocoder package, we are going to download those coordinates and corresponding postal code directly from http://cocl.us/Geospatial_data
This is a link to a CSV, which we will load as a dataframe.

In [7]:
#Use the pandas option to read a csv from URL
url_csv = 'https://cocl.us/Geospatial_data'
postcode_csv_df = pd.read_csv(url_csv)
postcode_csv_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Once we have loaded that new dataframe, we need to join it with the one that we obtained from the Wikipedia and create our final dataframe containing: 
* postal codes
* boroughs
* neighborhoods
* latitudes
* longitudes

In [8]:
#Join our 2 dataframes to get the final one that we will use in next steps
toronto_df = wiki_clean_df
toronto_df = toronto_df.join(postcode_csv_df.set_index('Postal Code'),on='Postal Code')
toronto_df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## 3. Show our aggregated neighborhoods in Toronto Map

We are going to use Folium package for Python in order to show the center of our aggregated by postal code neighborhoods in the Toronto Map.

In [9]:
#Install and import folium and nominatim
!pip install folium==0.5.0
!pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
import folium # plotting library



In [10]:
#Use Nominatim to get the coordinates of the center of Toronto
#We will need it to fix the center of the Folium Map
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [11]:
#Use Folium to plot the map of Toronto and the neighborhoods
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## 4. Invoke FourSquare API in order to get the venues for each neighborhood

In the next step, we are going to make some calls to the FourSquare API in order to get the venues within 500 meters of the center of each neighborhood.


In [12]:
# The code was removed by Watson Studio for sharing.

We will loop through all the neighborhoods, doing a request to FourSquare for getting the venues for each one of them and including all the venues in a unique dataframe.


In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Once our function "getNearbyVenues" has been defined, we are going to call it for our Toronto neighborhoods.

In [None]:
toronto_venues = getNearbyVenues(names=toronto_df['Neighbourhood'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [None]:
#Check head and size
print(toronto_venues.shape)
toronto_venues.head()

Once we have the list of venues for our Toronto neighbourhoods and the categories for each venue, we want to analyze each neighborhood taken into account those venues categories.
Our hypothesis is that analyzing the frequency for different venue categories in each neighborhood will help us to classify each neighborhood and create some clusters of similar neighborhoods in the city of Toronto.

So, as a first step, we are going to use one hot encoding to create a new dataframe where we will create a row for each venue with the corresponding one hot encoding value (1 for the column that represents the category venue and 0 for the rest of the categories).

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

#Remove neighborhood one hot encoded column from dataframe
toronto_onehot.drop(columns=['Neighborhood'],inplace=True)

#Add actual neighborhood column back to dataframe
toronto_onehot.insert(0, 'Neighborhood', toronto_venues['Neighborhood'], True)

print(toronto_onehot.shape)
toronto_onehot.head()

Once we have created our one hot encoding dataframe with all the venues, we are going to group by neighborhood, summarize and get the frequency for the different venue categories and neighborhood combination.

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

**Note: There are only 95 neighborhoods on this dataframe, versus the total of 103 that we have for Toronto. That means that there are 8 neighborhoods where there is no venue to be included.
We will need to remove those neighborhoods in next steps**

We are going to show which are the 10 most common venue types for each neighborhood in Toronto.

In [None]:
#First, we create a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
#Now use the function to create a dataframe with the 10 top venue categories for each neighborhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
toronto_venues_sorted = pd.DataFrame(columns=columns)
toronto_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    toronto_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

toronto_venues_sorted

## 5. Cluster Toronto Neighborhoods

Finally, we want to classify the Toronto neighborhoods into similar clusters taking into account the frequency of the category venues in each one of them.
That way, neighborhoods with similar type of venues will be included in the same cluster.

As a first step, we are going to run k-means clustering algorithem to divide the neighborhoods into 6 different clusters.

In [None]:
#Set the number of clusters
k_clusters = 6

#Remove the Neighborhood column before using the dataframe for fitting the algorithm
toronto_grouped_cluster = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=k_clusters, random_state=0).fit(toronto_grouped_cluster)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

In [None]:
#Add now the labels into venues sorted dataframe
# add clustering labels
toronto_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [None]:
toronto_merged = toronto_df

#Merge toronto_venues_sorted with toronto_df to add latitude/longitude for each neighborhood in there
toronto_merged = toronto_merged.join(toronto_venues_sorted.set_index('Neighborhood'),how='inner', on='Neighbourhood')
toronto_merged.shape

Once we have our neighborhoods classified into 6 different clusters, we are going to show them in the map with a different colour for each different cluster label.
The visual information will help us to understand better how this classification has been done by our k-means algorithm.

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k_clusters)
ys = [i + x + (i*x)**2 for i in range(k_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
#Need to set the labels as integers


# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters