<h1>Clustering Neighborhoods in Toronto</h1>

Here we will explore, segment, and cluster the neighborhoods in the city of Toronto.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

Once the data is in a structured format, we wll explore and cluster the neighborhoods in the city of Toronto.

## Table of Contents

1. <a href="#part1">Scrape the Wikipedia web page to build a dataframe</a>
2. <a href="#part2">Get latitude & longitude details of the neighbourhoods and add to the dataframe</a>  
3. <a href="#part3">Explore and cluster the neighborhoods in Toronto</a>  

### 1. Scrape the Wikipedia web page to build a dataframe

_Import beautifulsoup and other required libraries to scrape the web page and load the data into a dataframe_

In [34]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

_Scrape the webpage and create a Soup object_

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(source, 'lxml')

_Create a data frame with the webpage data_

In [3]:
datatable = soup.find('div', class_ ='mw-parser-output').table
acttable = datatable.find_all('td')

column_names = ['PostalCode', 'Borough']
df = pd.DataFrame(columns=column_names)

for td in acttable:
    boroughNeigh = td.p.span.text
    if boroughNeigh != "Not assigned":
        pstcd = td.p.b.text
        boroughNeigh = td.p.span.text
        df = df.append({'PostalCode':pstcd, 'Borough':boroughNeigh}, ignore_index=True)
df.head(10)

Unnamed: 0,PostalCode,Borough
0,M3A,North York(Parkwoods)
1,M4A,North York(Victoria Village)
2,M5A,Downtown Toronto(Regent Park / Harbourfront)
3,M6A,North York(Lawrence Manor / Lawrence Heights)
4,M7A,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke(Islington Avenue)
6,M1B,Scarborough(Malvern / Rouge)
7,M3B,North York(Don Mills)North
8,M4B,East York(Parkview Hill / Woodbine Gardens)
9,M5B,"Downtown Toronto(Garden District, Ryerson)"


_Wrangle the dataframe to correct format for analysis_

In [4]:
#The neighbourhood Queen's Park/Ontario Provincial Government with postal code M7A does not have any borough associated with it. So dropping this row.

df = df[df.PostalCode != 'M7A'].reset_index(drop=True)

#Spliting the Borough column into Borough & Neighbourhood
df["Neighborhood"] = df["Borough"].str.split(pat='(', n=-1, expand=True)[1]
df["Neighborhood"] = df["Neighborhood"].str.split(pat=')', n=-1, expand=True)[0]
df["Borough"] = df["Borough"].str.split(pat='(', n=-1, expand=True)[0]

#If there are multiple neighborhoods replace the separator  '/' with ','
df['Neighborhood'] = df['Neighborhood'].str.replace(' /', ',', n=-1)

_Grouping the data by Borough to check the different Boroughs and borough counts_

In [5]:
df['Borough'].value_counts()

North York                                                      24
Downtown Toronto                                                17
Scarborough                                                     17
Etobicoke                                                       11
Central Toronto                                                  9
West Toronto                                                     6
York                                                             5
East Toronto                                                     4
East York                                                        4
EtobicokeNorthwest                                               1
East TorontoBusiness reply mail Processing Centre969 Eastern     1
MississaugaCanada Post Gateway Processing Centre                 1
Downtown TorontoStn A PO Boxes25 The Esplanade                   1
East YorkEast Toronto                                            1
Name: Borough, dtype: int64

_We can see that for some boroughs only one record is there and the actual borough name also should be diffenet. So correcting the names like: "East YorkEast Toronto" to "East York", "Downtown TorontoStn A PO Boxes25 The Esplanade" to "Downtown Toronto" and so on depending on if the new borough has more than 1 count._

In [6]:
df['Borough'].loc[34] = 'East York'
df['Borough'].loc[91] = 'Downtown Toronto'
df['Borough'].loc[75] = 'Mississauga'
df['Borough'].loc[99] = 'East Toronto'
df['Borough'].loc[93] = 'Etobicoke'
df['Borough'].value_counts()

North York          24
Downtown Toronto    18
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
East Toronto         5
York                 5
Mississauga          1
Name: Borough, dtype: int64

_The formatted dataframe is displayed_

In [7]:
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M9A,Etobicoke,Islington Avenue
5,M1B,Scarborough,"Malvern, Rouge"
6,M3B,North York,Don Mills
7,M4B,East York,"Parkview Hill, Woodbine Gardens"
8,M5B,Downtown Toronto,"Garden District, Ryerson"
9,M6B,North York,Glencairn


_Shape of the dataframe_

In [8]:
df.shape

(102, 3)

### 2. Get latitude & longitude details of the neighbourhoods and add to the dataframe

_Get latitude & longitude details of the neighbourhoods using Nomination._  
_As nomination allows 1 request/sec so adding time delay._  
_Taking the populated latitude & longitude values in a list and adding to the dataframe._

In [9]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import time

torontodata = df # creating a duplicate of df datafarame as torontodata to work with

latList = []
lngList = []

for ind in torontodata.index:
    geolocator = Nominatim(user_agent="toronto_explorer")
    location = None
    location = geolocator.geocode(torontodata['Neighborhood'][ind] + ', ' + torontodata['Borough'][ind])
    time.sleep(1)
    if location is None:
        i=0
        while (location is None):
            try:
                tempneighbor = torontodata["Neighborhood"][ind].split(', ', -1)[i] + ', ' + torontodata['Borough'][ind]
            except:
                tempneighbor = torontodata["Borough"][ind]
            
            location = geolocator.geocode(tempneighbor)
            time.sleep(1)
            i += 1
    latList.append(location.latitude)
    lngList.append(location.longitude)
    
torontodata['Latitude'] = latList 
torontodata['Longitude'] = lngList

torontodata.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7588,-79.320197
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654174,-79.380812
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.722079,-79.437507
4,M9A,Etobicoke,Islington Avenue,43.622575,-79.514215
5,M1B,Scarborough,"Malvern, Rouge",43.809196,-79.221701
6,M3B,North York,Don Mills,43.775347,-79.345944
7,M4B,East York,"Parkview Hill, Woodbine Gardens",43.712078,-79.302567
8,M5B,Downtown Toronto,"Garden District, Ryerson",43.653552,-79.379373
9,M6B,North York,Glencairn,43.708712,-79.440685


### 3. Explore and cluster the neighborhoods in Toronto

_Install folium if not installed earlier and import it._

In [10]:
!pip install folium
import folium # map rendering library



_Create a map of Toronto with neighborhoods superimposed on top._

In [98]:
city = 'Toronto, Ontario'

# get the latitude and longitude of Ontario
geolocator = Nominatim(user_agent="city_explorer")
location = geolocator.geocode(city)
latitude = location.latitude
longitude = location.longitude

# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(torontodata['Latitude'], torontodata['Longitude'], torontodata['Borough'], torontodata['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

_Credentials and version for Foursquare API_

In [1]:
# The code was removed by Watson Studio for sharing.

_Defining get_category_type function to extract the category of the venue._

In [13]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

_Function to explore nearby vanues of all the neighborhoods in Toronto using Foursqure API._

In [14]:
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

_Use above function on each neighborhood and create a new dataframe called toronto_venues._

In [17]:
toronto_venues = getNearbyVenues(names=torontodata['Neighborhood'],
                                   latitudes=torontodata['Latitude'],
                                   longitudes=torontodata['Longitude']
                                  )

_The size of the new dataframe toronto_venues and first 5 rows of it._

In [18]:
print(toronto_venues.shape)
toronto_venues.head()

(3690, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.7588,-79.320197,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,Parkwoods,43.7588,-79.320197,LCBO,43.757774,-79.314257,Liquor Store
2,Parkwoods,43.7588,-79.320197,Petro-Canada,43.75795,-79.315187,Gas Station
3,Parkwoods,43.7588,-79.320197,Shoppers Drug Mart,43.760857,-79.324961,Pharmacy
4,Parkwoods,43.7588,-79.320197,TD Canada Trust,43.757569,-79.314976,Bank


_Creating a new dataframe from toronto_venues depending upon the venue category for further analysis._

In [67]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot.insert(0, "Neighbourhood", toronto_venues['Neighborhood'], True) 

print(toronto_onehot.shape)
toronto_onehot.head()

(3690, 263)


Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


_Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category_

In [69]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.076923,...,0.000000,0.000000,0.00,0.076923,0.000000,0.000,0.000000,0.000000,0.000000,0.000000
1,"Alderwood, Long Branch",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000,0.000000,0.000000,0.000000,0.000000
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.100000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000,0.000000,0.000000,0.000000,0.000000
3,Bayview Village,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000,0.000000,0.000000,0.000000,0.000000
4,"Bedford Park, Lawrence Manor East",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000,0.000000,0.000000,0.000000,0.000000
5,Berczy Park,0.0,0.000000,0.020000,0.000000,0.000000,0.010000,0.010000,0.000000,0.010000,...,0.000000,0.010000,0.00,0.000000,0.000000,0.000,0.000000,0.000000,0.010000,0.000000
6,"Birch Cliff, Cliffside West",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.000000,0.000000,0.000,0.000000,0.000000,0.000000,0.000000
7,"Brockton, Parkdale Village, Exhibition Place",0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.00,0.150000,0.000000,0.000,0.000000,0.000000,0.000000,0.000000
8,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.000000,0.020000,0.000000,0.000000,0.000000,0.000000,0.000000,0.020000,...,0.020000,0.000000,0.00,0.000000,0.000000,0.000,0.010000,0.000000,0.010000,0.000000
9,Caledonia-Fairbanks,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.010000,...,0.010000,0.000000,0.00,0.000000,0.000000,0.000,0.020000,0.000000,0.000000,0.000000


_Function to determine top 5 most common venues of each neighborhood._

In [70]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

_Create a new dataframe to display the top 10 venues for each neighborhood._

In [144]:
import numpy as np # Import Numpy

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Chinese Restaurant,Butcher,Cantonese Restaurant,Train Station,Hong Kong Restaurant,Korean Restaurant,Coffee Shop,Asian Restaurant,Peking Duck Restaurant,Vietnamese Restaurant
1,"Alderwood, Long Branch",Pizza Place,Pub,Gym,Coffee Shop,Skating Rink,Pool,Pharmacy,Sandwich Place,Flower Shop,Flea Market
2,"Bathurst Manor, Wilson Heights, Downsview North",Italian Restaurant,Asian Restaurant,Locksmith,Pizza Place,Deli / Bodega,Convenience Store,Sandwich Place,Beer Store,Bagel Shop,Intersection
3,Bayview Village,Bank,Sporting Goods Shop,Persian Restaurant,Outdoor Supply Store,Sandwich Place,Fast Food Restaurant,Fish Market,Gas Station,Metro Station,Breakfast Spot
4,"Bedford Park, Lawrence Manor East",Locksmith,Rental Car Location,Seafood Restaurant,Event Space,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Fish Market,Empanada Restaurant


_Run k-means to cluster the neighborhood into 5 clusters._

In [145]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([4, 4, 4, 4, 4, 4, 0, 4, 4, 4], dtype=int32)

_Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood._

In [146]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster_Labels', kmeans.labels_)

toronto_merged = torontodata

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.7588,-79.320197,4.0,Caribbean Restaurant,Gas Station,Liquor Store,Chinese Restaurant,Bank,Laundry Service,Shopping Mall,Convenience Store,Coffee Shop,Bus Line
1,M4A,North York,Victoria Village,43.732658,-79.311189,4.0,Thai Restaurant,Middle Eastern Restaurant,Yoga Studio,Flower Shop,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Food & Drink Shop
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654174,-79.380812,4.0,Clothing Store,Coffee Shop,Restaurant,Seafood Restaurant,Italian Restaurant,Bakery,Fast Food Restaurant,Bookstore,Tea Room,Electronics Store
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.722079,-79.437507,0.0,Doctor's Office,Electronics Store,Bank,Kids Store,Park,Flower Shop,Fast Food Restaurant,Fish & Chips Shop,Fish Market,Flea Market
4,M9A,Etobicoke,Islington Avenue,43.622575,-79.514215,4.0,Restaurant,Coffee Shop,Yoga Studio,Gourmet Shop,Pet Store,Movie Theater,Liquor Store,Japanese Restaurant,Intersection,Ice Cream Shop


_Check for unique Cluster Labels from the merged dataframe_

In [147]:
np.unique(toronto_merged['Cluster_Labels'], return_counts=True)

(array([ 0.,  1.,  2.,  3.,  4., nan, nan]),
 array([ 8,  1,  4,  1, 86,  1,  1]))

_After marging the two dataframes we can see that some rows are having nan values for cluster labels_
_Drop the rows with nan cluster labels and check for unique Cluster Labels again from the merged dataframe_

In [148]:
with pd.option_context('mode.use_inf_as_null', True):
   toronto_merged = toronto_merged.dropna()

# As the Cluster label data type is float in dataframe toronto_merged, casting it to integer
toronto_merged = toronto_merged.astype({"Cluster_Labels": int})

np.unique(toronto_merged['Cluster_Labels'], return_counts=True)

(array([0, 1, 2, 3, 4]), array([ 8,  1,  4,  1, 86]))

_Let's visualize the resulting clusters_

In [149]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster_Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters