# Segmenting and Clustering Neighborhoods in Toronto

Peer-graded Assignment

[**1. Loading and preprocessing the data (part 1)**](#1.-Loading-and-preprocessing-the-data)

[**2. Adding latitude and longitude (part 2)**](#2.-Adding-latitude-and-longitude)

[**3. Exploring and clustering the neighborhoods in Toronto (part 3)**](#3.-Exploring-and-clustering-the-neighborhoods-in-Toronto)

### Importing packages

In [1]:
import pandas as pd
import wget
from geopy.geocoders import Nominatim
import folium
import requests
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

## 1. Loading and preprocessing the data

Scraping the Wikipedia page, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df_list = pd.read_html(url)

Getting the desired dataframe from the list retunred by panda

In [3]:
df = df_list[0]
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Processing the cells that have an assigned borough and ignoring cells with a borough that is Not assigned

In [4]:
df = df.loc[df.loc[:,"Borough"] != "Not assigned"]
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Grouping by Postal Code and Borough and concatinating the neighbourhoods

In [5]:
df = df.groupby(["Postal Code","Borough"]).agg({'Neighbourhood': ','.join}).reset_index()
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


If a cell has a borough but a *Not assigned* neighborhood, then the neighborhood will be the same as the borough

In [6]:
df.loc[df.loc[:,"Neighbourhood"] == "Not assigned","Neighbourhood"] = df.loc[df.loc[:,"Neighbourhood"] == "Not assigned","Borough"]
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [7]:
print('The dataframe has {} boroughs and {} unique postal codes.'.format(len(df['Borough'].unique()),df.shape[0]))

The dataframe has 10 boroughs and 103 unique postal codes.


Shape of the cleaned dataframe

In [8]:
df.shape

(103, 3)

## 2. Adding latitude and longitude

Downloading the coordinates data

In [9]:
url = "https://cocl.us/Geospatial_data"
wget.download(url,"./data/coordinates.csv")
coordinates = pd.read_csv("./data/coordinates.csv")
coordinates

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Merging coordinates to add latitude and longtitude to the data

In [10]:
df = df.merge(coordinates, on = "Postal Code")
df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


## 3. Exploring and clustering the neighborhoods in Toronto

As there may be several neighbourhoods associated with a postal code, I use postal code instead of neighbourhood.

### 3.1. Retrieving data from Foursquare

Using geopy library to get the latitude and longitude values of Toronto

In [11]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Creating a map of Toronto with neighborhoods superimposed on top

In [12]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius=5, popup=label, color='blue', fill=True, fill_color='#3186cc', fill_opacity=0.7, parse_html=False).add_to(map_toronto)  
    
map_toronto

Defining Foursquare Credentials and Version

In [13]:
CLIENT_ID = 'MY_ID'
CLIENT_SECRET = 'MY_SECRET'
VERSION = '20180605'
LIMIT = 100

Defining a function which does the following to all the postal codes in Toronto:

- gets the top 100 venues that are in the neighborhood within a radius
- sends the GET request to Foursquare API
- cleans the json and structures it into a pandas dataframe

In [14]:
def getNearbyVenues(codes, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for code, lat, lng in zip(codes, latitudes, longitudes):
                   
        # creating the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)

        # making the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # returning only relevant information for each nearby venue
        venues_list.append([(code, lat, lng, v['venue']['name'], v['venue']['location']['lat'], v['venue']['location']['lng'], v['venue']['categories'][0]['name']) 
                            for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 'Latitude', 'Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    
    return(nearby_venues)

Running the above function on each postal code and creating a new dataframe called *toronto_venues*

In [15]:
toronto_venues = getNearbyVenues(codes=df['Postal Code'], latitudes=df['Latitude'], longitudes=df['Longitude'], radius=1050)

Checking the size of the resulting dataframe

In [16]:
print(toronto_venues.shape)
toronto_venues.head()

(5202, 7)


Unnamed: 0,Postal Code,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Images Salon & Spa,43.802283,-79.198565,Spa
1,M1B,43.806686,-79.194353,Harvey's,43.80002,-79.198307,Restaurant
2,M1B,43.806686,-79.194353,RBC Royal Bank,43.798782,-79.19709,Bank
3,M1B,43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
4,M1B,43.806686,-79.194353,Caribbean Wave,43.798558,-79.195777,Caribbean Restaurant


Checking how many venues were returned for each postal code

In [17]:
toronto_venues.groupby('Postal Code').count()

Unnamed: 0_level_0,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,17,17,17,17,17,17
M1C,5,5,5,5,5,5
M1E,25,25,25,25,25,25
M1G,12,12,12,12,12,12
M1H,32,32,32,32,32,32
...,...,...,...,...,...,...
M9N,16,16,16,16,16,16
M9P,20,20,20,20,20,20
M9R,17,17,17,17,17,17
M9V,16,16,16,16,16,16


Finding out how many unique categories can be curated from all the returned venues

In [18]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 331 uniques categories.


### 3.2. Analyzing each neighborhood

Implementing one hot encoding

In [19]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# adding Postal Code column back to dataframe
toronto_onehot['Postal Code'] = toronto_venues['Postal Code']

# moving Postal Code column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Postal Code,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Checking the size

In [20]:
toronto_onehot.shape

(5202, 332)

Grouping rows by postal code and by taking the mean of the frequency of occurrence of each category

In [21]:
toronto_grouped = toronto_onehot.groupby('Postal Code').mean().reset_index()
toronto_grouped

Unnamed: 0,Postal Code,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0000,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0000,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0000,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0000,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0000,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.03125,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,M9N,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0000,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
99,M9P,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0000,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
100,M9R,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,...,0.0000,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0
101,M9V,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0625,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.00000,0.0


Confirming the new size

In [22]:
toronto_grouped.shape

(103, 332)

Defining a function to sort the venues in descending order

In [23]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Creating the new dataframe and displaying the top 10 venues for each neighborhood

In [64]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# creating columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# creating a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = toronto_grouped['Postal Code']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Coffee Shop,Restaurant,Trail,Fast Food Restaurant,Bank,Spa,Chinese Restaurant,Paper / Office Supplies Store,Park,Martial Arts School
1,M1C,Breakfast Spot,Park,Burger Joint,Playground,Italian Restaurant,Zoo,Filipino Restaurant,Event Space,Falafel Restaurant,Farm
2,M1E,Pizza Place,Breakfast Spot,Restaurant,Bank,Fast Food Restaurant,Beer Store,Sandwich Place,Greek Restaurant,Discount Store,Supermarket
3,M1G,Coffee Shop,Park,Indian Restaurant,Chinese Restaurant,Mobile Phone Shop,Department Store,Pharmacy,Juice Bar,Fast Food Restaurant,Ethiopian Restaurant
4,M1H,Coffee Shop,Bakery,Gas Station,Bank,Indian Restaurant,Grocery Store,Fried Chicken Joint,Caribbean Restaurant,Music Store,Restaurant


### 3.3. Clustring neighbourhoods

Running k-means to cluster the neighborhood into 5 clusters

In [65]:
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Postal Code', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

kmeans.labels_[0:10]

array([1, 0, 3, 3, 3, 3, 3, 3, 3, 3])

Creating a new dataframe that includes the cluster as well as the top 10 venues for each postal code

In [66]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')

toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,1,Coffee Shop,Restaurant,Trail,Fast Food Restaurant,Bank,Spa,Chinese Restaurant,Paper / Office Supplies Store,Park,Martial Arts School
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,0,Breakfast Spot,Park,Burger Joint,Playground,Italian Restaurant,Zoo,Filipino Restaurant,Event Space,Falafel Restaurant,Farm
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,3,Pizza Place,Breakfast Spot,Restaurant,Bank,Fast Food Restaurant,Beer Store,Sandwich Place,Greek Restaurant,Discount Store,Supermarket
3,M1G,Scarborough,Woburn,43.770992,-79.216917,3,Coffee Shop,Park,Indian Restaurant,Chinese Restaurant,Mobile Phone Shop,Department Store,Pharmacy,Juice Bar,Fast Food Restaurant,Ethiopian Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,3,Coffee Shop,Bakery,Gas Station,Bank,Indian Restaurant,Grocery Store,Fried Chicken Joint,Caribbean Restaurant,Music Store,Restaurant


Visualizing the resulting clusters

In [67]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# setting color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# adding markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Postal Code'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat, lon], radius=5, popup=label, color=rainbow[cluster-1], fill=True, fill_color=rainbow[cluster-1], fill_opacity=0.7).add_to(map_clusters)
       
map_clusters