<a href="https://colab.research.google.com/github/lugoll/Coursera_Capstone/blob/main/neighborhood_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coursera Capstone: Segmenting and Clustering Neighborhoods in Toronto

The following notebook is the task 'segmenting and clustering neighborhoods in Toronto' of the Coursera course 'Applied Data Science Capstone'. 



Firstly define constant values:

In [1]:
CLIENT_ID = ''
CLIENT_SECRET = ''
VERSION = '20201016'
LIMIT = 100

Do a bunch of Imports:

In [14]:
import requests
import pandas as pd
import numpy as np

!pip install geocoder > /dev/null
import geocoder 

from sklearn.cluster import KMeans

!pip install geopy > /dev/null
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!pip install folium > /dev/null
import folium

import matplotlib.cm as cm
import matplotlib.colors as colors

print("Imports successful")

Imports successful


## Retrieve Postalcodes for Toronto 

Get Postal Codes of Wikipedia page into pandas Dataframe

In [3]:
postal_codes_df = pd.read_html('http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
postal_codes_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Clean Dataframe

1. Eliminate **Not assigned** Borough

In [5]:
postal_codes_clean = postal_codes_df[postal_codes_df['Borough'] != 'Not assigned']
postal_codes_clean.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


2. Eliminate double Postal Codes

In [7]:
print("Rows in Dataframe: {}". format(postal_codes_clean.shape[0]))
print("Number of Unique Postal Codes: {}".format(postal_codes_clean['Postal Code'].unique().shape[0]))

Rows in Dataframe: 103
Number of Unique Postal Codes: 103


As we can see there are no duplicate Postal Code entries, it seems this has been fixed at the Wikipedia page.


 3. Fill **Not assigned** Neighbourhoods


In [9]:
postal_codes_clean[postal_codes_clean['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighbourhood


There are no **Not assigned** neighbourhoods, this has been fixed at the Wikipedia page too.


In [10]:
postal_codes_clean.shape

(103, 3)

## Get Coordinates for Postalcodes

Define Function for retrieving coordinates

In [11]:
def add_coordinates(row):
  # initialize your variable to None
  lat_lng_coords = None

  # loop until you get the coordinates
  while(lat_lng_coords is None):
    g = geocoder.google('{}, Toronto, Ontario'.format(row['Postal Code']))
    lat_lng_coords = g.latlng

  row['Latitude'] = lat_lng_coords[0]
  row['Longitude'] = lat_lng_coords[1]

Now Apply function to Dataframe

In [None]:
postal_codes_clean['Latitude'] = pd.Series(dtype='float64')
postal_codes_clean['Longitude'] = pd.Series(dtype='float64')

toronto_df = postal_codes_clean.apply(add_coordinates, axis=1)
toronto_df.head()

In case the geocoder doesn't work, import the backup csv:

In [12]:
backup_df = pd.read_csv('http://cocl.us/Geospatial_data')
toronto_df = postal_codes_clean.merge(backup_df, on='Postal Code')
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Cluster Neighbourhoods

Define and apply function for retrieving venues near the neighbourhood

In [15]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

# Now apply it
toronto_venues = getNearbyVenues(toronto_df['Neighbourhood'],toronto_df['Latitude'],toronto_df['Longitude'])
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


One Hot encode the Categories of venues in a neighbourhood

In [16]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = list(toronto_onehot.columns)
fixed_columns.remove('Neighborhood')
fixed_columns = ['Neighborhood'] + fixed_columns
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Garage,Auto Workshop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Stadium,Beach,Bed & Breakfast,Beer Bar,Beer Store,Belgian Restaurant,Bike Shop,Bistro,Boat or Ferry,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,...,Smoothie Shop,Snack Place,Soccer Field,Social Club,Soup Place,Southern / Soul Food Restaurant,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Strip Club,Supermarket,Supplement Shop,Sushi Restaurant,Swim School,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tea Room,Tennis Court,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Victoria Village,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Take the average of every category per neighbourhood

In [17]:
toronto_mean = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_mean.shape

(96, 273)

Run _k_-means to cluster the neighborhood into 5 clusters.


In [18]:
# set number of clusters
kclusters = 5

toronto_mean_clustering = toronto_mean.drop('Neighborhood',1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_mean_clustering)

# check cluster labels generated for each row in the dataframe
len(kmeans.labels_)

96

Label *toronto_mean* DataFrame

In [19]:
toronto_mean.insert(0, 'Cluster Labels', kmeans.labels_)

Merge labels to *toronto_df* DataFrame, due to the merge, neighborhoods with no label will be dropped. These neighborhoods dont have a label, because no venues were found for this neighborhoods.

In [20]:
toronto_df.rename({'Neighbourhood': 'Neighborhood'},axis=1, inplace=True)
toronto_label = toronto_mean[['Neighborhood', 'Cluster Labels']].merge(toronto_df, on='Neighborhood')
toronto_label.head()

Unnamed: 0,Neighborhood,Cluster Labels,Postal Code,Borough,Latitude,Longitude
0,Agincourt,1,M1S,Scarborough,43.7942,-79.262029
1,"Alderwood, Long Branch",1,M8W,Etobicoke,43.602414,-79.543484
2,"Bathurst Manor, Wilson Heights, Downsview North",1,M3H,North York,43.754328,-79.442259
3,Bayview Village,1,M2K,North York,43.786947,-79.385975
4,"Bedford Park, Lawrence Manor East",1,M5M,North York,43.733283,-79.41975


## Visualize Neighborhood clusters

Get Coordinates of Toronto

In [21]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Create Map of Toronto with labeled neighborhoods

In [23]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_label['Latitude'], toronto_label['Longitude'], toronto_label['Neighborhood'], toronto_label['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters