# Segmenting and Clustering Neighborhoods in Toronto

# <font color='blue'> 1. First Part </font>

## 1.1. Scrape the Wikipedia page about the Neighborhoods in Toronto

### Source: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [1]:
# libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

# Generate HTML from the page that have the information about neighbothoods in Toronto
source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(source.text, 'lxml')

# Using soup object to iterate with the .wikitable to get the data from the HTML page and store it into a list
data = []
columns = []
table = soup.find(class_='wikitable')
for index, tr in enumerate(table.find_all('tr')):
    section = []
    for td in tr.find_all(['th','td']):
        section.append(td.text.rstrip())
    
    #First row of data is the header
    if (index == 0):
        columns = section
    else:
        data.append(section)

#convert list into Pandas DataFrame
canada_df = pd.DataFrame(data = data,columns = columns)
canada_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## 1.2.  DataPrep:

* <b> The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood </b>

In [2]:
# rename the columns name
canada_df.rename(columns={'Postal Code': 'PostalCode', 'Borough': 'Borough', 'Neighbourhood': 'Neighborhood'}, inplace=True)
canada_df.head(2)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned


* <b> Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned. </b>

In [3]:
# drop the rows with Borough == "Not assigned"
canada_df.drop(canada_df[canada_df.Borough == "Not assigned"].index, inplace=True)

# list of Boroughs
print(canada_df.Borough.unique())

canada_df.head(2)

['North York' 'Downtown Toronto' 'Etobicoke' 'Scarborough' 'East York'
 'York' 'East Toronto' 'West Toronto' 'Central Toronto' 'Mississauga']


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village


* <b> If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. </b>

In [4]:
# replace the name of Neighbourhood when = Not assigned to the name of the Borough
canada_df['Neighborhood'].replace("Not assigned", canada_df["Borough"],inplace=True)

# list of Neighborhood
#print(canada_df.Neighborhood.unique())

canada_df.head(2)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village


* <b> Number of rows of your dataframe </b>

In [5]:
canada_df.shape

(103, 3)

# <font color='blue'> 2. Second Part </font> 

## The Latitude and the Longitude coordinates of each neighborhood (Using GeoPy)

In [26]:
#!pip install geopy
#!pip install geocoder

In [7]:
import geocoder

# Defining a function to get the Latitude and Longitude
def get_latlng(arcgis_geocoder):
    
    # Initialize the Location (lat. and long.) to "None"
    lat_lng_coords = "NaN"
    
    # While loop helps to create a continous run until all the location coordinates are geocoded
    while(lat_lng_coords is "NaN"):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(arcgis_geocoder))
        lat_lng_coords = g.latlng
                
    return lat_lng_coords

In [8]:
# Append the latitude and longitude to the dataframe
canada_df['Latitude'], canada_df['Longitude'] = zip(*canada_df['PostalCode'].apply(get_latlng))

In [9]:
canada_df.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.75188,-79.33036
3,M4A,North York,Victoria Village,43.73042,-79.31282
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65514,-79.36265
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72321,-79.45141
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66449,-79.39302


# <font color='blue'> 3. Third Part </font>  

## Explore and cluster the neighborhoods in Toronto

In [10]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

usage: conda-script.py [-h] [-V] command ...
conda-script.py: error: unrecognized arguments: # uncomment this line if you haven't completed the Foursquare API lab


### Plot of the Toronto's neighborhood 

Let's get the geographical coordinates of Toronto.

In [11]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
toronto_latitude = location.latitude
toronto_longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(toronto_latitude, toronto_longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Now let's plot the get Toronto's neighborhood

In [12]:
# create map of Toronto using latitude and longitude values
map_newyork = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(canada_df['Latitude'], canada_df['Longitude'], canada_df['Borough'], canada_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

## Central Toronto Exploration

In [13]:
# create a data set with the Neighborhood of Central Toronto
central_toronto = canada_df[canada_df['Borough'] == "Central Toronto"].reset_index(drop=True)
central_toronto

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.72898,-79.39173
1,M5N,Central Toronto,Roselawn,43.71194,-79.41912
2,M4P,Central Toronto,Davisville North,43.71276,-79.38851
3,M5P,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",43.69479,-79.4144
4,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.71452,-79.40696
5,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67484,-79.40185
6,M4S,Central Toronto,Davisville,43.7034,-79.38596
7,M4T,Central Toronto,"Moore Park, Summerhill East",43.69066,-79.38356
8,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.68569,-79.40232


### Plot Central Toronto's neighborhood

In [14]:
# The geograpical coordinate of Central Toronto

address = 'Central Toronto, Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of Central Toronto using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(central_toronto['Latitude'], central_toronto['Longitude'], central_toronto['Borough'], central_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

### The top 100 venues at the Central Toronto Neighborhood (1KM)

In [27]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

In [16]:
# Function to get all the venues around the Central Toronto

def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    LIMIT = 500
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
               
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [17]:
# DataSet with the venues at Central Toronto
central_toronto_venues = getNearbyVenues(names=central_toronto['Neighborhood'],latitudes=central_toronto['Latitude'],longitudes=central_toronto['Longitude'])
central_toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Lawrence Park,43.72898,-79.39173,Lawrence Park Ravine,43.726963,-79.394382,Park
1,Lawrence Park,43.72898,-79.39173,For The Win Cafe,43.728636,-79.403255,Bubble Tea Shop
2,Lawrence Park,43.72898,-79.39173,T-buds,43.731247,-79.40364,Tea Room
3,Lawrence Park,43.72898,-79.39173,STACK,43.729311,-79.403241,BBQ Joint
4,Lawrence Park,43.72898,-79.39173,Granite Club,43.733043,-79.381986,Gym / Fitness Center


In [18]:
central_toronto_venues.shape

(584, 7)

### Count of venues by Neighborhood

In [19]:
central_toronto_venues.groupby('Neighborhood').count().Venue

Neighborhood
Davisville                                                           100
Davisville North                                                      97
Forest Hill North & West, Forest Hill Road Park                       41
Lawrence Park                                                         38
Moore Park, Summerhill East                                           66
North Toronto West,  Lawrence Park                                    47
Roselawn                                                               6
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park     89
The Annex, North Midtown, Yorkville                                  100
Name: Venue, dtype: int64

In [20]:
print('There are {} uniques categories.'.format(len(central_toronto_venues['Venue Category'].unique())))

There are 129 uniques categories.


### Most Common Venues per Neighborhood

In [21]:
# one hot encoding
central_toronto_onehot = pd.get_dummies(central_toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
central_toronto_onehot['Neighborhood'] = central_toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [central_toronto_onehot.columns[-1]] + list(central_toronto_onehot.columns[:-1])
central_toronto_onehot = central_toronto_onehot[fixed_columns]

# agroup the neighborhood
central_toronto_grouped = central_toronto_onehot.groupby('Neighborhood').mean().reset_index()

# Function to return the most common venues per beighborhood
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = central_toronto_grouped['Neighborhood']

for ind in np.arange(central_toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(central_toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Davisville,Italian Restaurant,Sushi Restaurant,Indian Restaurant,Coffee Shop,Café,Restaurant,Dessert Shop,Pizza Place,Bakery,Gym
1,Davisville North,Coffee Shop,Pizza Place,Italian Restaurant,Café,Dessert Shop,Sushi Restaurant,Fast Food Restaurant,Restaurant,Yoga Studio,Ramen Restaurant
2,"Forest Hill North & West, Forest Hill Road Park",Italian Restaurant,Café,Park,Coffee Shop,Sushi Restaurant,Gym / Fitness Center,Bank,Pharmacy,Juice Bar,Burger Joint
3,Lawrence Park,Coffee Shop,Bank,Bus Line,Italian Restaurant,Café,Frozen Yogurt Shop,Hobby Shop,Gastropub,Sandwich Place,Restaurant
4,"Moore Park, Summerhill East",Coffee Shop,Grocery Store,Italian Restaurant,Gym,Park,Thai Restaurant,Pharmacy,Sushi Restaurant,Pizza Place,Pub
5,"North Toronto West, Lawrence Park",Coffee Shop,Italian Restaurant,Restaurant,Park,Sporting Goods Shop,Café,Mexican Restaurant,Diner,Skating Rink,Electronics Store
6,Roselawn,Pharmacy,Trail,Café,Skating Rink,Bank,Dog Run,Discount Store,Donut Shop,Dry Cleaner,French Restaurant
7,"Summerhill West, Rathnelly, South Hill, Forest...",Coffee Shop,Park,Sushi Restaurant,Bank,Italian Restaurant,Café,Thai Restaurant,Grocery Store,Gym,Sandwich Place
8,"The Annex, North Midtown, Yorkville",Italian Restaurant,Coffee Shop,French Restaurant,Café,Boutique,Museum,Vegetarian / Vegan Restaurant,Indian Restaurant,Spa,Restaurant


## Cluster Neighborhood (into 3 clusters)

In [22]:
# set number of clusters
kclusters = 3

central_toronto_grouped_clustering = central_toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(central_toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 2, 2, 2, 0, 1, 2, 0])

In [23]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

central_toronto_merged = central_toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
central_toronto_merged = central_toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

In [24]:
central_toronto_merged

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4N,Central Toronto,Lawrence Park,43.72898,-79.39173,2,Coffee Shop,Bank,Bus Line,Italian Restaurant,Café,Frozen Yogurt Shop,Hobby Shop,Gastropub,Sandwich Place,Restaurant
1,M5N,Central Toronto,Roselawn,43.71194,-79.41912,1,Pharmacy,Trail,Café,Skating Rink,Bank,Dog Run,Discount Store,Donut Shop,Dry Cleaner,French Restaurant
2,M4P,Central Toronto,Davisville North,43.71276,-79.38851,0,Coffee Shop,Pizza Place,Italian Restaurant,Café,Dessert Shop,Sushi Restaurant,Fast Food Restaurant,Restaurant,Yoga Studio,Ramen Restaurant
3,M5P,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",43.69479,-79.4144,2,Italian Restaurant,Café,Park,Coffee Shop,Sushi Restaurant,Gym / Fitness Center,Bank,Pharmacy,Juice Bar,Burger Joint
4,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.71452,-79.40696,0,Coffee Shop,Italian Restaurant,Restaurant,Park,Sporting Goods Shop,Café,Mexican Restaurant,Diner,Skating Rink,Electronics Store
5,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67484,-79.40185,0,Italian Restaurant,Coffee Shop,French Restaurant,Café,Boutique,Museum,Vegetarian / Vegan Restaurant,Indian Restaurant,Spa,Restaurant
6,M4S,Central Toronto,Davisville,43.7034,-79.38596,0,Italian Restaurant,Sushi Restaurant,Indian Restaurant,Coffee Shop,Café,Restaurant,Dessert Shop,Pizza Place,Bakery,Gym
7,M4T,Central Toronto,"Moore Park, Summerhill East",43.69066,-79.38356,2,Coffee Shop,Grocery Store,Italian Restaurant,Gym,Park,Thai Restaurant,Pharmacy,Sushi Restaurant,Pizza Place,Pub
8,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.68569,-79.40232,2,Coffee Shop,Park,Sushi Restaurant,Bank,Italian Restaurant,Café,Thai Restaurant,Grocery Store,Gym,Sandwich Place


In [25]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(central_toronto_merged['Latitude'], central_toronto_merged['Longitude'], central_toronto_merged['Neighborhood'], central_toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters