# Segmenting and Clustering Neighborhoods in Toronto

In this notebook, we explore segment and cluster the neighborhoods in Toronto, Canada. First, since the neighborhood data is not readily available, we will have to scrape it from the internet. Luckily, though, there is a Wikipedia webpage with all the information we need to scrape, wrangle, clean and read the data into a pandas DataFrame, in order to have it in a structured format: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.



## 1. Creating the DataFrame

We will begin by importing all the necessary libraries.

In [31]:
from bs4 import BeautifulSoup # library for pulling data out of HTML files
import requests # library to handle requests
import pandas as pd # library for data analysis
import numpy as np # library to handle vectorized data

print('Libraries imported.')

Libraries imported.


### Scraping the list of Toronto Postal Codes

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url).text
soup = BeautifulSoup(source, 'html.parser')
table = soup.find('table')

Now we proceed to build the DataFrame.

In [3]:
col_names = ['PostalCode', 'Borough', 'Neighborhood']
toronto_df = pd.DataFrame(columns=col_names)
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood


We shall fill our DataFrame with the data collected from the webpage.

In [4]:
for cell in table.find_all('tr'):
    row = []
    for data in cell.find_all('td'):
        row.append(data.text.strip())
        if len(row) == 3:
            toronto_df.loc[len(toronto_df)] = row
            
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Data Cleaning and Wrangling

Our first step here will be to remove all rows where the Borough column has a 'Not assigned' value.

In [5]:
toronto_df = toronto_df[toronto_df['Borough'] != 'Not assigned']
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Now, we must check whether there are any 'Not assigned' cells in the Neighborhood column. To do this, we create a DataFrame consisted of boolean values and apply the sum() method on it to count the number of False values we have. If the result of this sum is 0, then we have successfully cleaned all missing values from our toronto_df DataFrame. 

In [6]:
clean_df = toronto_df['Neighborhood'] == 'Not assigned'
clean_df.sum()

0

Since we have multiple neighborhoods assigned to the same Borough, as showed in Postal Codes M5A, M6A and M7A, we will assume this has already been done for us. For cross-checking purposes, we can count the number of unique values in the PostalCode column of toronto_df as well as the total number of rows and see if they match.

In [7]:
print(len(toronto_df['PostalCode'].unique()))
print(toronto_df.shape[0])

103
103


Finally, we print the shape of our DataFrame

In [8]:
toronto_df.shape

(103, 3)

## 2. Getting Neighborhood Coordinates

Here, we will download and read a csv file containing latitude and longitude for all neighborhoods in Toronto into a pandas DataFrame. Then, we will merge it with our toronto_df, joining on the Postal Code column.

In [9]:
!wget -q -O 'toronto_coord_data.csv' http://cocl.us/Geospatial_data

In [10]:
tor_coord = pd.read_csv('toronto_coord_data.csv')
tor_coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We will change the name of the first column in order to match with our toronto_df's 'PostalCode' column.

In [11]:
tor_coord.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
tor_coord.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now we are all set to merge both DataFrames.

In [12]:
tor_df = pd.merge(toronto_df, tor_coord, on='PostalCode')
tor_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## 3. Exploring and Clustering Neighborhoods

In order to visualize all neighborhoods in Toronto, we need to create a map centered in Toronto to have a first look as to how we can group them into clusters. So, as we did before, we begin by importing all libraries necessary for this task.

In [13]:
!conda install -c conda-forge folium=0.5.0 --yes

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    ------------------------------------------------------------
                       

In [14]:
# Convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Import k-means clustering algorithm
from sklearn.cluster import KMeans

# Library for map rendering
import folium 

print('Libraries imported.')

Libraries imported.


Now, we proceed to get the latitude and longitude coordinates for the city of Toronto. We define our user_agent for the geocoder instance as <em>tor_explorer</em>.

In [15]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent='tor_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


Here we display the Toronto map with its neighborhoods superimposed on top.

In [16]:
map_tor = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(tor_df['Latitude'], tor_df['Longitude'], tor_df['Borough'], tor_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)  
    
map_tor

For illustration purposes, however, we will segment and cluster only boroughs that contain the name Toronto. So, we will slice our dataframe and recreate the above map.

In [17]:
tor_df_sliced = tor_df[tor_df['Borough'].str.contains('Toronto', regex=False)]
tor_df_sliced.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [18]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(tor_df_sliced['Latitude'], tor_df_sliced['Longitude'], tor_df_sliced['Borough'], tor_df_sliced['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Now, we will utilize k-means to cluster the above neighborhoods.

In [22]:
tor_clust = tor_df_sliced.drop(['PostalCode', 'Borough', 'Neighborhood'], axis=1)

In [23]:
k = 5
k_means = KMeans(n_clusters=k, random_state=1)
k_means.fit(tor_clust)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=1, tol=0.0001, verbose=0)

We append a new column in our tor_df_sliced dataframe that shows the cluster label assigned to the neighborhoods.

In [29]:
tor_df_sliced.insert(loc=0, column='Cluster', value=k_means.labels_)
tor_df_sliced.head()

Unnamed: 0,Cluster,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,3,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,3,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,2,M4E,East Toronto,The Beaches,43.676357,-79.293031


Finally, we display the toronto map with all the clusters to see how the algorithm grouped the neighborhoods.

In [37]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, neighb, cluster in zip(tor_df_sliced['Latitude'], tor_df_sliced['Longitude'], tor_df_sliced['Neighborhood'], tor_df_sliced['Cluster']):
    label = folium.Popup(str(neighb) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters