# Clustering Neighborhoods in Toronto

#### Description: The goal of this assignment is to explore the neighborhoods in Toronto and use ML Clustering to group them logically

Step 1: Collect the Data for Neighborhoods in Toronto. For this, we will scrape data in a Wikipedia page.

In [1]:
# Let's start by importing the necessary modules to access a web URL

import requests

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
website_url = requests.get(url).text


In [2]:
# We now need to structure the web content into a format that we can parse for later analyze.
# We will use the Beautiful Soup package:

from bs4 import BeautifulSoup

soup = BeautifulSoup(website_url, 'html5lib')

In [3]:
# The web content is now retrieved with each html5 corresponding tag.
# We need now to find the tag that identifies our table and then access it.

my_table = soup.find('table', {'class':'wikitable sortable'})
my_table


<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td>

In [4]:
# Having a look at 'my_table' we notice is a peculiar variable. Let's find out what type of variable it is:

type(my_table)

bs4.element.Tag

In [5]:
# We notice that it is a beautiful soup class of element. To access the elements inside my_table we will need to make us
# of the functions present in the Beautiful Soup library

# We now have the table but within it there are different nested elements, we need to extract the ones that matter to us.
# We notice 'tr' elements act as the rows of the table. So we will extract all tr elements.

tr_elements = my_table.find_all('tr')
tr_elements[0:3]

[<tr>
 <th>Postal Code
 </th>
 <th>Borough
 </th>
 <th>Neighborhood
 </th></tr>,
 <tr>
 <td>M1A
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>,
 <tr>
 <td>M2A
 </td>
 <td>Not assigned
 </td>
 <td>Not assigned
 </td></tr>]

In [6]:
# Cool it looks like a list but with tags inside it, looks something workable. 
# We observe that each 'tr' delimits a row, within each we see nested 'td' denoting the elements. 'th' are for titles
# If we want to access the 'td' we will use the 'find_all' function

# Let's now store the elements into lists that will act as the columns of our future dataframe:

Postal_Code = []
Borough = []
Neighborhood = []

# Now we iterate over every row ('tr') and we store in each of the lists:

for row in tr_elements[1:]:
    cells = row.find_all('td')
    if len(cells) == 3:
        Postal_Code.append(cells[0].find(text = True).replace('\n',''))
        Borough.append(cells[1].find(text = True).replace('\n',''))
        Neighborhood.append(cells[2].find(text = True).replace('\n',''))

print(Postal_Code[0:9])
print(Borough[0:9])
print(Neighborhood[0:9])

['M1A', 'M2A', 'M3A', 'M4A', 'M5A', 'M6A', 'M7A', 'M8A', 'M9A']
['Not assigned', 'Not assigned', 'North York', 'North York', 'Downtown Toronto', 'North York', 'Downtown Toronto', 'Not assigned', 'Etobicoke']
['Not assigned', 'Not assigned', 'Parkwoods', 'Victoria Village', 'Regent Park, Harbourfront', 'Lawrence Manor, Lawrence Heights', "Queen's Park, Ontario Provincial Government", 'Not assigned', 'Islington Avenue, Humber Valley Village']


In [7]:
# Cool it looks like our lists extracted all the records from the Wikipedia table, scraping is completed here.
# We will need to clean and treat the data. Let's start by converting these lists into a Data Frame

import pandas as pd

df = pd.DataFrame(list(zip(Postal_Code, Borough, Neighborhood)), columns = ['Postal Code', 'Borough', 'Neighborhood'])
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [8]:
# Let's start the cleaning, we will start by dropping the rows where the value of Borough is 'Not assigned'

df.drop(df[df['Borough'] == 'Not assigned'].index, inplace = True)
df

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [9]:
# We now check for the 'Not assigned' Neighborhoods to match them to their Boroughs in case it is applicable

df[df['Neighborhood'] == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighborhood


In [10]:
# Cool. There are no cases, so nothing we need to do.

# With this our data cleaning tasks end, we have our dataframe as required. 

# Let's print the shape of our df:

df.shape

(103, 3)

This concludes **Exercise 3** 

####  ----------------------------------------

#### Getting the Latitude and Longitude Coordinates of each Neighborhood

#### -----------------------------------------

In [11]:
# We import the location data from the csv file provided:

location_csv = 'http://cocl.us/Geospatial_data'

df_geo = pd.read_csv(location_csv)
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [12]:
# Let's check the shape of this df:

df_geo.shape

(103, 3)

In [13]:
# Cool it matches in number of rows!

# We need to reset the indexes of our working df to enable match with this newly created df ('df_geo')

df.reset_index(drop = True, inplace = True)
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [14]:
# We can now proceed to merge both tables

df3 = pd.merge(df, df_geo, on = 'Postal Code')
df3

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


### Cool! This concludes exercise 2!

#### ------------------------------------------

#### Now it is time to Cluster and Visualize the Neighborhoods in Toronto - *Exercise 3*

#### -------------------------------------------

In [15]:
# Firstly we will filter our DataFrame to contain only those boroughs with the word 'Toronto'

toronto_data = df3[df3['Borough'].str.contains("Toronto")].reset_index(drop = True)
toronto_data.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


In [16]:
# Next step is to plot Toronto data into a map and add the appropiate labels.
# We need to first install the appropiate packages:

!conda install -c conda-forge geopy --yes

from geopy.geocoders import Nominatim
import folium

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [17]:
# In order to plot the map we need to first get the coordinates for Toronto

address = 'Toronto, TO'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [18]:
# Now we proceed to create a map of Toronto with these values

toronto_map = folium.Map(location = [latitude,longitude], zoom_start = 12)

# add markers to the map

for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map