# Segmenting and Clustering Neighborhoods in Toronto
## Juan Prieto-Pena

This is the project for the third week of the capstone project for the IBM Data Science course in Coursera. The notebook contains all three parts of the project for its ease of use.

First, we will import all the necessary libraries here:

### First part: Data scraping

In [1]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np

import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

First, we will request the URL in which the table is located.

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
website_url=requests.get(url).text

The information is scraped using BeautifulSoup to get the table from the page. The HTML code will be checked to see whether or not the page has been scraped correctly.

In [3]:
html_soup = BeautifulSoup(website_url, 'lxml')
#print(html_soup.prettify())
#We will omit the print command in the github notebook to avoid unnecessary text.

We get the table from the html code.

In [4]:
postal_code_table= str(html_soup.table)

And use pandas to transform it into a data frame, as pandas can read html code. dfs will return a list with only one element, which will be a pandas dataframe.

In [5]:
dfs = pd.read_html(postal_code_table)
df=dfs[0]
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


The assignment tells us that the name of the first column should be Postalcode instead of Postal code. We will change that.

In [6]:
df.rename(columns={'Postal code':'Postalcode'}, inplace=True)
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


The datascrape is complete, but we need to clean and preprocess the dataframe before presenting it for review by applying the rules stated in the assignment.

#### Cleaning and preprocessing

In [7]:
#First, we will drop all cells with a 'Not assigned' value in the Borough column.
df.drop(df[df.Borough == 'Not assigned'].index, axis=0, inplace=True)
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [8]:
#Now, we will group the neighbourhoods with the same postcode
df=df.groupby(['Postalcode','Borough'],sort=False).agg(', '.join)
#We will also replace the / characters to commas
df['Neighborhood'] = df['Neighborhood'].str.replace(' / ',', ')
df.reset_index(inplace=True)
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [9]:
# Finally, we will be replacing the name of the neighbourhoods which are 'Not assigned' with the names of the Boroughs
df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned',df['Borough'], df['Neighborhood'])
df

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


The assignment asks us to print a cell with the shape of the data frame:

In [10]:
df.shape

(103, 3)

### Associating a latitude and longitude to each of the Boroughs in Toronto

We will import the data from the csv file available.

In [11]:
geodata=pd.read_csv('https://cocl.us/Geospatial_data')
geodata.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [12]:
geodata.rename(columns={'Postal Code':'Postalcode'},inplace=True)
df_geo=pd.merge(df,geodata,on='Postalcode')
df_geo.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### Borough visualization and clustering on a map

We will only use the boroguhs that contain the word Toronto in its field.

In [13]:
df_to = df_geo[df_geo['Borough'].str.contains('Toronto',regex=False)]
df_to.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


## To see the maps, please refer to the pdf file.

In [14]:
To_map = folium.Map(location=[43.651070,-79.347015],zoom_start=11)

for lat,lng,borough,neighborhood in zip(df_to['Latitude'],df_to['Longitude'],df_to['Borough'],df_to['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='#003153',
    fill=True,
    fill_color='#003153',
    fill_opacity=0.7,
    parse_html=False).add_to(To_map)

To_map

Now, the clustering of the neighbourhoods will be performed. A geographical clustering will be performed. That is, the postcodes will be clustered by means of its latitude and longitude, as we do not have any more data that could be used, and the names of the boroughs/neighbourhoods and the postcodes are not useful here.

In [15]:
k=5
to_cluster = df_to[['Latitude','Longitude']]
kmeans = KMeans(n_clusters = k,random_state=0).fit(to_cluster)
kmeans.labels_
df_to.insert(0, 'Cluster Labels', kmeans.labels_)

The map is created:

## To see the maps, please refer to the pdf file.

In [16]:
# create map
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighborhood, cluster in zip(df_to['Latitude'], df_to['Longitude'], df_to['Neighborhood'], df_to['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Here we can see how the model classified the data: We have some points in the center of the city, two different clusters to the west, one to the east and one to the north. The differentiation between the two groups to the west may be that the second group of points (blue in the map) are further away from the city center than the green ones.