# Segmenting and Clustering Neighborhoods in Toronto

## Week 3 - Assignment

#### In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

#### For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

#### Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

#### Your submission will be a link to your Jupyter Notebook on your Github repository.

### Note: You may not be able to view the maps of "Folium", therefore you can also see the following link:
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/5f577cf6-ea52-4cec-aa5d-62c3a875ac40/view?access_token=d9361f6f7a8029b1fe68c114f3d604db3488af61af5420e39ce

##### Author: Jhimy Cussi

In [86]:
#!conda install -c conda-forge folium=0.5.0 --yes

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... 
  - anaconda/win-64::ca-certificates-2020.6.24-0, anaconda/win-64::certifi-2020.6.20-py38_0
  - anaconda/win-64::ca-certificates-2020.6.24-0, defaults/win-64::certifi-2020.6.20-py38_0
  - anaconda/win-64::certifi-2020.6.20-py38_0, defaults/win-64::ca-certificates-2020.6.24-0
  - defaults/win-64::ca-certificates-2020.6.24-0, defaults/win-64::certifi-2020.6.20-py38_0done

## Package Plan ##

  environment location: C:\Users\jhimy\anaconda3

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.2  

In [167]:
import pandas as pd
import numpy as np
from IPython.display import display_html
from geopy.geocoders import Nominatim # convert an address into latitude 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

print('Libraries imported.')

Libraries imported.


## 1) Explore

In [164]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
data  = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")
tables = soup.find_all('table')
len(tables)

3

In [165]:
# Verify the correcta table.
tab = str(tables[0])
display_html(tab, raw=True)

0,1,2,3,4,5,6,7,8
M1A Not assigned,M2A Not assigned,M3A North York (Parkwoods),M4A North York (Victoria Village),M5A Downtown Toronto (Regent Park / Harbourfront),M6A North York (Lawrence Manor / Lawrence Heights),M7A Queen's Park (Ontario Provincial Government),M8A Not assigned,M9A Etobicoke (Islington Avenue)
M1B Scarborough (Malvern / Rouge),M2B Not assigned,M3B North York (Don Mills) North,M4B East York (Parkview Hill / Woodbine Gardens),"M5B Downtown Toronto (Garden District, Ryerson)",M6B North York (Glencairn),M7B Not assigned,M8B Not assigned,M9B Etobicoke (West Deane Park / Princess Gardens / Martin Grove / Islington / Cloverdale)
M1C Scarborough (Rouge Hill / Port Union / Highland Creek),M2C Not assigned,M3C North York (Don Mills) South (Flemingdon Park),M4C East York (Woodbine Heights),M5C Downtown Toronto (St. James Town),M6C York (Humewood-Cedarvale),M7C Not assigned,M8C Not assigned,M9C Etobicoke (Eringate / Bloordale Gardens / Old Burnhamthorpe / Markland Wood)
M1E Scarborough (Guildwood / Morningside / West Hill),M2E Not assigned,M3E Not assigned,M4E East Toronto (The Beaches),M5E Downtown Toronto (Berczy Park),M6E York (Caledonia-Fairbanks),M7E Not assigned,M8E Not assigned,M9E Not assigned
M1G Scarborough (Woburn),M2G Not assigned,M3G Not assigned,M4G East York (Leaside),M5G Downtown Toronto (Central Bay Street),M6G Downtown Toronto (Christie),M7G Not assigned,M8G Not assigned,M9G Not assigned
M1H Scarborough (Cedarbrae),M2H North York (Hillcrest Village),M3H North York (Bathurst Manor / Wilson Heights / Downsview North),M4H East York (Thorncliffe Park),M5H Downtown Toronto (Richmond / Adelaide / King),M6H West Toronto (Dufferin / Dovercourt Village),M7H Not assigned,M8H Not assigned,M9H Not assigned
M1J Scarborough (Scarborough Village),M2J North York (Fairview / Henry Farm / Oriole),M3J North York (Northwood Park / York University),M4J East York East Toronto (The Danforth East),M5J Downtown Toronto (Harbourfront East / Union Station / Toronto Islands),M6J West Toronto (Little Portugal / Trinity),M7J Not assigned,M8J Not assigned,M9J Not assigned
M1K Scarborough (Kennedy Park / Ionview / East Birchmount Park),M2K North York (Bayview Village),M3K North York (Downsview) East (CFB Toronto),M4K East Toronto (The Danforth West / Riverdale),M5K Downtown Toronto (Toronto Dominion Centre / Design Exchange),M6K West Toronto (Brockton / Parkdale Village / Exhibition Place),M7K Not assigned,M8K Not assigned,M9K Not assigned
M1L Scarborough (Golden Mile / Clairlea / Oakridge),M2L North York (York Mills / Silver Hills),M3L North York (Downsview) West,M4L East Toronto (India Bazaar / The Beaches West),M5L Downtown Toronto (Commerce Court / Victoria Hotel),M6L North York (North Park / Maple Leaf Park / Upwood Park),M7L Not assigned,M8L Not assigned,M9L North York (Humber Summit)
M1M Scarborough (Cliffside / Cliffcrest / Scarborough Village West),M2M North York (Willowdale / Newtonbrook),M3M North York (Downsview) Central,M4M East Toronto (Studio District),M5M North York (Bedford Park / Lawrence Manor East),M6M York (Del Ray / Mount Dennis / Keelsdale and Silverthorn),M7M Not assigned,M8M Not assigned,M9M North York (Humberlea / Emery)


In [166]:
print(tables[0].prettify())

 <br/>
      (
      <a href="/wiki/University_of_Toronto" title="University of Toronto">
       University of Toronto
      </a>
      / Harbord)
     </span>
    </p>
   </td>
   <td style="vertical-align:top;">
    <p>
     <b>
      M6S
     </b>
     <br/>
     <span style="font-size:85%;">
      West Toronto
      <br/>
      (
      <a href="/wiki/Runnymede,_Toronto" title="Runnymede, Toronto">
       Runnymede
      </a>
      /
      <a href="/wiki/Swansea,_Toronto" title="Swansea, Toronto">
       Swansea
      </a>
      )
     </span>
    </p>
   </td>
   <td style="vertical-align:top; color:#ccc;">
    <p>
     <b>
      M7S
     </b>
     <br/>
     <span style="font-size:85%;">
      <i>
       Not assigned
      </i>
     </span>
    </p>
   </td>
   <td style="vertical-align:top; color:#ccc;">
    <p>
     <b>
      M8S
     </b>
     <br/>
     <span style="font-size:85%;">
      <i>
       Not assigned
      </i>
     </span>
    </p>
   </td>
   <td style="vertical-

In [102]:
# Building the dataframe.
data = pd.DataFrame(columns=["postcode", "borough", "neighbourhood"])

for row in tables[0].tbody.find_all("tr"):
    for cell in row.find_all("td"):
        postcode = cell.b.text
        #print(postcode)
        i = cell.i
        if i:
            borough = i.text
            neighbourhood = i.text
            #print(borough)
        else:
            i = 0
            for ca in cell.find_all("a"):
                if (i == 0):
                    borough = ca["title"]
                else:
                    neighbourhood = ca["title"]
                i = i + 1
                #print(ca["title"])
        data = data.append({"postcode":postcode, "borough":borough, "neighbourhood":neighbourhood}, ignore_index=True)

# Print result
data

Unnamed: 0,postcode,borough,neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Harbourfront, Toronto"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,The Queensway


In [103]:
data.shape

(180, 3)

### Data preprocesing and cleaning

In [128]:
# Eliminating those columns from the "Borough" field that have the value 'Not assigned'.
df = data[data.borough != 'Not assigned']

# Combining all same Postalcodes.
df = df.groupby(['postcode','borough'], sort=False).agg(', '.join)
df.reset_index(inplace=True)

# Replacing, if a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
df['neighbourhood'] = np.where(df['neighbourhood'] == 'Not assigned', df['borough'], df['neighbourhood'])

df.head()


Unnamed: 0,postcode,borough,neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Toronto"
3,M6A,North York,Lawrence Heights
4,M7A,Queen's Park (Toronto),Lawrence Heights


In [129]:
df.shape

(103, 3)

## 2) Segment

### Importing data in format CSV containing latitude and longitude of Canada

In [130]:
geodata = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv')
geodata.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Joining Toronto neighbourhood table and geoespacial data

In [132]:
# Rename columnt 'Postal Code' to 'postcode'
geodata.rename(columns={'Postal Code':'postcode', 'Latitude':'latitude', 'Longitude':'longitude'}, inplace=True)

# Merge
df = pd.merge(df, geodata, on='postcode')
df.head()

Unnamed: 0,postcode,borough,neighbourhood,latitude,longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Toronto",43.65426,-79.360636
3,M6A,North York,Lawrence Heights,43.718518,-79.464763
4,M7A,Queen's Park (Toronto),Lawrence Heights,43.662301,-79.389494


In [133]:
df.shape

(103, 5)

## 3) Cluster the neighborhoods

### Explore and cluster the neighborhoods in Toronto. Therefore, considering only boroughs that contain the word Toronto

In [139]:
dfc = df[df['borough'].str.contains('Toronto',regex=False)]
dfc.head()

Unnamed: 0,postcode,borough,neighbourhood,latitude,longitude
2,M5A,Downtown Toronto,"Harbourfront, Toronto",43.65426,-79.360636
4,M7A,Queen's Park (Toronto),Lawrence Heights,43.662301,-79.389494
6,M1B,"Scarborough, Toronto","Rouge, Toronto",43.806686,-79.194353
9,M5B,Downtown Toronto,Ryerson University,43.657162,-79.378937
12,M1C,"Scarborough, Toronto","Highland Creek, Toronto",43.784535,-79.160497


In [140]:
dfc.shape

(52, 5)

### Create a map of Toronto with neighborhoods superimposed on top.

In [143]:
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 43.6534817, -79.3839347.


In [151]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat,lng,borough,neighbourhood in zip(dfc['latitude'],dfc['longitude'],dfc['borough'],dfc['neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto)
map_toronto

### Run k-means to cluster the neighborhood

In [153]:
# set number of clusters
kclusters = 5

toronto_clustering = dfc.drop(['postcode', 'borough', 'neighbourhood'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_clustering)

# Add labels in datset dfc.
dfc.insert(0, 'cluster_label', kmeans.labels_)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 1, 0, 1, 0, 3, 1, 0, 2])

In [154]:
dfc.head()

Unnamed: 0,cluster_label,postcode,borough,neighbourhood,latitude,longitude
2,0,M5A,Downtown Toronto,"Harbourfront, Toronto",43.65426,-79.360636
4,0,M7A,Queen's Park (Toronto),Lawrence Heights,43.662301,-79.389494
6,1,M1B,"Scarborough, Toronto","Rouge, Toronto",43.806686,-79.194353
9,0,M5B,Downtown Toronto,Ryerson University,43.657162,-79.378937
12,1,M1C,"Scarborough, Toronto","Highland Creek, Toronto",43.784535,-79.160497


### Map for cluster.

In [162]:
map_toronto_cluster = folium.Map(location=[latitude, longitude],zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(dfc['latitude'],dfc['longitude'],dfc['neighbourhood'], dfc['cluster_label']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_toronto_cluster)
       
map_toronto_cluster