# Segmenting and Clustering Neighborhoods in Toronto
In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.

In [2]:
from bs4 import BeautifulSoup as Soup
from urllib.request import urlopen as uReq
import pandas as pd
import numpy as np
import json # library to handle JSON files
# uncomment this line if you haven't completed the Foursquare API lab
!conda install -c conda-forge geopy --yes 

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# uncomment this line if you haven't completed the Foursquare API lab
!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

#pip install wget

# print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\m.joseph\anaconda3

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.10.3               |   py38haa244fe_0         3.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.1 MB

The following packages will be UPDATED:

  conda                               4.10.2-py38haa244fe_0 --> 4.10.3-py38haa244fe_0



Downloading and Extracting Packages

conda-4.10.3         | 3.1 MB    |            |   0% 
conda-4.10.3         | 3.1 MB    |            |   1% 
conda-4.10.3         | 3.1 MB    | 6          |   7% 
conda-4.10.3         | 3.1 MB    | #2         |  12% 
conda-4.10.3         | 3.1 MB    | #6  

In [3]:
# download url data from internet
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
uClient = uReq(url)
canada_html=uClient.read()
uClient.close()
# html parser used
canada_soup=Soup(canada_html,"html.parser")
# table tag captured
tableContainer=canada_soup.findAll('table')
# td tag caprured which contain all the required information
tdContainer=tableContainer[0].findAll('td')

# creating header columns for dataframe
column_names=['Postalcode','Borough','Neighborhood']
toronto=pd.DataFrame(columns = column_names)


postcode = 0
borough = 0
neighborhood = 0
i=0
# loop for each td and extract data and save append into a dataframe
for td in tdContainer:
    borough=tdContainer[i].span.text.strip()
    if borough!='Not assigned': 
        postcode=tdContainer[i].p.b.text
        spanContainer=tdContainer[i].find('span')
        spanContainer=spanContainer.encode_contents()
#         br replaced with |br| for split purpose
        spanContainer = spanContainer.replace(b'<br/>', b'|br|')
        bs = Soup(spanContainer, 'html.parser')
        words = bs.text.split('|br|')
        borough=words[0]
        neighborhood=words[1]
    else:
        postcode=tdContainer[i].p.b.text
        neighborhood=''
    toronto = toronto.append({'Postalcode': postcode,'Borough': borough,'Neighborhood': neighborhood},ignore_index=True)
    i=i+1
print('Matrix of dataframe before cleaning :',toronto.shape)
print('\n')
print(toronto.head(10))
toronto.replace('', np.nan, inplace=True)
toronto=toronto.dropna()
toronto.reset_index(drop=True, inplace=True)
print('\n\n\n')
print('Matrix of dataframe after cleaning :',toronto.shape)
print('\n')
print(toronto.head(10))

Matrix of dataframe before cleaning : (180, 3)


  Postalcode           Borough                         Neighborhood
0        M1A      Not assigned                                     
1        M2A      Not assigned                                     
2        M3A        North York                          (Parkwoods)
3        M4A        North York                   (Victoria Village)
4        M5A  Downtown Toronto         (Regent Park / Harbourfront)
5        M6A        North York  (Lawrence Manor / Lawrence Heights)
6        M7A      Queen's Park      (Ontario Provincial Government)
7        M8A      Not assigned                                     
8        M9A         Etobicoke                   (Islington Avenue)
9        M1B       Scarborough                    (Malvern / Rouge)




Matrix of dataframe after cleaning : (103, 3)


  Postalcode           Borough                         Neighborhood
0        M3A        North York                          (Parkwoods)
1        M4A   

Retrieve postcode coordinates. Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. We are supposed to use the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, this is a paid service API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code.

For this task, we just use a prepared csv to retrieve the coordinates.

Extract csv with Toronto geographical coordinates to dataframe.

In [4]:
toronto_geocsv = 'https://cocl.us/Geospatial_data'
geocsv_data = pd.read_csv(toronto_geocsv)
geocsv_data.rename(columns={'Postal Code': 'Postalcode'},inplace=True)
geocsv_data.set_index("Postalcode")

torontodf = pd.merge(toronto,geocsv_data, on='Postalcode')
torontodf.head(10)

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,(Parkwoods),43.753259,-79.329656
1,M4A,North York,(Victoria Village),43.725882,-79.315572
2,M5A,Downtown Toronto,(Regent Park / Harbourfront),43.65426,-79.360636
3,M6A,North York,(Lawrence Manor / Lawrence Heights),43.718518,-79.464763
4,M7A,Queen's Park,(Ontario Provincial Government),43.662301,-79.389494
5,M9A,Etobicoke,(Islington Avenue),43.667856,-79.532242
6,M1B,Scarborough,(Malvern / Rouge),43.806686,-79.194353
7,M3B,North York,(Don Mills),43.745906,-79.352188
8,M4B,East York,(Parkview Hill / Woodbine Gardens),43.706397,-79.309937
9,M5B,Downtown Toronto,"(Garden District, Ryerson)",43.657162,-79.378937


##### Use geopy library to get the latitude and longitude values of Toronto City.

In [11]:
address = 'Toronto, Toronto'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(torontodf['Borough'].unique()),
        torontodf.shape[0]
    )
)

The geograpical coordinate of Toronto City are 43.65238435, -79.38356765.
The dataframe has 11 boroughs and 103 neighborhoods.


##### Create a map of New York with neighborhoods superimposed on top.

In [13]:
# create map of Totonto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(torontodf['Latitude'], torontodf['Longitude'], torontodf['Borough'], torontodf['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

