# Segmenting and Clustering Neighborhoods in Toronto

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. 

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,

Note: There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use pandas to read the table into a pandas dataframe.
Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/
The package is so popular that there is a plethora of tutorials and examples on how to use it. Here is a very good Youtube video on how to use the BeautifulSoup package: https://www.youtube.com/watch?v=ng2o98k983k
Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe.

In [1]:
#import libraries
import numpy as np #library to handle data in a vectorised manner
import pandas as pd #library for data analysis
import json #library to handle JSON files
!pip install geopy
from geopy.geocoders import Nominatim #convert an address into latitude and longitude values
import requests #library to handle requests
from pandas.io.json import json_normalize #transform JSON file into a pandas dataframe
#Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans #import k-means from clustering stage
!pip install folium
import folium #map rendering library
print('Libraries imported')

Libraries imported


In [2]:
!pip3 install lxml
!pip install html5lib
import urllib.request

/bin/sh: 1: pip3: not found


In [3]:
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',header=0)
for df in dfs:
    print(df)

    Postal code           Borough  \
0           M1A      Not assigned   
1           M2A      Not assigned   
2           M3A        North York   
3           M4A        North York   
4           M5A  Downtown Toronto   
..          ...               ...   
175         M5Z      Not assigned   
176         M6Z      Not assigned   
177         M7Z      Not assigned   
178         M8Z         Etobicoke   
179         M9Z      Not assigned   

                                          Neighborhood  
0                                                  NaN  
1                                                  NaN  
2                                            Parkwoods  
3                                     Victoria Village  
4                           Regent Park / Harbourfront  
..                                                 ...  
175                                                NaN  
176                                                NaN  
177                                       

In [4]:
TorontoNeigh = dfs[0]

In [5]:
TorontoNeigh

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
...,...,...,...
175,M5Z,Not assigned,
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,Mimico NW / The Queensway West / South of Bloo...


Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [6]:
TorontoNeigh['Borough'].unique()

array(['Not assigned', 'North York', 'Downtown Toronto', 'Etobicoke',
       'Scarborough', 'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

In [7]:
#remove all the Boroughs that are not assigned
TorontoNeigh = TorontoNeigh[~TorontoNeigh.Borough.str.contains('Not assigned')]
TorontoNeigh = TorontoNeigh.reset_index(drop=True)
TorontoNeigh.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [8]:
TorontoNeigh.shape

(103, 3)

In [9]:
TorontoNeigh['Borough'].unique()

array(['North York', 'Downtown Toronto', 'Etobicoke', 'Scarborough',
       'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [10]:
TorontoNeigh['Postal code'].value_counts().shape

(103,)

In [11]:
TorontoNeigh = TorontoNeigh.replace('/', ',', regex=True)
TorontoNeigh.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [12]:
TorontoNeigh.isnull().values.any()

False

In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [13]:
TorontoNeigh.shape

(103, 3)

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:

In [14]:
!pip install geocoder



In [15]:
import geocoder #import geocoder

In [25]:
# define the geocoder function
def get_geocoder(postal_code_from_df):
    lat_lng_coords = None #initialise your variable to None
    #loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code_from_df))
        lat_lng_coords = g.latlng
    neigh_lat = lat_lng_coords[0]
    neigh_lng = lat_lng_coords[1]
    return neigh_lat, neigh_lng #this affects the order that it returns the value and the order in the next cell as well

In [26]:
#loop in our df and create a post_TorontoNeigh
post_TorontoNeigh = TorontoNeigh
post_TorontoNeigh['Latitude'], post_TorontoNeigh['Longitude'] = zip(*post_TorontoNeigh['Postal code'].apply(get_geocoder))
#The order of Latitude and Longitude column depends on how the function returns the output

In [27]:
post_TorontoNeigh.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.31189
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.66179,-79.38939


In [28]:
post_TorontoNeigh

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.311890
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.661790,-79.389390
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North",43.653340,-79.509766
99,M4Y,Downtown Toronto,Church and Wellesley,43.666659,-79.381472
100,M7Y,East Toronto,Business reply mail Processing CentrE,43.648700,-79.385450
101,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,...",43.632798,-79.493017
