# Segmenting and Clustering Neighborhoods in Toronto

This notebook shows a project about segmentation and clustering of neighborhoods in Toronto, based on the distribution of venues categories nearby.

## Part 1 - Collecting and Processing Neighborhoods 

### Webscraping for neighborhoods data

The neighborhoods names are taken through **webscraping** from a Wikipedia page: 
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. This can be achieved with the pandas function **read.html( )**.

In [1]:
# Importing pandas library
import pandas as pd
print("Pandas library succesfully imported.")

Pandas library succesfully imported.


In [2]:
# URL with postal codes, boroughs and neighbourhoods in Toronto
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Extracting url's tables as a list of dataframes
df_list = pd.read_html(url)

In [3]:
#The neighbourhood's table is the 1st dataframe in this list
df = df_list[0]
print("This dataframe has {} rows.".format(df.shape[0]))
df

This dataframe has 180 rows.


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### Processing neighborhood table

Some requirements were previously specified for the dataframe containing the neighborhoods data:
* The dataframe will consist of **three columns**: PostalCode, Borough, and Neighborhood

In [4]:
# Renaming columns in df
df.columns = ['PostalCode', 'Borough','Neighborhood']

* Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**.

In [5]:
# Masking the boroughs that are not 'Not assigned'
mask = (df['Borough'] != 'Not assigned')
print("There are {} 'not assigned' boroughs.".format(df.shape[0] - sum(mask)))

There are 77 'not assigned' boroughs.


In [6]:
# Keeping only the rows that have an assigned borough
df = df[mask].reset_index(drop=True)

In [7]:
# Checking for non-duplicated postal code areas
sum(df[['PostalCode']].duplicated())

0

* If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough.

In [8]:
# Not assigned neighborhoods
mask = df['Neighborhood'] == 'Not assigned'
print("There are {} 'not assigned' neighborhoods.".format(sum(mask)))

There are 0 'not assigned' neighborhoods.


In [9]:
# Changing not assigned neighborhoods for their boroughs
df[mask]['Neighborhood'] = df[mask]['Borough']

* In the last cell of your notebook, use the **.shape** method to print the number of rows of your dataframe.

In [10]:
# Final dataframe
print("The final dataframe has {} rows.".format(df.shape[0]))
df

The final dataframe has 103 rows.


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


## Part 2 - Collecting and Processing Locations

In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.cThe neighborhoods locations are obtained through the **Geocoder Python Package:** https://geocoder.readthedocs.io/index.html

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be **None**, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:

In [11]:
# Installing and importing geocoder library
#!pip install geocoder
import geocoder
print("Geocoder library succesfully imported.")

Geocoder library succesfully imported.


In [12]:
# Searching for location coordinates using Bing geocoder
for index, row in df.iterrows():
    g = geocoder.bing(row['PostalCode'] + ', Toronto', key = 'Ap2Ed0Z779lp2UHLSYFfPUhFeNXewJGj6ny9LKItYZUX6mDndgex92W5LvZujhky')
    df.at[index,'Latitude'] = g.latlng[0]
    df.at[index,'Longitude'] = g.latlng[1]

print('Location coordinates succesfully added!')
df

Location coordinates succesfully added!


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.756123,-79.329636
1,M4A,North York,Victoria Village,43.726780,-79.310738
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.655354,-79.365044
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.721996,-79.445915
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.663910,-79.388733
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.652699,-79.511276
99,M4Y,Downtown Toronto,Church and Wellesley,43.666286,-79.382446
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.663506,-79.317429
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.633709,-79.496521


## Part 3 - Exploring and Clustering Neighborhoods