## Segmenting and Clustering Neighborhoods in Toronto

**Jay Gendron  -  July 2019**

In this project, we explore and cluster the neighborhoods in Toronto. The project has three parts:

### Table of Contents

1. <a href="#part1">Create dataframe of Toronto's PostalCodes, Boroughs, and Neighborhoods</a>
2. <a href="#part2">Getting neighborhood latitude and longitude using Geocoder package</a>  
3. <a href="#part3">Exploring and clustering the neighborhoods in Toronto</a>


<a name="part1"></a> 
    
### Part 1. Create dataframe of Toronto's PostalCodes, Boroughs, and Neighborhoods

The first steps in most data science projects include ETL (extract, transform, and load) of the data. In this project, Part 1 and Part 2 retrieve the data and perform the needed transformations. First things first...we are given a data source on Wikipedia that contains postal codes within the city of Toronto (in the province of Ontario). That source is:

[Data Source: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)


In [1]:
#Import libraries
import numpy as np
import pandas as pd
import requests

We can now use the **requests** library `get()` function to extract the Wiki-based table from the data source.

In [2]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

Pandas allows for reading data from HTML sources with the `read_html()` function.

In [3]:
df_raw = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',
                      flavor='bs4') #flavor bs4 uses BeautifulSoup as a parsing engine
df = df_raw[0] #extract the first element from the list returned from read_html
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [4]:
df.tail() #and check to see all table elements down to M9Z were read into the dataframe

Unnamed: 0,Postcode,Borough,Neighbourhood
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor
287,M9Z,Not assigned,Not assigned


Now that the raw data is available, there are a six pre-processing requirements. They are provided from the project instructions here for reference.

#### Pre-Processing Requirements
>
>1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
2. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
3. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
5. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
6. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

**Requirement 2:** Use only boroughs assigned a name

In [5]:
#Create a copy() of dataframe to avoid memory reference issues
print(f'Before eliminating rows there were {df.shape[0]} boroughs')
df_named = df[df['Borough']!='Not assigned'].copy()
print(f'After eliminating rows there were {df_named.shape[0]} boroughs')

Before eliminating rows there were 288 boroughs
After eliminating rows there were 211 boroughs


**Requirement 4:** Fill in unnamed neighborhoods with name of borough

In [6]:
#Use apply function to assign borough (x[-2]) if neighborhood (x[-1]) is 'Not assigned'
df_named['Neighbourhood'] = df_named.apply(func=lambda x:x[-2] if x[-1]=='Not assigned' else x[-1],
                                           axis=1) #axis=1 applies along rows

#verify by checking borough named Queen's Park
df_named[df_named['Borough']=="Queen's Park"]

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Queen's Park


**Requirement 3:** Combine postal code neighborhoods with comma-separated string for label

This step makes use of a number of Pandas functions:
* `groupby()` to gather the rows having the same postal code
* `apply()` to manipulate the group with a user-defined function
* `lambda()` to hold the user-defined function
* `', 'join()` to convert the list of names and create a comma-separated string
* `to_list()` to gather the neighborhood names in the group into a list

The grouped neighborhoods are saved into an additional dataframe that will merge with the original data.

In [7]:
grouped_neighbourhoods = df_named.groupby('Postcode')['Neighbourhood'].apply(lambda x:', '.join(x.to_list()))
grouped_neighbourhoods = grouped_neighbourhoods.to_frame().reset_index()
grouped_neighbourhoods.head()

Unnamed: 0,Postcode,Neighbourhood
0,M1B,"Rouge, Malvern"
1,M1C,"Highland Creek, Rouge Hill, Port Union"
2,M1E,"Guildwood, Morningside, West Hill"
3,M1G,Woburn
4,M1H,Cedarbrae


Combine the grouped neighborhoods with the df_named dataframe using the `merge()` function

In [8]:
df_grouped = pd.merge(df_named, grouped_neighbourhoods, on='Postcode')
df_grouped.head()

Unnamed: 0,Postcode,Borough,Neighbourhood_x,Neighbourhood_y
0,M3A,North York,Parkwoods,Parkwoods
1,M4A,North York,Victoria Village,Victoria Village
2,M5A,Downtown Toronto,Harbourfront,"Harbourfront, Regent Park"
3,M5A,Downtown Toronto,Regent Park,"Harbourfront, Regent Park"
4,M6A,North York,Lawrence Heights,"Lawrence Heights, Lawrence Manor"


Now we can select the three columns we want to keep and drop the duplicate rows, as seen above with Postcode M5A.

In [9]:
df_final = df_grouped[['Postcode','Borough','Neighbourhood_y']].drop_duplicates()
df_final = df_final.sort_values(by='Postcode') #sort final dataframe by postal code

#Verify by checking that Postcode M5A is listed once with two neighborhoods: Harbourfront and Regent Park
df_final[df_final['Postcode']=='M5A']

Unnamed: 0,Postcode,Borough,Neighbourhood_y
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"


**Requirement 1:** Rename the three required columns

In [10]:
df_final.columns = [['Postcode','Borough','Neighborhood']]
df_final.head()

Unnamed: 0,Postcode,Borough,Neighborhood
8,M1B,Scarborough,"Rouge, Malvern"
21,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
33,M1E,Scarborough,"Guildwood, Morningside, West Hill"
39,M1G,Scarborough,Woburn
43,M1H,Scarborough,Cedarbrae


**Requirement 6:** Use the `.shape` method to print the number of rows of your dataframe

In [11]:
print(f'After processing, the 211 by 3 boroughs was reduced to {df_final.shape}')

After processing, the 211 by 3 boroughs was reduced to (103, 3)


<a id="part2"></a>

### Part 2. Getting neighborhood latitude and longitude using Geocoder package

<a id="part3"></a>


### Part 3. Exploring and clustering the neighborhoods in Toronto