# Segmenting and Clustering Neighborhoods in Toronto 
### Peer-graded assignment in Applied Data Science Capstone by IBM/Coursera

#### *Submitted by Jessiedee Mark B. Gingo*

## Table of contents
* [Installing Required Libraries](#installaiong)
* [Webscraping](#webscraping)
* [Data Cleaning](#cleaning)
* [Dataframe](#dataframe)


In this assignment, we will explore and cluster the neighborhoods in Toronto. The dataset is not readily available and we will be scraping it on Wikipedia, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.

## Installing Required Libraries <a name="installation"></a>

For webscraping, we will be utilizing **BeautifulSoup** package. Our chosen parser is **lxml** and we will install **html5** as well to parse data from a website without any problem.

In [385]:
!pip install BeautifulSoup4
!pip install lxml
!pip install html5



We will import required libraries. BeautifulSoup for webscraping, requests for GET request to a webpage, and pandas so we can prepare the data.

In [386]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

The following code is executed to source and html file is converted to text file. We then create the **BeautifulSoup** object, passing the source.

## Webscraping <a name="webscraping"></a>

In [387]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(source, 'lxml')


We will now define our column names. By inspecting the website, headers of the table is under **'th'** tag. The **'th'** tag is passed to the **find_all** method to create a list under **'th'** tag. Slicing the first 3 gives us **'Postcode', 'Borough'**, and **'Neighbourhood'.**

In [388]:
columns = [] # Column values

for header in soup.find_all('th')[0:3]:
    header = header.text
    columns.append(header)

columns[-1] = columns[-1].strip() # Removing the \n in the last item

Preparing the data. The rows list is created to contain the information. From the website, the data are located under **'tr'** tag. Passing to the **find_all** method. 

In [389]:
# Rows values
rows = [] # Rows value container

for data in soup.find_all('tr')[1:]:
    data = data.text
    data = data.split('\n')
    rows.append(data)

rows = rows[0:-5] # Remove unwanted information
    
for i in range(len(rows)): # Clean data by removing the whitespace in the list
    rows[i] = rows[i][1:4]

rows

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', 'Downtown Toronto', "Queen's Park"],
 ['M8A', 'Not assigned', 'Not assigned'],
 ['M9A', "Queen's Park", 'Not assigned'],
 ['M1B', 'Scarborough', 'Rouge'],
 ['M1B', 'Scarborough', 'Malvern'],
 ['M2B', 'Not assigned', 'Not assigned'],
 ['M3B', 'North York', 'Don Mills North'],
 ['M4B', 'East York', 'Woodbine Gardens'],
 ['M4B', 'East York', 'Parkview Hill'],
 ['M5B', 'Downtown Toronto', 'Ryerson'],
 ['M5B', 'Downtown Toronto', 'Garden District'],
 ['M6B', 'North York', 'Glencairn'],
 ['M7B', 'Not assigned', 'Not assigned'],
 ['M8B', 'Not assigned', 'Not assigned'],
 ['M9B', 'Etobicoke', 'Cloverdale'],
 ['M9B', 'Etobicoke', 'Islington'],
 ['M9B', 'Etobicoke', 'Martin Grove'],
 ['M9B', 'Et

## Data Cleaning <a name="cleaning"></a>

The columns and rows are now established and the dataframe is will be created.

In [390]:
df = pd.DataFrame(data=rows, columns=columns) 
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


Items that is 'Not assigned' in  **Borough** column is dropped.

In [391]:
df = df[df.Borough != 'Not assigned'].reset_index()
df = df.drop(['index'], axis=1) # redundant index column is removed

**Neighbourhood** with the same **Postalcode** is grouped and concatenated.

In [392]:
df = df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


The **Neighbourhood** with 'Not assigned' value will be replaced by the value in **Borough** column.

In [393]:
df_not_assigned = df[df['Neighbourhood'].str.contains("Not assigned")] # Extracting the dataframe whose Neighbourhood is 'Not assigned'

for i in list(df_not_assigned['Borough']): # 'Not assigned' value is replaced
    df = df.replace('Not assigned', i)

## Dataframe <a name="dataframe"></a>

First 12 items in the dataframe are shown below.

In [394]:
df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


The shape of the dataframe is shown.

In [395]:
df.shape

(103, 3)