# Segmenting and Clustering Neighborhoods in Toronto

## Part 1

Create and clean a *pandas* dataframe using the data provided on Wikipedia's table of Canadian postal codes.

In [1]:
# install appropriate packages
!pip install beautifulsoup4 # install the BeautifulSoup library for web scraping
!pip install lxml # install lxml parser to break down the html page into parts
!pip install requests # install the requests library
print('...Packages installed!')

...Packages installed!


In [2]:
# import necessary packages
from bs4 import BeautifulSoup # bs4 = beautifulsoup4
import requests
import pandas as pd

In [3]:
# pull code from the web via the requests library
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [4]:
# use BeautifulSoup to scrape and lxml to parse
soup = BeautifulSoup(source, 'lxml')

In [5]:
# prettify method indents the html code, making it easier to read code within each tag
# commenting this line of code out because unable to hide the lengthy code blob on GitHub
## print(soup.prettify()) 

In [6]:
# retrieve only the text within the <table> tag, then read the data into a pandas dataframe
table = soup.find_all('table')
df = pd.read_html(str(table))[0] # [0] converts data from a list to a dataframe, remove it to see!
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor


In [7]:
# rename headers
labels = ['PostalCode', 'Borough', 'Neighborhood']
df.columns = labels

In [8]:
import numpy as np

# replace "Not assigned" values in the Borough column to NaN for easy removal
df['Borough'].replace("Not assigned", np.nan, inplace = True)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,,Not assigned
1,M2A,,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor


In [9]:
# remove rows with NaN values in the Borough column
df.dropna(subset=["Borough"], axis=0, inplace=True)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
...,...,...,...
282,M8Z,Etobicoke,Kingsway Park South West
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West


In [10]:
# check frequency of neighborhoods in each PostalCode
df['PostalCode'].value_counts()

M8Y    8
M9V    8
M5V    7
M9B    5
M8Z    5
      ..
M1X    1
M5E    1
M4M    1
M4P    1
M9A    1
Name: PostalCode, Length: 103, dtype: int64

In [11]:
# join neighborhoods that exist within the same PostalCode and Borough into 1 row each
# use commas as seperator
df = df.groupby(['PostalCode','Borough']).agg(lambda x: ','.join(x))
df

## length of dataframe was 103 with the above value_counts() check, so after joining should have 103 rows/different PostalCodes

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood
PostalCode,Borough,Unnamed: 2_level_1
M1B,Scarborough,"Rouge,Malvern"
M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
M1E,Scarborough,"Guildwood,Morningside,West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
...,...,...
M9N,York,Weston
M9P,Etobicoke,Westmount
M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie..."
M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam..."


In [12]:
# reset index
df.reset_index(inplace=True)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie..."
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam..."


In [13]:
# check if 'Not assigned' values exist in the Neighborhood column
np.where(df['Neighborhood'] == 'Not assigned')

(array([85]),)

In [14]:
# examine row 85
df.iloc[85]

PostalCode               M7A
Borough         Queen's Park
Neighborhood    Not assigned
Name: 85, dtype: object

In [15]:
# change row 85's Neighborhood to Queen's Park (so it matches its' Borough)
df.loc[85, 'Neighborhood'] = "Queen's Park"
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie..."
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam..."


In [16]:
# check for any more 'Not assigned' values
np.where(df['Neighborhood'] == 'Not assigned')

(array([], dtype=int64),)

In [17]:
df.shape

(103, 3)

*pandas* dataframe is now clean!

---

## Part 2

Find the latitude and longitude coordinates for each postal code.

---

## Part 3

Explore and cluster Toronto neighborhoods, and create a map to visualize these clusters.

---

A peer-graded assignment for week 3 of IBM's Applied Data Science Capstone online course on Coursera  
Notebook created by Paige Larsen 

Acknowledgements: 
* Thank you to Corey Schafer's YouTube video, "Python Tutorial: Web Scraping with BeautifulSoup and Requests", for helping with BeautifulSoup
* Thank you to pythonprogramminglanguage.com's page on "Web Scraping with Pandas and Beautifulsoup", for helping to convert my soup into a *pandas* df
* Thank you to the various Stack Overflow pages consulted for ideas when I did not know how to progress