# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto
# Part one

### This notebook is created for IBM Data Science Capstone project on Coursera.
### Project's main purpose is to explore, segment, and cluster the neighborhoods in the city of Toronto.


## Downloading libraries

In [2]:
import numpy as np
import pandas as pd


## Data Requirements

First step allowing us to explore neighborhoods of Toronto is to define list of neighborhoods in Toronto. To achieve this goal we can produce dataset based on postal codes located within the city of Toronto in the province of Ontario.

In order to gather data for further analysis, webscraping of Wikipedia page is being performed (given source is one of the Coursera project requirements).



## Data Collection with web scraping from Wikipedia


To conduct web scraping process we can use Python library named Beautiful Soup. The library is already imported in the cell above.

Wikipedia page "List of postal codes of Canada: M" provides all necessary information required for this project and its adress is stored in variable *url*.

In [3]:
# Importing library to handle requests
import requests

# Assigning website to url variable with requests library, returning html of the page
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [15]:
# Importing Beautiful Soup Library for web scraping
from bs4 import BeautifulSoup

# Creating soup object with the BeautifulSoup function 
soup = BeautifulSoup(url, 'lxml')

# Viewing with prettify tags nested in the document
#print(soup.prettify()) #uncomment to see full HTML

Inspection of the HTML document provides information about class table: "wikitable sortable". Next lines of code aim to extract table's content.

In [5]:
# Finding "wikitable sortable" class in HTML
table = soup.find('table',{'class':'wikitable sortable'})

# Extracting column names from header
header = [head.findAll(text=True)[0].strip() for head in table.find_all("th")]

# Extracting content from table
data = [[td.findAll(text=True)[0].strip() for td in tr.find_all("td")] for tr in table.find_all("tr")]
data = [row for row in data if len(row) == len(header)]

# Transforming scraped data into dataframe for further cleaning/analysis
df = pd.DataFrame(data,columns=header)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


## Data Cleaning

Dataframe *df* created through web scraping contains many postal codes not assigned to any boroughs or neighborhoods. In the following cells data cleanup will be performed to produce dataframe consisting of rows with complete information: postal code, borough and neighnorhood.

First step is to check if there are rows with "Not assigned" value for only one column.

In [6]:
# Checking NA values for borough and neighbourhood columns

check1 = df[(df["Borough"] == "Not assigned") & (df["Neighbourhood"] != "Not assigned")].reset_index(drop=True)

check2 = df[(df["Borough"] != "Not assigned") & (df["Neighbourhood"] == "Not assigned")].reset_index(drop=True)

len(check1), len(check2)

(0, 1)

One row has "Not assigned" string in the column Neighbourhood. 

In [7]:
check2

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M7A,Queen's Park,Not assigned


In [8]:
# Replacing "Not assigned" in Neighbourhood column with Borough name
df.loc[df['Neighbourhood'] == 'Not assigned', ['Neighbourhood']] = df['Borough']

To meet another project requirement neighbourhoods with the same postcodes have to be combined within one row, where in Neighbourhood column different neighbourhoods are separated with comma.

In [16]:
df = df.groupby(["Postcode","Borough"], sort=False).agg( ", " .join).reset_index()
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M4T,Central Toronto,"Moore Park, Summerhill East"
1,M4N,Central Toronto,Lawrence Park
2,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi..."
3,M4S,Central Toronto,Davisville
4,M5N,Central Toronto,Roselawn
5,M4P,Central Toronto,Davisville North
6,M5P,Central Toronto,"Forest Hill North, Forest Hill West"
7,M5R,Central Toronto,"The Annex, North Midtown, Yorkville"
8,M4R,Central Toronto,North Toronto West
9,M6G,Downtown Toronto,Christie


Therefore we can exclude rows with "Not assigned" string from dataframe.

In [10]:
# Excluding rows without assignment from dataframe
df = df[(df["Borough"] != "Not assigned") & (df["Neighbourhood"] != "Not assigned")]
df.reset_index(drop=True).head(10)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [11]:
#Sorting dataframe 
df = df.sort_values(by=["Borough"]).reset_index(drop=True)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M4T,Central Toronto,"Moore Park, Summerhill East"
1,M4N,Central Toronto,Lawrence Park
2,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi..."
3,M4S,Central Toronto,Davisville
4,M5N,Central Toronto,Roselawn
5,M4P,Central Toronto,Davisville North
6,M5P,Central Toronto,"Forest Hill North, Forest Hill West"
7,M5R,Central Toronto,"The Annex, North Midtown, Yorkville"
8,M4R,Central Toronto,North Toronto West
9,M6G,Downtown Toronto,Christie


In [12]:
df.shape

(103, 3)

Created dataframe has 3 columns and 103 rows.