# Segmenting and Clustering Neighborhoods in Toronto
### Peer-graded Assignment for the course:<br/>*Applied Data Science Capstone (IBM Data Science Professional Certificate)*, Coursera/IBM.
**Author: Paw Hermansen, 2018, Oct. 17**


## Gets the Toronto postal codes from Wikipedia

### Imports

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

### Reads the html from the Wikipedia webpage and find the rows
Assumes that the table of postal codes has a class of 'wikitable' and that no other tables in the page has this class.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
sauce = requests.get(url).content
soup = BeautifulSoup(sauce, 'lxml')

table = soup.find("table", {"class":"wikitable"})
rows = table.find_all('tr')

### Builds a Pandas dataframe with the Toronto postal codes

First all the text, except from the tables header, is collected into a python array as triplets. Then the complete python array is transformed into a Pandas dataframe and the column names are added.

All scraped text is stripped of whitespace (spaces, line-ends, etc.) from both the front and the end.

In [3]:
data = []
for row in rows:
    tds = row.find_all('td')
    # Skip header
    if len(tds) == 3:
        postalCode = tds[0].text.strip()
        borough = tds[1].text.strip()
        neighborhood = tds[2].text.strip()
        datarow = (postalCode, borough, neighborhood)
        data.append(datarow)

titles = ['PostalCode', 'Borough', 'Neighborhood']
df = pd.DataFrame.from_records(data, columns=titles)

print(df.shape)
df.head(10)

(289, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


### Removes rows with no borough

In [4]:
df = df.loc[df['Borough'] != 'Not assigned']

print(df.shape)
df.head(8)

(212, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue


### Assigns the name of the borough to neighborhoods with no name

In [5]:
mask = df['Neighborhood'] == 'Not assigned'
df['Neighborhood'] = df['Neighborhood'].mask(mask, other=df['Borough'], axis=0)

print(df.shape)
df.head(8)

(212, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue


### Groups neighborhoods by the postal code and borough

The neighborhood names are concatenated simply by calling the string *join* method as seen inside the *apply* method below. Also *reset_index()* is called to turn the index of the *GroupBy* into a simple index to make it a "normal" dataframe.

In [6]:
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

print(df.shape)
df.head()

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Saves the dataframe

In [7]:
df.to_csv('toronto_postal_codes.csv', quoting=csv.QUOTE_ALL)

### Writes the final number of neighborhoods in the cleaned list of Toronto neighborhoods

In [8]:
print("Number of rows (neighborhoods) = ", df.shape[0])

Number of rows (neighborhoods) =  103
