### Coursera Capstone Project - Segmenting and Clustering Neighborhoods in the City of Toronto

Author: Jörn Grimmer
Date: Dec. 2019

### Table of Contents
#### Part I Create initial file of City of Toronto, Postal Codes & Neighborhoods
#### Part II Assign Geospital Data to Postal Codes
#### Part III Perform Analysis

###  Part I - Create Initial File - Approach
#### 1) Scrape the data from wikipedia
#### 2) Drop 'Boroughs" with value "Not assigned"
#### 3) Combine 'Neighborhood with the identical 'PostalCode"
#### 4) Rename "Neighborhood" where value is 'Not assigned' with the value of 'Borough'
####
####
#### 1) Scrape the data from wikipedia
In this notebook, we will explore and cluster the neighborhoods in Toronto.
We will build the code to scrape the following Wikipedia page
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.
In order to obtain the data that is in the table of postal codes we will transform the data into a pandas dataframe.

In [1]:
# First, we import all required packages, not only for screeen scraping, but for Clustering and Visualization, too.
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes
from bs4 import BeautifulSoup
import urllib3
from urllib.request import urlopen
import requests
import csv

import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 

from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

print('Libraries imported.')

Libraries imported.


In [2]:
# Specify the url & load url and get the table of postal codes
# The idea for this program code was found on https://scipython.com/blog/scraping-a-wikipedia-table-with-beautiful-soup/
# get a local copy of the Wikipedia article
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
req = urlopen(url)
article = req.read().decode()
with open('List_of_postal_codes_of_Canada:_M', 'w') as fo:
    fo.write(article)

**Scraping with the Package BeautifulSoap**

Extract all the < table >tags and search for the one with the headings corresponding to the data we want. Finally, iterate over its rows, pulling out the columns we want and writing the cell text to the file 'List_of_postal_codes_of_Canada:_M.txt'. The file should be interpreted as utf-8 encoded.

In [3]:
# Load article, turn into soup and get the <table>s.
article = open('List_of_postal_codes_of_Canada:_M').read()
soup = BeautifulSoup(article, 'html.parser')
tables = soup.find_all('table', class_='sortable')

# Search through the tables for the one with the headings we want.
for table in tables:
    ths = table.find_all('th')
    headings = [th.text.strip() for th in ths]
    if headings[:5] == ['Postcode', 'Borough', 'Neighbourhood_Draft']:
        break

# Extract the columns we want and write to a semicolon-delimited text file.
with open('List_of_postal_codes_of_Canada:_M', 'w') as fo:
    for tr in table.find_all('tr'):
        tds = tr.find_all('td')
        if not tds:
            continue
        postcode, borough, neighbourhood = [td.text.strip() for td in tds[:4]]
        print('; '.join([postcode, borough, neighbourhood]), file=fo)

In [4]:
# Read file 'List_of_postal_codes_of_Canada:_M' with Pandas and create Pandas.dataframe
data = pd.read_csv('List_of_postal_codes_of_Canada:_M', sep=";", header=None, names=["PostalCode", "Borough", "Neighborhood_Draft"])
data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood_Draft
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [5]:
data.sort_values('Borough',ascending=True)
data.shape

(287, 3)

#### 2) Drop 'Boroughs" with value "Not assigned"

In [6]:
# Find all cells in column 'Borough' containing ' Not assigned'
data[data['Borough'].str.contains('Not assigned',regex=False)]

Unnamed: 0,PostalCode,Borough,Neighborhood_Draft
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
8,M8A,Not assigned,Not assigned
12,M2B,Not assigned,Not assigned
19,M7B,Not assigned,Not assigned
20,M8B,Not assigned,Not assigned
29,M2C,Not assigned,Not assigned
35,M7C,Not assigned,Not assigned
36,M8C,Not assigned,Not assigned
44,M2E,Not assigned,Not assigned


In [7]:
# Drop all cells, where column 'Borough' contains ' Not assigned'
to_drop = [' Not assigned']
data_new = data[~data['Borough'].isin(to_drop)]
data_new.shape

(210, 3)

#### 3) Combine 'Neighborhood with the identical 'PostalCode"

In [8]:
# Group all Postal codes with more than one neighborhood, and join corresponding neighborhoods
data_new = data_new.groupby(['PostalCode','Borough'])['Neighborhood_Draft'].apply(', '.join).reset_index()
data_new.head(103)

Unnamed: 0,PostalCode,Borough,Neighborhood_Draft
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village ..."
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [9]:
data_new.shape

(103, 3)

#### 4) Rename "Neighborhood" where value is 'Not assigned' with the value of 'Borough'

In [10]:
# Find rows where value of Neighborhood_Draft is "Not assigned"
data_new[data_new['Neighborhood_Draft'].str.contains('Not assigned')]

Unnamed: 0,PostalCode,Borough,Neighborhood_Draft
85,M7A,Queen's Park,Not assigned


In [11]:
# Rename "Neighborhood" where value is 'Not assigned' with the value of 'Borough
data_final=data_new['Neighborhood_Draft'].str.replace('Not assigned',"Queen's Park")
data_final.head(103)

0                                        Rouge,  Malvern
1               Highland Creek,  Rouge Hill,  Port Union
2                    Guildwood,  Morningside,  West Hill
3                                                 Woburn
4                                              Cedarbrae
5                                    Scarborough Village
6          East Birchmount Park,  Ionview,  Kennedy Park
7                      Clairlea,  Golden Mile,  Oakridge
8       Cliffcrest,  Cliffside,  Scarborough Village ...
9                           Birch Cliff,  Cliffside West
10      Dorset Park,  Scarborough Town Centre,  Wexfo...
11                                    Maryvale,  Wexford
12                                             Agincourt
13             Clarks Corners,  Sullivan,  Tam O'Shanter
14      Agincourt North,  L'Amoreaux East,  Milliken,...
15                                       L'Amoreaux West
16                                           Upper Rouge
17                             

In [12]:
data_submit_draft=data_final.rename("Neighborhood")
data_submit_draft.head()

0                              Rouge,  Malvern
1     Highland Creek,  Rouge Hill,  Port Union
2          Guildwood,  Morningside,  West Hill
3                                       Woburn
4                                    Cedarbrae
Name: Neighborhood, dtype: object

In [13]:
# join dataframe with updated series "Neighborhood"
data_submit=pd.concat([data_new,data_submit_draft],axis=1, join='inner')
data_submit

Unnamed: 0,PostalCode,Borough,Neighborhood_Draft,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern","Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union","Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill","Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn,Woburn
4,M1H,Scarborough,Cedarbrae,Cedarbrae
5,M1J,Scarborough,Scarborough Village,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park","East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge","Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village ...","Cliffcrest, Cliffside, Scarborough Village ..."
9,M1N,Scarborough,"Birch Cliff, Cliffside West","Birch Cliff, Cliffside West"


In [14]:
# Drop column "Neighborhood_Draft", create final dataframe
data_submit_final=data_submit.drop(['Neighborhood_Draft'], axis=1)
data_submit_final

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village ..."
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [15]:
data_submit_final.shape

(103, 3)

#### Part II Assign Geospital Data to Postal Codes

In [19]:
# Read file 'Geospatial_Coordinates.csv' with Pandas and create Pandas.dataframe
geospatial= pd.read_csv('Geospatial_Coordinates.csv', sep=",")
geospatial.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [20]:
# Join dataframes 'data_submit_final' and 'geospatial'
All_data_draft=pd.concat([data_submit_final,geospatial],axis=1, join='inner')
All_data_draft

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,M1J,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",M1K,43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",M1L,43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village ...",M1M,43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",M1N,43.692657,-79.264848


In [21]:
# Drop column "Postal Code", create final dataframe
All_data=All_data_draft.drop(['Postal Code'], axis=1)
All_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village ...",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [22]:
All_data.shape

(103, 5)