# Segmenting and Clustering Neighborhoods in Toronto

## Applied Data Science Capstone Project by Martin Manjolo - Part One

### Introduction

This is part 1 of 3 notebooks that explore, segment, and cluster the neighbourhoods in the city of Toronto. Part 1 scrapes Wikipedia data and builds a dataframe.

The Wikipedia page is available here: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

This notebook uses Python 3.8

### Library and Dependency Import

In [1]:
import requests
import lxml.html as lh
import bs4 as bs
import urllib.request

import pandas as pd

### Data Source

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

### Scraping html tables using BeautifulSoup

In [3]:
# scrape table
def scrape_table(cname, cols):
    page = urllib.request.urlopen(url).read()
    soup = bs.BeautifulSoup(page, 'lxml')
    table = soup.find("table", class_=cname)
    header = [head.findAll(text=True)[0].strip() for head in table.find_all("th")]
    data = [[td.findAll(text=True)[0].strip() for td in tr.find_all("td")]
            for tr in table.find_all("tr")]
    data = [row for row in data if len(row) == cols]
    # Store data to this temporary data frame
    raw_df = pd.DataFrame(data, columns=header)
    return raw_df

# Parsing using xpath
def scrape_table_lxml(xpath, cols):
    page = requests.get(url)
    doc = lh.fromstring(page.content)
    table_content = doc.xpath(xpath)
    for table in table_content:
        headers = [th.text_content().strip() for th in table.xpath('//th')]
        headers = headers[0:3]
        data = [[td.text_content().strip() for td in tr.xpath('td')]
                for tr in table.xpath('//tbody/tr')]
        data = [row for row in data if len(row) == cols]
        raw_df = pd.DataFrame(data, columns=headers)
        return raw_df

In [4]:
# Test in BeautifulSoup
raw_TorontoPostalCodes = scrape_table("wikitable", 3)

# Test in lxml
print("# Toronto Postal codes stored in data")
print(raw_TorontoPostalCodes.info(verbose=True))

# Toronto Postal codes stored in data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Postal code   180 non-null    object
 1   Borough       180 non-null    object
 2   Neighborhood  180 non-null    object
dtypes: object(3)
memory usage: 4.3+ KB
None


### Data Cleanup and Dataframe Creation

The Wikipedia table data will be output as a Pandas Dataframe.  The following conditions apply:
* The dataframe will have 3 columns: PostalCode, Borough and Neighbourhood
* Removal of cells without an assigned borough 
* Sometimes, there's more than one neighbourhood in a postal area. They will be combined into one row, separated by a comma
* Some cells have boroughs but no neighbourhoods. The neighbourhood will carry the borough's name

In [5]:
df=pd.DataFrame(raw_TorontoPostalCodes) 
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [6]:
# Rename 'Postal Code' column
df.rename(columns = {'Postal code':'Postal_Code'}, inplace = True)
df.head()

Unnamed: 0,Postal_Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [7]:
# Drop cells with unassigned boroughs
df1=df[~df.Borough.str.contains("Not assigned")]
df1=df1.reset_index(drop=True)
df1.head()

Unnamed: 0,Postal_Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [8]:
# Replace 'non assigned' neighbourhoods with borough's name
df1.loc[df1['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df1['Borough']
df1.head(10)

Unnamed: 0,Postal_Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,Malvern / Rouge
7,M3B,North York,Don Mills
8,M4B,East York,Parkview Hill / Woodbine Gardens
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [9]:
# check for duplicate rows. It's' a bit of a scroll...
pd.set_option('display.max_rows', None)
df1.duplicated(subset=None, keep='first')

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
30     False
31     False
32     False
33     False
34     False
35     False
36     False
37     False
38     False
39     False
40     False
41     False
42     False
43     False
44     False
45     False
46     False
47     False
48     False
49     False
50     False
51     False
52     False
53     False
54     False
55     False
56     False
57     False
58     False
59     False
60     False
61     False
62     False
63     False
64     False
65     False
66     False
67     False
68     False
69     False
70     False
71     False
72     False
73     False
74     False
75     False
76     False

As all rows return a false value, neighbourhoods are already combined and the dataframe is ready

### Final Dataframe and Shape

In [10]:
df1.head(20)

Unnamed: 0,Postal_Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,Malvern / Rouge
7,M3B,North York,Don Mills
8,M4B,East York,Parkview Hill / Woodbine Gardens
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Export of dataframe to csv, for use in other notebooks

In [11]:
df1.to_csv('toronto_post_codes.csv', index=False)

### End of Part 1

In [12]:
df1.shape

(103, 3)