# Toronto Neighborhoods Assignment (1)

### Heriberto Encinas López

## Page scraping

This first section corresponds to the process of scraping the web page and getting the relevant information, such as: postal code, borough name and neighbourhood name.

#### Import the relevant libraries

In [1]:
#first we import the libraries we need
import pandas as pd
import requests
from bs4 import BeautifulSoup
import lxml

#### Use beautiful soup to extract data from the HTML file

In [2]:
#then we define the target url: 'list of postal codes of Canada'
wiki_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
#then we get the raw page and extract the text
raw_wiki_page = requests.get(wiki_link)
wiki_text = raw_wiki_page.text

In [4]:
#then we create a soup object and link the html data to it
soup = BeautifulSoup(wiki_text,'lxml')
soup.prettify()
#print(soup)
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [5]:
#then we extract the postal codes table from the page
table = soup.find('table')
#print(table)
print(type(table))

<class 'bs4.element.Tag'>


In [6]:
#then we extract the table's header
header = table.find_all('th')
print(header)

[<th>Postcode</th>, <th>Borough</th>, <th>Neighbourhood
</th>]


In [7]:
#then we extract each element of the header
heads = []
for i in range(0,3):
    heads.append((header[i].text).rstrip())
print(heads)

['Postcode', 'Borough', 'Neighbourhood']


In [8]:
#then we do the same for the data inside the table
data = table.find_all('td')
pcode=[]
bor=[]
neigh=[]
counter = 0
for i in range(0,len(data)): 
    if counter==0:
        pcode.append(data[i].text.rstrip())
        counter+=1
    elif counter==1:
        bor.append(data[i].text.rstrip())
        counter+=1
    elif counter==2:
        neigh.append(data[i].text.rstrip())
        counter=0
print('We have',len(pcode), 'Postal code entries')
print('We have',len(bor), 'Borough entries')
print('We have',len(neigh), 'Neighbourhood entries')

We have 288 Postal code entries
We have 288 Borough entries
We have 288 Neighbourhood entries


## Building the initial dataframe

In this section we manipulate the extracted data using pandas in order to clean it. This is the preprocessing stage so to speak. At the end of this section we will have a final dataframe that we will use for clustering.

In [9]:
#here we create our first data frame
data = pd.DataFrame(columns = heads)
data.head()
type(data)

pandas.core.frame.DataFrame

In [10]:
#then we add the extracted data to the data frame
data['Postcode'] = pcode
data['Borough'] = bor
data['Neighbourhood'] = neigh
print(data.head(5))
print(data.shape)

  Postcode           Borough     Neighbourhood
0      M1A      Not assigned      Not assigned
1      M2A      Not assigned      Not assigned
2      M3A        North York         Parkwoods
3      M4A        North York  Victoria Village
4      M5A  Downtown Toronto      Harbourfront
(288, 3)


In [11]:
#this is a checkpoint
toclean = data.copy()

In [12]:
#here we drop the columns with no borough assigned to them
toclean = toclean[toclean['Borough']!='Not assigned']
toclean.head()
len(toclean)

211

In [13]:
#we compute number of rows with 'Neighbourhood' = 'Not assigned'
sum(toclean['Neighbourhood']=='Not assigned')

1

In [14]:
#we get location for the instance
index_ = toclean['Neighbourhood']=='Not assigned'
#we set neighbourhood equal to borough
toclean['Neighbourhood'][index_] = toclean['Borough'][index_]

In [15]:
#we compute number of rows with 'Neighbourhood' = 'Not assigned'
sum(toclean['Neighbourhood']=='Not assigned')

0

In [16]:
#print head
toclean.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [17]:
#this is a checkpoint
clean_data = toclean.copy()
clean_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [18]:
#here we sort the data based on borough
clean_data.sort_values('Borough',inplace=True)
clean_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
212,M4V,Central Toronto,Forest Hill SE
197,M4T,Central Toronto,Summerhill East
196,M4T,Central Toronto,Moore Park
211,M4V,Central Toronto,Deer Park
183,M4S,Central Toronto,Davisville


In [19]:
#here we group data based on Postcode and Borough
#this is done to get neighbourhoods in same borough/postalcode 
grouped_data = clean_data.groupby(['Postcode','Borough']).apply(lambda x: ', '.join(x['Neighbourhood']))
grouped_data.head()

Postcode  Borough    
M1B       Scarborough                            Malvern, Rouge
M1C       Scarborough    Rouge Hill, Port Union, Highland Creek
M1E       Scarborough         Guildwood, West Hill, Morningside
M1G       Scarborough                                    Woburn
M1H       Scarborough                                 Cedarbrae
dtype: object

In [20]:
#this can be seen as another checkpoint
grouped_data.to_csv('grouped_data.csv')

In [21]:
#we read the data into a frame
dataset = pd.read_csv('grouped_data.csv',names=heads)
dataset.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, West Hill, Morningside"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [22]:
#finally, we reset index and save
dataset.reset_index()
dataset.to_csv('pbn_dataset.csv',index=False)
dataset.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, West Hill, Morningside"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Final Dataset Shape:

In [23]:
#here we print the final dataset shape
print('The dataset shape is:',dataset.shape)

The dataset shape is: (103, 3)
