## Segmenting and Clustering Neighborhooods in Toronto
## Part 1: Webscraping and Dataframe construction

### Applied Data Science Project

#### Luis Andrade, August 2021

We are going to extract neighborhood information about Toronto by scraping a Wikipedia page.

To achieve this we will scrap the webpage using the Beautiful Soup library.

If you don't have this package, run this installation code first. Otherwise, omit it and go to the next cell.

In [None]:
!pip install bs4

Import libraries

In [1]:
from bs4 import BeautifulSoup #Webscraping library
import requests
import pandas as pd

Save the url address with the required data into a variable:

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

Download the web page content in text format:

In [3]:
data_req = requests.get(url).text

Create a soup object:

In [4]:
soup = BeautifulSoup(data_req, 'html5lib')

Find table information of the page:

In [5]:
table = soup.find('table') #Find html tables, represented by <table> tag

The next step is to wrangle the table data into a structured data classified by postal code, borough and neighborhood. First we create an empty list to be filled by dictionaries due to be filled with information contained in each html  cell (**td**) of the web page table. There are some additional considerations:

- Cells with 'Not assigned' information will be omitted. 
- Postal codes are 3 character long.
- Borough comes between the postal codes and the first parenthesis opening '('.
- Neighborhoods come next between parenthesis and separated by slashes which will be replaced by commas for our dataframe.

In [6]:
#Create empty list that will be populated with dictionaries
code_table=[]

for row in table.findAll('td'):
    cell = {}
    #Omit records with not assigned information
    if row.span.text=='Not assigned':
        pass
    else:
        # Postal codes are 3 character long
        cell['PostalCode'] = row.p.text[:3]
        # Borough follow next and before the parenthesis opening
        cell['Borough'] = (row.span.text).split('(')[0]
        # Neighborhood info lies between parenthesis, separated by slashes
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        code_table.append(cell)

Convert Toronto postal code table to Pandas dataframe:

In [7]:
df=pd.DataFrame(code_table)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


The head looks good, however, we could see in the web page that there are some information anomalies that affected some borough names of Toronto and were replicated in our dataframe such as:
- MississaugaCanada Post Gateway Processing Centre
- Downtown TorontoStn A PO Boxes25 The Esplanade
- EtobicokeNorthwest
- East TorontoBusiness reply mail Processing Centre969 Eastern

In [8]:
df['Borough'].unique()

array(['North York', 'Downtown Toronto', "Queen's Park", 'Etobicoke',
       'Scarborough', 'East York', 'York', 'East Toronto', 'West Toronto',
       'East YorkEast Toronto', 'Central Toronto',
       'MississaugaCanada Post Gateway Processing Centre',
       'Downtown TorontoStn A PO Boxes25 The Esplanade',
       'EtobicokeNorthwest',
       'East TorontoBusiness reply mail Processing Centre969 Eastern'],
      dtype=object)

These anomalous borough names are then changed to appropriate names:

In [9]:
df['Borough'] = df['Borough'].replace({
                                      'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                      'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                      'EtobicokeNorthwest':'Etobicoke Northwest',
                                      'East YorkEast Toronto':'East York/East Toronto',
                                      'MississaugaCanada Post Gateway Processing Centre':'Mississauga'
                                      })
df['Borough'].unique()

array(['North York', 'Downtown Toronto', "Queen's Park", 'Etobicoke',
       'Scarborough', 'East York', 'York', 'East Toronto', 'West Toronto',
       'East York/East Toronto', 'Central Toronto', 'Mississauga',
       'Downtown Toronto Stn A', 'Etobicoke Northwest',
       'East Toronto Business'], dtype=object)

Now the listed boroughs have accurate names. Finally, we give a final display of the first 10 postal codes and find the shape of the Toronto postal code dataframe:

In [10]:
print(df.shape)
df.head(10)

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


As we can see, we have 103 samples, postal codes in our case, with 3 features: Postal code, borough and neighborhoods.

In [11]:
df.to_csv('toronto_codes.csv')