# Exploring Toronto
For this project I will need postal codes for Canada. I will be scraping this from a Wikipedia sight:
> https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

I have downloaded this file onto my PC and then uploaded it as a data asset. I struggled a bit with using the file in the cloud environment, and for the sake of time I decided to only scrape the content direcly from the website.

To get the data from this website I will be learning much about a new library: **BeautifulSoup**.
The first step is to start understanding how to use the new library...

### Beautiful Soup
The documentation for beautifulSoup is:
> https://beautiful-soup-4.readthedocs.io/en/latest/

Here is a tutorial video that I will be going through to understand it:
> https://www.youtube.com/watch?v=ng2o98k983k

The video above starts with following pieces at the times specified below:
1. Using a file (not in a cloud environment): [09:00]
2. Website interactions: [20:00]

For the interest of time I will be looping all three installation steps into a single cell.
##### Step1 - Installing beautifulSoup 
Installing the HTML manipulater tool into my notebook. BeautifulSoup creates the parsertree structure, for the parser to interpret the file.
##### Step2 - Installing an HTML parser
Installing the HTML parser. This is the part that interprets the parsed tree into a data structure.
##### Step3 - Installing the RequestLib [for working with Web api's]
Installing the library to interact with api's

In [1]:
!conda install beautifulsoup4
!conda install lxml
!conda install requests

Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

    beautifulsoup4: 4.6.0-py35h442a8c9_1 --> 4.6.3-py35_0

beautifulsoup4 100% |################################| Time: 0:00:00  39.93 MB/s
Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

    libgcc-ng: 7.2.0-h7cc24e2_2     --> 8.2.0-hdf63c60_1    
    libxml2:   2.9.4-h6b072ca_5     --> 2.9.8-hf84eae3_0    
    libxslt:   1.1.29-hcf9102b_5    --> 1.1.33-h7d1a2b0_0   
    lxml:      4.1.0-py35ha401a81_0 --> 4.2.5-py35hefd8a0e_0

libgcc-ng-8.2. 100% |################################| Time: 0:00:00  86.60 MB/s
libxml2-2.9.8- 100% |################################| Time: 0:00:00  67.37 MB/s
libxslt-1.1.33 100% |################################| Time: 0:00:00  62.

##### Step4 - Importing for use

In [2]:
from bs4 import BeautifulSoup as bs
import requests as rq
import pandas as pd
import numpy as np

##### Step5 - Scraping the website
**_Part A: pulling table._**
In the below cell I am merely pulling the table and putting it in an initial DataFrame. This df still has mulitple rows for a single postcode.

In [3]:
source = rq.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = bs(source, 'lxml')

table = soup.table

column_names = []
for te in table.find_all('th'):
    column_names.append(te.text)


## Creating a blank dataframe
postalCodes_DF = pd.DataFrame(columns = column_names)
# postalCodes_DF

for i,te in enumerate(table.find_all('tr')):
    if (te.td):
        row_lst = te.text[:].split('\n')
        # Below: Eliminating the 'Not assigned'-rows
        if row_lst[2] != 'Not assigned':
            postcode = row_lst[1]
            borough = row_lst[2]
            # Below: Eliminating rows where 'Borough's are stipulated, but 'Neighbourhood's are 'Not assigned'
            if row_lst[3] == 'Not assigned': 
                nhood = borough
            else:
                nhood = row_lst[3]

            postalCodes_DF = postalCodes_DF.append({'Postcode': postcode,
                                                    'Borough': borough,
                                                    'Neighbourhood': nhood},ignore_index=True)

pc_DF = postalCodes_DF.drop(['Neighbourhood\n'],1)
# pc_DF.loc[pc_DF['Postcode'] == 'M5A']  # Checking M5A against the orignal table to make sure the table was drawn correctly

**_PartB - Removing duplicate rows (with csv's)_.** 
In the below cell I merely join the 'Neighbourhood' values of multiple columns with a comma.

In [4]:
# Getting the Neighbourhoods with the same postcodes comma separated
new_pc_DF = pc_DF.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join)
new_pc_DF = new_pc_DF.to_frame().reset_index()
new_pc_DF.loc[new_pc_DF['Postcode'] == 'M7A']
# ['Postcode'] == 'M7A'
# type(new_pc_DF)

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


In [5]:
new_pc_DF.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


In [8]:
print(('The shape of the DF is: {}').format(new_pc_DF.shape))

The shape of the DF is: (103, 3)
