Step 1: Notebook created!!
--------------------------------------

<hr />

Step 2: Scrape the [wiki page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) and create a dataframe
-----------------------------------------------------------------------------------------------------------------------------------

#### Import some useful staff:

In [1]:
import pandas as pd

import requests
from bs4 import BeautifulSoup # Great web scraping library!

#### Download and parse web-page:

In [2]:
page_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page_response = requests.get(page_link, timeout=5) # Use GET to obtain the response
page_content = BeautifulSoup(page_response.content, 'html.parser') # Use html parser to parse the page

#### Let's extract the table from the content:

In [3]:
target_table = page_content.find('table', {'class': 'wikitable sortable'})
type(target_table)

bs4.element.Tag

#### And now let's define some helper functions

Function to count rows and columns (here we assume, that each row have exactly the same number of columns)

In [8]:
def count_rows_and_columns(table):
    n_rows = n_columns = 0
    
    for row in table.find_all('tr'): # 'tr' is html tag that defines Table Row
        n_data_columns_in_row = row.find_all('td') # 'td' is html tag that defines a column with Table Data
        if len(n_data_columns_in_row): # if there is some columns in row
            n_rows += 1 # count this row
            if n_columns == 0: # and if we have not counted number of columns
                n_columns = len(n_data_columns_in_row) # count it too
    return n_rows, n_cols

Function to get column names (here we assume, that column names are defined only once in the table)

In [9]:
def get_column_names_list(table):
    column_names = []
    column_name_elements = table.find_all('th') # 'th' is html tag that defines Table Header (column name)
    
    if len(column_name_elements): # if there are some column names
        for column_name_element in column_name_elements: # cycle through column name elements
            column_names.append(column_name_element.get_text())
    
    return column_names

Function to get one row at specific index (yep, I know, that it's not efficient to extract all the rows from the table every single time you want to get only one row, but in our case, it's acceptable and very convenient. Also, let's assume, that column names come in the first row, so we can simplify things and just throw it away)

In [14]:
def get_row_as_dict(table, row_index):
    table_rows = table.find_all('tr')
    table_rows.pop(0) # throw away row with column names
    column_names = get_column_names_list(table)
    result_row = {}
    for column_name, column_data_element in zip(column_names, table_rows[row_index].find_all('td')):
        result_row[column_name] = column_data_element.get_text()
    return result_row
    
    

In [13]:
a = target_table.find_all('tr')
a.pop(0)
a[3].find_all('td')

[<td>M4A</td>,
 <td><a href="/wiki/North_York" title="North York">North York</a></td>,
 <td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
 </td>]