# Identifying and parsing tables

In this notebook we see some functionality of the script to find tables containing in the `cleaning.py` module and a first attempt to parse the tables with the output of this script.

In [1]:
import sys
sys.path.append('../scripts')

import pandas as pd
from cleaning import *

In [2]:
Aggreko_path = "../data/raw/annual_reports/2020/Aggreko_Annual_Report_2020.pdf"
Microsoft_path = "../data/raw/annual_reports/2020/Microsoft_Annual_Report_2020.pdf"
Amazon_path = "../data/raw/annual_reports/2020/Amazon_Annual_Report_2020.pdf"
Just_Eat_path = "../data/raw/annual_reports/2020/Just_Eat_Annual_Report_2020.pdf"

In [3]:
Aggreko = parse_file(Aggreko_path)
Microsoft = parse_file(Microsoft_path)
Amazon = parse_file(Amazon_path)
Just_Eat = parse_file(Just_Eat_path)

In [4]:
Aggreko_tables = find_tables(Aggreko)
Microsoft_tables = find_tables(Microsoft)
Amazon_tables = find_tables(Amazon)
Just_Eat_tables = find_tables(Just_Eat)

Example  for which many tables are found:

In [5]:
Aggreko_tables

{1: {'page': 9,
  'paragraph': 21,
  'raw_table_text': '0\n2020\n2025\n2024\n2021\n2023\n2022\n2030\n'},
 2: {'page': 10,
  'paragraph': 17,
  'raw_table_text': 'Just under 90% of our revenue comes  \nfrom seven key sectors:\n1. Petrochemical and refining\n2. Building services and construction\n3. Oil and gas\n4. Utilities\n5. Events\n6. Manufacturing\n7. Mining \nOther sectors served include data centres \nand shipping.\n'},
 3: {'page': 11,
  'paragraph': 8,
  'raw_table_text': 'What sets us apart? \nAggreko has a number of differentiating \nfactors leading to high levels of repeat \nbusiness from our customers. \n1. Brand strength and reputation \nCustomers value our flexibility, reliability \nand innovation in offering options for \nlower cost and emissions, and strong \nfocus on our ethics. \n2. Global network of sales  \nand service centres\nA truly global footprint, operating in  \n80 countries with 182 sales and service \ncentres. We respond quickly to worldwide \nevents, movin

### First attempt at parsing tables

Below we have a draft function to parse tables from text. This function only works in very limited cases with headers clearly defined, hence it was left out of the implementation of the project.

In [7]:
def parse_table_to_dataframe(table_string):
    """Parses table from string format to dataframe
    
    Args:
        table_string: string with newline characters separating
    headers and entries
    
    Returns:
        table_df: Dataframe containing table data
    """
    # Find all entries separated by newline characters
    
    newline_matcher = re.compile('\S\n\S')
    numerical_matcher = re.compile('\S\n\d')
    list_of_matches = newline_matcher.findall(table_string)
    number_entries = len(list_of_matches)
    
    # For each entry on the table string, loop until you
    # find the first ocurrence of a numerical entry
    # and call it n
    
    for n, match in enumerate(list_of_matches):
        if numerical_matcher.match(match) != None:
            break
            
    # Define the table entries as a list
    
    table_entries = table_string.split('\n')[:-1]
    
    # Table strings usually contain all headers and the name
    # of the first column before the first numerical match
    # Headers are all the entries before the first column
    # and the first column is the entry before the first
    # numerical match
    # Indexes are usually not disclosed either,
    # so we must include one as a header
    
    headers = ['Index'] + table_entries[:n]
    
    # Start a dictionary that will be converted to a dataframe
    
    dataframe_dictionary = {}
    
    # For each header in position j get the entries of
    # the column by taking n steps from position j
    # until the end of the table entries
    
    for j, column in enumerate(headers):
        dataframe_dictionary[column] = [table_entries[n+j+x] for x in range(0,len(table_entries[n+j:]), n+1)]
        
    return pd.DataFrame(dataframe_dictionary)                                                   

### Successful parsing example

We have found a success story for this naive code snippet

In [22]:
table = parse_table_to_dataframe(Aggreko_tables[14]['raw_table_text'])
table

Unnamed: 0,Index,Average,Year end
0,United States Dollar,1.28,1.31
1,Euro,1.14,1.17
2,UAE Dirhams,4.69,4.8
3,Australian Dollar,1.83,1.88
4,Brazilian Reals,5.03,5.3
5,Argentinian Peso,61.1,78.28
6,Russian Rouble,82.61,80.94


However most of the parsing breaks as tables can be much more intricate in terms of header and indexes. We have not pursued parsing tables from text format further on the project.