# 1. Table extraction

In [1]:
import logging
import pathlib

logger = logging.getLogger()
logger.setLevel(logging.ERROR)

In [2]:
from esg_data_pipeline import PDFTableExtractor
from esg_data_pipeline import config

In [3]:
PDFTableExtractor_kwargs = {
    "batch_size": -1,
    "cscdtabnet_config": config.CONFIG_FOLDER / "cascade_mask_rcnn_hrnetv2p_w32_20e.py",
    "cscdtabnet_ckpt": config.CHECKPOINT_FOLDER / "icdar_19b2.pth",
    "bbox_thres": 0.85,
    "dpi": 200,
}

In [4]:
table_extractor = PDFTableExtractor(**PDFTableExtractor_kwargs)

2020-08-17 20:33:03,398 — esg_data_pipeline.components.pdf_table_extractor — INFO —__create_model:70 — cascadetabnet checkpoint does not exist. Downloading...


Downloading...
From: https://drive.google.com/uc?id=1-QieHkR1Q7CXuBu4fp3rYrvDG9j26eFT
To: /esg_data_pipeline/esg_data_pipeline/checkpoint/icdar_19b2.pth
39.3MB [00:00, 68.3MB/s]('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))
50.3MB [00:00, 97.4MB/s]


2020-08-17 20:33:06,031 — esg_data_pipeline.components.pdf_table_extractor — INFO —__create_model:85 — Error while downloading cascadetabnet checkpoint. Redownloading...


Downloading...
From: https://drive.google.com/uc?id=1-QieHkR1Q7CXuBu4fp3rYrvDG9j26eFT
To: /esg_data_pipeline/esg_data_pipeline/checkpoint/icdar_19b2.pth
664MB [00:03, 212MB/s] 


### Extract tables from a single pdf file
Table Extraction happens in two stages:
1. **Infer Bounding Box:** Convert a pdf into image and use an object detection model for table recognition. The model is called CascadeTabNet. The output of this stage is the coordinates of tables.

2. **Extract tables:** from PDFs using Tabula. The coordinates are passed to tabula for analyzing that area of page.

### Automate these processes using run_folder() function

We extract all tables from pdfs in the `data/pdfs` folder. First, let's delete all files in the extraction folder. In the data/pdfs folder, we have a pdf `NYSE_TOT_2015 annual.pdf`

In [5]:
!ls $config.PDF_FOLDER

'NYSE_TOT_2015 annual.pdf'


In [6]:
!rm -rf $config.EXTRACTION_FOLDER/*

In [7]:
!ls $config.EXTRACTION_FOLDER

In [8]:
table_extractor.run_folder(config.PDF_FOLDER, config.EXTRACTION_FOLDER)

2020-08-17 20:33:25,345 — esg_data_pipeline.components.pdf_table_extractor — INFO —run:273 — PDFTableExtractor is running on file /esg_data_pipeline/esg_data_pipeline/data/pdfs/NYSE_TOT_2015 annual.pdf...


Page:   0%|          | 0/295 [00:00<?, ?it/s]
  "See the documentation of nn.Upsample for details.".format(mode))

Inferring tables for page 1-295:   0%|          | 1/294 [00:01<06:04,  1.24s/it][A
Inferring tables for page 1-295:   1%|          | 2/294 [00:01<04:26,  1.09it/s][A
Inferring tables for page 1-295:   1%|          | 3/294 [00:01<03:51,  1.26it/s][A
Inferring tables for page 1-295:   1%|▏         | 4/294 [00:02<03:13,  1.50it/s][A
Inferring tables for page 1-295:   2%|▏         | 5/294 [00:02<02:51,  1.69it/s][A
Inferring tables for page 1-295:   2%|▏         | 6/294 [00:02<02:11,  2.19it/s][A
Inferring tables for page 1-295:   2%|▏         | 7/294 [00:03<02:41,  1.78it/s][A
Inferring tables for page 1-295:   3%|▎         | 8/294 [00:04<02:35,  1.84it/s][A
Inferring tables for page 1-295:   3%|▎         | 9/294 [00:04<02:07,  2.24it/s][A
Inferring tables for page 1-295:   3%|▎         | 10/294 [00:04<01:40,  2.81it/s][A
Inferring tables for page 1-295:   4%|▎     

Inferring tables for page 1-295:  31%|███       | 91/294 [00:28<01:11,  2.82it/s][A
Inferring tables for page 1-295:  31%|███▏      | 92/294 [00:29<00:58,  3.43it/s][A
Inferring tables for page 1-295:  32%|███▏      | 93/294 [00:29<00:51,  3.93it/s][A
Inferring tables for page 1-295:  32%|███▏      | 94/294 [00:29<00:44,  4.52it/s][A
Inferring tables for page 1-295:  32%|███▏      | 95/294 [00:29<00:39,  5.06it/s][A
Inferring tables for page 1-295:  33%|███▎      | 96/294 [00:29<00:35,  5.52it/s][A
Inferring tables for page 1-295:  33%|███▎      | 97/294 [00:29<00:33,  5.89it/s][A
Inferring tables for page 1-295:  33%|███▎      | 98/294 [00:30<00:31,  6.19it/s][A
Inferring tables for page 1-295:  34%|███▎      | 99/294 [00:30<00:30,  6.36it/s][A
Inferring tables for page 1-295:  34%|███▍      | 100/294 [00:30<00:29,  6.51it/s][A
Inferring tables for page 1-295:  34%|███▍      | 101/294 [00:30<00:29,  6.62it/s][A
Inferring tables for page 1-295:  35%|███▍      | 102/294 [00:3

Inferring tables for page 1-295:  63%|██████▎   | 186/294 [01:00<01:09,  1.54it/s][A
Inferring tables for page 1-295:  64%|██████▎   | 187/294 [01:01<01:18,  1.37it/s][A
Inferring tables for page 1-295:  64%|██████▍   | 188/294 [01:02<01:20,  1.32it/s][A
Inferring tables for page 1-295:  64%|██████▍   | 189/294 [01:02<01:24,  1.24it/s][A
Inferring tables for page 1-295:  65%|██████▍   | 190/294 [01:03<01:24,  1.23it/s][A
Inferring tables for page 1-295:  65%|██████▍   | 191/294 [01:04<01:27,  1.17it/s][A
Inferring tables for page 1-295:  65%|██████▌   | 192/294 [01:05<01:29,  1.14it/s][A
Inferring tables for page 1-295:  66%|██████▌   | 193/294 [01:05<01:07,  1.51it/s][A
Inferring tables for page 1-295:  66%|██████▌   | 194/294 [01:06<01:12,  1.38it/s][A
Inferring tables for page 1-295:  66%|██████▋   | 195/294 [01:07<01:14,  1.34it/s][A
Inferring tables for page 1-295:  67%|██████▋   | 196/294 [01:08<01:17,  1.26it/s][A
Inferring tables for page 1-295:  67%|██████▋   | 197/

Inferring tables for page 1-295:  96%|█████████▌| 281/294 [02:11<00:09,  1.35it/s][A
Inferring tables for page 1-295:  96%|█████████▌| 282/294 [02:12<00:09,  1.26it/s][A
Inferring tables for page 1-295:  96%|█████████▋| 283/294 [02:12<00:08,  1.23it/s][A
Inferring tables for page 1-295:  97%|█████████▋| 284/294 [02:13<00:08,  1.18it/s][A
Inferring tables for page 1-295:  97%|█████████▋| 285/294 [02:14<00:07,  1.15it/s][A
Inferring tables for page 1-295:  97%|█████████▋| 286/294 [02:15<00:06,  1.18it/s][A
Inferring tables for page 1-295:  98%|█████████▊| 287/294 [02:16<00:05,  1.22it/s][A
Inferring tables for page 1-295:  98%|█████████▊| 288/294 [02:17<00:05,  1.18it/s][A
Inferring tables for page 1-295:  98%|█████████▊| 289/294 [02:18<00:04,  1.14it/s][A
Inferring tables for page 1-295:  99%|█████████▊| 290/294 [02:19<00:03,  1.15it/s][A
Inferring tables for page 1-295:  99%|█████████▉| 291/294 [02:19<00:02,  1.13it/s][A
Inferring tables for page 1-295:  99%|█████████▉| 292/

Each detected tables is saved as a CSV file.

In [9]:
!ls $config.EXTRACTION_FOLDER

'NYSE_TOT_2015 annual_page102_1.csv'  'NYSE_TOT_2015 annual_page225_2.csv'
'NYSE_TOT_2015 annual_page104_1.csv'  'NYSE_TOT_2015 annual_page225_3.csv'
'NYSE_TOT_2015 annual_page105_1.csv'  'NYSE_TOT_2015 annual_page226_1.csv'
'NYSE_TOT_2015 annual_page108_1.csv'  'NYSE_TOT_2015 annual_page226_2.csv'
'NYSE_TOT_2015 annual_page109_1.csv'  'NYSE_TOT_2015 annual_page226_3.csv'
'NYSE_TOT_2015 annual_page109_2.csv'  'NYSE_TOT_2015 annual_page226_4.csv'
'NYSE_TOT_2015 annual_page110_1.csv'  'NYSE_TOT_2015 annual_page227_1.csv'
'NYSE_TOT_2015 annual_page110_2.csv'  'NYSE_TOT_2015 annual_page228_1.csv'
'NYSE_TOT_2015 annual_page112_1.csv'  'NYSE_TOT_2015 annual_page229_1.csv'
'NYSE_TOT_2015 annual_page112_2.csv'  'NYSE_TOT_2015 annual_page22_1.csv'
'NYSE_TOT_2015 annual_page113_1.csv'  'NYSE_TOT_2015 annual_page230_1.csv'
'NYSE_TOT_2015 annual_page113_2.csv'  'NYSE_TOT_2015 annual_page230_2.csv'
'NYSE_TOT_2015 annual_page114_1.csv'  'NYSE_TOT_2015 annual_page230_3.csv'
'NYSE_TOT_201

# 2. Table Curation

We will be using Allyson's annotations on this report `NYSE_TOT_2015 annual.pdf`. The annotation file exists in `config.ANNOTATION_FOLDER`.

In [10]:
from esg_data_pipeline import TableCurator

In [11]:
tb_cur = TableCurator(
    neg_pos_ratio=1,
    create_neg_samples=True,
    columns_to_read=["company", "source_file", "source_page", "kpi_id", "year", "answer", "data_type"],
    company_to_exclude=["CEZ"],
)

In [12]:
import glob

annotation_excels = glob.glob("{}/[!~$]*[.xlsx]".format(config.ANNOTATION_FOLDER))
tb_cur.run(config.EXTRACTION_FOLDER, annotation_excels, config.CURATION_FOLDER)











  res_values = method(rvalues)














Most of the warnings are just pdfs mentioned in the excel file which were not extracted, in other words do not exist in the initial pdfs folder. Next, we will see that the curation folder will have a csv which shows the curated data.

In [13]:
!ls $config.CURATION_FOLDER

esg_TABLE_dataset.csv


In [14]:
import pandas as pd

df = pd.read_csv(config.CURATION_FOLDER / "esg_TABLE_dataset.csv", index_col=0)
df.head()

Unnamed: 0,Company,Year,Question,Answer,Table_filename,Label
0,Total SA,2013,What is the volume of estimated proven or prob...,,NYSE_TOT_2015 annual_page242_4.csv,0
1,Total SA,2013,What is the volume of estimated proven or prob...,"11,526 million barrels of oil equivalent",NYSE_TOT_2015 annual_page266_1.csv,1
2,Total SA,2014,What is the volume of estimated proven or prob...,,NYSE_TOT_2015 annual_page230_2.csv,0
3,Total SA,2014,What is the volume of estimated proven or prob...,"11,523 million barrels of oil equivalent",NYSE_TOT_2015 annual_page266_1.csv,1
4,Total SA,2015,What is the volume of estimated proven or prob...,,NYSE_TOT_2015 annual_page8_1.csv,0


# 3. Text Extraction

The text extraction stage extracts all the paragraphs in a pdf and saves them in a JSON file, so that it can be easily handled by Python.

This process is applied to all the pdfs mentioned in an annotation Excel file provided by Allianz.
A JSON file is created for each pdf, with the same file name.
Place the Excel files in data/annotations, and the pdfs in data/pdfs.
The extracted pdfs will be saved in data/extraction. These directories can be changed in the config file config/config.py

In [15]:
import os

from esg_data_pipeline import config
from esg_data_pipeline.components import PDFTextExtractor
from esg_data_pipeline.components import TextCurator

In [16]:
config.PDFTextExtractor_kwargs

{'min_paragraph_length': 30,
 'annotation_folder': PosixPath('/esg_data_pipeline/esg_data_pipeline/data/annotations')}

In [17]:
ext = PDFTextExtractor(**config.PDFTextExtractor_kwargs)
ext.run_folder(config.PDF_FOLDER, config.EXTRACTION_FOLDER)

2020-08-17 20:42:22,506 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf 2015_BASF_Report.pdf
2020-08-17 20:42:22,508 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf BASF_Report_2016.pdf
2020-08-17 20:42:22,510 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf BASF_Report_2017.pdf
2020-08-17 20:42:22,511 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf BASF_Report_2018.pdf
2020-08-17 20:42:22,513 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf Wintershall Dea annual report 2019.pdf
2020-08-17 20:42:22,514 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf Wintershall-Dea_Sustainability_Report_2019.pdf
2020-08-17 20:42:22,516 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not



2020-08-17 20:43:29,444 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf NYSE_TOT_2016 annual.pdf
2020-08-17 20:43:29,446 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf NYSE_TOT_2017 annual.pdf
2020-08-17 20:43:29,447 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf NYSE_TOT_2018 annual.pdf
2020-08-17 20:43:29,449 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf Transocean_Sustain_digital_FN_4 2017_2018.pdf
2020-08-17 20:43:29,450 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf sustainable development 2017.pdf
2020-08-17 20:43:29,451 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf annual 2018 .pdf
2020-08-17 20:43:29,452 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could no

We can see that the JSON files are created for each pdf.

In [18]:
!ls -lh $config.EXTRACTION_FOLDER

total 2.4M
-rw-r--r-- 1 root root 1.3M Aug 17 20:43 'NYSE_TOT_2015 annual.json'
-rw-r--r-- 1 root root 2.1K Aug 17 20:39 'NYSE_TOT_2015 annual_page102_1.csv'
-rw-r--r-- 1 root root  311 Aug 17 20:39 'NYSE_TOT_2015 annual_page104_1.csv'
-rw-r--r-- 1 root root  734 Aug 17 20:39 'NYSE_TOT_2015 annual_page105_1.csv'
-rw-r--r-- 1 root root 1.8K Aug 17 20:39 'NYSE_TOT_2015 annual_page108_1.csv'
-rw-r--r-- 1 root root  539 Aug 17 20:39 'NYSE_TOT_2015 annual_page109_1.csv'
-rw-r--r-- 1 root root 1.5K Aug 17 20:39 'NYSE_TOT_2015 annual_page109_2.csv'
-rw-r--r-- 1 root root  783 Aug 17 20:39 'NYSE_TOT_2015 annual_page110_1.csv'
-rw-r--r-- 1 root root 1.4K Aug 17 20:39 'NYSE_TOT_2015 annual_page110_2.csv'
-rw-r--r-- 1 root root 1.8K Aug 17 20:39 'NYSE_TOT_2015 annual_page112_1.csv'
-rw-r--r-- 1 root root  296 Aug 17 20:39 'NYSE_TOT_2015 annual_page112_2.csv'
-rw-r--r-- 1 root root  465 Aug 17 20:39 'NYSE_TOT_2015 annual_page113_1.csv'
-rw-r--r-- 1 root root 1.8K Aug 17 20:39 'NYSE_TO

**Alternatively**, we can have the pipeline work on a **single pdf**, by using the `run()` method and specifying the path to the desired pdf file.
Let's create a test directory and save the JSON file in there.

In [19]:
test_dir = "{}/test_dir".format(config.DATA_FOLDER)
if not os.path.exists(test_dir):
    os.mkdir(test_dir)

sample_pdf = "{}/NYSE_TOT_2015 annual.pdf".format(config.PDF_FOLDER)
text_dict = ext.run(input_filepath=sample_pdf, output_folder=test_dir)

In [20]:
!ls $config.DATA_FOLDER/test_dir

'NYSE_TOT_2015 annual.json'


# 4. Text Curation


The extracted JSON files are fed into the next stage to curate a training dataset.
<br>The positive examples (label 1) are taken from the annotated data provided by Allinaz.
<br>A negative example (label 0) for each question is created by selecting a random paragraph from the JSON files.

In [21]:
ls $config.CURATION_FOLDER/*

/esg_data_pipeline/esg_data_pipeline/data/curation/esg_TABLE_dataset.csv


In [22]:
cur = TextCurator(**config.TextCurator_kwargs)
cur.run(config.EXTRACTION_FOLDER, annotation_excels, config.CURATION_FOLDER)

[13.]
2020-08-17 20:45:29,151 — esg_data_pipeline.components.text_curator — INFO —run:91 — Curated 215 examples
2020-08-17 20:45:29,151 — esg_data_pipeline.components.text_curator — INFO —run:92 — Saving the dataset in /esg_data_pipeline/esg_data_pipeline/data/curation/esg_TEXT_dataset.csv


In [23]:
ls $config.CURATION_FOLDER/*

/esg_data_pipeline/esg_data_pipeline/data/curation/esg_TABLE_dataset.csv
/esg_data_pipeline/esg_data_pipeline/data/curation/esg_TEXT_dataset.csv


# 5. Extraction pipeline

We can create an extraction pipeline using `Extractor` object to run extraction of text and table data. In the config file, we define the extractors which we would like to run on the pdfs.

In [24]:
config.EXTRACTORS

[('PDFTableExtractor',
  {'batch_size': -1,
   'cscdtabnet_config': PosixPath('/esg_data_pipeline/esg_data_pipeline/config/cascade_mask_rcnn_hrnetv2p_w32_20e.py'),
   'cscdtabnet_ckpt': PosixPath('/esg_data_pipeline/esg_data_pipeline/checkpoint/icdar_19b2.pth'),
   'bbox_thres': 0.85,
   'dpi': 200}),
 ('PDFTextExtractor',
  {'min_paragraph_length': 30,
   'annotation_folder': PosixPath('/esg_data_pipeline/esg_data_pipeline/data/annotations')})]

In [25]:
from esg_data_pipeline.components import Extractor

ext = Extractor(config.EXTRACTORS)

In [26]:
ext.run_folder(config.PDF_FOLDER, config.EXTRACTION_FOLDER)

2020-08-17 20:45:52,122 — esg_data_pipeline.components.pdf_table_extractor — INFO —run:273 — PDFTableExtractor is running on file /esg_data_pipeline/esg_data_pipeline/data/pdfs/NYSE_TOT_2015 annual.pdf...



Page:   0%|          | 0/295 [00:00<?, ?it/s][A

  "See the documentation of nn.Upsample for details.".format(mode))


Inferring tables for page 1-295:   0%|          | 1/294 [00:00<01:58,  2.47it/s][A[A

Inferring tables for page 1-295:   1%|          | 2/294 [00:00<01:35,  3.06it/s][A[A

Inferring tables for page 1-295:   1%|          | 3/294 [00:01<01:53,  2.57it/s][A[A

Inferring tables for page 1-295:   1%|▏         | 4/294 [00:01<01:52,  2.59it/s][A[A

Inferring tables for page 1-295:   2%|▏         | 5/294 [00:01<01:55,  2.51it/s][A[A

Inferring tables for page 1-295:   2%|▏         | 6/294 [00:02<01:32,  3.11it/s][A[A

Inferring tables for page 1-295:   2%|▏         | 7/294 [00:02<02:18,  2.07it/s][A[A

Inferring tables for page 1-295:   3%|▎         | 8/294 [00:03<02:20,  2.03it/s][A[A

Inferring tables for page 1-295:   3%|▎         | 9/294 [00:03<01:57,  2.43it/s][A[A

Inferring tables for page 1-295:   3%|▎         | 10/294 [00:03<01:33,  3.03it/s][A[A

Inferring tables for page 1-295:  30%|██▉       | 87/294 [00:27<01:39,  2.08it/s][A[A

Inferring tables for page 1-295:  30%|██▉       | 88/294 [00:27<01:18,  2.63it/s][A[A

Inferring tables for page 1-295:  30%|███       | 89/294 [00:27<01:25,  2.41it/s][A[A

Inferring tables for page 1-295:  31%|███       | 90/294 [00:28<01:32,  2.20it/s][A[A

Inferring tables for page 1-295:  31%|███       | 91/294 [00:28<01:13,  2.77it/s][A[A

Inferring tables for page 1-295:  31%|███▏      | 92/294 [00:28<00:59,  3.37it/s][A[A

Inferring tables for page 1-295:  32%|███▏      | 93/294 [00:28<00:51,  3.88it/s][A[A

Inferring tables for page 1-295:  32%|███▏      | 94/294 [00:29<00:45,  4.44it/s][A[A

Inferring tables for page 1-295:  32%|███▏      | 95/294 [00:29<00:40,  4.95it/s][A[A

Inferring tables for page 1-295:  33%|███▎      | 96/294 [00:29<00:36,  5.40it/s][A[A

Inferring tables for page 1-295:  33%|███▎      | 97/294 [00:29<00:34,  5.76it/s][A[A

Inferring tables for 

Inferring tables for page 1-295:  61%|██████    | 178/294 [00:56<00:31,  3.72it/s][A[A

Inferring tables for page 1-295:  61%|██████    | 179/294 [00:57<00:27,  4.26it/s][A[A

Inferring tables for page 1-295:  61%|██████    | 180/294 [00:57<00:23,  4.85it/s][A[A

Inferring tables for page 1-295:  62%|██████▏   | 181/294 [00:57<00:22,  4.94it/s][A[A

Inferring tables for page 1-295:  62%|██████▏   | 182/294 [00:57<00:22,  5.07it/s][A[A

Inferring tables for page 1-295:  62%|██████▏   | 183/294 [00:57<00:24,  4.56it/s][A[A

Inferring tables for page 1-295:  63%|██████▎   | 184/294 [00:58<00:45,  2.43it/s][A[A

Inferring tables for page 1-295:  63%|██████▎   | 185/294 [00:59<01:02,  1.75it/s][A[A

Inferring tables for page 1-295:  63%|██████▎   | 186/294 [01:00<01:12,  1.49it/s][A[A

Inferring tables for page 1-295:  64%|██████▎   | 187/294 [01:01<01:20,  1.33it/s][A[A

Inferring tables for page 1-295:  64%|██████▍   | 188/294 [01:02<01:23,  1.28it/s][A[A

Inferring 

Inferring tables for page 1-295:  91%|█████████▏| 269/294 [02:02<00:19,  1.26it/s][A[A

Inferring tables for page 1-295:  92%|█████████▏| 270/294 [02:03<00:20,  1.20it/s][A[A

Inferring tables for page 1-295:  92%|█████████▏| 271/294 [02:04<00:19,  1.15it/s][A[A

Inferring tables for page 1-295:  93%|█████████▎| 272/294 [02:05<00:19,  1.12it/s][A[A

Inferring tables for page 1-295:  93%|█████████▎| 273/294 [02:06<00:19,  1.10it/s][A[A

Inferring tables for page 1-295:  93%|█████████▎| 274/294 [02:07<00:18,  1.09it/s][A[A

Inferring tables for page 1-295:  94%|█████████▎| 275/294 [02:08<00:17,  1.08it/s][A[A

Inferring tables for page 1-295:  94%|█████████▍| 276/294 [02:09<00:16,  1.07it/s][A[A

Inferring tables for page 1-295:  94%|█████████▍| 277/294 [02:10<00:16,  1.06it/s][A[A

Inferring tables for page 1-295:  95%|█████████▍| 278/294 [02:11<00:15,  1.05it/s][A[A

Inferring tables for page 1-295:  95%|█████████▍| 279/294 [02:12<00:13,  1.12it/s][A[A

Inferring 

2020-08-17 20:53:54,095 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf 2015_BASF_Report.pdf
2020-08-17 20:53:54,096 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf BASF_Report_2016.pdf
2020-08-17 20:53:54,097 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf BASF_Report_2017.pdf
2020-08-17 20:53:54,098 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf BASF_Report_2018.pdf
2020-08-17 20:53:54,100 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf Wintershall Dea annual report 2019.pdf
2020-08-17 20:53:54,102 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf Wintershall-Dea_Sustainability_Report_2019.pdf
2020-08-17 20:53:54,103 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not



2020-08-17 20:55:02,085 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf NYSE_TOT_2016 annual.pdf
2020-08-17 20:55:02,087 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf NYSE_TOT_2017 annual.pdf
2020-08-17 20:55:02,088 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf NYSE_TOT_2018 annual.pdf
2020-08-17 20:55:02,090 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf Transocean_Sustain_digital_FN_4 2017_2018.pdf
2020-08-17 20:55:02,091 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf sustainable development 2017.pdf
2020-08-17 20:55:02,093 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could not find pdf annual 2018 .pdf
2020-08-17 20:55:02,094 — esg_data_pipeline.components.pdf_text_extractor — INFO —get_path_pdf:104 — Could no

# 6. Curation pipeline

Similarly, we can create a curation pipeline using `Curator` object to run curation of both text and table data. In the config file, we define the curators.

In [27]:
config.CURATORS

[('TextCurator',
  {'retrieve_paragraph': False,
   'neg_pos_ratio': 1,
   'create_neg_samples': False}),
 ('TableCurator',
  {'neg_pos_ratio': 1,
   'create_neg_samples': True,
   'columns_to_read': ['company',
    'source_file',
    'source_page',
    'kpi_id',
    'year',
    'answer',
    'data_type'],
   'company_to_exclude': ['CEZ']})]

In [28]:
from esg_data_pipeline.components import Curator

cur = Curator(config.CURATORS)

In [29]:
cur.run(config.EXTRACTION_FOLDER, config.ANNOTATION_FOLDER, config.CURATION_FOLDER)

2020-08-17 20:56:07,500 — esg_data_pipeline.components.curator — INFO —run:45 — Received 1 excel files
[13.]
2020-08-17 20:56:08,572 — esg_data_pipeline.components.text_curator — INFO —run:91 — Curated 215 examples
2020-08-17 20:56:08,573 — esg_data_pipeline.components.text_curator — INFO —run:92 — Saving the dataset in /esg_data_pipeline/esg_data_pipeline/data/curation/esg_TEXT_dataset.csv










  res_values = method(rvalues)












