# KPI Extraction

At this point in the training pipeline, we have developed notebooks to curate the ESG data with annotations, and then training a relevance classifier on it to determine if a given paragraph is relevant to answering given a KPI question. In this notebook, we will train a model that, given a relevant paragraph and a KPI question, extracts the precise answer to that question from the paragraph.

The KPI extraction model that we will be training for this task is a Question-Answering (QA) model. For training such models, the input data is generally required to be present in a specific format, such as that in the SQuAD dataset. SQuAD is a public dataset for the Question-Answering task. So the first section of this notebook deals with getting the ESG data curated in a SQuAD-like format. And then the subsequent sections consume this re-formatted data for training the model. 

The model training related classes being used in this notebook include components that are provided by the FARM library. FARM is a framework which facilitates transfer learning tasks for BERT based models. Documentation for FARM is available here: https://farm.deepset.ai.

In [1]:
import os
import json
import pprint
import pathlib
import pandas as pd
from dotenv import load_dotenv
from io import BytesIO
import zipfile

from farm.infer import QAInferencer
from farm.evaluation import squad_evaluation
from farm.data_handler.utils import write_squad_predictions

from src.models.qa_farm_trainer import QAFARMTrainer
from src.data.s3_communication import S3Communication, S3FileType
from src.components.preprocessing.kpi_inference_curator import TextKPIInferenceCurator

import config
from config_qa_farm_train import (
    QAFileConfig,
    QATokenizerConfig,
    QAProcessorConfig,
    QAModelConfig,
    QATrainingConfig,
    QAMLFlowConfig,
)

07/13/2022 16:23:46 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

In [4]:
kpi_df = s3c.download_df_from_s3(
    f"{config.EXPERIMENT_NAME}/kpi_mapping",
    "kpi_mapping.csv",
    filetype=S3FileType.CSV,
    header=0,
)
kpi_df.head()

Unnamed: 0,kpi_id,question,sectors,add_year,kpi_category,Unnamed: 5,Unnamed: 6
0,0.0,What is the company name?,"OG, CM, CU",False,TEXT,,
1,1.0,In which year was the annual report or the sus...,"OG, CM, CU",False,TEXT,,
2,2.0,What is the total volume of proven and probabl...,OG,True,"TEXT, TABLE",,
3,2.1,What is the volume of estimated proven hydroca...,OG,True,"TEXT, TABLE",,
4,2.2,What is the volume of estimated probable hydro...,OG,True,"TEXT, TABLE",,


## Curate Data in SQuAD Format

The following code will curate the data and output them as SQuAD-like format.

In [5]:
# When running in Automation using Elyra and Kubeflow Pipelines,
# set AUTOMATION = 1 as an environment variable
if os.getenv("AUTOMATION"):
    # extracted pdfs
    if not os.path.exists(config.BASE_EXTRACTION_FOLDER):
        config.BASE_EXTRACTION_FOLDER.mkdir(parents=True, exist_ok=True)

    if not os.path.exists(config.BASE_ANNOTATION_FOLDER):
        config.BASE_ANNOTATION_FOLDER.mkdir(parents=True, exist_ok=True)

    if not os.path.exists(config.BASE_INFER_RELEVANCE_FOLDER):
        config.BASE_INFER_RELEVANCE_FOLDER.mkdir(parents=True, exist_ok=True)

    # processed data
    if not os.path.exists(config.BASE_PROCESSED_DATA):
        config.BASE_PROCESSED_DATA.mkdir(parents=True, exist_ok=True)

    # output squad
    if not os.path.exists(config.TextKPIInferenceCurator_kwargs['output_squad_folder']):
        pathlib.Path(config.TextKPIInferenceCurator_kwargs['output_squad_folder']).mkdir(parents=True, exist_ok=True)

    # download extracted pdfs from s3
    s3c.download_files_in_prefix_to_dir(
        config.BASE_EXTRACTION_S3_PREFIX,
        config.BASE_EXTRACTION_FOLDER,
    )

    # download the annoatation files
    s3c.download_files_in_prefix_to_dir(
        config.BASE_ANNOTATION_S3_PREFIX,
        config.BASE_ANNOTATION_FOLDER,
    )
    # download the relevance infer files
    s3c.download_files_in_prefix_to_dir(
        config.BASE_INFER_RELEVANCE_S3_PREFIX,
        config.BASE_INFER_RELEVANCE_FOLDER)

In [6]:
# run data curation with default settings
tkpi = TextKPIInferenceCurator(
    **config.TextKPIInferenceCurator_kwargs,
    kpi_df=kpi_df,
    columns_to_read=config.TRAIN_KPI_INFERENCE_COLUMNS_TO_READ,
)
train_squad, val_squad = tkpi.curate(**config.CurateConfig().__dict__)

07/13/2022 16:23:47 - INFO - src.components.preprocessing.kpi_inference_curator -   6.7
07/13/2022 16:23:47 - INFO - src.components.preprocessing.kpi_inference_curator -   6.7
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_single.loc[:, "source_page"] = df_single["source_page"].apply(lambda x: x[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_single.loc[:, "relevant_paragraphs"] = df["relevant_paragraphs"].apply(
07/13/2022 16:23:48 - INFO - src.components.preprocessing.kpi_inference_curator -   LSE_WG_2016.pdf json file has not been extracted. Will

Now, we have the data in SQuAD format, which is ready for training.

In [7]:
# see that the data has been reformatted and placed in the output folder
output_dir = str(config.TextKPIInferenceCurator_kwargs['output_squad_folder'])
!ls $output_dir

kpi_train.json	      reference_kpi_08-07-2022.csv
kpi_train_split.json  reference_kpi_12-07-2022.csv
kpi_val_split.json    reference_kpi_13-07-2022.csv


## Train Model

Now that we have the data in an appropriate format, we will train a QA model (or rather, fine tune a pretrained QA model) in this section.

### Configure Training Parameters  
Before start training, parameters for each component of the training pipeline must be set. For this we create `config` objects which hold these parameters. Default values have already been set but they can be easily changed.

In [8]:
# Settings data files and checkpoints parameters
file_config = QAFileConfig(config.EXPERIMENT_NAME)

# Settings for the processor component
processor_config = QAProcessorConfig(config.EXPERIMENT_NAME)

# Settings for the tokenizer
tokenizer_config = QATokenizerConfig(config.EXPERIMENT_NAME)

# Settings for the model
model_config = QAModelConfig(config.EXPERIMENT_NAME)

# Settings for training
train_config = QATrainingConfig(config.EXPERIMENT_NAME)

# Settings for training
mlflow_config = QAMLFlowConfig(config.EXPERIMENT_NAME)

In [9]:
# processed data
pred_dir = pathlib.Path(file_config.dev_predictions_filename).parent
if not os.path.exists(pred_dir):
    pred_dir.mkdir(parents=True, exist_ok=True)

Parameters can be changed as follows:

In [10]:
# config.EXPERIMENT_NAME = "test-training-pipeline"

However, we advise that you manually update the parameters in the corresponding config file:

`./config_qa_farm_trainer.py`

In [11]:
print(f"Experiment_name: \n {file_config.experiment_name} \n")
print(f"Data directory: \n {file_config.data_dir} \n")
print(f"Curated dataset path: \n {file_config.curated_data} \n")
print(f"Split train/validation ratio: \n{file_config.dev_split} \n")
print(f"Training dataset path: \n {file_config.train_filename} \n")
print(f"Validation dataset path: \n {file_config.dev_filename} \n")
print(f"Directory where trained model is saved: \n {file_config.saved_models_dir} \n")

Experiment_name: 
 test-demo-2 

Data directory: 
 /opt/app-root/src/aicoe-osc-demo/data 

Curated dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/squad/kpi_train.json 

Split train/validation ratio: 
0.2 

Training dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/squad/kpi_train_split.json 

Validation dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/squad/kpi_val_split.json 

Directory where trained model is saved: 
 /opt/app-root/src/aicoe-osc-demo/models/KPI_EXTRACTION 



In [12]:
print(f"Max number of tokens per example: {processor_config.max_seq_len} \n")

Max number of tokens per example: 384 



In [13]:
print(f"Use GPU: {train_config.use_cuda} \n")

Use GPU: True 



In [14]:
print(f"Learning_rate: {train_config.learning_rate} \n")
print(f"Number of epochs for fine tuning: {train_config.n_epochs} \n")
print(f"Batch size: {train_config.batch_size} \n")
print(f"Perform Cross validation: {train_config.run_cv} \n")

Learning_rate: 2e-05 

Number of epochs for fine tuning: 1 

Batch size: 2 

Perform Cross validation: False 



### Fine Tune on Curated Dataset

Now, we will fine tune the model on the curated ESG data. In the following section, a a `QAFARMTrainer` object can be instantiated by passing all the configuration objects. This object defines the model, which is essentially a bert-based model with extra dense layers for the question answering task. The weights of this model are initialized from the pretrained model on SQuAD dataset. In addition to the model, some other necessary components such as the Tokenizer and Processor will also be loaded. These components are used to create features from the input text

In [15]:
# need to convert root directory location from PosixPath to str
# otherwise transformers lib is unable to serialize and save the tokenizer
tokenizer_config.root = str(tokenizer_config.root)

In [16]:
# init trainer / fine-tuner
farm_trainer = QAFARMTrainer(
    file_config=file_config,
    tokenizer_config=tokenizer_config,
    model_config=model_config,
    processor_config=processor_config,
    training_config=train_config,
    mlflow_config=mlflow_config,
)

Call the method `run()` to start training

**Note** For the first time, loading the model will take a little longer, for download the checkpoints. The model will be cached after that. 

In [17]:
# start training
farm_trainer.run(metric="f1")

07/13/2022 16:23:53 - INFO - src.models.qa_farm_trainer -   Loading the /opt/app-root/src/aicoe-osc-demo/data/squad/kpi_train.json data and splitting to train and val...
07/13/2022 16:23:53 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: True
07/13/2022 16:23:53 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'RobertaTokenizer'
07/13/2022 16:23:53 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
07/13/2022 16:23:53 - INFO - farm.data_handler.data_silo -   Loading train set from: /opt/app-root/src/aicoe-osc-demo/data/squad/kpi_train_split.json 
07/13/2022 16:23:53 - INFO - farm.data_handler.data_silo -   Got ya 7 parallel workers to convert 824 dictionaries to pytorch datasets (chunksize = 24)...
07/13/2022 16:23:53 - INFO - farm.data_h

0.8266915389753079

At the end of the training process, the model and the processor vocabulary are saved into the directory `file_config.saved_models_dir`

In [None]:
!ls -al $file_config.saved_models_dir

You can find the developement dataset at `file_config.dev_filename`. This dataset has not been seen by the model.

In [19]:
file_config.dev_filename

'/opt/app-root/src/aicoe-osc-demo/data/squad/kpi_val_split.json'

## Run Inference on Samples

At this point we have a fine tuned model that can be used to extract answers to KPI questions! Let's use the saved model and test it on some real examples (that the model has not seen before)

In [20]:
# load model
model = QAInferencer.load(file_config.saved_models_dir, batch_size=40, gpu=True)

07/13/2022 16:28:13 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
07/13/2022 16:28:21 - INFO - farm.modeling.adaptive_model -   Found files for loading 1 prediction heads
07/13/2022 16:28:21 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [1024, 2]
07/13/2022 16:28:21 - INFO - farm.modeling.prediction_head -   Loading prediction head from /opt/app-root/src/aicoe-osc-demo/models/KPI_EXTRACTION/prediction_head_0.bin
07/13/2022 16:28:21 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'RobertaTokenizer'
07/13/2022 16:28:22 - INFO - farm.data_handler.processor -   Initialized processor without tasks. Supply `metric` and `label_list` to the constructor for using the default task or add a custom task later via processor.add_task()
07/13/2022 16:28:22 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
07/13/2

Now, let's make prediction on a pair of paragraph and question.

In [21]:
# sample context and question
context = """the paris agreement on climate change drafted in 2015 aims to reduce worldwide emissions of greenhouse
gases to a level intended to limit a rise in global temperatures to below 2 degrees or, better still,
to below 1.5 degrees. verbund’s target of reducing greenhouse gas emissions by 90% measured beginning from
the basis year 2011 5 million tonnes co2e until 2021 includes scope 1, scope 2 market- based and parts of scope 3 emissions
for energy and air travel. the science based targets initiative validated this goal as science-based in october 2016,
i.e. it meets global standards. according to current planning, the target can be achieved.
however, if the grid operator requires higher generation volumes
"""
question = "What is the target year for climate commitment?"

In [22]:
# model response
QA_input = [
    {
        "qas": [question],
        "context":  context
    }
]

result = model.inference_from_dicts(dicts=QA_input)[0]
pprint.pprint(result)

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 26.03 Batches/s]

{'predictions': [{'answers': [{'answer': '2021',
                               'context': 'the basis year 2011 5 million '
                                          'tonnes co2e until 2021 includes '
                                          'scope 1, scope 2 market- based and '
                                          'par',
                               'document_id': '0-0',
                               'offset_answer_end': 364,
                               'offset_answer_start': 360,
                               'offset_context_end': 412,
                               'offset_context_start': 312,
                               'probability': None,
                               'score': 4.064626693725586},
                              {'answer': 'no_answer',
                               'context': '',
                               'document_id': '0-0',
                               'offset_answer_end': 0,
                               'offset_answer_start': 0,
      




What does the prediction result show? 

In [23]:
# This is the best answer. Generally it can be span-based or it can be no-answer, which ever is higher
# Here the top answer is the span '2021'
result['predictions'][0]['answers'][0]

{'score': 4.064626693725586,
 'probability': None,
 'answer': '2021',
 'offset_answer_start': 360,
 'offset_answer_end': 364,
 'context': 'the basis year 2011 5 million tonnes co2e until 2021 includes scope 1, scope 2 market- based and par',
 'offset_context_start': 312,
 'offset_context_end': 412,
 'document_id': '0-0'}

In [24]:
# Non-answerable score: The model is pretty confident that the answer to the question can be in the context.
result['predictions'][0]['answers'][1]

{'score': -0.8681855201721191,
 'probability': None,
 'answer': 'no_answer',
 'offset_answer_start': 0,
 'offset_answer_end': 0,
 'context': '',
 'offset_context_start': 0,
 'offset_context_end': 0,
 'document_id': '0-0'}

## Evaluate Model

Next, we will make the predictions on the squad-formatted validation file.

In [25]:
# run inference on validation dataset
results = model.inference_from_file(
    file=file_config.dev_filename,
    return_json=False,
)
result_squad = [x.to_squad_eval() for x in results]

write_squad_predictions(
    predictions=result_squad,
    predictions_filename=file_config.dev_filename,
    out_filename=file_config.dev_predictions_filename,
)

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.63 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.76 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.85 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.80 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.71 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.62 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.14 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.26 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.26 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

The result is written in the `out_filename`

In [26]:
# evaluate the model usong the squad evaluation tool provided by farm
# settings for squad evaluation
eval_params = {
    "data_file": file_config.dev_filename,
    "pred_file": file_config.dev_predictions_filename,
    "out_file": file_config.model_performance_metrics_filename,
    "na_prob_thresh": 1,
    "na_prob_file": False,
}

In [27]:
with open(eval_params['data_file']) as f:
    dataset_json = json.load(f)
    dataset = dataset_json['data']

with open(eval_params['pred_file']) as f:
    preds = json.load(f)

# NOTE: in predictions, the keys are strings but need to be converted to ints
# if we want to use squad evaluation file provided by farm
preds = {int(k): v for k, v in preds.items()}

if eval_params['na_prob_file']:
    with open(eval_params['na_prob_file']) as f:
        na_probs = json.load(f)
else:
    na_probs = {k: 0.0 for k in preds}

# maps qid to True/False
qid_to_has_ans = squad_evaluation.make_qid_to_has_ans(dataset)
has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]

# get raw scores
exact_raw, f1_raw = squad_evaluation.get_raw_scores_extended(dataset, preds)

# apply thresholds
exact_thresh = squad_evaluation.apply_no_ans_threshold(
    exact_raw,
    na_probs,
    qid_to_has_ans,
    eval_params['na_prob_thresh'],
)
f1_thresh = squad_evaluation.apply_no_ans_threshold(
    f1_raw,
    na_probs,
    qid_to_has_ans,
    eval_params['na_prob_thresh'],
)

# create results dict
results_squad = squad_evaluation.make_eval_dict(exact_thresh, f1_thresh)
if has_ans_qids:
    has_ans_eval = squad_evaluation.make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
    squad_evaluation.merge_eval(results_squad, has_ans_eval, 'HasAns')
if no_ans_qids:
    no_ans_eval = squad_evaluation.make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
    squad_evaluation.merge_eval(results_squad, no_ans_eval, 'NoAns')

# covert to df
scores_df = pd.DataFrame(results_squad, index=[0])
scores_df

Unnamed: 0,exact,f1,total,HasAns_exact,HasAns_f1,HasAns_total,NoAns_exact,NoAns_f1,NoAns_total
0,74.709302,82.168055,344,71.530249,80.661249,281,88.888889,88.888889,63


In [28]:
# save locally and upload to s3
scores_df.to_csv(file_config.model_performance_metrics_filename)

# upload performance files to s3
s3c.upload_df_to_s3(
    scores_df,
    s3_prefix=f"{config.BASE_SAVED_MODELS_S3_PREFIX}",
    s3_key="kpi_scores.csv",
    filetype=S3FileType.CSV,
)

{'ResponseMetadata': {'RequestId': 'PCV73MM5ZW2CC99H',
  'HostId': '6NjBcy45TH2f25I8FwpQsoX+6xgMsMm1vsXyfVJp3DMhh3oDKf9pg1Qz8jiDKC75P76iERuQskI=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': '6NjBcy45TH2f25I8FwpQsoX+6xgMsMm1vsXyfVJp3DMhh3oDKf9pg1Qz8jiDKC75P76iERuQskI=',
   'x-amz-request-id': 'PCV73MM5ZW2CC99H',
   'date': 'Wed, 13 Jul 2022 16:28:30 GMT',
   'etag': '"0820ac557bbbe05deb432818f2292e41"',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'ETag': '"0820ac557bbbe05deb432818f2292e41"'}

## Save Model to s3

In [29]:
buffer = BytesIO()
with zipfile.ZipFile(buffer, 'a') as z:
    for dirname, _, files in os.walk(f"{file_config.saved_models_dir}"):
        for f in files:
            f_path = os.path.join(dirname, f)
            with open (f_path, 'rb') as file_content:
                z.writestr(f"KPI_EXTRACTION/{f}", file_content.read())

In [30]:
buffer.seek(0)
# upload model to s3
s3c._upload_bytes(
    buffer_bytes=buffer,
    prefix=config.BASE_SAVED_MODELS_S3_PREFIX,
    key="KPI_EXTRACTION.zip"
)

{'ResponseMetadata': {'RequestId': '6CGBP33PMPKRAC5Z',
  'HostId': 'EX6/h1B8YXiYFEBk3BuyBDF5hNiZXjoBLq1LK7elzeNDgjlqPrjQnHTMRuLPXQSJVQQpbqZvH1I=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'EX6/h1B8YXiYFEBk3BuyBDF5hNiZXjoBLq1LK7elzeNDgjlqPrjQnHTMRuLPXQSJVQQpbqZvH1I=',
   'x-amz-request-id': '6CGBP33PMPKRAC5Z',
   'date': 'Wed, 13 Jul 2022 16:28:36 GMT',
   'etag': '"d3349b66b152cbe145ccf21dce2db593"',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'ETag': '"d3349b66b152cbe145ccf21dce2db593"'}

# Conclusion

In this notebook, we developed a model that can be used for answering a KPI question, given a relevant paragraph (context). With this model in place, the training pipeline is now complete. That is, we have a pipeline that takes a set of raw PDFs, runs extraction, curates the data for relevance training, trains a relevance model, curates the relevance results for question-answering training, and finally trains a question-answering model.