## Train Relevance Model

Now that the dataset has been [extracted](pdf_text_extraction.ipynb) and [curated](pdf_text_curation.ipynb), we will train the relevance classifier model in this notebook. The model trained is comprised of a transformer model (e.g., BERT) that can be loaded pre-trained on the NQ dataset into the pipeline and then be fine-tuned on the curated data for our specific relevance detection task.

Our pipeline includes components that are provided by the FARM library. FARM is a framework which facilitates transfer learning tasks for BERT based models. Documentation for FARM is available here: https://farm.deepset.ai.

In [1]:
import os
import config
import zipfile
import pathlib
from io import BytesIO

import pandas as pd

from dotenv import load_dotenv

from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score

from src.models import FARMTrainer
from src.data.s3_communication import S3Communication, S3FileType

from config_farm_train import (
    FileConfig,
    ModelConfig,
    MLFlowConfig,
    TrainingConfig,
    TokenizerConfig,
    ProcessorConfig,
)
from farm.infer import Inferencer

07/13/2022 16:02:14 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

#### Set parameters

Before starting training, parameters for each component of the training pipeline must be set. For this we create `config` objects which hold these parameters. Default values have already been set but they can be easily changed. To do so, you can manually update the parameters in the corresponding config file:

`aicoe-osc-demo/notebooks/demo2/config_farm_train.py`

In [4]:
# Settings data files and checkpoints parameters
file_config = FileConfig(config.EXPERIMENT_NAME)

# Settings for the processor component
processor_config = ProcessorConfig(config.EXPERIMENT_NAME)

# Settings for the tokenizer
tokenizer_config = TokenizerConfig(config.EXPERIMENT_NAME)
# NOTE: specifically for tokenizer, we need to ensure root dir is a string
tokenizer_config.root = str(tokenizer_config.root)

# Settings for the model
model_config = ModelConfig(config.EXPERIMENT_NAME)

# Settings for training
train_config = TrainingConfig(config.EXPERIMENT_NAME)

# Settings for training
mlflow_config = MLFlowConfig(config.EXPERIMENT_NAME)

We can check the value for some parameters:

In [5]:
print(f"Experiment_name: \n {file_config.experiment_name} \n")
print(f"Data directory: \n {file_config.data_dir} \n")
print(f"Curated dataset path: \n {file_config.curated_data} \n")
print(f"Split train/validation ratio: \n{file_config.dev_split} \n")
print(f"Training dataset path: \n {file_config.train_filename} \n")
print(f"Validation dataset path: \n {file_config.dev_filename} \n")
print(f"Directory where trained model is saved: \n {file_config.saved_models_dir} \n")

Experiment_name: 
 test-demo-2 

Data directory: 
 /opt/app-root/src/aicoe-osc-demo/data 

Curated dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/curation/esg_TEXT_dataset.csv 

Split train/validation ratio: 
0.2 

Training dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/kpi_train_split.csv 

Validation dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/kpi_val_split.csv 

Directory where trained model is saved: 
 /opt/app-root/src/aicoe-osc-demo/models/RELEVANCE 



In [6]:
print(f"Max number of tokens per example: {processor_config.max_seq_len} \n")

Max number of tokens per example: 128 



In [7]:
print(f"Use GPU: {train_config.use_cuda} \n")

Use GPU: True 



In [8]:
print(f"Learning_rate: {train_config.learning_rate} \n")
print(f"Number of epochs for fine tuning: {train_config.n_epochs} \n")
print(f"Batch size: {train_config.batch_size} \n")
print(f"Perform Cross validation: {train_config.run_cv} \n")

Learning_rate: 1e-05 

Number of epochs for fine tuning: 1 

Batch size: 1 

Perform Cross validation: False 



## Load Pretrained Model

We already have a trained relevance classifier on Google's large NQ dataset. We download it and then save it in the following directory: `file_config.saved_models_dir / "relevance_roberta"`

In [10]:
# When running in Automation using Elyra and Kubeflow Pipelines,
# set AUTOMATION = 1 as an environment variable
if os.getenv("AUTOMATION"):
    # extracted pdfs
    if not os.path.exists(config.BASE_EXTRACTION_FOLDER):
        config.BASE_EXTRACTION_FOLDER.mkdir(parents=True, exist_ok=True)

    # curated pdfs
    if not os.path.exists(config.BASE_CURATION_FOLDER):
        config.BASE_CURATION_FOLDER.mkdir(parents=True, exist_ok=True)

    # processed data
    if not os.path.exists(config.BASE_PROCESSED_DATA):
        config.BASE_PROCESSED_DATA.mkdir(parents=True, exist_ok=True)

    # load dir
    if not os.path.exists(model_config.load_dir):
        pathlib.Path(model_config.load_dir).mkdir(parents=True, exist_ok=True)

    # download extracted pdfs from s3
    s3c.download_files_in_prefix_to_dir(
        config.BASE_EXTRACTION_S3_PREFIX,
        config.BASE_EXTRACTION_FOLDER,
    )
    # download curated pdfs from s3
    s3c.download_files_in_prefix_to_dir(
        config.BASE_CURATION_S3_PREFIX,
        config.BASE_CURATION_FOLDER,
    )
    # download the pretrained model
    model_root = pathlib.Path(model_config.load_dir).parent
    model_rel_zip = pathlib.Path(model_root, "relevance_roberta.zip")

    s3c.download_file_from_s3(
        model_rel_zip, config.CHECKPOINT_S3_PREFIX, "relevance_roberta.zip"
    )

    with zipfile.ZipFile(pathlib.Path(model_root, "relevance_roberta.zip"), "r") as z:
        z.extractall(model_root)

In [11]:
file_config.data_type = "Text"
print(f"Data type: \n {file_config.data_type} \n")

Data type: 
 Text 



We need to load this model in our pipeline to fine-tune a relevance classifier on our specific ESG curated dataset. For this we have to set the parameter `model_config.load_dir` to be the directory where we saved our first checkpoint. We can check that this is set:

In [12]:
print(f"NQ checkpoint directory: {model_config.load_dir}")

NQ checkpoint directory: /opt/app-root/src/aicoe-osc-demo/models/relevance_roberta


## Fine-tune on curated ESG data

Once all the parameters are set, a `FARMTrainer` object can be instantiated by passing all the configuration objects

In [13]:
# init farm trainer
farm_trainer = FARMTrainer(
    file_config=file_config,
    tokenizer_config=tokenizer_config,
    processor_config=processor_config,
    model_config=model_config,
    training_config=train_config,
    mlflow_config=mlflow_config,
)

Call the method `run()` to start training

In [14]:
farm_trainer.run()

07/13/2022 16:02:27 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: True
07/13/2022 16:02:27 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'RobertaTokenizer'
07/13/2022 16:02:28 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
07/13/2022 16:02:28 - INFO - farm.data_handler.data_silo -   Loading train set from: /opt/app-root/src/aicoe-osc-demo/data/kpi_train_split.csv 
07/13/2022 16:02:28 - INFO - farm.data_handler.data_silo -   Got ya 7 parallel workers to convert 1175 dictionaries to pytorch datasets (chunksize = 34)...
07/13/2022 16:02:28 - INFO - farm.data_handler.data_silo -    0    0    0    0    0    0    0 
07/13/2022 16:02:28 - INFO - farm.data_handler.data_silo -   /w\  /w\  /w\  /w\  /w\  /|\  /|\
07/13/2022 16:02:28 - INFO

0.9931972789115646

At the end of the training process, the model and the processor vocabulary are saved into the directory `file_config.saved_models_dir`

In [15]:
file_config.saved_models_dir

'/opt/app-root/src/aicoe-osc-demo/models/RELEVANCE'

In [16]:
!ls -al $file_config.saved_models_dir

total 488352
drwxrwsr-x. 2 1000630000 1000630000      4096 Jul 12 21:40 .
drwxrwsr-x. 6 1000630000 1000630000      4096 Jul 12 21:50 ..
-rw-rw-r--. 1 1000630000 1000630000 498669047 Jul 13 16:07 language_model.bin
-rw-rw-r--. 1 1000630000 1000630000       562 Jul 13 16:07 language_model_config.json
-rw-rw-r--. 1 1000630000 1000630000    456318 Jul 13 16:07 merges.txt
-rw-rw-r--. 1 1000630000 1000630000      7489 Jul 13 16:07 prediction_head_0.bin
-rw-rw-r--. 1 1000630000 1000630000       321 Jul 13 16:07 prediction_head_0_config.json
-rw-rw-r--. 1 1000630000 1000630000       735 Jul 13 16:07 processor_config.json
-rw-rw-r--. 1 1000630000 1000630000       772 Jul 13 16:07 special_tokens_map.json
-rw-rw-r--. 1 1000630000 1000630000       224 Jul 13 16:07 tokenizer_config.json
-rw-rw-r--. 1 1000630000 1000630000    898822 Jul 13 16:07 vocab.json


## Cross-validation

To better estimate the performance of the model on new data, it is recommended to perform k-folds cross validation (CV). CV works as follows:

- Split the entire data randomly into k folds (usually 5 to 10)
- Fit the model using the K — 1 folds and validate the model using the remaining Kth fold and save the scores
- Repeat until every K-fold serve as the test set and average the saved scores

`FARMTrainer` includes this features. To perform 3-fold CV proceed as follows:

In [17]:
train_config.run_cv = True
train_config.xval_folds = 2
train_config.n_epochs = 1
train_config.batch_size = 8

In [18]:
farm_trainer = FARMTrainer(
    file_config=file_config,
    tokenizer_config=tokenizer_config,
    model_config=model_config,
    processor_config=processor_config,
    training_config=train_config,
    mlflow_config=mlflow_config,
)

In [19]:
farm_trainer.run()

07/13/2022 16:07:16 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: True
07/13/2022 16:07:16 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'RobertaTokenizer'
07/13/2022 16:07:16 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
07/13/2022 16:07:16 - INFO - farm.data_handler.data_silo -   Loading train set from: /opt/app-root/src/aicoe-osc-demo/data/kpi_train_split.csv 
07/13/2022 16:07:17 - INFO - farm.data_handler.data_silo -   Got ya 7 parallel workers to convert 1175 dictionaries to pytorch datasets (chunksize = 34)...
07/13/2022 16:07:17 - INFO - farm.data_handler.data_silo -    0    0    0    0    0    0    0 
07/13/2022 16:07:17 - INFO - farm.data_handler.data_silo -   /w\  /w\  /w\  /w\  /w\  /|\  /|\
07/13/2022 16:07:17 - INFO

**NOTE:** CV mode does not save a checkpoint, it is only used for validation

## Model Performance Metrics

In this section, we will quantify the performance of the fine tuned model on our dataset. Specifically, we will calculate the precision, recall, and f1-score. 

In [20]:
# load test set
test_data = pd.read_csv(file_config.dev_filename, index_col=0)
test_data.head()

Unnamed: 0,label,text,text_b
59,1,In which year was the annual report or the sus...,Annual Report 2017
287,1,What is the base year for carbon reduction com...,The Group intends to reduce its carbon intensi...
1420,1,What is the volume of estimated proven hydroca...,OMV had proven reserves of approximately 1.03 ...
631,1,What is the target carbon reduction in percent...,The objective for 2025 is to reduce upstream e...
1395,1,What is the volume of estimated proven hydroca...,"As of December 31, 2015, TOTAL's combined prov..."


In [21]:
# get predictions from current model
model = Inferencer.load(file_config.saved_models_dir)

result = model.inference_from_file(file_config.dev_filename)
results = [d for r in result for d in r["predictions"]]
preds = [int(r["label"]) for r in results]

test_data["pred"] = preds

07/13/2022 16:07:59 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
07/13/2022 16:08:02 - INFO - farm.modeling.adaptive_model -   Found files for loading 1 prediction heads
07/13/2022 16:08:02 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [768, 2]
07/13/2022 16:08:02 - INFO - farm.modeling.prediction_head -   Loading prediction head from /opt/app-root/src/aicoe-osc-demo/models/RELEVANCE/prediction_head_0.bin
07/13/2022 16:08:02 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'RobertaTokenizer'
07/13/2022 16:08:02 - INFO - farm.data_handler.processor -   Initialized processor without tasks. Supply `metric` and `label_list` to the constructor for using the default task or add a custom task later via processor.add_task()
07/13/2022 16:08:02 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
07/13/2022 16:0

In [22]:
# evalute performance
groups = test_data.groupby("text")
scores = {}
for group, data in groups:
    pred = data.pred
    true = data.label
    scores[group] = {}
    scores[group]["accuracy"] = accuracy_score(true, pred)
    scores[group]["f1_score"] = f1_score(true, pred)
    scores[group]["recall_score"] = recall_score(true, pred)
    scores[group]["precision_score"] = precision_score(true, pred)
    scores[group]["support"] = len(pred)

In [23]:
# kpi wise performance metrics
scores_df = pd.DataFrame(scores)
scores_df.head()

Unnamed: 0,In which year was the annual report or the sustainability report published?,What is the annual total production from coal?,What is the base year for carbon reduction commitment?,What is the climate commitment scenario considered?,What is the company name?,What is the target carbon reduction in percentage?,What is the target year for climate commitment?,What is the total amount of direct greenhouse gases emissions referred to as scope 1 emissions?,What is the total amount of energy indirect greenhouse gases emissions referred to as scope 2 emissions?,What is the total amount of scope 1 and 2 greenhouse gases emissions?,"What is the total amount of scope 1, scope 2 and scope 3 greenhouse gases emissions?",What is the total amount of upstream energy indirect greenhouse gases emissions referred to as scope 3 emissions?,What is the total installed capacity from coal?,What is the total volume of crude oil liquid production?,What is the total volume of hydrocarbons production?,What is the total volume of natural gas liquid production?,What is the total volume of natural gas production?,What is the total volume of proven and probable hydrocarbons reserves?,What is the volume of estimated proven hydrocarbons reserves?
accuracy,1.0,1.0,1.0,0.96,0.971429,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
f1_score,1.0,1.0,1.0,0.971429,0.981132,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
recall_score,1.0,1.0,1.0,1.0,0.962963,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
precision_score,1.0,1.0,1.0,0.944444,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
support,39.0,5.0,18.0,25.0,35.0,22.0,28.0,13.0,9.0,2.0,1.0,8.0,2.0,1.0,28.0,4.0,13.0,21.0,20.0


In [24]:
# save results locally
if not os.getenv("AUTOMATION"):
    scores_df.to_csv(file_config.model_performance_metrics_file)

In [None]:
if os.getenv("AUTOMATION"):
    # upload the extracted files to s3
    s3c.upload_file_to_s3(
        file_config.train_filename,
        config.BASE_TRAIN_TEST_DATASET_S3_PREFIX,
        "rel_train_split.csv"
    )
    s3c.upload_file_to_s3(
        file_config.dev_filename,
        config.BASE_TRAIN_TEST_DATASET_S3_PREFIX,
        "rel_test_split.csv"
    )

## Save model to s3

Great, we have a fine tuned model at this point. We will now save this model as well as its performance metrics to s3.

In [25]:
# upload performance files to s3
s3c.upload_df_to_s3(
    scores_df,
    s3_prefix=config.BASE_SAVED_MODELS_S3_PREFIX,
    s3_key="relevance_scores.csv",
    filetype=S3FileType.CSV,
)

{'ResponseMetadata': {'RequestId': 'CET4PGRKP6SKEMMM',
  'HostId': 'pZOSuLYNLXzheNMBZCMH+fiiDhCYxQNCFCB2GOpVqkncEXYh1nnjlraOnKwsvHv5EtbWgBLRCMo=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'pZOSuLYNLXzheNMBZCMH+fiiDhCYxQNCFCB2GOpVqkncEXYh1nnjlraOnKwsvHv5EtbWgBLRCMo=',
   'x-amz-request-id': 'CET4PGRKP6SKEMMM',
   'date': 'Wed, 13 Jul 2022 16:09:08 GMT',
   'etag': '"3ab74db546b026fac40d8dde1f078ca2"',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'ETag': '"3ab74db546b026fac40d8dde1f078ca2"'}

In [26]:
buffer = BytesIO()
with zipfile.ZipFile(buffer, 'a') as z:
    for dirname, _, files in os.walk(file_config.saved_models_dir):
        for f in files:
            f_path = os.path.join(dirname, f)
            with open (f_path, 'rb') as file_content:
                z.writestr(f"RELEVANCE/{f}", file_content.read())

In [27]:
buffer.seek(0)
# upload model to s3
s3c._upload_bytes(
    buffer_bytes=buffer,
    prefix=config.BASE_SAVED_MODELS_S3_PREFIX,
    key="RELEVANCE.zip"
)

{'ResponseMetadata': {'RequestId': 'KJ4971ZTBB0PYN26',
  'HostId': 'uu3MbfW0G78Ba32m/8qcFuABPkWZpAp9n8SsF057wmAb0usaBskB89epBmuYGbjOFPjO4biGIbw=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'uu3MbfW0G78Ba32m/8qcFuABPkWZpAp9n8SsF057wmAb0usaBskB89epBmuYGbjOFPjO4biGIbw=',
   'x-amz-request-id': 'KJ4971ZTBB0PYN26',
   'date': 'Wed, 13 Jul 2022 16:09:10 GMT',
   'etag': '"0e5f4b5fd9e45485af9c40bc4167840a"',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'ETag': '"0e5f4b5fd9e45485af9c40bc4167840a"'}

# Conclusion

In this notebook, we developed a model that can be used for finding relevant paragraphs for a KPI question, given a list of paragraphs from climate report PDFs. With this model in place, we can go ahead to training the kpi extraction model.