## Hugging face relevance model

This notebook first tries zero short learning with a bert model or in other words, direct prediction with a bert model on the climate relevance task. Then it fine tunes the bert model for the relevance task using the huggingface transformers package. 

In [None]:
import config
import os
import pathlib
from dotenv import load_dotenv
from statistics import median
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
from src.data.s3_communication import S3Communication
from sparsezoo import Model
import zipfile
from io import BytesIO
import evaluate
from datasets import Dataset, DatasetDict
import pandas as pd
import datasets
import random
import numpy as np
from IPython.display import display, HTML
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer, pipeline
from transformers import AutoTokenizer
import shutil

In [None]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [None]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

## Retrieve the test dataset and the trained models

In [9]:
s3c.download_files_in_prefix_to_dir(
    config.BASE_TRAIN_TEST_DATASET_S3_PREFIX,
    config.BASE_PROCESSED_DATA)

In [10]:
test_data_path = str(config.BASE_PROCESSED_DATA)+'/rel_test_split.csv'
test_data = pd.read_csv(test_data_path, index_col=0)
test_data.rename(columns={'text':'question', 'text_b':'sentence'}, inplace=True)

train_data_path = str(config.BASE_PROCESSED_DATA)+'/rel_train_split.csv'
train_data = pd.read_csv(train_data_path, index_col=0)
train_data.rename(columns={'text':'question', 'text_b':'sentence'}, inplace=True)

In [11]:
train_data[train_data['question']=='What is the annual total production from lignite (brown coal)?']

Unnamed: 0,label,question,sentence
403,0,What is the annual total production from ligni...,"PJM's Operating ORDC Filing — On March 29, 201..."
402,1,What is the annual total production from ligni...,64.8 million metric tons of lignite produced


In [12]:
trds = Dataset.from_pandas(train_data)
teds = Dataset.from_pandas(test_data.drop('label', axis=1))

climate_dataset = DatasetDict()

climate_dataset['train'] = trds
climate_dataset['test'] = teds

# Try zero shot learning or directly inferencing with pretrained model

In [None]:
sequences = (test_data['question'] + ' [SEP] ' + test_data['sentence']).values.tolist()
classifier = pipeline(task='zero-shot-classification')

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [9]:
def make_batches(sequences, size=10):
    batches = list()
    i=0
    while i < len(sequences):
        end = i+size
        if end > len(sequences):
            end = len(sequences)
        batches.append(sequences[i:end])
        i+=size
    return batches


batches = make_batches(sequences, size=75)

In [10]:
results = list()
for batch in batches:
    results.extend(classifier(batch, [0, 1]))

In [11]:
label_1 = [results[i]['scores'][0] for i in range(509)]
cutoff = median(label_1)
pred = list()
for label in label_1:
    pred.append(1 if label > cutoff else 0)

In [12]:
test_data["pred"] = pred

In [14]:
#evalute performance
groups = test_data.groupby("question")
scores = {}
for group, data in groups:
    pred = data.pred
    true = data.label
    scores[group] = {}
    scores[group]["accuracy"] = accuracy_score(true, pred)
    scores[group]["f1_score"] = f1_score(true, pred)
    scores[group]["recall_score"] = recall_score(true, pred)
    scores[group]["precision_score"] = precision_score(true, pred)
    scores[group]["support"] = len(pred)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [15]:
# kpi wise performance metrics
scores_df = pd.DataFrame(scores)
scores_df.head()

Unnamed: 0,In which year was the annual report or the sustainability report published?,What is the annual total production from coal?,What is the base year for carbon reduction commitment?,What is the climate commitment scenario considered?,What is the company name?,What is the target carbon reduction in percentage?,What is the target year for climate commitment?,What is the total amount of direct greenhouse gases emissions referred to as scope 1 emissions?,What is the total amount of energy indirect greenhouse gases emissions referred to as scope 2 emissions?,What is the total amount of scope 1 and 2 greenhouse gases emissions?,...,What is the total amount of upstream energy indirect greenhouse gases emissions referred to as scope 3 emissions?,What is the total installed capacity from coal?,What is the total installed capacity from lignite (brown coal)?,What is the total volume of crude oil liquid production?,What is the total volume of hydrocarbons production?,What is the total volume of natural gas liquid production?,What is the total volume of natural gas production?,What is the total volume of proven and probable hydrocarbons reserves?,What is the volume of estimated probable hydrocarbons reserves?,What is the volume of estimated proven hydrocarbons reserves?
accuracy,0.695652,0.166667,0.32,0.52,0.603774,0.470588,0.608696,0.555556,0.538462,0.0,...,0.466667,0.5,1.0,0.333333,0.371429,0.5,0.3125,0.454545,0.0,0.531915
f1_score,0.666667,0.285714,0.26087,0.25,0.571429,0.526316,0.357143,0.692308,0.0,0.0,...,0.0,0.5,0.0,0.0,0.371429,0.5,0.352941,0.4375,0.0,0.592593
recall_score,0.717949,0.25,0.25,0.181818,0.7,0.833333,0.25,1.0,0.0,0.0,...,0.0,0.333333,0.0,0.0,0.464286,0.5,0.6,0.411765,0.0,0.727273
precision_score,0.622222,0.333333,0.272727,0.4,0.482759,0.384615,0.625,0.529412,0.0,0.0,...,0.0,1.0,0.0,0.0,0.309524,0.5,0.25,0.466667,0.0,0.5
support,92.0,6.0,25.0,25.0,53.0,34.0,46.0,18.0,13.0,2.0,...,15.0,4.0,1.0,3.0,70.0,4.0,16.0,33.0,1.0,47.0


In [16]:
scores_df.loc['f1_score'].mean()

0.30309084611632864

That f1 score s*cks.

## Using distil BERT model for the task

In [13]:
test_data_path = str(config.BASE_PROCESSED_DATA)+'/rel_test_split.csv'
test_data = pd.read_csv(test_data_path, index_col=0)
test_data.rename(columns={'text':'question', 'text_b':'sentence'}, inplace=True)

In [14]:
task = "qnli"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

In [15]:
actual_task = "qnli"
metric = evaluate.load('f1')

In [17]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))


show_random_elements(climate_dataset["train"])

Unnamed: 0,label,question,sentence,__index_level_0__
0,1,What is the base year for carbon reduction commitment?,"The global challenge of climate change will dominate many debates in 2020 and the years ahead. Equinor`s joint statement with Climate Action 100+ from April 2019, forms the starting point for our investor dialogue in support of the goals of the Paris Agreement. In our updated climate roadmap, we recognise the need for significant changes in the energy markets, which means that also Equinor`s portfolio will have to change accordingly to remain competitive. We will produce less oil in a low carbon future, but value creation will still be high. Oil and gas production with low greenhouse gas emissions will be an even stronger competitive advantage for us. In addition, profitable growth in renewables gives significant new opportunities to create attractive returns.",497
1,0,In which year was the annual report or the sustainability report published?,"The new Industrial Plan gives impetus to growth through an integrated business model. The portfolio of conventional assets1, the high percentage of gas reserves and the development of renewable sources thanks to synergies with Eni’s industrial assets will favour the evolution of the business model towards a low-carbon scenario, also thanks to technological development and digitalization in support of asset integrity and operating efficiency. Moreover, in the Gas & Power sector Eni will continue to restructure its procurement portfolio and re- duce logistics costs, also by increasing integration with other businesses including LNG and Trading. The Plan provides for the continued development of Green projects, including the start-up of the Gela green refinery plant and the expansion of the Venice plant, as well as the commitment to sustainable mobility through the increased supply of alternative fuels and the growth of enjoy2. Circular economy initiatives for waste transformation will also be developed; through these, Eni aims to reduce green- house gas emissions in production processes by increasing energy efficiency.",307
2,1,In which year was the annual report or the sustainability report published?,BP Sustainability Report 2019,111
3,1,In which year was the annual report or the sustainability report published?,Cabot Oil & Gas Corporation 2019 Annual Report,122
4,1,What is the total amount of direct greenhouse gases emissions referred to as scope 1 emissions?,"In 2018, direct emissions of CO2 equivalent (Scope 1) amounted to approximately 95 million equivalent tons, registering a decrease of 10% compared to 2017.",1469
5,0,What is the company name?,EmblaHod Deep WestEldﬁskEkoﬁskEddaVest ekoﬁskExploration prospect,747
6,1,What is the total volume of hydrocarbons production?,Upstream plans to maintain production at around 110 mboepd in 2019 (and at 100-110 mboepd in 2019-23),2021
7,0,What is the total volume of hydrocarbons production?,"Hunting Dearborn is a world leader in the deep drilling of high grade, non-magnetic components. As a Group, Hunting has the ability to produce fully integrated advanced downhole tools and equipment, manufactured, assembled and tested to the customer’s specifications using its proprietary know-how.",1847
8,0,What is the total volume of natural gas production?,"Galp is a member of the London Benchmarking Group and uses its methodology, which is an international benchmark to classify, manage, measure and communicate its contribution to society.",2114
9,0,What is the total volume of proven and probable hydrocarbons reserves?,"FormoresegmentreportingandthereconciliationofthesefigureswiththeIFRS‐EUFinancialStatements,seeAppendix II.",2245


In [18]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [19]:
task_to_keys = {"qnli": ("question", "sentence")}
sentence1_key, sentence2_key = task_to_keys[actual_task]
print(f"Sentence 1: {climate_dataset['train'][0][sentence1_key]}")
print(f"Sentence 2: {climate_dataset['train'][0][sentence2_key]}")

Sentence 1: What is the climate commitment scenario considered?
Sentence 2: This is the motivation behind Enel's support for the initiatives undertaken by the countries in which it operates, aimed at achieving the objectives established in the Paris Agreement. The commitment to the SDGs was strengthened by setting targets through 2030, strengthening the objective of reducing specific CO2 emissions to 0.23 kg/kWheq (SDG 13) and increasing the level of interaction between the Group and local communities, fostering their access to education (SDG 4), energy (SDG 7) and employment as well as sustainable and inclusive economic growth (SDG 8).


In [20]:
def preprocess_function(examples):
    return tokenizer(examples[sentence1_key],
                     examples[sentence2_key],
                     truncation=True)


encoded_climate_dataset = climate_dataset.map(preprocess_function, batched=True)

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [21]:
num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
metric_name="f1"
model_name = model_checkpoint.split("/")[-1]
model_name

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classi

'distilbert-base-uncased'

In [22]:
args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "no",
    save_strategy = "no",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=16,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=False,
)

In [23]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [24]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_climate_dataset["train"],
    eval_dataset=None,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: __index_level_0__, sentence, question. If __index_level_0__, sentence, question are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2033
  Num Epochs = 16
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2048


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.1675
1000,0.006
1500,0.0006
2000,0.0001




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=2048, training_loss=0.042531981715001166, metrics={'train_runtime': 241.1649, 'train_samples_per_second': 134.879, 'train_steps_per_second': 8.492, 'total_flos': 1815869827636536.0, 'train_loss': 0.042531981715001166, 'epoch': 16.0})

In [27]:
teds = Dataset.from_pandas(test_data)
climate_dataset['test'] = teds
encoded_climate_dataset_wl = climate_dataset.map(preprocess_function, batched=True)

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [28]:
pred = trainer.predict(encoded_climate_dataset_wl['test'])

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: __index_level_0__, sentence, question. If __index_level_0__, sentence, question are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 509
  Batch size = 16


In [29]:
label_pred = np.argmax(pred.predictions, axis=1)
test_data["pred"]=label_pred

In [30]:
#evalute performance
groups = test_data.groupby("question")
scores = {}
for group, data in groups:
    pred = data.pred
    true = data.label
    scores[group] = {}
    scores[group]["accuracy"] = accuracy_score(true, pred)
    scores[group]["f1_score"] = f1_score(true, pred)
    scores[group]["recall_score"] = recall_score(true, pred)
    scores[group]["precision_score"] = precision_score(true, pred)
    scores[group]["support"] = len(pred)

# kpi wise performance metrics
scores_df = pd.DataFrame(scores)
scores_df.head()

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,In which year was the annual report or the sustainability report published?,What is the annual total production from coal?,What is the base year for carbon reduction commitment?,What is the climate commitment scenario considered?,What is the company name?,What is the target carbon reduction in percentage?,What is the target year for climate commitment?,What is the total amount of direct greenhouse gases emissions referred to as scope 1 emissions?,What is the total amount of energy indirect greenhouse gases emissions referred to as scope 2 emissions?,What is the total amount of scope 1 and 2 greenhouse gases emissions?,...,What is the total amount of upstream energy indirect greenhouse gases emissions referred to as scope 3 emissions?,What is the total installed capacity from coal?,What is the total installed capacity from lignite (brown coal)?,What is the total volume of crude oil liquid production?,What is the total volume of hydrocarbons production?,What is the total volume of natural gas liquid production?,What is the total volume of natural gas production?,What is the total volume of proven and probable hydrocarbons reserves?,What is the volume of estimated probable hydrocarbons reserves?,What is the volume of estimated proven hydrocarbons reserves?
accuracy,0.847826,1.0,0.96,0.96,0.886792,0.970588,0.956522,0.944444,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.75,1.0,1.0,1.0,1.0
f1_score,0.847826,1.0,0.96,0.952381,0.857143,0.96,0.952381,0.947368,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,0.8,1.0,1.0,1.0,1.0
recall_score,1.0,1.0,1.0,0.909091,0.9,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
precision_score,0.735849,1.0,0.923077,1.0,0.818182,0.923077,0.909091,0.9,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,0.666667,1.0,1.0,1.0,1.0
support,92.0,6.0,25.0,25.0,53.0,34.0,46.0,18.0,13.0,2.0,...,15.0,4.0,1.0,3.0,70.0,4.0,16.0,33.0,1.0,47.0


In [31]:
scores_df.loc['f1_score'].mean()

0.9179571080911388

This f1 score of ~91.7% is better. So far this notebook has successfully used huggingface transformer for the relevance task. The farm model had an f1 score of around 91% so this surpases the original f1-score.  

## Save model locally and to s3

In [47]:
local_model_path = '/opt/app-root/src/aicoe-osc-demo/models/transformers/RELEVANCE'
trainer.save_model(local_model_path)
shutil.make_archive(local_model_path, 'zip', local_model_path)

'/opt/app-root/src/aicoe-osc-demo/models/transformers/RELEVANCE.zip'

In [49]:
buffer = BytesIO()
with zipfile.ZipFile(buffer, 'a') as z:
    for dirname, _, files in os.walk(local_model_path):
        for f in files:
            f_path = os.path.join(dirname, f)
            with open (f_path, 'rb') as file_content:
                z.writestr(f"RELEVANCE/{f}", file_content.read())

In [50]:
buffer.seek(0)
# upload model to s3
s3c._upload_bytes(
    buffer_bytes=buffer,
    prefix=config.BASE_SAVED_MODELS_S3_PREFIX,
    key="RELEVANCE.zip"
)

{'ResponseMetadata': {'RequestId': 'CAH611SY1SHVETHG',
  'HostId': 'GuFNxgxi116yMs8jPJo7aUpKXq8zwT7ELbrNnIJC6gB+gDlnUyeE/Zv6nYMqgAoH0IQD17nJqC0=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'GuFNxgxi116yMs8jPJo7aUpKXq8zwT7ELbrNnIJC6gB+gDlnUyeE/Zv6nYMqgAoH0IQD17nJqC0=',
   'x-amz-request-id': 'CAH611SY1SHVETHG',
   'date': 'Thu, 20 Oct 2022 17:43:17 GMT',
   'etag': '"2f37148816b514eeeee532469d37e4ef"',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'ETag': '"2f37148816b514eeeee532469d37e4ef"'}

## Sparse model

In [128]:
test_data_path = str(config.BASE_PROCESSED_DATA)+'/rel_test_split.csv'
test_data = pd.read_csv(test_data_path, index_col=0)
test_data.rename(columns={'text':'question', 'text_b':'sentence'}, inplace=True)

In [129]:
stub='zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/pruned80_quant-none-vnni'
path='/opt/app-root/src/aicoe-osc-demo/models/distilbert'
sparse_model = Model(stub, path)

In [None]:
sparse_model = Model(stub, path)
sparse_model.download()

In [131]:
num_labels = 2
path='/opt/app-root/src/aicoe-osc-demo/models/distilbert/training'
sparse_model = AutoModelForSequenceClassification.from_pretrained(path, num_labels=num_labels)

loading configuration file /opt/app-root/src/aicoe-osc-demo/models/distilbert/training/config.json
Model config DistilBertConfig {
  "_name_or_path": "/opt/app-root/src/aicoe-osc-demo/models/distilbert/training",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "mnli",
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.23.1",
  "vocab_size": 30522
}

loading weights file /opt/app-root/src/aicoe-osc-demo/models/distilbert/training/pytorch_model.bin
Some weights of the model checkpoint at /opt/app-root/src/aicoe-osc-demo/models/distilbert/

In [132]:
trainer = Trainer(
    sparse_model,
    args,
    train_dataset=encoded_climate_dataset["train"],
    eval_dataset=None,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: question, sentence, __index_level_0__. If question, sentence, __index_level_0__ are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2033
  Num Epochs = 16
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2048
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.4027
1000,0.1914
1500,0.0951
2000,0.0728




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=2048, training_loss=0.1870914592873305, metrics={'train_runtime': 240.8874, 'train_samples_per_second': 135.034, 'train_steps_per_second': 8.502, 'total_flos': 1815869827636536.0, 'train_loss': 0.1870914592873305, 'epoch': 16.0})

In [136]:
trainer.evaluate(eval_dataset=encoded_climate_dataset['test'])

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: question, sentence, __index_level_0__. If question, sentence, __index_level_0__ are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 509
  Batch size = 16


{'eval_runtime': 1.1964,
 'eval_samples_per_second': 425.446,
 'eval_steps_per_second': 26.747,
 'epoch': 16.0}

In [133]:
pred = trainer.predict(encoded_climate_dataset['test'])
test_data["pred"] = np.argmax(pred.predictions, axis=1)

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: question, sentence, __index_level_0__. If question, sentence, __index_level_0__ are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 509
  Batch size = 16


In [134]:
#evalute performance
groups = test_data.groupby("question")
scores = {}
for group, data in groups:
    pred = data.pred
    true = data.label
    scores[group] = {}
    scores[group]["accuracy"] = accuracy_score(true, pred)
    scores[group]["f1_score"] = f1_score(true, pred)
    scores[group]["recall_score"] = recall_score(true, pred)
    scores[group]["precision_score"] = precision_score(true, pred)
    scores[group]["support"] = len(pred)

# kpi wise performance metrics
scores_df = pd.DataFrame(scores)
scores_df.head()

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,In which year was the annual report or the sustainability report published?,What is the annual total production from coal?,What is the base year for carbon reduction commitment?,What is the climate commitment scenario considered?,What is the company name?,What is the target carbon reduction in percentage?,What is the target year for climate commitment?,What is the total amount of direct greenhouse gases emissions referred to as scope 1 emissions?,What is the total amount of energy indirect greenhouse gases emissions referred to as scope 2 emissions?,What is the total amount of scope 1 and 2 greenhouse gases emissions?,...,What is the total amount of upstream energy indirect greenhouse gases emissions referred to as scope 3 emissions?,What is the total installed capacity from coal?,What is the total installed capacity from lignite (brown coal)?,What is the total volume of crude oil liquid production?,What is the total volume of hydrocarbons production?,What is the total volume of natural gas liquid production?,What is the total volume of natural gas production?,What is the total volume of proven and probable hydrocarbons reserves?,What is the volume of estimated probable hydrocarbons reserves?,What is the volume of estimated proven hydrocarbons reserves?
accuracy,0.858696,1.0,1.0,0.92,0.830189,1.0,0.956522,0.833333,0.923077,1.0,...,1.0,1.0,1.0,1.0,0.971429,1.0,1.0,0.939394,1.0,1.0
f1_score,0.850575,1.0,1.0,0.909091,0.790698,1.0,0.952381,0.857143,0.857143,1.0,...,1.0,1.0,0.0,1.0,0.965517,1.0,1.0,0.941176,1.0,1.0
recall_score,0.948718,1.0,1.0,0.909091,0.85,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.941176,1.0,1.0
precision_score,0.770833,1.0,1.0,0.909091,0.73913,1.0,0.909091,0.75,0.75,1.0,...,1.0,1.0,0.0,1.0,0.933333,1.0,1.0,0.941176,1.0,1.0
support,92.0,6.0,25.0,25.0,53.0,34.0,46.0,18.0,13.0,2.0,...,15.0,4.0,1.0,3.0,70.0,4.0,16.0,33.0,1.0,47.0


In [135]:
scores_df.loc['f1_score'].mean()

0.9106535083232097

This is similar as previous model but slightly less. Next let's see what the inference timings are for these models in the transformer_inference notebook.