# Fine-tuning Transformers model for correct first page prediction

This notebook covers one of the approaches to training a model for predicting whether a page of the document is the first one or not -- a feature that would allow correct splitting for PDFs that consist of more than one actual document (we assume that the pages are already sorted). The approach used is fine-tuning Transformers model (BERT) with our document-related dataset.

Before you start, make sure you have **installed** and **initialized** the konfuzio_sdk package as shown in the readme of the [repository](https://github.com/konfuzio-ai/Python-SDK).

In [None]:
!pip install konfuzio-sdk

In [None]:
!konfuzio_sdk init

Also, you will need to install the Transformers-related packages:

In [None]:
!pip install transformers datasets

Importing necessary libraries and packages:

In [3]:
import os

import numpy as np

from datasets import load_dataset, load_metric
from transformers import BertTokenizer, AutoModelForSequenceClassification, \
                        TrainingArguments, DataCollatorWithPadding, Trainer

Setting seed for reproducibility purposes:

In [4]:
seed_value = 42
os.environ['PYTHONHASHSEED'] = str(seed_value)

Initializing the model and the tokenizer:

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, max_length=10000, padding="max_length")

In [None]:
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Loading the dataset previously saved to .csv for usage with Transformers-native Dataset class:

In [None]:
dataset = load_dataset('csv',
                      data_files={'train': 'drive/MyDrive/knfz/train.csv',
                                 'test': 'drive/MyDrive/knfz/test.csv'})

Setting the training arguments:


In [None]:
arguments = TrainingArguments(
    do_predict=True,
    output_dir='drive/MyDrive/knfz/model', 
    evaluation_strategy="steps", 
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    num_train_epochs=5,
    logging_steps=500, 
    logging_strategy='steps', 
    save_strategy='steps',
    save_steps=500,
    seed=42,

)
data_collator = DataCollatorWithPadding(tokenizer, return_tensors="pt")

using `logging_steps` to initialize `eval_steps` to 500
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Tokenizing our dataset:

In [9]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [None]:
tokenized = dataset.map(preprocess_function, batched=True)

Defining our metric of choice which is accuracy:

In [None]:
metric = load_metric('accuracy')

In [12]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

Initializing the Trainer class and starting the training process:

In [14]:
trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

Let's also try training with the different amount of epochs:

In [None]:
arguments = TrainingArguments(
    do_predict=True,
    output_dir='drive/MyDrive/knfz/model_2', 
    evaluation_strategy="steps", 
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    num_train_epochs=10,
    logging_steps=500, 
    logging_strategy='steps', 
    save_strategy='steps',
    save_steps=500,
    seed=42,

)

In [None]:
trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

Evaluating the models' performance on a second test set:

In [None]:
dataset = load_dataset('csv',
                      data_files={
                                 'test': 'drive/MyDrive/knfz/test_2.csv'})

In [None]:
tokenized = dataset.map(preprocess_function, batched=True)

In [None]:
model_5 = AutoModelForSequenceClassification.from_pretrained('drive/MyDrive/knfz/model/checkpoint-1000')
tokenizer_5 = BertTokenizer.from_pretrained('drive/MyDrive/knfz/model/checkpoint-1000')

In [None]:
trainer = Trainer(
    model=model_5,
    args=arguments,
    eval_dataset=tokenized['test'],
    tokenizer=tokenizer_5,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 7
  Batch size = 4


{'eval_loss': 5.0522613525390625,
 'eval_accuracy': 0.42857142857142855,
 'eval_runtime': 0.3286,
 'eval_samples_per_second': 21.301,
 'eval_steps_per_second': 6.086}

In [None]:
model_10 = AutoModelForSequenceClassification.from_pretrained('drive/MyDrive/knfz/model_2/checkpoint-2000')
tokenizer_10 = BertTokenizer.from_pretrained('drive/MyDrive/knfz/model_2/checkpoint-2000')

In [None]:
trainer = Trainer(
    model=model_10,
    args=arguments,
    eval_dataset=tokenized['test'],
    tokenizer=tokenizer_10,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 7
  Batch size = 4


{'eval_loss': 6.361591339111328,
 'eval_accuracy': 0.42857142857142855,
 'eval_runtime': 0.2696,
 'eval_samples_per_second': 25.963,
 'eval_steps_per_second': 7.418}

## Metrics & prediction

Let's compare the results for both models on the two datasets:

| model  |accuracy on set #1   |  accuracy on set #2 |   |   |
|---|---|---|---|---|
| bert-base-uncased (5 epochs)  |  1 |  0.42 |   |   |
|  bert-base-uncased (10 epochs) |  0.96 |  0.42 |   |   |


In [None]:
from tqdm import tqdm

def calculate_metrics(texts, labels):
    true_positive = 0
    false_positive = 0
    false_negative = 0
    
    for i, test in tqdm(zip(labels, texts)):
        inputs = tokenizer_10(test, truncation=True, return_tensors="pt")
        with torch.no_grad():
            logits = model_10(**inputs).logits
        pred = logits.argmax().item()
        
        if i == 1 and pred == 1:
            true_positive += 1
        elif i == 1 and pred == 0:
            false_negative += 1
        elif i == 0 and pred == 1:
            false_positive += 1
    
    if true_positive + false_positive != 0:
        precision = true_positive / (true_positive + false_positive)
    else:
        precision = 0
    
    if true_positive + false_negative != 0:
        recall = true_positive / (true_positive + false_negative)
    else:
        recall = 0
    
    if precision + recall != 0:
    
        f1 = 2 * precision * recall / (precision + recall)
    
    else:
        
        f1 = 0
    
    return precision, recall, f1

In [None]:
pages_test_docs = [x for x in dataset['test']['text']]
pages_labels_test = [x for x in dataset['test']['label']]

In [None]:
precision, recall, f1 = calculate_metrics(pages_test_docs, pages_labels_test)

In [None]:
print('\n Precision: {} \n Recall: {} \n F1-score: {}'.format(precision, recall, f1))