In this notebook, we are going to fine-tune `LayoutLMv2ForTokenClassification` on the [CORD](https://github.com/clovaai/cord) dataset. The goal for the model is to label words appearing in scanned documents (namely, receipts) appropriately. This task is treated as a NER problem (sequence labeling). However, compared to BERT, LayoutLMv2 also incorporates visual and layout information about the tokens when encoding them into vectors. This makes the LayoutLMv2 model very powerful for document understanding tasks.

LayoutLMv2 is itself an upgrade of LayoutLM. The main novelty of LayoutLMv2 is that it also pre-trains visual embeddings, whereas the original LayoutLM only adds visual embeddings during fine-tuning.

* Paper: https://arxiv.org/abs/2012.14740
* Original repo: https://github.com/microsoft/unilm/tree/master/layoutlmv2

NOTES: 

* you first need to prepare the CORD dataset for LayoutLMv2. For that, check out the notebook "Prepare CORD for LayoutLMv2".
* this notebook is heavily inspired by [this Github repository](https://github.com/omarsou/layoutlm_CORD), which fine-tunes both BERT and LayoutLM (v1) on the CORD dataset.



## Install dependencies

First, we install the required libraries:
* Transformers (for the LayoutLMv2 model)
* Datasets (for data preprocessing)
* Seqeval (for metrics)
* Detectron2 (which LayoutLMv2 requires for its visual backbone).



In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 5.4 MB/s 
Collecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 34.7 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 26.7 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 35.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.6 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Foun

In [None]:
!pip install -q datasets seqeval

[?25l[K     |█                               | 10 kB 16.8 MB/s eta 0:00:01[K     |██                              | 20 kB 16.4 MB/s eta 0:00:01[K     |███▏                            | 30 kB 11.0 MB/s eta 0:00:01[K     |████▏                           | 40 kB 9.6 MB/s eta 0:00:01[K     |█████▎                          | 51 kB 5.8 MB/s eta 0:00:01[K     |██████▎                         | 61 kB 5.8 MB/s eta 0:00:01[K     |███████▍                        | 71 kB 5.8 MB/s eta 0:00:01[K     |████████▍                       | 81 kB 6.4 MB/s eta 0:00:01[K     |█████████▌                      | 92 kB 5.0 MB/s eta 0:00:01[K     |██████████▌                     | 102 kB 5.4 MB/s eta 0:00:01[K     |███████████▋                    | 112 kB 5.4 MB/s eta 0:00:01[K     |████████████▋                   | 122 kB 5.4 MB/s eta 0:00:01[K     |█████████████▊                  | 133 kB 5.4 MB/s eta 0:00:01[K     |██████████████▊                 | 143 kB 5.4 MB/s eta 0:00:01[K  

In [None]:
!pip install pyyaml==5.1
!pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html

!python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html

Collecting pyyaml==5.1
  Downloading PyYAML-5.1.tar.gz (274 kB)
[?25l[K     |█▏                              | 10 kB 18.1 MB/s eta 0:00:01[K     |██▍                             | 20 kB 9.0 MB/s eta 0:00:01[K     |███▋                            | 30 kB 7.1 MB/s eta 0:00:01[K     |████▉                           | 40 kB 7.9 MB/s eta 0:00:01[K     |██████                          | 51 kB 5.6 MB/s eta 0:00:01[K     |███████▏                        | 61 kB 5.6 MB/s eta 0:00:01[K     |████████▍                       | 71 kB 5.5 MB/s eta 0:00:01[K     |█████████▋                      | 81 kB 6.1 MB/s eta 0:00:01[K     |██████████▊                     | 92 kB 6.3 MB/s eta 0:00:01[K     |████████████                    | 102 kB 5.2 MB/s eta 0:00:01[K     |█████████████▏                  | 112 kB 5.2 MB/s eta 0:00:01[K     |██████████████▍                 | 122 kB 5.2 MB/s eta 0:00:01[K     |███████████████▌                | 133 kB 5.2 MB/s eta 0:00:01[K     |█████

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Prepare the data

First, let's read in the annotations which we prepared in the other notebook. These contain the word-level annotations (words, labels, normalized bounding boxes).

In [None]:
import pandas as pd

train = pd.read_pickle('/content/drive/MyDrive/sparrow/CORD/CORD_layoutlmv2_format/train.pkl')
val = pd.read_pickle('/content/drive/MyDrive/sparrow/CORD/CORD_layoutlmv2_format/dev.pkl')
test = pd.read_pickle('/content/drive/MyDrive/sparrow/CORD/CORD_layoutlmv2_format/test.pkl')

Let's define a list of all unique labels. For that, let's first count the number of occurrences for each label:

In [None]:
from collections import Counter

all_labels = [item for sublist in train[1] for item in sublist] + [item for sublist in val[1] for item in sublist] + [item for sublist in test[1] for item in sublist]
Counter(all_labels)

Counter({'menu.cnt': 2429,
         'menu.discountprice': 403,
         'menu.etc': 19,
         'menu.itemsubtotal': 7,
         'menu.nm': 6597,
         'menu.num': 109,
         'menu.price': 2585,
         'menu.sub_cnt': 189,
         'menu.sub_etc': 9,
         'menu.sub_nm': 822,
         'menu.sub_price': 160,
         'menu.sub_unitprice': 14,
         'menu.unitprice': 750,
         'menu.vatyn': 9,
         'sub_total.discount_price': 191,
         'sub_total.etc': 283,
         'sub_total.othersvc_price': 6,
         'sub_total.service_price': 353,
         'sub_total.subtotal_price': 1482,
         'sub_total.tax_price': 1283,
         'total.cashprice': 1393,
         'total.changeprice': 1297,
         'total.creditcardprice': 410,
         'total.emoneyprice': 129,
         'total.menuqty_cnt': 630,
         'total.menutype_cnt': 130,
         'total.total_etc': 89,
         'total.total_price': 2120,
         'void_menu.nm': 3,
         'void_menu.price': 1})

As we can see, there are some labels that contain very few examples. Let's replace them by the "neutral" label "O" (which stands for "Outside").

In [None]:
replacing_labels = {'menu.etc': 'O', 'mneu.itemsubtotal': 'O', 'menu.sub_etc': 'O', 'menu.sub_unitprice': 'O', 'menu.vatyn': 'O',
                  'void_menu.nm': 'O', 'void_menu.price': 'O', 'sub_total.othersvc_price': 'O'}

In [None]:
def replace_elem(elem):
  try:
    return replacing_labels[elem]
  except KeyError:
    return elem
def replace_list(ls):
  return [replace_elem(elem) for elem in ls]
train[1] = [replace_list(ls) for ls in train[1]]
val[1] = [replace_list(ls) for ls in val[1]]
test[1] = [replace_list(ls) for ls in test[1]]

In [None]:
all_labels = [item for sublist in train[1] for item in sublist] + [item for sublist in val[1] for item in sublist] + [item for sublist in test[1] for item in sublist]
Counter(all_labels)

Counter({'O': 61,
         'menu.cnt': 2429,
         'menu.discountprice': 403,
         'menu.itemsubtotal': 7,
         'menu.nm': 6597,
         'menu.num': 109,
         'menu.price': 2585,
         'menu.sub_cnt': 189,
         'menu.sub_nm': 822,
         'menu.sub_price': 160,
         'menu.unitprice': 750,
         'sub_total.discount_price': 191,
         'sub_total.etc': 283,
         'sub_total.service_price': 353,
         'sub_total.subtotal_price': 1482,
         'sub_total.tax_price': 1283,
         'total.cashprice': 1393,
         'total.changeprice': 1297,
         'total.creditcardprice': 410,
         'total.emoneyprice': 129,
         'total.menuqty_cnt': 630,
         'total.menutype_cnt': 130,
         'total.total_etc': 89,
         'total.total_price': 2120})

Now we have to save all the unique labels in a list.

In [None]:
labels = list(set(all_labels))
print(labels)

['menu.sub_nm', 'total.menutype_cnt', 'sub_total.subtotal_price', 'menu.discountprice', 'sub_total.etc', 'sub_total.service_price', 'sub_total.discount_price', 'sub_total.tax_price', 'total.total_etc', 'total.emoneyprice', 'menu.cnt', 'total.creditcardprice', 'total.cashprice', 'O', 'total.menuqty_cnt', 'menu.sub_cnt', 'menu.sub_price', 'total.changeprice', 'menu.num', 'total.total_price', 'menu.itemsubtotal', 'menu.nm', 'menu.price', 'menu.unitprice']


In [None]:
import pickle

label2id = {label: idx for idx, label in enumerate(labels)}
id2label = {idx: label for idx, label in enumerate(labels)}
print(label2id)
print(id2label)

with open('/content/drive/MyDrive/sparrow/CORD/label2id.pkl', 'wb') as t:
    pickle.dump(label2id, t)

with open('/content/drive/MyDrive/sparrow/CORD/id2label.pkl', 'wb') as t:
    pickle.dump(id2label, t)

{'menu.sub_nm': 0, 'total.menutype_cnt': 1, 'sub_total.subtotal_price': 2, 'menu.discountprice': 3, 'sub_total.etc': 4, 'sub_total.service_price': 5, 'sub_total.discount_price': 6, 'sub_total.tax_price': 7, 'total.total_etc': 8, 'total.emoneyprice': 9, 'menu.cnt': 10, 'total.creditcardprice': 11, 'total.cashprice': 12, 'O': 13, 'total.menuqty_cnt': 14, 'menu.sub_cnt': 15, 'menu.sub_price': 16, 'total.changeprice': 17, 'menu.num': 18, 'total.total_price': 19, 'menu.itemsubtotal': 20, 'menu.nm': 21, 'menu.price': 22, 'menu.unitprice': 23}
{0: 'menu.sub_nm', 1: 'total.menutype_cnt', 2: 'sub_total.subtotal_price', 3: 'menu.discountprice', 4: 'sub_total.etc', 5: 'sub_total.service_price', 6: 'sub_total.discount_price', 7: 'sub_total.tax_price', 8: 'total.total_etc', 9: 'total.emoneyprice', 10: 'menu.cnt', 11: 'total.creditcardprice', 12: 'total.cashprice', 13: 'O', 14: 'total.menuqty_cnt', 15: 'menu.sub_cnt', 16: 'menu.sub_price', 17: 'total.changeprice', 18: 'menu.num', 19: 'total.total_pr

In [None]:
from os import listdir
from torch.utils.data import Dataset
import torch
from PIL import Image

class CORDDataset(Dataset):
    """CORD dataset."""

    def __init__(self, annotations, image_dir, processor=None, max_length=512):
        """
        Args:
            annotations (List[List]): List of lists containing the word-level annotations (words, labels, boxes).
            image_dir (string): Directory with all the document images.
            processor (LayoutLMv2Processor): Processor to prepare the text + image.
        """
        self.words, self.labels, self.boxes = annotations
        self.image_dir = image_dir
        self.image_file_names = [f for f in listdir(image_dir)]
        self.processor = processor

    def __len__(self):
        return len(self.image_file_names)

    def __getitem__(self, idx):
        # first, take an image
        item = self.image_file_names[idx]
        image = Image.open(self.image_dir + item).convert("RGB")

        # get word-level annotations 
        words = self.words[idx]
        boxes = self.boxes[idx]
        word_labels = self.labels[idx]

        assert len(words) == len(boxes) == len(word_labels)
        
        word_labels = [label2id[label] for label in word_labels]
        # use processor to prepare everything
        encoded_inputs = self.processor(image, words, boxes=boxes, word_labels=word_labels, 
                                        padding="max_length", truncation=True, 
                                        return_tensors="pt")
        
        # remove batch dimension
        for k,v in encoded_inputs.items():
          encoded_inputs[k] = v.squeeze()

        assert encoded_inputs.input_ids.shape == torch.Size([512])
        assert encoded_inputs.attention_mask.shape == torch.Size([512])
        assert encoded_inputs.token_type_ids.shape == torch.Size([512])
        assert encoded_inputs.bbox.shape == torch.Size([512, 4])
        assert encoded_inputs.image.shape == torch.Size([3, 224, 224])
        assert encoded_inputs.labels.shape == torch.Size([512]) 
      
        return encoded_inputs

In [None]:
from transformers import LayoutLMv2Processor

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

train_dataset = CORDDataset(annotations=train,
                            image_dir='/content/drive/MyDrive/sparrow/CORD/train/image/', 
                            processor=processor)
val_dataset = CORDDataset(annotations=val,
                            image_dir='/content/drive/MyDrive/sparrow/CORD/dev/image/', 
                            processor=processor)
test_dataset = CORDDataset(annotations=test,
                            image_dir='/content/drive/MyDrive/sparrow/CORD/test/image/', 
                            processor=processor)

Downloading:   0%|          | 0.00/136 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/707 [00:00<?, ?B/s]

Let's verify an example:

In [None]:
encoding = train_dataset[0]
encoding.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'labels', 'image'])

In [None]:
for k,v in encoding.items():
  print(k, v.shape)

input_ids torch.Size([512])
token_type_ids torch.Size([512])
attention_mask torch.Size([512])
bbox torch.Size([512, 4])
labels torch.Size([512])
image torch.Size([3, 224, 224])


In [None]:
print(processor.tokenizer.decode(encoding['input_ids']))

[CLS] 2 apple pie 22, 000 1 apple creamcheese pastry 13, 000 total 35, 000 cash 100, 000 change 65, 000 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [P

In [None]:
train[0][0]

['2',
 'APPLE',
 'PIE',
 '22,000',
 '1',
 'APPLE',
 'CREAMCHEESE',
 'PASTRY',
 '13,000',
 'TOTAL',
 '35,000',
 'CASH',
 '100,000',
 'CHANGE',
 '65,000']

In [None]:
train[1][0]

['menu.cnt',
 'menu.nm',
 'menu.nm',
 'menu.price',
 'menu.cnt',
 'menu.nm',
 'menu.nm',
 'menu.nm',
 'menu.price',
 'total.total_price',
 'total.total_price',
 'total.cashprice',
 'total.cashprice',
 'total.changeprice',
 'total.changeprice']

In [None]:
[id2label[label] for label in encoding['labels'].tolist() if label != -100]

['menu.cnt',
 'menu.nm',
 'menu.nm',
 'menu.price',
 'menu.cnt',
 'menu.nm',
 'menu.nm',
 'menu.nm',
 'menu.price',
 'total.total_price',
 'total.total_price',
 'total.cashprice',
 'total.cashprice',
 'total.changeprice',
 'total.changeprice']

In [None]:
for id, label in zip(encoding['input_ids'][:30], encoding['labels'][:30]):
  print(processor.tokenizer.decode([id]), label.item())

[CLS] -100
2 10
apple 21
pie 21
22 22
, -100
000 -100
1 10
apple 21
cream 21
##chee -100
##se -100
pastry 21
13 22
, -100
000 -100
total 19
35 19
, -100
000 -100
cash 12
100 12
, -100
000 -100
change 17
65 17
, -100
000 -100
[SEP] -100
[PAD] -100


Next, we create corresponding dataloaders.

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=2, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=2, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=2)

## Train the model

Let's train the model using native PyTorch. We use the AdamW optimizer with learning rate = 5e-5 (this is a good default value when fine-tuning Transformer-based models).



In [None]:
from transformers import LayoutLMv2ForTokenClassification, AdamW
import torch
from tqdm.notebook import tqdm

model = LayoutLMv2ForTokenClassification.from_pretrained('microsoft/layoutlmv2-base-uncased',
                                                                      num_labels=len(labels))

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)

global_step = 0
num_train_epochs = 4

#put the model in training mode
model.train() 
for epoch in range(num_train_epochs):  
   print("Epoch:", epoch)
   for batch in tqdm(train_dataloader):
        # get the inputs;
        input_ids = batch['input_ids'].to(device)
        bbox = batch['bbox'].to(device)
        image = batch['image'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)

        # zero the parameter gradients
        optimizer.zero_grad()
        
        # forward + backward + optimize
        outputs = model(input_ids=input_ids,
                        bbox=bbox,
                        image=image,
                        attention_mask=attention_mask,
                        token_type_ids=token_type_ids,
                        labels=labels) 
        loss = outputs.loss
        
        # print loss every 100 steps
        if global_step % 100 == 0:
          print(f"Loss after {global_step} steps: {loss.item()}")

        loss.backward()
        optimizer.step()
        global_step += 1

model.save_pretrained("/content/drive/MyDrive/sparrow/CORD/Checkpoints")

Downloading:   0%|          | 0.00/765M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/layoutlmv2-base-uncased were not used when initializing LayoutLMv2ForTokenClassification: ['layoutlmv2.visual.backbone.bottom_up.res3.0.conv3.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.22.conv3.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.12.conv1.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.18.conv2.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res2.0.conv1.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res3.2.conv2.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.13.conv3.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.21.conv1.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.16.conv3.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.9.conv2.norm.num_batches_tracked', 'layoutlmv2.visual.backbone.bottom_up.res4.5.conv3.norm.num_batches_track

Epoch: 0




  0%|          | 0/400 [00:00<?, ?it/s]

  // self.config.image_feature_pool_shape[1]
  // self.config.image_feature_pool_shape[0]


Loss after 0 steps: 3.1878292560577393
Loss after 100 steps: 1.9094265699386597
Loss after 200 steps: 2.232769012451172
Loss after 300 steps: 1.8571228981018066
Epoch: 1


  0%|          | 0/400 [00:00<?, ?it/s]

Loss after 400 steps: 0.7522667050361633
Loss after 500 steps: 0.7076326608657837
Loss after 600 steps: 0.9862611889839172
Loss after 700 steps: 0.9300562143325806
Epoch: 2


  0%|          | 0/400 [00:00<?, ?it/s]

Loss after 800 steps: 0.6010346412658691
Loss after 900 steps: 0.6442053318023682
Loss after 1000 steps: 0.23071077466011047
Loss after 1100 steps: 0.30630722641944885
Epoch: 3


  0%|          | 0/400 [00:00<?, ?it/s]

Loss after 1200 steps: 0.1786537766456604
Loss after 1300 steps: 0.2862323522567749
Loss after 1400 steps: 0.37083208560943604
Loss after 1500 steps: 0.09407611936330795


## Evaluation

Let's evaluate the model on the test set. First, let's do a sanity check on the first example of the test set.

In [None]:
# encoding = test_dataset[0]
# processor.tokenizer.decode(encoding['input_ids'])

In [None]:
# ground_truth_labels = [id2label[label] for label in encoding['labels'].squeeze().tolist() if label != -100]
# print(ground_truth_labels)

In [None]:
# for k,v in encoding.items():
#   encoding[k] = v.unsqueeze(0).to(device)

# model.eval()
# # forward pass
# outputs = model(input_ids=encoding['input_ids'], attention_mask=encoding['attention_mask'],
#                 token_type_ids=encoding['token_type_ids'], bbox=encoding['bbox'],
#                 image=encoding['image'])

In [None]:
# prediction_indices = outputs.logits.argmax(-1).squeeze().tolist()
# print(prediction_indices)

In [None]:
# prediction_indices = outputs.logits.argmax(-1).squeeze().tolist()
# predictions = [id2label[label] for gt, label in zip(encoding['labels'].squeeze().tolist(), prediction_indices) if gt != -100]
# print(predictions)

In [None]:
import numpy as np

preds_val = None
out_label_ids = None

# put model in evaluation mode
model.eval()
for batch in tqdm(test_dataloader, desc="Evaluating"):
    with torch.no_grad():
        input_ids = batch['input_ids'].to(device)
        bbox = batch['bbox'].to(device)
        image = batch['image'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)

        # forward pass
        outputs = model(input_ids=input_ids, bbox=bbox, image=image, attention_mask=attention_mask, 
                        token_type_ids=token_type_ids, labels=labels)
        
        if preds_val is None:
          preds_val = outputs.logits.detach().cpu().numpy()
          out_label_ids = batch["labels"].detach().cpu().numpy()
        else:
          preds_val = np.append(preds_val, outputs.logits.detach().cpu().numpy(), axis=0)
          out_label_ids = np.append(
              out_label_ids, batch["labels"].detach().cpu().numpy(), axis=0
          )

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

  // self.config.image_feature_pool_shape[1]
  // self.config.image_feature_pool_shape[0]


In [None]:
import warnings
warnings.filterwarnings("ignore")
from seqeval.metrics import (
    classification_report,
    f1_score,
    precision_score,
    recall_score)

def results_test(preds, out_label_ids, labels):
  preds = np.argmax(preds, axis=2)

  label_map = {i: label for i, label in enumerate(labels)}

  out_label_list = [[] for _ in range(out_label_ids.shape[0])]
  preds_list = [[] for _ in range(out_label_ids.shape[0])]

  for i in range(out_label_ids.shape[0]):
      for j in range(out_label_ids.shape[1]):
          if out_label_ids[i, j] != -100:
              out_label_list[i].append(label_map[out_label_ids[i][j]])
              preds_list[i].append(label_map[preds[i][j]])

  results = {
      "precision": precision_score(out_label_list, preds_list),
      "recall": recall_score(out_label_list, preds_list),
      "f1": f1_score(out_label_list, preds_list),
  }
  return results, classification_report(out_label_list, preds_list)

In [None]:
labels = list(set(all_labels))
val_result, class_report = results_test(preds_val, out_label_ids, labels)
print("Overall results:", val_result)
print(class_report)

Overall results: {'precision': 0.9069069069069069, 'recall': 0.9158453373768006, 'f1': 0.9113542059600152}
                         precision    recall  f1-score   support

                enu.cnt       0.92      1.00      0.96       224
      enu.discountprice       0.20      0.20      0.20        10
       enu.itemsubtotal       0.00      0.00      0.00         6
                 enu.nm       0.92      0.95      0.93       251
                enu.num       0.89      0.73      0.80        11
              enu.price       0.97      0.97      0.97       247
            enu.sub_cnt       1.00      0.06      0.11        17
             enu.sub_nm       0.57      0.81      0.67        32
          enu.sub_price       0.73      0.80      0.76        20
          enu.unitprice       0.97      0.99      0.98        68
         otal.cashprice       0.95      0.82      0.88        71
       otal.changeprice       0.89      0.92      0.90        59
   otal.creditcardprice       0.78      0.82   

The results I was getting were: 

`{'precision': 0.9307458143074582, 'recall': 0.9272175890826384, 'f1': 0.9289783516900872}`