[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Tuha3PWqvUicwhS7uUGKH0HBI185pDY-?usp=sharing)

# Fine tuning LayoutLM for Document Image Classification Task

In this notebook, we're going to go through fine-tuning LayoutLM Multimodal Model for Document Image Classification task with RVL-CDIP dataset

- Small dataset at [here](https://huggingface.co/datasets/sitloboi2012/rvl_cdip_small_dataset), if you want to test with larger dataset, link [here](https://huggingface.co/datasets/sitloboi2012/rvl_cdip_large_dataset)
- Model hub at [here](https://huggingface.co/docs/transformers/model_doc/layoutlm)
- Academic paper for LayoutLM at [here](https://huggingface.co/docs/transformers/model_doc/layoutlm)

# Setting the environment

- Use the `requirements.txt` file if you need to install `torch` and `torchvision` from the beginning. Else you can just run the second cell to install `transformers` and other relevant packages.

- We also need to install `tesseract-ocr`.
  - For Ubuntu user, you can use `sudo apt-get install tesseract-ocr`.
  - For Windows user, please refer to this link


In [None]:
%%writefile requirements.txt
--find-links https://download.pytorch.org/whl/torch_stable.html

transformers==4.33.1
datasets
seqeval
torch==2.0.1+cu117
torchvision==0.15.2+cu117
pytesseract==0.3.10
accelerate
transformers[torch]

In [None]:
!pip install transformers datasets seqeval pytesseract accelerate -q -q -q --exists-action i
!sudo apt-get install tesseract-ocr

# in case you need to install torch and torch vision
# then uncomment and run the bottom pip install instead of the first one

#pip install -r requirements.txt

In [None]:
# login to huggingface hub to push the model at the end and pull the dataset down

!huggingface-cli login

# Get the dataset and preview

We going to use `sitloboi2012/rvl_cdip_small_dataset` to run this. But you can also tag from `nielsr/rvl_cdip_10_examples_per_class`

In [4]:
from datasets import load_dataset

dataset = load_dataset("sitloboi2012/rvl_cdip_small_dataset")
train_dataset = dataset["train"]

Visualize the image

In [None]:
from PIL import ImageDraw

image = train_dataset["image"][0]
image = image.convert("RGB")
image

Visualize the OCR using PyTesseract

In [None]:
import pytesseract
import numpy as np

ocr_df = pytesseract.image_to_data(image, output_type='data.frame')
ocr_df = ocr_df.dropna().reset_index(drop=True)
float_cols = ocr_df.select_dtypes('float').columns
ocr_df[float_cols] = ocr_df[float_cols].round(0).astype(int)
ocr_df = ocr_df.replace(r'^\s*$', np.nan, regex=True)
words = ' '.join([word for word in ocr_df.text if str(word) != 'nan'])
print("Word after run PyTesseract: ", words)

Draw OCR Bounding Boxes on Image

In [None]:
coordinates = ocr_df[['left', 'top', 'width', 'height']]
actual_boxes = []
for idx, row in coordinates.iterrows():
    x, y, w, h = tuple(row) # the row comes in (left, top, width, height) format
    actual_box = [x, y, x+w, y+h] # we turn it into (left, top, left+width, top+height) to get the actual box
    actual_boxes.append(actual_box)

draw = ImageDraw.Draw(image, "RGB")
for box in actual_boxes:
  draw.rectangle(box, outline='red')

image

# Preprocess Image

In [12]:
from datasets import Dataset
import datasets

def normalize_box(box, width, height):
     return [
         int(1000 * (box[0] / width)),
         int(1000 * (box[1] / height)),
         int(1000 * (box[2] / width)),
         int(1000 * (box[3] / height)),
     ]

def apply_ocr(row: datasets.formatting.formatting.LazyRow):
    """
    Apply OCR PyTesseract to create a column for word token and bounding box location

    Args:
      row `(datasets.formatting.formatting.LazyRow)`: a row in the dataset

    Returns:
      `(datasets.formatting.formatting.LazyRow)`: updated row contains word token and bounding box
    """

    # get the image
    image = row["image"]

    width, height = image.size

    # apply ocr to the image
    ocr_df = pytesseract.image_to_data(image, output_type='data.frame')
    float_cols = ocr_df.select_dtypes('float').columns
    ocr_df = ocr_df.dropna().reset_index(drop=True)
    ocr_df[float_cols] = ocr_df[float_cols].round(0).astype(int)
    ocr_df = ocr_df.replace(r'^\s*$', np.nan, regex=True)
    ocr_df = ocr_df.dropna().reset_index(drop=True)

    # get the words and actual (unnormalized) bounding boxes
    #words = [word for word in ocr_df.text if str(word) != 'nan'])
    words = list(ocr_df.text)
    words = [str(w) for w in words]
    coordinates = ocr_df[['left', 'top', 'width', 'height']]
    actual_boxes = []
    for idx, bbox_row in coordinates.iterrows():
        x, y, w, h = tuple(bbox_row) # the row comes in (left, top, width, height) format
        actual_box = [x, y, x+w, y+h] # we turn it into (left, top, left+width, top+height) to get the actual box
        actual_boxes.append(actual_box)

    # normalize the bounding boxes
    boxes = []
    for box in actual_boxes:
        boxes.append(normalize_box(box, width, height))

    # add as extra columns
    assert len(words) == len(boxes)
    row['words'] = words
    row['bbox'] = boxes
    return row

In [None]:
updated_dataset = train_dataset.map(apply_ocr)

# Processing Image

Convert label to list and to dictionary

In [None]:
labels: list[str] = list(set(train_dataset["label"]))
id2label: dict[int, str] = dict(enumerate(labels))
label2id: dict[str, int] = {k: v for v, k in enumerate(labels)}
print(f"Label to Id: {label2id}")

Initialize `Tokenizer` with pretrained and apply encode to dataset

In [15]:
from transformers import LayoutLMTokenizer
import torch

tokenizer = LayoutLMTokenizer.from_pretrained("microsoft/layoutlm-base-uncased")

def encode_example(example, max_seq_length=512, pad_token_box=[0, 0, 0, 0]):
    """
    Get a row in the `dataset.Dataset` object and convert to correct input for the model
    using the `Tokenizer`

    Args:
      row `(datasets.formatting.formatting.LazyRow)`: a row in the dataset object
      max_seq_length `(int)`: the maximum sequence length to tokenize
      pad_token_box `(list[int])`: the padding token

    Returns:
      `(datasets.formatting.formatting.LazyRow)`: the updated row contains the input shape for the model
    """
    words = example['words']
    normalized_word_boxes = example['bbox']

    assert len(words) == len(normalized_word_boxes)

    token_boxes = []
    for word, box in zip(words, normalized_word_boxes):
        word_tokens = tokenizer.tokenize(word)
        token_boxes.extend([box] * len(word_tokens))

    # Truncation of token_boxes
    special_tokens_count = 2
    if len(token_boxes) > max_seq_length - special_tokens_count:
        token_boxes = token_boxes[: (max_seq_length - special_tokens_count)]

    # add bounding boxes of cls + sep tokens
    token_boxes = [[0, 0, 0, 0]] + token_boxes + [[1000, 1000, 1000, 1000]]

    encoding = tokenizer(' '.join(words), padding='max_length', truncation=True)
    # Padding of token_boxes up the bounding boxes to the sequence length.
    input_ids = tokenizer(' '.join(words), truncation=True)["input_ids"]
    padding_length = max_seq_length - len(input_ids)
    token_boxes += [pad_token_box] * padding_length
    encoding['bbox'] = token_boxes
    encoding['label'] = label2id[example['label']]

    assert len(encoding['input_ids']) == max_seq_length
    assert len(encoding['attention_mask']) == max_seq_length
    assert len(encoding['token_type_ids']) == max_seq_length
    assert len(encoding['bbox']) == max_seq_length

    return encoding

In [None]:
from datasets import Features, Sequence, ClassLabel, Value, Array2D, Image

# we need to define the features ourselves as the bbox of LayoutLM are an extra feature
features = Features({
    'input_ids': Sequence(feature=Value(dtype='int64')),
    'bbox': Array2D(dtype="int64", shape=(512, 4)),
    'attention_mask': Sequence(Value(dtype='int64')),
    'token_type_ids': Sequence(Value(dtype='int64')),
    'label': ClassLabel(names=labels),
    'image': Image(),
    'words': Sequence(feature=Value(dtype='string')),
})

encoded_dataset = updated_dataset.map(lambda example: encode_example(example), features=features)

In [17]:
encoded_dataset.set_format(type='torch', columns=['input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'label'])

Double check the data before feed to the model

In [None]:
dataloader = torch.utils.data.DataLoader(encoded_dataset, batch_size=1, shuffle=True)
batch = next(iter(dataloader))
batch

In [None]:
tokenizer.decode(batch['input_ids'][0].tolist())

In [None]:
id2label[batch['label'][0].item()]

# Training

## Model Initialize

In [None]:
from transformers import LayoutLMForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = LayoutLMForSequenceClassification.from_pretrained("microsoft/layoutlm-base-uncased", num_labels=len(labels))
model.to(device)

In [22]:
model.config.id2label = id2label
model.config.label2id = label2id

## Metric Setup

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
import numpy as np

def compute_metrics(eval_pred):
    """
    Evaluate the model output base on multiple metrics

    Args:
      eval_pred `(Model.Output)`: the output of the model contains logits, and labels

    Returns:
      `dict[str, float]`: the evaluation result after calculate
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy = accuracy_score(predictions, labels)
    precision_micro = precision_score(predictions, labels, average = "micro", zero_division=0)
    precision_macro = precision_score(predictions, labels, average = "macro", zero_division=0)
    recall_micro = recall_score(predictions, labels, average = "micro", zero_division=0)
    recall_macro = recall_score(predictions, labels, average = "macro", zero_division=0)
    f1_micro = f1_score(predictions, labels, average = "micro", zero_division=0)
    f1_macro = f1_score(predictions, labels, average = "macro", zero_division=0)
    return {"accuracy": accuracy, "precision_micro": precision_micro, "recall_micro": recall_micro, "f1_micro": f1_micro, "precision_macro": precision_macro, "recall_macro": recall_macro, "f1_macro": f1_macro}

## Training Arguments and Trainer Setup

In [24]:
from transformers import Trainer, TrainingArguments
from datasets import load_metric
import numpy as np

metric_name = "precision_micro" # change if you want to

train_args = TrainingArguments(
    output_dir = "result_trainer",
    evaluation_strategy="epoch",
    save_strategy = "epoch",
    per_device_train_batch_size=10,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    warmup_ratio = 0.1,
    fp16 = True,
    learning_rate = 1e-5,
    weight_decay = 0.01,
    logging_steps = 10,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    remove_unused_columns=False,
    #push_to_hub = True,
    #push_to_hub_model_id = f"layoutlm-finetune-rvlcdip-small",
)

In [25]:
from transformers import EarlyStoppingCallback

early_stopping = EarlyStoppingCallback(early_stopping_patience = 3)

In [26]:
import torch

trainer = Trainer(
    model,
    train_args,
    train_dataset=encoded_dataset,
    eval_dataset=encoded_dataset,
    compute_metrics=compute_metrics,
    callbacks = [early_stopping]
)

## Train

In [None]:
trainer.train()

## Evaluate

In [None]:
predictions, labels, metrics = trainer.predict(encoded_dataset)
metrics

# Inference

In [None]:
model.eval()

In [None]:
image = train_dataset["image"][1].convert("RGB")
image

In [30]:
def inference_ocr(image):
    """
    Copy version of the `apply_ocr` function but adapt for inference version

    Args:
      image (PIL.Image): the image to do OCR on

    Returns:
      `tuple[list[str], list[int]]`: contains the word token level and the bounding box respectively
    """
    width, height = image.size

    # apply ocr to the image
    ocr_df = pytesseract.image_to_data(image, output_type='data.frame')
    float_cols = ocr_df.select_dtypes('float').columns
    ocr_df = ocr_df.dropna().reset_index(drop=True)
    ocr_df[float_cols] = ocr_df[float_cols].round(0).astype(int)
    ocr_df = ocr_df.replace(r'^\s*$', np.nan, regex=True)
    ocr_df = ocr_df.dropna().reset_index(drop=True)

    # get the words and actual (unnormalized) bounding boxes
    words = list(ocr_df.text)
    words = [str(w) for w in words]
    coordinates = ocr_df[['left', 'top', 'width', 'height']]
    actual_boxes = []
    for idx, row in coordinates.iterrows():
        x, y, w, h = tuple(row) # the row comes in (left, top, width, height) format
        actual_box = [x, y, x+w, y+h] # we turn it into (left, top, left+width, top+height) to get the actual box
        actual_boxes.append(actual_box)

    # normalize the bounding boxes
    boxes = [normalize_box(box, width, height) for box in actual_boxes]

    # add as extra columns
    assert len(words) == len(boxes)
    return words, boxes

In [None]:
word_ocr, boxes_ocr = inference_ocr(image)
encoded_input = tokenizer(' '.join(word_ocr), padding='max_length', truncation=True, return_tensors = "pt")

In [32]:
for k,v in encoded_input.items():
  encoded_input[k] = v.to(model.device)

In [33]:
outputs = model(**encoded_input)
loss = outputs.loss
logits = outputs.logits

In [None]:
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

# Extra

In [None]:
# In case you want to load model from HuggingFace Hub

model = LayoutLMForSequenceClassification.from_pretrained("<MODEL_TAG>")
model.eval()

In [35]:
# In case you want to load model from checkpoint

model = LayoutLMForSequenceClassification.from_pretrained("/content/result_trainer/checkpoint-20")
model.eval()

Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.
