[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rHpuvwp63UqOBaUcliiEo80YRuDLPe07?usp=sharing)

# Fine tuning LayoutLMv3 for Document Image Classification Task

In this notebook, we're going to go through fine-tuning LayoutLMv3 Multimodal Model for Document Image Classification task with RVL-CDIP dataset

- Small dataset at [here](https://huggingface.co/datasets/sitloboi2012/rvl_cdip_small_dataset), if you want to test with larger dataset, link [here](https://huggingface.co/datasets/sitloboi2012/rvl_cdip_large_dataset)
- Model hub at [here](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)
- Academic paper for LayoutLMv3 at [here](https://arxiv.org/abs/2204.08387)

# Setting the environment

- Use the `requirements.txt` file if you need to install `torch` and `torchvision` from the beginning. Else you can just run the second cell to install `transformers` and other relevant packages.

- We also need to install `tesseract-ocr`.
  - For Ubuntu user, you can use `sudo apt-get install tesseract-ocr`.
  - For Windows user, please refer to this link

- We also need to install `detectron-2`
  - For Ubuntu user, you can use this link to help: [Detectron2 Pre-built Install](https://detectron2.readthedocs.io/en/latest/tutorials/install.html#install-pre-built-detectron2-linux-only)
  - For Windows user, run this command `python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'`

  - __WARNING__ ❗:
    - _gcc & g++ ≥ 5.4_ are required
    - _ninja_ is optional but recommended for faster build.
    - _PyTorch ≥ 1.8 and torchvision_ that matches the PyTorch installation.


In [None]:
%%writefile requirements.txt
--find-links https://download.pytorch.org/whl/torch_stable.html

transformers
datasets
seqeval
torch==2.0.1+cu117
torchvision==0.15.2+cu117
pytesseract==0.3.10
accelerate
transformers[torch]

In [None]:
!pip install transformers datasets seqeval pytesseract accelerate -q -q -q --exists-action i
!sudo apt-get install tesseract-ocr # For Tesseract OCR
!python -m pip install -q 'git+https://github.com/facebookresearch/detectron2.git' # For Detectron2

# in case you need to install torch and torch vision
# then uncomment and run the bottom pip install instead of the first one

#pip install -r requirements.txt

In [None]:
# login to huggingface hub to push the model at the end and pull the dataset down

!huggingface-cli login

# Get the dataset and preview

We going to use `sitloboi2012/rvl_cdip_small_dataset` to run this. But you can also use dataset tag from `nielsr/rvl_cdip_10_examples_per_class`

In [None]:
from datasets import load_dataset

dataset = load_dataset("sitloboi2012/rvl_cdip_small_dataset")
train_dataset = dataset["train"]

Visualize the image

In [None]:
from PIL import ImageDraw

image = train_dataset["image"][0]
image = image.convert("RGB")
image

Visualize the OCR using PyTesseract

In [None]:
import pytesseract
import numpy as np

ocr_df = pytesseract.image_to_data(image, output_type='data.frame')
ocr_df = ocr_df.dropna().reset_index(drop=True)
float_cols = ocr_df.select_dtypes('float').columns
ocr_df[float_cols] = ocr_df[float_cols].round(0).astype(int)
ocr_df = ocr_df.replace(r'^\s*$', np.nan, regex=True)
words = ' '.join([word for word in ocr_df.text if str(word) != 'nan'])
print("Word after run PyTesseract: ", words)

Draw OCR Bounding Boxes on Image

In [None]:
coordinates = ocr_df[['left', 'top', 'width', 'height']]
actual_boxes = []
for idx, row in coordinates.iterrows():
    x, y, w, h = tuple(row) # the row comes in (left, top, width, height) format
    actual_box = [x, y, x+w, y+h] # we turn it into (left, top, left+width, top+height) to get the actual box
    actual_boxes.append(actual_box)

draw = ImageDraw.Draw(image, "RGB")
for box in actual_boxes:
  draw.rectangle(box, outline='red')

image

# Processing Image

Convert label to list and to dictionary

In [None]:
labels: list[str] = list(set(train_dataset["label"]))
id2label: dict[int, str] = dict(enumerate(labels))
label2id: dict[str, int] = {k: v for v, k in enumerate(labels)}
print(f"Label to Id: {label2id}")

Initialize `Tokenizer`, `FeatureExtractor` and `DataCollator` with pretrained and apply encode to dataset

In [None]:
from transformers import LayoutLMv3FeatureExtractor, LayoutLMv3Tokenizer, LayoutLMv3Processor, DataCollatorWithPadding
from datasets import Features, Sequence, ClassLabel, Value, Array2D, Array3D

feature_extractor = LayoutLMv3FeatureExtractor()
tokenizer = LayoutLMv3Tokenizer.from_pretrained("microsoft/layoutlmv3-base")
processor = LayoutLMv3Processor(feature_extractor, tokenizer)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
features = Features({
    'pixel_values': Array3D(dtype="float32", shape=(3, 224, 224)),
    'input_ids': Sequence(feature=Value(dtype='int64')),
    'attention_mask': Sequence(Value(dtype='int64')),
    'bbox': Array2D(dtype="int64", shape=(512, 4)),
    'labels': ClassLabel(num_classes=len(labels), names=labels),

})


def preprocess_data(examples):
  # take a batch of images
  images = [image_obj.convert("RGB") for image_obj in examples['image']]

  encoded_inputs = processor(images, padding="max_length", truncation=True)

  # add labels
  encoded_inputs["labels"] = [label2id[label] for label in examples["label"]]

  return encoded_inputs

train_encoded_dataset = train_dataset.map(preprocess_data, remove_columns=train_dataset.column_names, features=features,
                              batched=True, batch_size=2)

#test_encoded_dataset = test_dataset.map(preprocess_data, remove_columns=test_dataset.column_names, features=features, batched=True)

In [12]:
train_encoded_dataset.set_format(type="torch", device="cuda")

Double check the data before feed to the model

In [13]:
import torch

train_dataloader = torch.utils.data.DataLoader(train_encoded_dataset, batch_size=4, shuffle = True)
batch = next(iter(train_dataloader))

In [None]:
for k,v in batch.items():
  print(k, v.shape)

In [None]:
processor.tokenizer.decode(batch['input_ids'][0].tolist())

In [None]:
id2label[batch['labels'][0].item()]

# Training

## Model Initialize

In [None]:
from transformers import LayoutLMv3ForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = LayoutLMv3ForSequenceClassification.from_pretrained("microsoft/layoutlmv3-base", num_labels=len(labels), id2label = id2label, label2id = label2id)
model.to(device)

## Metric Setup

In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
import numpy as np

def compute_metrics(eval_pred):
    """
    Evaluate the model output base on multiple metrics

    Args:
      eval_pred `(Model.Output)`: the output of the model contains logits, and labels

    Returns:
      `dict[str, float]`: the evaluation result after calculate
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy = accuracy_score(predictions, labels)
    precision_micro = precision_score(predictions, labels, average = "micro", zero_division=0)
    precision_macro = precision_score(predictions, labels, average = "macro", zero_division=0)
    recall_micro = recall_score(predictions, labels, average = "micro", zero_division=0)
    recall_macro = recall_score(predictions, labels, average = "macro", zero_division=0)
    f1_micro = f1_score(predictions, labels, average = "micro", zero_division=0)
    f1_macro = f1_score(predictions, labels, average = "macro", zero_division=0)
    return {"accuracy": accuracy, "precision_micro": precision_micro, "recall_micro": recall_micro, "f1_micro": f1_micro, "precision_macro": precision_macro, "recall_macro": recall_macro, "f1_macro": f1_macro}

## Training Arguments and Trainer Setup

In [30]:
import warnings
warnings.filterwarnings("ignore") #FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.warnings.warn(

We will train the model with 10 epochs, learning rate of 1e-5 and logging every 5 steps. We will also include an EarlyStoppingCallbacks to avoid overfit. Feel free to change the parameter if need. Documentation for the parameter at [here](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)

To push your model to HuggingFace Hub, feel free to uncomment the last two lines

In [32]:
from transformers import Trainer, TrainingArguments
from datasets import load_metric
import numpy as np

metric_name = "precision_micro" # change if you want to

train_args = TrainingArguments(
    output_dir = "result_trainer",
    evaluation_strategy="epoch",
    save_strategy = "epoch",
    per_device_train_batch_size=10,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    warmup_ratio = 0.1,
    fp16 = True,
    learning_rate = 1e-5,
    weight_decay = 0.01,
    logging_steps = 5,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    remove_unused_columns=False,
    #push_to_hub = True,
    #push_to_hub_model_id = f"layoutlmv3-finetune-rvlcdip-small",
)

In [33]:
from transformers import EarlyStoppingCallback

early_stopping = EarlyStoppingCallback(early_stopping_patience = 3)

In [34]:
class ClassificationTrainer(Trainer):
    def get_train_dataloader(self):
      return train_dataloader

    def get_test_dataloader(self, test_dataset):
      return train_dataloader # please change this to your evaluate dataset

In [35]:
trainer = Trainer(
    model,
    train_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=train_encoded_dataset,
    eval_dataset=train_encoded_dataset, # please change this to your evaluate dataset
    compute_metrics=compute_metrics,
    callbacks = [early_stopping]
)

## Train

This would take around 3 mins to finish

In [None]:
trainer.train()

## Evaluate

In [None]:
predictions, labels, metrics = trainer.predict(train_encoded_dataset)
metrics

We split the metrics into __Macro__ and __Micro__ approach to get the better idea of how the model is going.
- Micro: calculate the metrics globally by counting the total true positives, false negatives and false positives
- Macro: calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account

The model current performance are:
- `Val Loss`: ~2.11
- `Val Accuracy`: ~0.93
- `Val Precision Micro`, `Val Recal Micro`, `Val F1 Micro`: ~0.93
- `Val Precision Macro`, `Val Recal Macro`, `Val F1 Macro`: ~0.91

So we can see that in terms of predicting true positives, false negatives and false positives label globally, the model managed to do better

But for each label specifically, there are still room for improvement

For future improvement, we can apply HuggingFace Rays

# Inference

In [None]:
model.eval()

In [None]:
image = train_dataset["image"][5].convert("RGB")
image

In [41]:
# prepare image for the model
encoded_inputs = processor(image, return_tensors="pt", padding="max_length", truncation=True)

# make sure all keys of encoded_inputs are on the same device as the model
for k,v in encoded_inputs.items():
  encoded_inputs[k] = v.to(model.device)

# forward pass
outputs = model(**encoded_inputs)
loss = outputs.loss
logits = outputs.logits

In [None]:
predicted_class_idx = logits.argmax(-1).item()
print(f"Predicted class: {model.config.id2label[predicted_class_idx]}")

# Extra

In [None]:
# In case you want to load model from HuggingFace Hub

model = LayoutLMv3ForSequenceClassification.from_pretrained("<MODEL_TAG>")
model.eval()

In [None]:
# In case you want to load model from checkpoint

model = LayoutLMv3ForSequenceClassification.from_pretrained("/content/result_trainer/checkpoint-20")
model.eval()