Document Question Answering, also referred to as Document Visual Question Answering, is a task that involves providing answers to questions posed about document images. The input to models supporting this task is typically a combination of an image and a question, and the output is an answer expressed in natural language. These models utilize multiple modalities, including text, the positions of words (bounding boxes), and the image itself.

This guide illustrates how to:
- Fine-tune LayoutLMv2 on the DocVQA dataset.
- Use the fine-tuned model for inference.

LayoutLMv2 solves the document question-answering task by adding a question-answering head on top of the final hidden states of the tokens, to predict the positions of the start and end tokens of the answer. In other words, the problem is treated as extractive question answering: given the context, extract which piece of information answers the question. The context comes from the output of an OCR engine, which is Google’s Tesseract in this particular case.

# Libraries

In [1]:
pip install -q transformers datasets

Note: you may need to restart the kernel to use updated packages.


In [5]:
!git clone https://github.com/facebookresearch/detectron2.git
!python3 -m pip install -e detectron2
!pip install torchvision

Cloning into 'detectron2'...
remote: Enumerating objects: 15819, done.[K
remote: Counting objects: 100% (76/76), done.[K
remote: Compressing objects: 100% (59/59), done.[K
remote: Total 15819 (delta 31), reused 47 (delta 17), pack-reused 15743 (from 1)[K
Receiving objects: 100% (15819/15819), 6.38 MiB | 24.30 MiB/s, done.
Resolving deltas: 100% (11525/11525), done.
Obtaining file:///Users/nm/Projects/multimodal_AI/multimodal_AI/detectron2
  Preparing metadata (setup.py) ... [?25ldone
Collecting pycocotools>=2.0.2 (from detectron2==0.6)
  Using cached pycocotools-2.0.8-cp310-cp310-macosx_10_9_universal2.whl.metadata (1.1 kB)
Collecting cloudpickle (from detectron2==0.6)
  Using cached cloudpickle-3.1.0-py3-none-any.whl.metadata (7.0 kB)
Collecting iopath<0.1.10,>=0.1.7 (from detectron2==0.6)
  Using cached iopath-0.1.9-py3-none-any.whl.metadata (370 bytes)
Collecting omegaconf<2.4,>=2.1 (from detectron2==0.6)
  Using cached omegaconf-2.3.0-py3-none-any.whl.metadata (3.9 kB)
Collect

In [9]:
!pip install -q pytesseract

In [10]:
from datasets import load_dataset


In [11]:
# GLOBAL VARS
# Note that the LayoutLMv2 checkpoint that we use in this guide has been trained with max_position_embeddings = 512
model_checkpoint = "microsoft/layoutlmv2-base-uncased"
batch_size = 4

# Load Data

In [12]:
# Load a subset of the DocVQA data set
# Full data set can be found at https://rrc.cvc.uab.es/?ch=17
dataset = load_dataset("nielsr/docvqa_1200_examples")
dataset

Downloading metadata:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/123M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/25.2M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/200 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'],
        num_rows: 200
    })
})

In [None]:
# The dataset is split into train and test sets already
dataset["train"].features

Here’s what the individual fields within the data set represent:

- id: the example’s id
- image: a PIL.Image.Image object containing the document image
- query: the question string - natural language asked question, in several languages
- answers: a list of correct answers provided by human annotators
- words and bounding_boxes: the results of OCR, which we will not use here
- answer: an answer matched by a different model which we will not use here

Let’s leave only English questions, and drop the answer feature which appears to contain predictions by another model. We’ll also take the first of the answers from the set provided by the annotators. Alternatively, you can randomly sample it:

In [None]:
updated_dataset = dataset.map(lambda example: {"question": example["query"]["en"]}, remove_columns=["query"])
updated_dataset = updated_dataset.map(
    lambda example: {"answer": example["answers"][0]}, remove_columns=["answer", "answers"]
)

In [None]:
# Since max embeddings = 512, we’ll remove the few examples where the embedding is likely to end up longer than 512
# Could have truncated the examples the answer might be at the end of a large document and also end up truncated
# Alternatively, implement a sliding window strategy: https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb
updated_dataset = updated_dataset.filter(lambda x: len(x["words"]) + len(x["question"].split()) < 512)