Document Question Answering, also referred to as Document Visual Question Answering, is a task that involves providing answers to questions posed about document images. The input to models supporting this task is typically a combination of an image and a question, and the output is an answer expressed in natural language. These models utilize multiple modalities, including text, the positions of words (bounding boxes), and the image itself.

This guide illustrates how to:
- Fine-tune LayoutLMv2 on the DocVQA dataset.
- Use the fine-tuned model for inference.

LayoutLMv2 solves the document question-answering task by adding a question-answering head on top of the final hidden states of the tokens, to predict the positions of the start and end tokens of the answer. In other words, the problem is treated as extractive question answering: given the context, extract which piece of information answers the question. The context comes from the output of an OCR engine, which is Google’s Tesseract in this particular case.

# Libraries

In [1]:
pip install -q transformers datasets

Note: you may need to restart the kernel to use updated packages.


In [2]:
!git clone https://github.com/facebookresearch/detectron2.git
!python3 -m pip install -e detectron2
!pip install torchvision

fatal: destination path 'detectron2' already exists and is not an empty directory.
Obtaining file:///Users/nm/Projects/multimodal_AI/multimodal_AI/detectron2
  Preparing metadata (setup.py) ... [?25ldone
Installing collected packages: detectron2
  Attempting uninstall: detectron2
    Found existing installation: detectron2 0.6
    Uninstalling detectron2-0.6:
      Successfully uninstalled detectron2-0.6
  Running setup.py develop for detectron2
Successfully installed detectron2-0.6


In [3]:
!pip install -q pytesseract

In [4]:
from datasets import load_dataset


In [5]:
# GLOBAL VARS
# Note that the LayoutLMv2 checkpoint that we use in this guide has been trained with max_position_embeddings = 512
model_checkpoint = "microsoft/layoutlmv2-base-uncased"
batch_size = 4

# Load Data

In [6]:
# Load a subset of the DocVQA data set
# Full data set can be found at https://rrc.cvc.uab.es/?ch=17
dataset = load_dataset("nielsr/docvqa_1200_examples")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'],
        num_rows: 200
    })
})

In [7]:
# The dataset is split into train and test sets already
dataset["train"].features

{'id': Value(dtype='string', id=None),
 'image': Image(decode=True, id=None),
 'query': {'de': Value(dtype='string', id=None),
  'en': Value(dtype='string', id=None),
  'es': Value(dtype='string', id=None),
  'fr': Value(dtype='string', id=None),
  'it': Value(dtype='string', id=None)},
 'answers': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'words': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'bounding_boxes': Sequence(feature=Sequence(feature=Value(dtype='float32', id=None), length=4, id=None), length=-1, id=None),
 'answer': {'match_score': Value(dtype='float64', id=None),
  'matched_text': Value(dtype='string', id=None),
  'start': Value(dtype='int64', id=None),
  'text': Value(dtype='string', id=None)}}

Here’s what the individual fields within the data set represent:

- id: the example’s id
- image: a PIL.Image.Image object containing the document image
- query: the question string - natural language asked question, in several languages
- answers: a list of correct answers provided by human annotators
- words and bounding_boxes: the results of OCR, which we will not use here
- answer: an answer matched by a different model which we will not use here

Let’s leave only English questions, and drop the answer feature which appears to contain predictions by another model. We’ll also take the first of the answers from the set provided by the annotators. Alternatively, you can randomly sample it:

In [8]:
updated_dataset = dataset.map(lambda example: {"question": example["query"]["en"]}, remove_columns=["query"])
updated_dataset = updated_dataset.map(
    lambda example: {"answer": example["answers"][0]}, remove_columns=["answer", "answers"]
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [9]:
# Since max embeddings = 512, we’ll remove the few examples where the embedding is likely to end up longer than 512
# Could have truncated the examples the answer might be at the end of a large document and also end up truncated
# Alternatively, implement a sliding window strategy: https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb
updated_dataset = updated_dataset.filter(lambda x: len(x["words"]) + len(x["question"].split()) < 512)

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/200 [00:00<?, ? examples/s]

In [12]:
updated_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'image', 'words', 'bounding_boxes', 'answer', 'question'],
        num_rows: 904
    })
    test: Dataset({
        features: ['id', 'image', 'words', 'bounding_boxes', 'answer', 'question'],
        num_rows: 190
    })
})