Feature request: widgets for document image understanding and document image VQA #21

NielsRogge · 2021-08-19T08:30:48Z

I'm currently adding LayoutLMv2 and LayoutXLM to HuggingFace Transformers. These models, built by Microsoft, have impressive capabilities for understanding document images (scanned documents, such as PDFs). LayoutLM, and its successor, LayoutLMv2, are extensions of BERT that incorporate layout and visual information, besides just text. LayoutXLM is a multilingual version of LayoutLMv2.

It would be really cool to have inference widgets for the following tasks:

document image understanding
document image visual question answering
document image classification

Document image understanding

Document image understanding (also called form understanding) means understanding all pieces of information of a document image. Example datasets here are FUNSD, CORD, SROIE and Kleister-NDA.

The input is a document image:

The output should be the same image, but with colored bounding boxes, indicating for example what part of the image are questions (blue), which are answers (green), which are headers (orange), etc.

LayoutLMv2 solves this as a NER problem, using LayoutLMv2ForTokenClassification. First, an OCR engine is run on the image to get a list of words + corresponding coordinates. These are then tokenized, and together with the image sent through the LayoutLMv2 model. The model then labels each token using its classification head.

Document visual question answering

Document visual question answering means, given an image + question, generate (or extract) an answer. For example, for the PDF document above, a question could be "what's the date at which this document was sent?", and the answer is "January 11, 1999".
Example datasets here are DocVQA - on which LayoutLMv2 obtains SOTA performance, who might have guessed.

LayoutLMv2 solves this as a extractive question answering problem similar to SQuAD. I've defined a LayoutLMv2ForQuestionAnswering, which predicts the start_positions and end_positions.

Document image classification

Document image classification is fairly simple: given a document image, classify it (e.g. invoice/form/letter/email/etc.). Example datasets here are [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/#:~:text=The%20RVL%2DCDIP%20(Ryerson%20Vision,images%2C%20and%2040%2C000%20test%20images.). For this, I have defined a LayoutLMv2ForSequenceClassification, which just places a linear layer on top of the model in order to classify documents.

Remarks

I don't think we can leverage the existing 'token-classification', 'question-answering' and 'image-classification' pipelines, as the inputs are quite different (document images instead of text). To ease the development of new pipelines, I have implemented a new LayoutLMv2Processor, which takes care of all the preprocessing required for LayoutLMv2. It combines a LayoutLMv2FeatureExtractor (for the image modality) and LayoutLMv2Tokenizer (for the text modality). I would also argue that if we have other models in the future, they all implement a processor that takes care of all the preprocessing (and possibly postprocessing). Processors are ideal for multi-modal models (they have been defined previously for CLIP and Wav2Vec2).

The text was updated successfully, but these errors were encountered:

osanseviero · 2021-08-19T08:49:07Z

Thanks for your ideas! I think these are great

cc @mishig25 that has been working on adding new widgets. The first step for having the widget is to actually have this working in the Inference API.

We usually try to have the widgets be as generic as possible, so let's think in ways that this can be reused by similar tasks so we don't end up with 200 tasks :)

For image classification, isn't the existing image-classification widget good enough? (example).

@Narsil deeply understand the inference API using transformers, so he can probably give good insights in how to integrate this for transformers.

NielsRogge · 2021-08-19T08:54:10Z

For image classification, isn't the existing image-classification widget good enough? (example).

We can perhaps leverage it, the only difference is that this model requires a processor instead of a feature extractor for preparing the inputs.

The first step for having the widget is to actually have this working in the Inference API.

This means, first defining a pipeline for each task (if the existing ones can't be used)?

osanseviero · 2021-08-19T09:04:44Z

This means, first defining a pipeline for each task (if the existing ones can't be used)?

Yes. To give you a quick summary, we have api-inference which uses transformers and api-inference-community which is used for 3rd party libraries + wip generic inference (which we could use for a proof of concept).

Each widget maps to a single task and the input/output specifications are somewhat similar between the the two inferences. In the case of api-inference, each task maps to a single transformers pipeline, but maybe @Narsil can correct me here. So if we want to have a working widget that uses transformers, we'll need to define the pipeline. For non-transformers we have more flexibility and could get something up relatively quickly.

Narsil · 2021-08-19T16:45:08Z

image-classification already exists:

image-token-classification can be added
image-question-answering too.

We can setup something quick&dirty with generic to demo the widget, however, proper integration within transformers seems like a great idea, but will take more time (heavy refactor of pipelines + BC in transfomers means more time for review)

osanseviero · 2021-08-20T06:11:23Z

Agree with @Narsil here. We should build a quick&dirty impl with generic, but a proper integration will take some time.

I'm not sure I understand how image-token-classification (for document-image-understanding) differs in terms of inputs/outputs from object detection. Aren't they the same?

Narsil · 2021-08-20T08:02:29Z

Interesting idea to fuse it with object-detection (which I insist is exactly image-segmentation but let's not dwell on it).

For @NielsRogge , the output is `[{bounding_box: ... , "label": "XXXX", "instance": "Y", "parent"?: "Z"}, ....]. I feel like the label could be the NER tag, but we still need a field to print the written text otherwise it's not really useful. "text_content"? could be a new optional key.

Maybe the output is different enough for a separate pipeline here. It's on the edge IMO.

NielsRogge · 2021-08-20T09:56:39Z

Actually, the output is the same as object detection (a list of class labels and bounding boxes). Of course, postprocessing is quite different (for DETR, one can use DetrFeatureExtractor.post_process(), for LayoutLMv2, I'm currently converting them manually in a cell in a Jupyter notebook, but I could add a similar post_process() method to LayoutLMv2Processor).

We can then probably just use the existing image-classification and (WIP) object-detection pipelines. The only difference is that now a Processor will be required for preparing the inputs + postprocessing instead of FeatureExtractor.

For the other task, DocVQA, a new pipeline will be required: image-question-answering.

Narsil · 2021-08-20T13:20:05Z

AFAIK, Processor is only a superset of FeatureExtractor which encompasses all lower-level objects (FeatureExtractor + Tokenizer) is that true, or is there some subtlety I am missing ? Pipeline try to stay away for Processor if possible as it makes logic harder to follow (too much magic).

NielsRogge · 2021-08-20T13:37:15Z

AFAIK, Processor is only a superset of FeatureExtractor which encompasses all lower-level objects (FeatureExtractor + Tokenizer) is that true, or is there some subtlety I am missing ?

That is only true for the processors defined for CLIP and Wav2Vec2. For those, the processor is just a wrapper around the FeatureExtractor and Tokenizer, and it can act as one of the two at any point in time. However, the processor of LayoutLMv2 is different in the sense that it applies the feature extractor and tokenizer in a sequence.

Narsil · 2021-08-23T15:36:16Z

Thanks for the clarification, will have to dig a bit into this.

osanseviero · 2021-08-24T06:33:15Z

related: this could be a nice library to integrate https://github.com/mindee/doctr

NielsRogge · 2021-08-25T10:29:55Z

There's another beautiful library for deep-learning based layout analysis: https://github.com/Layout-Parser/layout-parser

They might be interested too.

osanseviero · 2023-11-20T21:45:39Z

DOcument QA is now supported in transformers and widget, so closing

NielsRogge · 2023-11-21T07:25:19Z

This is still failing for Donut:

osanseviero · 2023-11-21T07:40:30Z

This is because donut does not conform to the expected output specification of the document QA pipeline in transformers. All other models output both the answer and a score, while Donut only outputs the answer.

[
  {
    "score": 0.9670659899711609,
    "answer": "The Alignment Handbook",
    "start": 2,
    "end": 4
  }
]

@mishig25 would it be possible to make the document QA widget not need the score? It's not displayed in the UI so it should not be needed

NielsRogge · 2023-11-21T07:45:23Z

Thanks for looking into that @osanseviero. Donut is a generative model, so it indeed does not output a score. All newer models (including Nougat - which is also a Donut model) are generative, so would be great to make the widget more general.

julien-c · 2023-11-21T17:28:30Z

let's re-open and transfer to the https://github.com/huggingface/huggingface.js repo then

Should be a simple fix for score to be optional (@NielsRogge want to give it a try?)

osanseviero · 2023-11-21T17:48:34Z

Let's open a new issue as this is more of a bug and don't want to clutter the huggingface.js with the whole request

osanseviero added the widgets label Aug 19, 2021

mishig25 self-assigned this Aug 19, 2021

mishig25 mentioned this issue Sep 17, 2021

Add obj-det pipeline support for LayoutLMV2 huggingface/transformers#13622

Closed

5 tasks

LysandreJik transferred this issue from huggingface/huggingface_hub Mar 16, 2022

osanseviero mentioned this issue Mar 16, 2022

Tracking integration for Image Question Answering #38

Closed

6 tasks

osanseviero closed this as completed Nov 20, 2023

osanseviero mentioned this issue Nov 21, 2023

Allow generative models with the document QA widget huggingface/huggingface.js#298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: widgets for document image understanding and document image VQA #21

Feature request: widgets for document image understanding and document image VQA #21

NielsRogge commented Aug 19, 2021 •

edited

Loading

osanseviero commented Aug 19, 2021

NielsRogge commented Aug 19, 2021 •

edited

Loading

osanseviero commented Aug 19, 2021

Narsil commented Aug 19, 2021

osanseviero commented Aug 20, 2021

Narsil commented Aug 20, 2021

NielsRogge commented Aug 20, 2021 •

edited

Loading

Narsil commented Aug 20, 2021

NielsRogge commented Aug 20, 2021

Narsil commented Aug 23, 2021

osanseviero commented Aug 24, 2021

NielsRogge commented Aug 25, 2021 •

edited

Loading

osanseviero commented Nov 20, 2023

NielsRogge commented Nov 21, 2023

osanseviero commented Nov 21, 2023 •

edited

Loading

NielsRogge commented Nov 21, 2023

julien-c commented Nov 21, 2023

osanseviero commented Nov 21, 2023

Feature request: widgets for document image understanding and document image VQA #21

Feature request: widgets for document image understanding and document image VQA #21

Comments

NielsRogge commented Aug 19, 2021 • edited Loading

Document image understanding

Document visual question answering

Document image classification

Remarks

osanseviero commented Aug 19, 2021

NielsRogge commented Aug 19, 2021 • edited Loading

osanseviero commented Aug 19, 2021

Narsil commented Aug 19, 2021

osanseviero commented Aug 20, 2021

Narsil commented Aug 20, 2021

NielsRogge commented Aug 20, 2021 • edited Loading

Narsil commented Aug 20, 2021

NielsRogge commented Aug 20, 2021

Narsil commented Aug 23, 2021

osanseviero commented Aug 24, 2021

NielsRogge commented Aug 25, 2021 • edited Loading

osanseviero commented Nov 20, 2023

NielsRogge commented Nov 21, 2023

osanseviero commented Nov 21, 2023 • edited Loading

NielsRogge commented Nov 21, 2023

julien-c commented Nov 21, 2023

osanseviero commented Nov 21, 2023

NielsRogge commented Aug 19, 2021 •

edited

Loading

NielsRogge commented Aug 19, 2021 •

edited

Loading

NielsRogge commented Aug 20, 2021 •

edited

Loading

NielsRogge commented Aug 25, 2021 •

edited

Loading

osanseviero commented Nov 21, 2023 •

edited

Loading