Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: widgets for document image understanding and document image VQA #21

Closed
NielsRogge opened this issue Aug 19, 2021 · 18 comments
Assignees
Labels

Comments

@NielsRogge
Copy link
Contributor

NielsRogge commented Aug 19, 2021

I'm currently adding LayoutLMv2 and LayoutXLM to HuggingFace Transformers. These models, built by Microsoft, have impressive capabilities for understanding document images (scanned documents, such as PDFs). LayoutLM, and its successor, LayoutLMv2, are extensions of BERT that incorporate layout and visual information, besides just text. LayoutXLM is a multilingual version of LayoutLMv2.

It would be really cool to have inference widgets for the following tasks:

  • document image understanding
  • document image visual question answering
  • document image classification

Document image understanding

Document image understanding (also called form understanding) means understanding all pieces of information of a document image. Example datasets here are FUNSD, CORD, SROIE and Kleister-NDA.

The input is a document image:
Schermafbeelding 2021-08-19 om 10 12 18

The output should be the same image, but with colored bounding boxes, indicating for example what part of the image are questions (blue), which are answers (green), which are headers (orange), etc.
Schermafbeelding 2021-08-19 om 10 12 45

LayoutLMv2 solves this as a NER problem, using LayoutLMv2ForTokenClassification. First, an OCR engine is run on the image to get a list of words + corresponding coordinates. These are then tokenized, and together with the image sent through the LayoutLMv2 model. The model then labels each token using its classification head.

Document visual question answering

Document visual question answering means, given an image + question, generate (or extract) an answer. For example, for the PDF document above, a question could be "what's the date at which this document was sent?", and the answer is "January 11, 1999".
Example datasets here are DocVQA - on which LayoutLMv2 obtains SOTA performance, who might have guessed.

LayoutLMv2 solves this as a extractive question answering problem similar to SQuAD. I've defined a LayoutLMv2ForQuestionAnswering, which predicts the start_positions and end_positions.

Document image classification

Document image classification is fairly simple: given a document image, classify it (e.g. invoice/form/letter/email/etc.). Example datasets here are [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/#:~:text=The%20RVL%2DCDIP%20(Ryerson%20Vision,images%2C%20and%2040%2C000%20test%20images.). For this, I have defined a LayoutLMv2ForSequenceClassification, which just places a linear layer on top of the model in order to classify documents.

Remarks

I don't think we can leverage the existing 'token-classification', 'question-answering' and 'image-classification' pipelines, as the inputs are quite different (document images instead of text). To ease the development of new pipelines, I have implemented a new LayoutLMv2Processor, which takes care of all the preprocessing required for LayoutLMv2. It combines a LayoutLMv2FeatureExtractor (for the image modality) and LayoutLMv2Tokenizer (for the text modality). I would also argue that if we have other models in the future, they all implement a processor that takes care of all the preprocessing (and possibly postprocessing). Processors are ideal for multi-modal models (they have been defined previously for CLIP and Wav2Vec2).

@mishig25 mishig25 self-assigned this Aug 19, 2021
@osanseviero
Copy link
Member

Thanks for your ideas! I think these are great

cc @mishig25 that has been working on adding new widgets. The first step for having the widget is to actually have this working in the Inference API.

We usually try to have the widgets be as generic as possible, so let's think in ways that this can be reused by similar tasks so we don't end up with 200 tasks :)

For image classification, isn't the existing image-classification widget good enough? (example).

@Narsil deeply understand the inference API using transformers, so he can probably give good insights in how to integrate this for transformers.

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Aug 19, 2021

For image classification, isn't the existing image-classification widget good enough? (example).

We can perhaps leverage it, the only difference is that this model requires a processor instead of a feature extractor for preparing the inputs.

The first step for having the widget is to actually have this working in the Inference API.

This means, first defining a pipeline for each task (if the existing ones can't be used)?

@osanseviero
Copy link
Member

This means, first defining a pipeline for each task (if the existing ones can't be used)?

Yes. To give you a quick summary, we have api-inference which uses transformers and api-inference-community which is used for 3rd party libraries + wip generic inference (which we could use for a proof of concept).

Each widget maps to a single task and the input/output specifications are somewhat similar between the the two inferences. In the case of api-inference, each task maps to a single transformers pipeline, but maybe @Narsil can correct me here. So if we want to have a working widget that uses transformers, we'll need to define the pipeline. For non-transformers we have more flexibility and could get something up relatively quickly.

@Narsil
Copy link

Narsil commented Aug 19, 2021

image-classification already exists:

  • image-token-classification can be added
  • image-question-answering too.

We can setup something quick&dirty with generic to demo the widget, however, proper integration within transformers seems like a great idea, but will take more time (heavy refactor of pipelines + BC in transfomers means more time for review)

@osanseviero
Copy link
Member

Agree with @Narsil here. We should build a quick&dirty impl with generic, but a proper integration will take some time.

I'm not sure I understand how image-token-classification (for document-image-understanding) differs in terms of inputs/outputs from object detection. Aren't they the same?

@Narsil
Copy link

Narsil commented Aug 20, 2021

Interesting idea to fuse it with object-detection (which I insist is exactly image-segmentation but let's not dwell on it).

For @NielsRogge , the output is `[{bounding_box: ... , "label": "XXXX", "instance": "Y", "parent"?: "Z"}, ....]. I feel like the label could be the NER tag, but we still need a field to print the written text otherwise it's not really useful. "text_content"? could be a new optional key.

Maybe the output is different enough for a separate pipeline here. It's on the edge IMO.

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Aug 20, 2021

Actually, the output is the same as object detection (a list of class labels and bounding boxes). Of course, postprocessing is quite different (for DETR, one can use DetrFeatureExtractor.post_process(), for LayoutLMv2, I'm currently converting them manually in a cell in a Jupyter notebook, but I could add a similar post_process() method to LayoutLMv2Processor).

We can then probably just use the existing image-classification and (WIP) object-detection pipelines. The only difference is that now a Processor will be required for preparing the inputs + postprocessing instead of FeatureExtractor.

For the other task, DocVQA, a new pipeline will be required: image-question-answering.

@Narsil
Copy link

Narsil commented Aug 20, 2021

AFAIK, Processor is only a superset of FeatureExtractor which encompasses all lower-level objects (FeatureExtractor + Tokenizer) is that true, or is there some subtlety I am missing ? Pipeline try to stay away for Processor if possible as it makes logic harder to follow (too much magic).

@NielsRogge
Copy link
Contributor Author

AFAIK, Processor is only a superset of FeatureExtractor which encompasses all lower-level objects (FeatureExtractor + Tokenizer) is that true, or is there some subtlety I am missing ?

That is only true for the processors defined for CLIP and Wav2Vec2. For those, the processor is just a wrapper around the FeatureExtractor and Tokenizer, and it can act as one of the two at any point in time. However, the processor of LayoutLMv2 is different in the sense that it applies the feature extractor and tokenizer in a sequence.

@Narsil
Copy link

Narsil commented Aug 23, 2021

Thanks for the clarification, will have to dig a bit into this.

@osanseviero
Copy link
Member

related: this could be a nice library to integrate https://github.com/mindee/doctr

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Aug 25, 2021

There's another beautiful library for deep-learning based layout analysis: https://github.com/Layout-Parser/layout-parser

They might be interested too.

@osanseviero
Copy link
Member

DOcument QA is now supported in transformers and widget, so closing

@NielsRogge
Copy link
Contributor Author

This is still failing for Donut:

Screenshot 2023-11-21 at 08 24 54

@osanseviero
Copy link
Member

osanseviero commented Nov 21, 2023

This is because donut does not conform to the expected output specification of the document QA pipeline in transformers. All other models output both the answer and a score, while Donut only outputs the answer.

[
  {
    "score": 0.9670659899711609,
    "answer": "The Alignment Handbook",
    "start": 2,
    "end": 4
  }
]

@mishig25 would it be possible to make the document QA widget not need the score? It's not displayed in the UI so it should not be needed

@NielsRogge
Copy link
Contributor Author

Thanks for looking into that @osanseviero. Donut is a generative model, so it indeed does not output a score. All newer models (including Nougat - which is also a Donut model) are generative, so would be great to make the widget more general.

@julien-c
Copy link
Member

let's re-open and transfer to the https://github.com/huggingface/huggingface.js repo then

Should be a simple fix for score to be optional (@NielsRogge want to give it a try?)

@osanseviero
Copy link
Member

Let's open a new issue as this is more of a bug and don't want to clutter the huggingface.js with the whole request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants