-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: widgets for document image understanding and document image VQA #21
Comments
Thanks for your ideas! I think these are great cc @mishig25 that has been working on adding new widgets. The first step for having the widget is to actually have this working in the Inference API. We usually try to have the widgets be as generic as possible, so let's think in ways that this can be reused by similar tasks so we don't end up with 200 tasks :) For image classification, isn't the existing @Narsil deeply understand the inference API using |
We can perhaps leverage it, the only difference is that this model requires a processor instead of a feature extractor for preparing the inputs.
This means, first defining a pipeline for each task (if the existing ones can't be used)? |
Yes. To give you a quick summary, we have Each widget maps to a single task and the input/output specifications are somewhat similar between the the two inferences. In the case of |
We can setup something quick&dirty with |
Agree with @Narsil here. We should build a quick&dirty impl with I'm not sure I understand how |
Interesting idea to fuse it with For @NielsRogge , the output is `[{bounding_box: ... , "label": "XXXX", "instance": "Y", "parent"?: "Z"}, ....]. I feel like the label could be the NER tag, but we still need a field to print the written text otherwise it's not really useful. "text_content"? could be a new optional key. Maybe the output is different enough for a separate pipeline here. It's on the edge IMO. |
Actually, the output is the same as object detection (a list of class labels and bounding boxes). Of course, postprocessing is quite different (for DETR, one can use We can then probably just use the existing For the other task, DocVQA, a new pipeline will be required: |
AFAIK, |
That is only true for the processors defined for CLIP and Wav2Vec2. For those, the processor is just a wrapper around the FeatureExtractor and Tokenizer, and it can act as one of the two at any point in time. However, the processor of LayoutLMv2 is different in the sense that it applies the feature extractor and tokenizer in a sequence. |
Thanks for the clarification, will have to dig a bit into this. |
related: this could be a nice library to integrate https://github.com/mindee/doctr |
There's another beautiful library for deep-learning based layout analysis: https://github.com/Layout-Parser/layout-parser They might be interested too. |
DOcument QA is now supported in transformers and widget, so closing |
This is still failing for Donut: ![]() |
This is because donut does not conform to the expected output specification of the document QA pipeline in
@mishig25 would it be possible to make the document QA widget not need the score? It's not displayed in the UI so it should not be needed |
Thanks for looking into that @osanseviero. Donut is a generative model, so it indeed does not output a score. All newer models (including Nougat - which is also a Donut model) are generative, so would be great to make the widget more general. |
let's re-open and transfer to the https://github.com/huggingface/huggingface.js repo then Should be a simple fix for score to be optional (@NielsRogge want to give it a try?) |
Let's open a new issue as this is more of a bug and don't want to clutter the |
I'm currently adding LayoutLMv2 and LayoutXLM to HuggingFace Transformers. These models, built by Microsoft, have impressive capabilities for understanding document images (scanned documents, such as PDFs). LayoutLM, and its successor, LayoutLMv2, are extensions of BERT that incorporate layout and visual information, besides just text. LayoutXLM is a multilingual version of LayoutLMv2.
It would be really cool to have inference widgets for the following tasks:
Document image understanding
Document image understanding (also called form understanding) means understanding all pieces of information of a document image. Example datasets here are FUNSD, CORD, SROIE and Kleister-NDA.
The input is a document image:
![Schermafbeelding 2021-08-19 om 10 12 18](https://user-images.githubusercontent.com/48327001/130034903-13f88eaf-7a6e-4dca-9973-e1e83a41deba.png)
The output should be the same image, but with colored bounding boxes, indicating for example what part of the image are questions (blue), which are answers (green), which are headers (orange), etc.
![Schermafbeelding 2021-08-19 om 10 12 45](https://user-images.githubusercontent.com/48327001/130035118-39cce15b-a9a7-4e5a-9c5c-ce759b89d272.png)
LayoutLMv2 solves this as a NER problem, using
LayoutLMv2ForTokenClassification
. First, an OCR engine is run on the image to get a list of words + corresponding coordinates. These are then tokenized, and together with the image sent through the LayoutLMv2 model. The model then labels each token using its classification head.Document visual question answering
Document visual question answering means, given an image + question, generate (or extract) an answer. For example, for the PDF document above, a question could be "what's the date at which this document was sent?", and the answer is "January 11, 1999".
Example datasets here are DocVQA - on which LayoutLMv2 obtains SOTA performance, who might have guessed.
LayoutLMv2 solves this as a extractive question answering problem similar to SQuAD. I've defined a
LayoutLMv2ForQuestionAnswering
, which predicts thestart_positions
andend_positions
.Document image classification
Document image classification is fairly simple: given a document image, classify it (e.g. invoice/form/letter/email/etc.). Example datasets here are [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/#:~:text=The%20RVL%2DCDIP%20(Ryerson%20Vision,images%2C%20and%2040%2C000%20test%20images.). For this, I have defined a
LayoutLMv2ForSequenceClassification
, which just places a linear layer on top of the model in order to classify documents.Remarks
I don't think we can leverage the existing 'token-classification', 'question-answering' and 'image-classification' pipelines, as the inputs are quite different (document images instead of text). To ease the development of new pipelines, I have implemented a new
LayoutLMv2Processor
, which takes care of all the preprocessing required for LayoutLMv2. It combines aLayoutLMv2FeatureExtractor
(for the image modality) andLayoutLMv2Tokenizer
(for the text modality). I would also argue that if we have other models in the future, they all implement a processor that takes care of all the preprocessing (and possibly postprocessing). Processors are ideal for multi-modal models (they have been defined previously for CLIP and Wav2Vec2).The text was updated successfully, but these errors were encountered: