Improve LayoutLM #9476

NielsRogge · 2021-01-08T08:52:02Z

What does this PR do?

Improve documentation of LayoutLM, explaining how people can normalize bounding boxes before passing them to the model, add links to the various datasets on which the model achieves state-of-the-art results, add code examples in the documentation for the various models
Add notebook to the list of community notebooks showcasing how to fine-tune LayoutLMForTokenClassification on the FUNSD dataset (on which the model achieves SOTA results)
Add integration tests, which confirm that the model outputs the same tensors as the original implementation on the same input data
Add LayoutLMForSequenceClassification, which makes it possible to fine-tune LayoutLM for document image classification tasks (such as the RVL-CLIP dataset), extra tests included.

Fixes the following issues:

Who can review?

@LysandreJik, @patrickvonplaten, @sgugger

docs/source/model_doc/layoutlm.rst

notebooks/README.md

src/transformers/models/layoutlm/modeling_layoutlm.py

tests/test_modeling_layoutlm.py

patrickvonplaten · 2021-01-08T09:22:12Z

tests/test_modeling_layoutlm.py

-    def test_LayoutLM_backward_pass_reduces_loss(self):
-        """Test loss/gradients same as reference implementation, for example."""
-        pass
+        self.assertTrue(torch.allclose(outputs.loss, expected_loss, atol=0.1))


atol=1e-3 would not pass here?

tests/test_modeling_layoutlm.py

src/transformers/models/layoutlm/modeling_layoutlm.py

patrickvonplaten

Very nice PR! Thanks so much for taking care of this. The notebook looks great as well. Left a couple of comments. If possible it would be awesome if we could make the example a bit more concise (e.g. to just use tokenizer(...) instead of tokenize(...) and conevrt_tokens_to_ids(...).

sgugger

Thanks for all your work on this!

docs/source/model_doc/layoutlm.rst

src/transformers/models/layoutlm/modeling_layoutlm.py

tests/test_modeling_layoutlm.py

LysandreJik

LGTM, great job @NielsRogge! Thanks a lot for your contribution!

src/transformers/models/layoutlm/modeling_layoutlm.py

tests/test_modeling_layoutlm.py

Improve docs Add LayoutLM notebook to list of community notebooks

NielsRogge · 2021-01-08T17:23:56Z

Thanks for the reviews, I've addressed all comments. There are 2 things remaining:

in the code examples, I use both tokenize() and convert_tokens_to_ids as the bounding boxes (which are at word-level) need to be converted to token-level. Is there a better solution?

words = ["Hello", "world"]
normalized_word_boxes = [637, 773, 693, 782], [698, 773, 733, 782]
tokens = []
token_boxes = []
for word, box in zip(words, normalized_word_boxes):
    word_tokens = tokenizer.tokenize(word)
    tokens.extend(word_tokens)
    token_boxes.extend([box] * len(word_tokens))

according to @sgugger the input data on which the integration tests are ran are maybe too long, and black formatting causes them to be flattened vertically. Could you maybe fix this @LysandreJik?

LysandreJik · 2021-01-11T09:20:43Z

I pushed the reformat you asked for @NielsRogge, make sure to pull before doing any more changes!

NielsRogge · 2021-01-11T09:36:25Z

Ok thank you, so the only thing remaining is make the code examples more efficient? Is there a way to make the code block (see comment above) better?

src/transformers/models/layoutlm/modeling_layoutlm.py

@sgugger

* Add LayoutLMForSequenceClassification and integration tests Improve docs Add LayoutLM notebook to list of community notebooks * Make style & quality * Address comments by @sgugger, @patrickvonplaten and @LysandreJik * Fix rebase with master * Reformat in one line * Improve code examples as requested by @patrickvonplaten Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>