How were the 2D positional segments generated from OCR for the pretraining tasks of LayoutLMv3 #838

wandering-walrus · 2022-08-23T20:40:15Z

Describe
Model: LayoutLMv3

I understand that LayoutLMv3 uses 2D positional encodings for whole text segments instead of positional encodings per word. How are these generated for the pretraining task? Is there a specific OCR that was used for this?

I understand data for the FUNSD finetuning task was modified based on the labels of the training data and what key group / value group they belonged to, but how are the text segment positional encodings generated without the use of labels?

Is there any discussion around the OCR engine that was used for pretraining to utilize segment positions instead of word-level positions? Is this a special OCR engine or a model trained for segment extraction? It's a little confusing how much is being hand-waved here on this piece or if I'm just missing something.

Thanks!

HYPJUDY · 2022-08-25T05:29:40Z

Many OCR engines support line findings. For example, the open-source Tesseract OCR Engine discussed "Line Finding" in Section 3.1.
We use Microsoft Read API; you can find a sample OCR output here.

The JSON response maintains the original line groupings of recognized words. It includes the extracted text lines and their bounding box coordinates. Each text line includes all extracted words with their coordinates and confidence scores.

{
  "status": "succeeded",
  "createdDateTime": "2021-02-04T06:32:08.2752706+00:00",
  "lastUpdatedDateTime": "2021-02-04T06:32:08.7706172+00:00",
  "analyzeResult": {
    "version": "3.2",
    "readResults": [
      {
        "page": 1,
        "angle": 2.1243,
        "width": 502,
        "height": 252,
        "unit": "pixel",
        "lines": [
          {
            "boundingBox": [
              58,
              42,
              314,
              59,
              311,
              123,
              56,
              121
            ],
            "text": "Tabs vs",
            "appearance": {
              "style": {
                "name": "handwriting",
                "confidence": 0.96
              }
            },
            "words": [
              {
                "boundingBox": [
                  68,
                  44,
                  225,
                  59,
                  224,
                  122,
                  66,
                  123
                ],
                "text": "Tabs",
                "confidence": 0.933
              },
              {
                "boundingBox": [
                  241,
                  61,
                  314,
                  72,
                  314,
                  123,
                  239,
                  122
                ],
                "text": "vs",
                "confidence": 0.977
              }
            ]
          }
        ]
      }
    ]
  }
}

We use the extracted lines as text segments for pre-training and some downstream tasks without labeled line groupings. For FUNSD, please see #793.

HYPJUDY mentioned this issue Aug 25, 2022

LayoutLMv3 on FUNSD using ground-truth entity groupings #793

Closed

HYPJUDY closed this as completed Aug 25, 2022

HYPJUDY mentioned this issue Aug 25, 2022

Issue in Object Detection using LayoutLMV3 #775

Closed

NielsRogge mentioned this issue Oct 20, 2022

Leveraging Segment position embeddings during Inference time in LayoutLMv3 Token Classification NielsRogge/Transformers-Tutorials#130

Closed

NielsRogge mentioned this issue Sep 29, 2023

LayoutLMv3 correct OCR engine NielsRogge/Transformers-Tutorials#352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How were the 2D positional segments generated from OCR for the pretraining tasks of LayoutLMv3 #838

How were the 2D positional segments generated from OCR for the pretraining tasks of LayoutLMv3 #838

wandering-walrus commented Aug 23, 2022

HYPJUDY commented Aug 25, 2022

How were the 2D positional segments generated from OCR for the pretraining tasks of LayoutLMv3 #838

How were the 2D positional segments generated from OCR for the pretraining tasks of LayoutLMv3 #838

Comments

wandering-walrus commented Aug 23, 2022

HYPJUDY commented Aug 25, 2022