Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How were the 2D positional segments generated from OCR for the pretraining tasks of LayoutLMv3 #838

Closed
wandering-walrus opened this issue Aug 23, 2022 · 1 comment

Comments

@wandering-walrus
Copy link

Describe
Model: LayoutLMv3

I understand that LayoutLMv3 uses 2D positional encodings for whole text segments instead of positional encodings per word. How are these generated for the pretraining task? Is there a specific OCR that was used for this?

I understand data for the FUNSD finetuning task was modified based on the labels of the training data and what key group / value group they belonged to, but how are the text segment positional encodings generated without the use of labels?

Is there any discussion around the OCR engine that was used for pretraining to utilize segment positions instead of word-level positions? Is this a special OCR engine or a model trained for segment extraction? It's a little confusing how much is being hand-waved here on this piece or if I'm just missing something.

Thanks!

@HYPJUDY
Copy link
Contributor

HYPJUDY commented Aug 25, 2022

Many OCR engines support line findings. For example, the open-source Tesseract OCR Engine discussed "Line Finding" in Section 3.1.
We use Microsoft Read API; you can find a sample OCR output here.

The JSON response maintains the original line groupings of recognized words. It includes the extracted text lines and their bounding box coordinates. Each text line includes all extracted words with their coordinates and confidence scores.

{
  "status": "succeeded",
  "createdDateTime": "2021-02-04T06:32:08.2752706+00:00",
  "lastUpdatedDateTime": "2021-02-04T06:32:08.7706172+00:00",
  "analyzeResult": {
    "version": "3.2",
    "readResults": [
      {
        "page": 1,
        "angle": 2.1243,
        "width": 502,
        "height": 252,
        "unit": "pixel",
        "lines": [
          {
            "boundingBox": [
              58,
              42,
              314,
              59,
              311,
              123,
              56,
              121
            ],
            "text": "Tabs vs",
            "appearance": {
              "style": {
                "name": "handwriting",
                "confidence": 0.96
              }
            },
            "words": [
              {
                "boundingBox": [
                  68,
                  44,
                  225,
                  59,
                  224,
                  122,
                  66,
                  123
                ],
                "text": "Tabs",
                "confidence": 0.933
              },
              {
                "boundingBox": [
                  241,
                  61,
                  314,
                  72,
                  314,
                  123,
                  239,
                  122
                ],
                "text": "vs",
                "confidence": 0.977
              }
            ]
          }
        ]
      }
    ]
  }
}

We use the extracted lines as text segments for pre-training and some downstream tasks without labeled line groupings. For FUNSD, please see #793.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants