<a href="https://colab.research.google.com/github/maheshpec/dockie/blob/initial/notebooks/layoutlm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set up

We need hugging face transformer to load the example dataset for [FUNSD](https://huggingface.co/nielsr/layoutlmv3-finetuned-funsd).

In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone


We need 
1. `datasets` for an example image. 
1. `pytesseract` for interacting with tessearct in python
1. `numpy` is needed for transforming the data returned from ocr processing 

In [36]:
!pip install datasets pytesseract numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


`pytesseract` doesn't actually install tessearct. We need to install `tesseract-ocr` via `apt` (since we are on an ubuntu notebook).

In [34]:
!apt install tesseract-ocr

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 20 not upgraded.
Need to get 4,795 kB of archives.
After this operation, 15.8 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr-eng all 4.00~git24-0e00fe6-1.2 [1,588 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr-osd all 4.00~git24-0e00fe6-1.2 [2,989 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr amd64 4.00~git2288-10f4998a-2 [218 kB]
Fetched 4,795 kB in 1s (8,927 kB/s)
Selecting previously unselect

## Download and run the language model

The model we can use is `nielsr/layoutlmv3-finetuned-funsd` - It has been trained on financial data. The processor is `microsoft/layoutlmv3-base` - we need it to convert the image and text data (after OCR) into embeddings so that we can feed it to the model

In [65]:
from transformers import AutoModelForTokenClassification
from transformers import AutoProcessor
from transformers import LayoutLMv3Tokenizer
import torch

# we'll use the Auto API here - it will load LayoutLMv3Processor behind the scenes,
# based on the checkpoint we provide from the hub
processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=True)
model = AutoModelForTokenClassification.from_pretrained("nielsr/layoutlmv3-finetuned-funsd")
image_column_name = "image"
text_column_name = "tokens"
boxes_column_name = "bboxes"
label_column_name = "ner_tags"


Load an example from fundsd

In [13]:
from datasets import load_dataset 

# this dataset uses the new Image feature :)
dataset = load_dataset("nielsr/funsd-layoutlmv3")

Downloading builder script:   0%|          | 0.00/5.13k [00:00<?, ?B/s]

Downloading and preparing dataset funsd-layoutlmv3/funsd to /root/.cache/huggingface/datasets/nielsr___funsd-layoutlmv3/funsd/1.0.0/0e3f4efdfd59aa1c3b4952c517894f7b1fc4d75c12ef01bcc8626a69e41c1bb9...


Downloading data:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset funsd-layoutlmv3 downloaded and prepared to /root/.cache/huggingface/datasets/nielsr___funsd-layoutlmv3/funsd/1.0.0/0e3f4efdfd59aa1c3b4952c517894f7b1fc4d75c12ef01bcc8626a69e41c1bb9. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

[LayoutLMV3 needs images to be in RGB](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3).

The processor usage in v3 is [similar to v2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2#usage-layoutlmv2processor). We use it with OCR. If we didn't, we have to provide the segment bounding boxes and the corresponding words. LayoutLmV3 also uses Byte pair encoding instead of word piece encoding. 

These items were helpful in figuring this out:
1. [Usage for LayoutLMv2Processor](https://huggingface.co/docs/transformers/model_doc/layoutlmv2#usage-layoutlmv2processor)
1. [Dataset creation for FUNSD for use in LMV3](https://huggingface.co/datasets/nielsr/funsd-layoutlmv3/blob/main/funsd-layoutlmv3.py#L140)
1. [Reference for LMV3](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3)
1. [How the layout lm v2 model was fine tuned on FUNSD](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/Fine_tuning_LayoutLMv2ForTokenClassification_on_FUNSD.ipynb)
1. [How the layout lm v3 model was fine tuned on FUNSD](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb#scrollTo=V6GvYURlY5ZV)

## Run inference

We do the following
1. load the 20th image in the FUNSD dataset
1. Convert it into RGB
1. Run it through the processor with OCR on - this will give us the encodings

We can also run it through our own OCR but we've to provide segment position embeddings and bounding boxes

In [89]:
image = (dataset["train"][20]["image"]).convert("RGB")
encoding = processor(image, return_tensors="pt")
#print(encoding['text_pair'])
print(encoding.keys())

dict_keys(['input_ids', 'attention_mask', 'bbox', 'pixel_values'])


Run the inputs through the model

In [69]:
with torch.no_grad():
  outputs = model(**encoding)



In [85]:
predictions = outputs.logits.argmax(-1).squeeze().tolist()
print(predictions)
print(f'Predictions length: {len(predictions)} Words length: {encoding.input_ids.shape}')


[0, 0, 0, 0, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0, 3, 4, 5, 6, 6, 6, 3, 3, 5, 3, 4, 3, 4, 3, 4, 3, 4, 4, 3, 5, 3, 4, 4, 0, 5, 0, 3, 0, 3, 3, 3, 3, 5, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 3, 0, 0, 3, 5, 4, 3, 4, 3, 5, 6, 6, 6, 6, 6, 5, 6, 6, 3, 4, 3, 4, 4, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 5, 6, 6, 5, 3, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 1, 2, 0, 3, 4, 3, 4, 3, 3, 4, 3, 3, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 0, 3, 4, 3, 4, 3, 3, 3, 4, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6, 3, 4, 4, 4, 3, 3, 4, 3, 4, 4, 4, 4, 3, 3, 4, 3, 3, 3, 4, 4, 3, 4, 4, 3, 4, 4, 4, 3, 3, 4, 5, 6, 6, 6, 6, 6, 3, 3, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 4, 3, 4, 0, 5, 6, 0, 0, 0, 3, 4, 3, 4, 3, 4, 4, 5, 5, 3, 3, 5, 6, 3, 4, 3, 4, 3, 4, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 3, 4, 4, 6, 3, 3, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4, 5, 6, 6, 3, 4, 4, 4, 4, 4, 3, 4, 4, 3, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0]
Predictions length: 309 Words length: torch.Size([1, 309])
