<a href="https://colab.research.google.com/github/maheshpec/dockie/blob/initial/notebooks/layoutlm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set up

We need hugging face transformer to load the example dataset for [FUNSD](https://huggingface.co/nielsr/layoutlmv3-finetuned-funsd).

In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git > /dev/null || echo 'error'

We need 
1. `datasets` for an example image. 
1. `pytesseract` for interacting with tessearct in python
1. `numpy` is needed for transforming the data returned from ocr processing 

In [8]:
!pip install -U datasets pytesseract numpy  Pillow==9.0.0 > /dev/null || echo 'error'

`pytesseract` doesn't actually install tessearct. We need to install `tesseract-ocr` via `apt` (since we are on an ubuntu notebook).

In [3]:
!apt install tesseract-ocr > /dev/null || echo 'error install tesseract'





**Make sure to restart the runtime**

## Download and run the language model

The model we can use is `nielsr/layoutlmv3-finetuned-funsd` - It has been trained on financial data. The processor is `microsoft/layoutlmv3-base` - we need it to convert the image and text data (after OCR) into embeddings so that we can feed it to the model

In [1]:
from transformers import AutoModelForTokenClassification, AutoProcessor
from transformers import LayoutLMv3FeatureExtractor, LayoutLMv3Tokenizer, LayoutLMv3Processor
import torch

# we'll use the Auto API here - it will load LayoutLMv3Processor behind the scenes,
# based on the checkpoint we provide from the hub
processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=True)
model = AutoModelForTokenClassification.from_pretrained("nielsr/layoutlmv3-finetuned-funsd")
model.config.id2label



{0: 'O',
 1: 'B-HEADER',
 2: 'I-HEADER',
 3: 'B-QUESTION',
 4: 'I-QUESTION',
 5: 'B-ANSWER',
 6: 'I-ANSWER'}

Load an example from fundsd

In [2]:
from datasets import load_dataset 

# this dataset uses the new Image feature :)
dataset = load_dataset("nielsr/funsd-layoutlmv3")

Downloading and preparing dataset funsd-layoutlmv3/funsd to /root/.cache/huggingface/datasets/nielsr___funsd-layoutlmv3/funsd/1.0.0/0e3f4efdfd59aa1c3b4952c517894f7b1fc4d75c12ef01bcc8626a69e41c1bb9...


Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset funsd-layoutlmv3 downloaded and prepared to /root/.cache/huggingface/datasets/nielsr___funsd-layoutlmv3/funsd/1.0.0/0e3f4efdfd59aa1c3b4952c517894f7b1fc4d75c12ef01bcc8626a69e41c1bb9. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

[LayoutLMV3 needs images to be in RGB](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3).

The processor usage in v3 is [similar to v2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2#usage-layoutlmv2processor). We use it with OCR. If we didn't, we have to provide the segment bounding boxes and the corresponding words. LayoutLmV3 also uses Byte pair encoding instead of word piece encoding. 

These items were helpful in figuring this out:
1. [Usage for LayoutLMv2Processor](https://huggingface.co/docs/transformers/model_doc/layoutlmv2#usage-layoutlmv2processor)
1. [Dataset creation for FUNSD for use in LMV3](https://huggingface.co/datasets/nielsr/funsd-layoutlmv3/blob/main/funsd-layoutlmv3.py#L140)
1. [Reference for LMV3](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3)
1. [How the layout lm v2 model was fine tuned on FUNSD](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/Fine_tuning_LayoutLMv2ForTokenClassification_on_FUNSD.ipynb)
1. [How the layout lm v3 model was fine tuned on FUNSD](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb#scrollTo=V6GvYURlY5ZV)

## Run inference

We do the following
1. load the 20th image in the FUNSD dataset
1. Convert it into RGB
1. Run it through the processor with OCR on - this will give us the encodings

We can also run it through our own OCR but we've to provide segment position embeddings and bounding boxes

In [3]:
image = (dataset["train"][20]["image"]).convert("RGB")

features = processor.feature_extractor(images=image, return_tensors="pt")
#print(features.keys())
words = features['words'][0]
encoding = processor.tokenizer(text=features['words'], boxes=features['boxes'], return_tensors="pt", add_special_tokens=True, return_overflowing_tokens=False, verbose=True)
#encoding = processor(image, return_tensors="pt")
print(encoding.keys())



dict_keys(['input_ids', 'attention_mask', 'bbox'])


Run the inputs through the model

In [4]:
with torch.no_grad():
  outputs = model(**encoding)



In [7]:
predictions = outputs.logits.argmax(-1).squeeze().tolist()
pred = [(w, model.config.id2label[p]) for (w, p) in zip(words, predictions) if p != 'O']
print(pred)
#print(f'Predictions length: {len(predictions)} Words length: {encoding.input_ids.shape}')


[('INTERNATIONAL', 'O'), ('MARKETING', 'O'), ('RESEARCH', 'O'), ('‘CHANGE', 'O'), ('OF', 'I-HEADER'), ('AUTHORIZED', 'I-HEADER'), ('COST', 'I-HEADER'), ('ate:', 'O'), ('6/2190', 'O'), ('No.', 'O'), ('27', 'O'), ('Description:', 'O'), ('ona', 'O'), ('Kong:', 'O'), ('Cisarette', 'O'), ('Market', 'I-HEADER'), ('Monitor', 'I-HEADER'), ('Supplier:', 'I-HEADER'), ('HOR', 'O'), ('HK', 'O'), ('Total', 'I-QUESTION'), ('Cost', 'I-QUESTION'), ('1990', 'I-QUESTION'), ('Cost', 'I-QUESTION'), ('wi', 'I-ANSWER'), ('itn', 'I-ANSWER'), ('$43,335.00', 'I-QUESTION'), ('$__0,00—', 'I-QUESTION'), ('‘dnt.', 'I-QUESTION'), ('of', 'I-QUESTION'), ('Change:', 'I-QUESTION'), ('Increase', 'I-QUESTION'), ('X_', 'I-QUESTION'), ('Decrease', 'I-QUESTION'), ('__', 'I-QUESTION'), ('$4,376.47', 'I-QUESTION'), ('_.08', 'I-QUESTION'), ('(los', 'I-QUESTION'), ('chaage)', 'I-QUESTION'), ('Adjusted', 'I-QUESTION'), ('Total', 'I-QUESTION'), ('Cost', 'I-QUESTION'), ('of', 'I-QUESTION'), ('Project:', 'O'), ('S_4L', 'I-QUESTION'