<https://huggingface.co/transformers/model_doc/layoutlm.html#layoutlmmodel>

In [1]:
from transformers import LayoutLMTokenizer, LayoutLMModel
import torch

2021-08-06 10:35:08.646484: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-08-06 10:35:08.646524: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
tokenizer = LayoutLMTokenizer.from_pretrained('microsoft/layoutlm-base-uncased')
model = LayoutLMModel.from_pretrained('microsoft/layoutlm-base-uncased')

Some weights of the model checkpoint at microsoft/layoutlm-base-uncased were not used when initializing LayoutLMModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing LayoutLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LayoutLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
words = ["Hello", "world"]
normalized_word_boxes = [637, 773, 693, 782], [698, 773, 733, 782]

In [4]:
token_boxes = []
for word, box in zip(words, normalized_word_boxes):
    word_tokens = tokenizer.tokenize(word)
    print(f"word_tokens = {word_tokens }")
    token_boxes.extend([box] * len(word_tokens))
# add bounding boxes of cls + sep tokens
token_boxes = [[0, 0, 0, 0]] + token_boxes + [[1000, 1000, 1000, 1000]]
token_boxes

word_tokens = ['hello']
word_tokens = ['world']


[[0, 0, 0, 0],
 [637, 773, 693, 782],
 [698, 773, 733, 782],
 [1000, 1000, 1000, 1000]]

In [5]:
encoding = tokenizer(' '.join(words), return_tensors="pt")
encoding

{'input_ids': tensor([[ 101, 7592, 2088,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

I think `return_tensors` is to specify which type of tensor will be returned

- `pt` signifies PyTorch
- Let's try `tf` to see if it will return `tf.Tensor` from TensorFlow

In [6]:
tokenizer(' '.join(words), return_tensors="tf")

2021-08-06 10:35:20.336196: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-08-06 10:35:20.336256: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-08-06 10:35:20.336281: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (homography-x220t): /proc/driver/nvidia/version does not exist


{'input_ids': <tf.Tensor: shape=(1, 4), dtype=int32, numpy=array([[ 101, 7592, 2088,  102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1, 4), dtype=int32, numpy=array([[0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 4), dtype=int32, numpy=array([[1, 1, 1, 1]], dtype=int32)>}

**(?)** We've got `input_ids` as `101, 7592, 2088,  102`. Do they mean

- `101 -> <cls>`
- `7592 -> hello`
- `2088 -> world`
- `102 -> <sep>`?

Is there a dictionary for the correspondances?

**(R)** Yes, just use the `decode()` method of the tokenizer instance. But do notice the difference of inputting arg as `int` and as list of `int`'s.

In [24]:
tokenizer.decode(101)

'[ C L S ]'

In [25]:
tokenizer.decode([101, 102])

'[CLS] [SEP]'

In [26]:
tokenizer.decode(2088)

'w o r l d'

In [28]:
len(tokenizer.decode(2088))

9

In [27]:
tokenizer.decode([101, 7592, 2088,  102])

'[CLS] hello world [SEP]'

### Other parameters to the instance method of `LayoutLMTokenizer`
01. There is also a `padding` parameter for this instance method, which, when set to `"max_length"`, will pad to `512`
02. `truncation`

```python
ValueError: 5 is not a valid PaddingStrategy, please select one of ['longest', 'max_length', 'do_not_pad']
```

In [17]:
max_padded = tokenizer(' '.join(words), return_tensors="tf", padding="max_length")
max_padded["input_ids"]

<tf.Tensor: shape=(1, 512), dtype=int32, numpy=
array([[ 101, 7592, 2088,  102,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,   

In [13]:
"apple pie " * 3

'apple pie apple pie apple pie '

In [14]:
many_apple_pie = tokenizer("apple pie " * 512, truncation=True)["input_ids"]
many_apple_pie

[101,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 6207,
 11345,
 620

In [15]:
len(many_apple_pie)

512

Notice a few things

01. mot only the output is being truncated to length `512`
02. but also the first and last `input_ids` are `101` and `102`, which are presumably `<cls>` and `<sep>` tokens, resp.

In [16]:
many_apple_pie[0], many_apple_pie[-1]

(101, 102)

In [8]:
input_ids = encoding["input_ids"]
attention_mask = encoding["attention_mask"]
token_type_ids = encoding["token_type_ids"]

In [9]:
bbox = torch.tensor([token_boxes])
outputs = model(input_ids=input_ids,
                bbox=bbox,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids,
)
outputs

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.1377, -0.0217,  0.5168,  ..., -0.3557,  0.0057, -0.0747],
         [ 0.2349, -0.0126,  0.2175,  ...,  0.3546,  0.4783, -0.5063],
         [ 0.1883,  0.0240,  0.5338,  ..., -0.0384,  0.2696, -0.7188],
         [ 0.1364, -0.0190,  0.5194,  ..., -0.3599,  0.0044, -0.0714]]],
       grad_fn=<NativeLayerNormBackward>), pooler_output=tensor([[-0.1216,  0.2186, -0.0275, -0.0761,  0.2830,  0.0359, -0.4294, -0.1931,
          0.1522,  0.0605,  0.0462,  0.2257,  0.2945, -0.3746,  0.2547,  0.1569,
          0.4103,  0.1967, -0.0695,  0.4194, -0.3590,  0.0327,  0.0486, -0.2101,
         -0.1370,  0.1006,  0.2010, -0.0674,  0.3664, -0.0214,  0.3499, -0.3179,
         -0.0975,  0.2563, -0.0059, -0.4568, -0.1184, -0.2711,  0.1816,  0.0836,
         -0.1046, -0.2034,  0.3083,  0.3592,  0.0949,  0.2845,  0.2826, -0.1538,
         -0.3005, -0.0784, -0.0088,  0.1495, -0.3902, -0.0309, -0.4107, -0.1594,
         -0.0673, -0.1504, 

In [15]:
type(outputs)

transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions

In [14]:
[s for s in dir(outputs) if not s.startswith("_")]

['attentions',
 'clear',
 'copy',
 'cross_attentions',
 'fromkeys',
 'get',
 'hidden_states',
 'items',
 'keys',
 'last_hidden_state',
 'move_to_end',
 'past_key_values',
 'pooler_output',
 'pop',
 'popitem',
 'setdefault',
 'to_tuple',
 'update',
 'values']

In [12]:
outputs.pooler_output.shape, outputs.last_hidden_state.shape

(torch.Size([1, 768]), torch.Size([1, 4, 768]))

Note that

- we have called `model(input_ids=input_ids, bbox=bbox, attention_mask=attention_mask, token_type_ids=token_type_ids)`, treating the `LayoutLMModel` instance as a function. This is probably via the method `__call__()` of the class.
- It is said in the doc that the above way of calling the model is preferred to `model.forward()` because
  - calling the instance will handle the pre and post processing
  - whereas calling its `forward()` method won't