# Bilma tutorial
## import bilma

In [1]:
from bilma import bilma_model
import numpy as np

## Load a trained model

In [2]:
model = bilma_model.load("models-final/bilma_small_MX_epoch-1.h5")

## Show the model

The model structure can be showed with `model.summary()`

In [3]:
model.summary()

Model: "model_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
capt_input (InputLayer)      [(None, 280)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 280, 512)          14860800  
_________________________________________________________________
encoder_5 (Encoder)          (None, 280, 512)          9456640   
_________________________________________________________________
dense_37 (Dense)             (None, 280, 29025)        14889825  
Total params: 39,207,265
Trainable params: 39,207,265
Non-trainable params: 0
_________________________________________________________________


The model's input is a tensor of size `(bs, 280)` where bs is the batch size. Each sequence contains the id's of the tokens on the input text with a maximum length of 280.

In [4]:
model.inputs

[<KerasTensor: shape=(None, 280) dtype=float32 (created by layer 'capt_input')>]

The output is a tensor of size `(bs, 280, 29025)` where 29025 is the vocabulary size.

In [5]:
model.outputs

[<KerasTensor: shape=(None, 280, 29025) dtype=float32 (created by layer 'dense_37')>]

# Test MLM on new text

To feed the model with text we need to tranform it to tokens with a tokenizer. `bilma_model.tokenizer` returns a tokenizer to do that, we just need to pass the vocabulary file and the max length of the sequences.

In [6]:
tokenizer = bilma_model.tokenizer(vocab_file="d:/data/twitts/vocab_file_ALL.txt", max_length=280)

After that, we can use the `tokenize` function to transform the text.

In [7]:
texts = ["Tenemos tres días sin internet ni señal de celular en el pueblo.",
         "Incomunicados en el siglo XXI tampoco hay servicio de telefonía fija",
         "Vamos a comer unos tacos"]
tweet = tokenizer.tokenize(texts)

Now, tweet contain the tokenized text:

In [8]:
tweet[0]

array([   2, 1252, 1430, 1098, 1063, 1925, 1048, 2694, 1007, 1576, 1011,
       1010, 1285,   18,    3,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

The `tokenize` adds a **start** and **end** token to each sequence of text and fills with **pad** tokens to get 280 tokens. The **start** token is 2, **end** is 3, and **pad** is zero.

Now we can input the tweet into the model

In [9]:
p = model.predict(tweet)
p.shape

(3, 280, 29025)

The output is a probability distribution (after you apply softmax) at each token position. To display it we can use `detokenize` but we need to get the most probable token at each position with `np.argmax(p, axis=2)`

In [10]:
tokenizer.detokenize(np.argmax(p, axis=2))

['tenemos tres dias sin internet ni senal de celular en el pueblo .',
 'inc ##om ##un ##ica ##dos en el siglo xxi tampoco hay servicio de que .',
 'vamos a comer unos tacos']

Note that the tokenizer uses `wordpiece` and could break words like *incomunicados* in different tokens.

You can put a **mask** token by changing a token for the number 4. To mask the token for *internet* in the first tweet we can do:

In [11]:
tweet[0][5] = 4

Let's now mask the tokens *XXI* and *tacos* from the other tweets:

In [12]:
tweet[1][9] = 4
tweet[2][5] = 4

And predict the masked tweets

In [13]:
p = model.predict(tweet)
tokenizer.detokenize(np.argmax(p, axis=2))

['tenemos tres dias sin internet ni senal de celular en el pueblo .',
 'inc ##om ##un ##ica ##dos en el siglo xxi tampoco hay servicio de que .',
 'vamos a comer unos tacos']

Note how the masks were correctly predicted.

We can also display the top *k* predictions of a position, here, *mask_pos* indicates the positions we want:

In [14]:
mask_pos = [5, 9, 5] 

Now we can get the top *k* predictions with:

In [15]:
tokenizer.top_k(p, mask_pos, k=5)

[['internet', 'saber', 'luz', 'tener', 'tomar'],
 ['xxi', 'pasado', '0', 'que', 'donde'],
 ['tacos', 'taquitos', 'tamales', 'chilaquiles', 'dias']]