## Playing with BERT base fucionality locally

Notebook was inspired by content of https://github.com/google-research/bert.
All models from smallest to to standard one could be downloaded from
https://storage.googleapis.com/bert_models/2020_02_20/all_bert_models.zip
Models were downloaded and uzipped to subirectory models. Everything was done in Conda prepared environment.

---

### Other useful links
1. [BERT Explained: A Complete Guide with Theory and Tutorial](https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/)
2. [BERT](https://medium.com/@shoray.goel/bert-f6d23b06069f)
3. Video [LSTM is dead. Long Live Transformers!](https://www.youtube.com/watch?v=S27pHKBEp30)
4. Video [BERT Explained!](https://www.youtube.com/watch?v=OR0wfP2FD3c)
5. [What does BERT know about books, movies and music? Probing BERT for Conversational Recommendation](https://arxiv.org/pdf/2007.15356.pdf)
6. [SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/pdf/1907.10529.pdf)

---

Notebook shows how BERT predicts `[MASK]`-ed words(MLM mode).
There is also another mode ie NSP( not presented here).

Downloaded different sizes of the BERT models are used 
token_input, seg_input, mask_input are generated based on given sentence with MASK-ed words
`BERT` expect that those 3 vectors will be  512 long : `token_input`, `seg_input` and  `mask_input`.

`token_input` as a first token has `[CLS]` and sentence ends with `[SEP]`. In case only one sentence there is one  `[SEP]` if there are two sentences there are two  `[SEP]`.
Additionally  `token_input` could have `[MASK]`.
Places where  `token_input` has `[MASK]` in vector `mask_input` have "1" (other places are "0").


In [33]:
#Required libraries
import numpy as np
from keras_bert import load_trained_model_from_checkpoint
import tokenization # source --> https://github.com/google-research/bert/blob/master/tokenization.py


First time after downloading tokenisation.py There is need to change as follows :


`[before]`
- `with tf.gfile.GFile(vocab_file, "r") as reader:` [github](https://github.com/google-research/bert/blob/master/tokenization.py#L125)

`[after]`
- `with tf.io.gfile.GFile(vocab_file, "r") as reader:`



In [34]:
def init_tokenizer_and_load_bert(model_name='uncased_L-2_H-128_A-2', do_lower_case=True, model_trainable=False):
    model_dir = './models/{}'.format(model_name)

    config_path = model_dir + '/bert_config.json'
    checkpoint_path = model_dir +'/bert_model.ckpt'
    vocab_path = model_dir + '/vocab.txt'
    
    print("loading: {}".format(model_name))
    
    tokenizer = tokenization.FullTokenizer(vocab_file=vocab_path, do_lower_case=do_lower_case)
    print("vocab size: {}".format(len(tokenizer.vocab)))
    
    model = load_trained_model_from_checkpoint(config_path, checkpoint_path, training=model_trainable)
    print("loaded: {}".format(model_name))
    
    return tokenizer, model

In [35]:
def guess_words(sentence, tokenizer, model):
    
    print('Sentence to be processed :  ', sentence )
    
    TOKEN_MASK = "[MASK]"

    sentence.split(TOKEN_MASK)
    sentence = sentence.replace(' {} '.format(TOKEN_MASK), TOKEN_MASK); 
    sentence = sentence.replace('{} '.format(TOKEN_MASK), TOKEN_MASK); 
    sentence = sentence.replace(' {}'.format(TOKEN_MASK), TOKEN_MASK)
    sentence.split(TOKEN_MASK)

    tokens = ['[CLS]'] #first token

    for idx, chunk_sent in enumerate(sentence.split(TOKEN_MASK)):
        if idx == 0:
            tokens += tokenizer.tokenize(chunk_sent) 
        else:
            tokens +=  [TOKEN_MASK] + tokenizer.tokenize(chunk_sent) 

    tokens += ['[SEP]']


    print(tokens)
    token_input = tokenizer.convert_tokens_to_ids(tokens) 
    print('Tokenized : --->',token_input)
    
    token_input = token_input + [0] * (512 - len(token_input))
    seg_input = [0] * 512 # bo jedno zdanie
    
    mask_input = [0]*512
    for i in range(len(mask_input)):
        if token_input[i] == 103:
            mask_input[i] = 1

    
    if len(token_input) != 2:
        token_input = np.asarray([token_input], dtype=np.int16)

    if len(mask_input) != 2:
        mask_input = np.asarray([mask_input], dtype=np.int16)

    if len(seg_input) != 2:
        seg_input = np.asarray([seg_input], dtype=np.int16)


    print('shapes :  ', token_input.shape, seg_input.shape, mask_input.shape) # just to check if shape is OK
    
    predicts = model.predict([token_input, seg_input, mask_input])[0]
    preds_argmax = np.argmax(predicts, axis=2)[0]
    result = preds_argmax[:len(tokens)]
    
    out = []
    
    
    
    for i in range(len(mask_input[0])):
          if mask_input[0][i] == 1:

                out.append(result[i])
                out1 = tokenizer.convert_ids_to_tokens(out)
                out1 = ' '.join(out1)
                out1 = tokenization.printable_text(out1)
                out1=out1.replace(' ##','')
    
    a = out1.split(' ')
    for nr, tok in enumerate(tokens):
        if tok=='[MASK]':
            tokens[nr]=a[0]
            del a[0]
    guessed_sentence=" ".join(tokens)

    return out1, guessed_sentence

In [36]:
tokenizer, model = init_tokenizer_and_load_bert('uncased_L-2_H-128_A-2', model_trainable=True)

loading: uncased_L-2_H-128_A-2
vocab size: 30522
loaded: uncased_L-2_H-128_A-2


In [37]:
model.summary()

Model: "model_10"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        [(None, 512)]        0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      [(None, 512)]        0                                            
__________________________________________________________________________________________________
Embedding-Token (TokenEmbedding [(None, 512, 128), ( 3906816     Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, 512, 128)     256         Input-Segment[0][0]              
___________________________________________________________________________________________

In [38]:
bert_models = {'uncased_L-2_H-128_A-2':'uncased_L-2_H-128_A-2',
    'uncased_L-2_H-256_A-4':'uncased_L-2_H-256_A-4',
              "uncased_L-12_H-768_A-12":'uncased_L-12_H-768_A-12'}

In [39]:
sentence = 'For me the most important thing in  [MASK] is [MASK].'
for model in bert_models:
    print(model)
    print('----------------------------------------------------------------------------\n')
    tokenizer, model = init_tokenizer_and_load_bert(model, model_trainable=True)
    res=guess_words(sentence, tokenizer, model)
    print('Masked words-->', res, '\n\n')

uncased_L-2_H-128_A-2
----------------------------------------------------------------------------

loading: uncased_L-2_H-128_A-2
vocab size: 30522
loaded: uncased_L-2_H-128_A-2
Sentence to be processed :   For me the most important thing in  [MASK] is [MASK].
['[CLS]', 'for', 'me', 'the', 'most', 'important', 'thing', 'in', '[MASK]', 'is', '[MASK]', '.', '[SEP]']
Tokenized : ---> [101, 2005, 2033, 1996, 2087, 2590, 2518, 1999, 103, 2003, 103, 1012, 102]
shapes :   (1, 512) (1, 512) (1, 512)
Masked words--> ('this important', '[CLS] for me the most important thing in this is important . [SEP]') 


uncased_L-2_H-256_A-4
----------------------------------------------------------------------------

loading: uncased_L-2_H-256_A-4
vocab size: 30522
loaded: uncased_L-2_H-256_A-4
Sentence to be processed :   For me the most important thing in  [MASK] is [MASK].
['[CLS]', 'for', 'me', 'the', 'most', 'important', 'thing', 'in', '[MASK]', 'is', '[MASK]', '.', '[SEP]']
Tokenized : ---> [101, 200