## Playing with BERT base fucionality locally

Notebook was inspired by content of https://github.com/google-research/bert.
All models from smallest to to standard one could be downloaded from
https://storage.googleapis.com/bert_models/2020_02_20/all_bert_models.zip
Models were downloaded and uzipped to subirectory models. Everything was done in Conda prepared environment.

---

### Other useful links
1. [BERT Explained: A Complete Guide with Theory and Tutorial](https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/)
2. [BERT](https://medium.com/@shoray.goel/bert-f6d23b06069f)
3. Video [LSTM is dead. Long Live Transformers!](https://www.youtube.com/watch?v=S27pHKBEp30)
4. Video [BERT Explained!](https://www.youtube.com/watch?v=OR0wfP2FD3c)
5. [What does BERT know about books, movies and music? Probing BERT for Conversational Recommendation](https://arxiv.org/pdf/2007.15356.pdf)
6. [SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/pdf/1907.10529.pdf)

---

Notebook shows how BERT predicts Next entence(NSP mode).
There is also another mode ie MSM( not presented here).

Downloaded different sizes of the BERT models are used 
token_input, seg_input, mask_input are generated based on given sentence with MASK-ed words
`BERT` expect that those 3 vectors will be  512 long : `token_input`, `seg_input` and  `mask_input`.

`token_input` as a first token has `[CLS]` and sentence ends with `[SEP]`. In case only one sentence there is one  `[SEP]` if there are two sentences there are two  `[SEP]`.
Additionally  `token_input` could have `[MASK]`.
Places where  `token_input` has `[MASK]` in vector `mask_input` have "1" (other places are "0").


In [1]:
import numpy as np
from keras_bert import load_trained_model_from_checkpoint
import tokenization # source --> https://github.com/google-research/bert/blob/master/tokenization.py

In [None]:
def init_tokenizer_and_load_bert(model_name='uncased_L-2_H-128_A-2', do_lower_case=True, model_trainable=False):
    model_dir = './models/{}'.format(model_name)

    config_path = model_dir + '/bert_config.json'
    checkpoint_path = model_dir +'/bert_model.ckpt'
    vocab_path = model_dir + '/vocab.txt'
    
    print("loading: {}".format(model_name))
    
    tokenizer = tokenization.FullTokenizer(vocab_file=vocab_path, do_lower_case=do_lower_case)
    print("vocab size: {}".format(len(tokenizer.vocab)))
    
    model = load_trained_model_from_checkpoint(config_path, checkpoint_path, training=model_trainable)
    print("loaded: {}".format(model_name))
    
    return tokenizer, model

In [3]:
tokenizer, model = init_tokenizer_and_load_bert('uncased_L-2_H-128_A-2', model_trainable=True)

loading: uncased_L-2_H-128_A-2
vocab size: 30522
loaded: uncased_L-2_H-128_A-2


In [4]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        [(None, 512)]        0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      [(None, 512)]        0                                            
__________________________________________________________________________________________________
Embedding-Token (TokenEmbedding [(None, 512, 128), ( 3906816     Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, 512, 128)     256         Input-Segment[0][0]              
______________________________________________________________________________________________

## Next Sentence Prediction (NSP)

Lets play with checking what is probability that the next sente is ok

### Example

- `Input` = `[CLS] That’s the first sentence. [SEP] Hahaha, nice! [SEP]`
- `Label` = `IsNext`

In [7]:
bert_models = {'uncased_L-2_H-128_A-2':'uncased_L-2_H-128_A-2',
    'uncased_L-2_H-256_A-4':'uncased_L-2_H-256_A-4',
              "uncased_L-12_H-768_A-12":'uncased_L-12_H-768_A-12'}

In [13]:
def compare_senten(sentence_one,sentence_two,  tokenizer, model):

    
    print('----------------------------------------------------------------------------')
    print('1. ',sentence_one, '>>>>  2. ', sentence_two)
    print('----------------------------------------------------------------------------\n')
    
    tokens_sen_one = tokenizer.tokenize(sentence_one)
    tokens_sen_two = tokenizer.tokenize(sentence_two)
    
    tokens = ['[CLS]'] + tokens_sen_one + ['[SEP]'] + tokens_sen_two + ['[SEP]']
    print(tokens)
    
    #============================
    token_input = tokenizer.convert_tokens_to_ids(tokens)
    token_input = token_input + [0] * (512 - len(token_input)) #puting o's where there was not enough words
    mask_input = [0] * 512 #ccopying 512 times
    
    seg_input = [0] * 512
    len_1 = len(tokens_sen_one) + 2  #first sentence + 2 tokeny: [CLS], [SEP]
    for i in range(len(tokens_sen_two)+1): #+1 at the end of second sentence there is  `[SEP]`
        seg_input[len_1 + i] = 1
    
    
    
#converting to numpy
    
    if len(token_input) != 2:
        token_input = np.asarray([token_input], dtype=np.int32)

    if len(mask_input) != 2:
        mask_input = np.asarray([mask_input], dtype=np.int16)

    if len(seg_input) != 2:
        seg_input = np.asarray([seg_input], dtype=np.int16)


#     print('shapes :  ', token_input.shape, seg_input.shape, mask_input.shape)
    
    #=================================================
    predicts = model.predict([token_input, seg_input, mask_input])[1] 
    


    return int(round(predicts[0][0]*100))

In [16]:
sentence_one = "I shot the sheriff "
sentence_two = "It was in self defence"

for model in bert_models:
    print(model)
#     print('.............................\n')
    tokenizer, model = init_tokenizer_and_load_bert(model, model_trainable=True)
    
    res = compare_senten(sentence_one,sentence_two,  tokenizer, model)
    print('NSP probabilty for the sentences is :', res,' % ')
    print('==========================================================================================\n\n')

uncased_L-2_H-128_A-2
loading: uncased_L-2_H-128_A-2
vocab size: 30522
loaded: uncased_L-2_H-128_A-2
----------------------------------------------------------------------------
1.  I shot the sheriff  >>>>  2.  It was in self defence
----------------------------------------------------------------------------

['[CLS]', 'i', 'shot', 'the', 'sheriff', '[SEP]', 'it', 'was', 'in', 'self', 'defence', '[SEP]']
NSP probabilty for the sentences is : 13  % 


uncased_L-2_H-256_A-4
loading: uncased_L-2_H-256_A-4
vocab size: 30522
loaded: uncased_L-2_H-256_A-4
----------------------------------------------------------------------------
1.  I shot the sheriff  >>>>  2.  It was in self defence
----------------------------------------------------------------------------

['[CLS]', 'i', 'shot', 'the', 'sheriff', '[SEP]', 'it', 'was', 'in', 'self', 'defence', '[SEP]']
NSP probabilty for the sentences is : 11  % 


uncased_L-12_H-768_A-12
loading: uncased_L-12_H-768_A-12
vocab size: 30522
loaded: unc

In [17]:
sentence_one = "What goes around. "
sentence_two = "Comes around."

for model in bert_models:
    print(model)
#     print('.............................\n')
    tokenizer, model = init_tokenizer_and_load_bert(model, model_trainable=True)
    
    res = compare_senten(sentence_one,sentence_two,  tokenizer, model)
    print('NSP probabilty for the sentences is :', res,' % ')
    print('==========================================================================================\n\n')

uncased_L-2_H-128_A-2
loading: uncased_L-2_H-128_A-2
vocab size: 30522
loaded: uncased_L-2_H-128_A-2
----------------------------------------------------------------------------
1.  What goes around.  >>>>  2.  Comes around.
----------------------------------------------------------------------------

['[CLS]', 'what', 'goes', 'around', '.', '[SEP]', 'comes', 'around', '.', '[SEP]']
NSP probabilty for the sentences is : 84  % 


uncased_L-2_H-256_A-4
loading: uncased_L-2_H-256_A-4
vocab size: 30522
loaded: uncased_L-2_H-256_A-4
----------------------------------------------------------------------------
1.  What goes around.  >>>>  2.  Comes around.
----------------------------------------------------------------------------

['[CLS]', 'what', 'goes', 'around', '.', '[SEP]', 'comes', 'around', '.', '[SEP]']
NSP probabilty for the sentences is : 79  % 


uncased_L-12_H-768_A-12
loading: uncased_L-12_H-768_A-12
vocab size: 30522
loaded: uncased_L-12_H-768_A-12
---------------------------