#BERT (Bidirectional Encoder Representations from Transformers)

**Intro**: BERT is a transformer based model. BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.

Using the bidirectional capability, BERT is pre-trained on two different, but related, NLP tasks:
- Masked Language Modeling and
- Next Sentence Prediction.

This model has been pre-trained for English on 2500M words of the Wikipedia and 800M words of BooksCorpus.

**Word Embeddings**: BERT generates sophisticated word embeddings that capture the context of a word within a sentence. Unlike traditional models, BERT examines the full context of a word by looking at the words that come before and after it, resulting in rich, nuanced representations that vary depending on the word's usage.



Sources:
1. BERT models: https://tfhub.dev/google/collections/bert/1
2. Text preprocessing layer: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3

2. https://jalammar.github.io/illustrated-bert/
2. https://www.techtarget.com/searchenterpriseai/definition/BERT-language-model#:~:text=BERT%2C%20which%20stands%20for%20Bidirectional,calculated%20based%20upon%20their%20connection.



In [4]:
# Import necessary libraries

import tensorflow_hub as hub


In [2]:
!pip install -U tensorflow-text


Collecting tensorflow-text
  Downloading tensorflow_text-2.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m51.1 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow<2.16,>=2.15.0 (from tensorflow-text)
  Downloading tensorflow-2.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (475.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m475.2/475.2 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboard<2.16,>=2.15 (from tensorflow<2.16,>=2.15.0->tensorflow-text)
  Downloading tensorboard-2.15.1-py3-none-any.whl (5.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m101.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-estimator<2.16,>=2.15.0 (from tensorflow<2.16,>=2.15.0->tensorflow-text)
  Downloading tensorflow_estimator-2.15.0-py2.py3-none-any.whl (441 kB)
[2K     [90m━━━━━━━━━━━━━━━━

In [1]:
import tensorflow_text as tf_text


In [2]:
# Text preprocessing
text_preprocess_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"

# BERT encoder model :
bert_encoder_url =  "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"



In [5]:
text_preprocess_bert_model = hub.KerasLayer(text_preprocess_url)


Now let's run this preprocessing model to some sentences.

In [10]:
sample_text = ['How are you?','I am doing fine', 'How about you?']
sample_text_preprocessed = text_preprocess_bert_model(sample_text)
sample_text_preprocessed

{'input_type_ids': <tf.Tensor: shape=(3, 128), dtype=int32, numpy=
 array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

sample_text_preprocessed is a dictionary. It is 3 keys: 'input_type_ids', 'input_word_ids', 'input_mask'

In [11]:
sample_text_preprocessed.keys()

dict_keys(['input_type_ids', 'input_word_ids', 'input_mask'])

In [12]:
sample_text_preprocessed['input_word_ids']

<tf.Tensor: shape=(3, 128), dtype=int32, numpy=
array([[ 101, 2129, 2024, 2017, 1029,  102,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0],
       [ 101, 1045, 2572, 2725, 2986,  102,    0,    0,    0,    0,    0,
           0,

In **shape=(3,128):**

3: This indicates that there are 3 separate sequences/senteces in this batch. Each sequence corresponds to one set of inputs to the model, which in this case is likely a tokenized text sequence.

128: This refers to the sequence length. Each sequence has been padded or truncated to a fixed length of 128 tokens. This fixed length is necessary for batching purposes, as TensorFlow (and most neural networks) require a consistent input shape. The number 128 is a common choice because it balances the need to handle most sentence lengths without being too large to manage computationally.

 - In the given tensor, each row is an array of integers, where each integer represents a token ID. These token IDs are used to map the words in the sentences to embeddings during the processing by a model such as BERT.
 - The value 101 represents the classification token **[CLS]** which is added at the beginning of each sequence by BERT's tokenizer.
 - The value 102 represents  **[SEP]** as a special token to separate two sentences or end single sentece.
 -  A value of 0 typically represents a padding token **[PAD]** which is used to fill the sequence up to the required fixed length.

- The output you're seeing is showing the input_word_ids, which are the input IDs for the BERT model after preprocessing the text.
- The zeros following the sequence indicate that padding has been applied to ensure each input sequence reaches the maximum length required by the model.

### Now let's get the BERT model
Then pass the preprocessed text into the model to get the embeddings

In [13]:
bert_encoder_model = hub.KerasLayer(bert_encoder_url)

In [15]:
results = bert_encoder_model(sample_text_preprocessed)

In [17]:
results.keys()

dict_keys(['sequence_output', 'pooled_output', 'default', 'encoder_outputs'])

In [19]:
# Let's check the length of the model, (remember we use base version which has 12 encoder blocks)
len(results['encoder_outputs'])

12

In [21]:
# results['encoder_outputs']

We can see that last entry of encoder_outputs is basically a sequence_output. Let's check that

In [23]:
results['encoder_outputs'][-1] == results['sequence_output']

<tf.Tensor: shape=(3, 128, 768), dtype=bool, numpy=
array([[[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  

In [24]:
# Here is the final output
results['sequence_output']

<tf.Tensor: shape=(3, 128, 768), dtype=float32, numpy=
array([[[-5.70411943e-02,  1.62799835e-01, -4.03557688e-01, ...,
         -4.44653511e-01,  2.88596094e-01,  3.04885596e-01],
        [-1.72945946e-01, -4.66228306e-01, -2.68049657e-01, ...,
         -2.24454895e-01,  7.56450593e-01, -4.46049809e-01],
        [ 3.81558180e-01, -6.49014413e-01,  3.82462382e-01, ...,
         -7.14544415e-01,  1.52210504e-01, -7.01340914e-01],
        ...,
        [-1.23435929e-01, -7.13972673e-02,  4.56849366e-01, ...,
          2.09850937e-01, -6.09567501e-02, -1.39756560e-01],
        [-4.57099080e-02, -1.21522211e-02,  3.87843579e-01, ...,
          1.79219753e-01,  3.81778851e-02, -1.89389363e-01],
        [ 2.23090753e-01, -6.91267624e-02,  3.55052412e-01, ...,
          2.13947430e-01,  1.44929647e-01, -2.09377870e-01]],

       [[ 3.26697975e-01,  4.97255296e-01,  8.14630240e-02, ...,
          6.18204474e-04,  2.98998684e-01,  2.34256759e-01],
        [ 2.42227480e-01,  4.22549285e-02,  2.52

- This tensor contains a batch of 3 sequences, each with 128 tokens, and for each token, BERT produces an embedding of 768 features.
- This output can be used for various downstream tasks such as classification, named entity recognition, or any other task that requires understanding of the content of the sequences.