## **BERT BASE**

Finding Embeddings

In [3]:
import tensorflow_hub as hub
import tensorflow_text as text

## **BERT Preprocessing**

tokenizing the text into tokens that BERT was trained on, adding special tokens (like [CLS], [SEP]), and creating attention masks.

In [4]:
preprocess_url = "https://kaggle.com/models/tensorflow/bert/TensorFlow2/en-uncased-preprocess/3"

In [5]:
bert_preprocess_model = hub.KerasLayer(preprocess_url)

In [6]:
test = ["Please subscribe to this newsletter", "You are being offered this job role"]
text_preprocessed = bert_preprocess_model(test)
text_preprocessed.keys()

dict_keys(['input_word_ids', 'input_type_ids', 'input_mask'])

## **OUTPUT Explanation**

'input_type_ids', 'input_word_ids', and 'input_mask' these outputs are specifically formatted to meet the input requirements of BERT models.

**'input_word_ids':**

or token IDs. The preprocessing model first tokenizes the text into words or subwords (subword tokenization helps in dealing with out-of-vocabulary words for which BERT hasn't been explicitly trained). Each token or subword is then mapped to a unique integer ID.

**'input_mask':**

or attention mask. The purpose of the 'input_mask' is to provide the model with information about which parts of the input data are actual tokens and which parts are padding.

The attention mask has a binary value (0 or 1):
1 - real token that should be attended to.
0 - padding.

**'input_type_ids':**

or segment IDs. The 'input_type_ids' signal to the model which part of the input belongs to sentence A and which part belongs to sentence B.

0 - first sentence.
1 - second sentence.



In [7]:
text_preprocessed['input_mask'], text_preprocessed['input_type_ids'], text_preprocessed['input_word_ids']

(<tf.Tensor: shape=(2, 128), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int32)>,
 <tf.Tensor: shape=(2, 128), dtype=int32, numpy=
 

## **BERT BASE Pre-Trained Model**

In [8]:
encoder_url = "https://www.kaggle.com/models/tensorflow/bert/TensorFlow2/en-uncased-l-12-h-768-a-12/4"

In [9]:
bert_model = hub.KerasLayer(encoder_url)

In [10]:
bert_results = bert_model(text_preprocessed)

In [11]:
bert_results.keys()

dict_keys(['encoder_outputs', 'sequence_output', 'default', 'pooled_output'])

## **OUTPUT Explanation**

['encoder_outputs', 'sequence_output', 'pooled_output', 'default'] these are part of the output from a BERT model after it processes input text. Each of these keys provides a different type of output from the BERT layers, useful for various downstream tasks in NLP.

1.	**'encoder_outputs':**

•	This key provides the outputs of each individual encoder (Transformer block) within the BERT model. This is useful for tasks that might benefit from accessing intermediate layers of the model, rather than just the final output.

•	In BERT-base model, the encoder_outputs contains 12 items i.e. transformer blocks or layers of the model, each contributing to the hierarchical understanding of the input text at various levels of abstraction.

Accessing the output of each individual layer can be highly beneficial for certain NLP tasks. For example, earlier layers might be better for tasks focused on the syntactic nuances of the text, while later layers might be more effective for tasks involving complex semantic understanding.


In [12]:
len(bert_results['encoder_outputs'])

12

In [13]:
bert_results['encoder_outputs'][-1] == bert_results['sequence_output']

<tf.Tensor: shape=(2, 128, 768), dtype=bool, numpy=
array([[[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]]])>

2. **'sequence_output':**

The 'sequence_output' is the output from the last layer of the BERT model for each token in the input sequence. It provides a high-dimensional representation of each token in the context of the entire input sequence.

The shape of the sequence_output tensor as (2, 128, 768) from a BERT model output indicates specific dimensions related to the model's processing of input text.

**Batch Size (2):**

indicates that the output is for a batch of two input sequences.

**Sequence Length (128):**

represents the sequence length, that is, the number of tokens (words or subwords) in each input sequence that the model processes. Each sequence is padded or truncated to this fixed length to ensure uniformity in input size, which is required for batch processing in neural networks.

**Hidden Size (768): **

it is the size of the hidden layers in the BERT model. This number indicates the dimensionality of the output vectors that BERT generates for each token in the input sequence. Each token's output vector is a 768-dimensional representation that encapsulates the contextual relationships learned by the model during training.

In [15]:
bert_results['sequence_output']

<tf.Tensor: shape=(2, 128, 768), dtype=float32, numpy=
array([[[ 0.22257687, -0.10579046,  0.00254981, ..., -0.24145333,
         -0.05483516,  0.30309677],
        [ 0.29199606, -0.08624656, -0.50955796, ...,  0.0044163 ,
          0.2859007 , -0.6242738 ],
        [-0.25697884, -0.5023774 , -0.2705382 , ..., -0.1796462 ,
         -0.32368812, -0.30876014],
        ...,
        [ 0.39664716,  0.07789326,  0.5355841 , ..., -0.04175995,
         -0.43019056,  0.09946573],
        [ 0.2147437 , -0.07457222,  0.5009455 , ...,  0.03458769,
         -0.32633275, -0.07769995],
        [ 0.3138128 , -0.03680493,  0.61024684, ..., -0.01120434,
         -0.33619204, -0.08630194]],

       [[ 0.16681583,  0.1201781 , -0.1994772 , ..., -0.1812833 ,
          0.22992885,  0.2749213 ],
        [-0.02964864, -0.14141597, -0.1265569 , ...,  0.0872253 ,
          0.9378395 , -0.72198415],
        [ 0.5813221 ,  0.10083358,  0.00974214, ..., -0.30160144,
          0.49794245,  0.25141135],
        ...,

3. **'pooled_output':**

The 'pooled_output' represents a fixed-length output vector for the entire input sequence and is usually derived from the hidden state of the first token of the sequence (which is the special [CLS] token in BERT). This token's final hidden state is typically used as the "aggregate representation" for classification tasks. It's processed through an additional dense layer with a Tanh activation function to generate the pooled output. This output is useful for classification tasks where the entire input sequence needs to be represented as a single fixed-size vector.

In [16]:
bert_results['pooled_output']

<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.91949916, -0.45129243, -0.774078  , ..., -0.36198598,
        -0.7211742 ,  0.880642  ],
       [-0.8155011 , -0.2725284 ,  0.0147709 , ...,  0.1272465 ,
        -0.56835717,  0.86130506]], dtype=float32)>

4. **default':**

This output typically points to one of the other outputs as the default one that should be used for most tasks. In many implementations of BERT on TensorFlow Hub, the 'default' key points to the 'pooled_output' as it is the most commonly used output for various classification tasks. However, depending on the specific implementation or model variant, it might point to a different output.