The novelty about BERT is the attention model built-in. Attention is a complex task, and it requires different-
tasks working in conjunction. Seq2seq is a model that takes a sequence of words and creates an output of another-
sequence of words. The mechanics of seq2seq is an encoder-decoder model in which the information is captured-
into a “context vector”; the encoder reduces the dimensionality of the model. The encoder and decoder are-
recursive neural networks because RNNs have a longer memory; for newer NLP models, RNNs tend to be bidirectional too,-
which allows the model to capture deeper intrinsic correlations. Most common size of context vectors are: 256,512,1024.

The input of an encoder is one tokenized word along with one hidden state. When the two inputs get processed in the-
encoder, the encoder generates an output, and the next hidden state is used as an input on the following input for-
the enconder. In the same way, the encoder encodes data representations, and the decoder unrolls data representation-
into a sequence output.

To make the model “attentive,” this encoder-decoder is modified so the model can pay attention to important parts of-
the input sequence. The main difference is that the encoder passes all the hidden states to the decoder, not just one-
at a time.

In the process of decoding,  each encoder hidden state is associated with specific words in the input sequence; each-
hidden state gets a “score” inferred by a softmax function, each hidden score is multiplied by its score and this-
amplifies the hidden states, while the ones with the more minor scores get minimized, in a vanishing gradient type.-
How the model achieves attention is by masking a relevant word of the text and the model will infer it, based on the-
score, it knows what “attention vector” the sentence belongs to.

In [41]:
import os
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from bert import BertModelLayer
from bert.loader import StockBertConfig, map_stock_config_to_params, load_stock_weights
from bert.tokenization.bert_tokenization import FullTokenizer



import bert
from bert import BertModelLayer
from bert.loader import StockBertConfig,map_stock_config_to_params,load_stock_weights
from bert.tokenization.bert_tokenization import FullTokenizer


https://www.kaggle.com/praveengovi/classify-emotions-in-text-with-bert

The dataset contains 5 different emotions: ['sadness', 'anger', 'love', 'surprise', 'fear', 'joy'].

Dataset is already processed but the emotions( labels) must be changed to numerical values-
since this is a classification task. Pandas framework is used to load data and process it-
into different emotions. {'joy':0,'sadness':1,'anger':2,'fear':3,'love':4,'surprise':5}-
Once the labels are changed, the object must be changed to a int32 because this is the-
encoding that the network used.

In [42]:
train_df = pd.read_csv('D:/bert-fine-tune-tf2/data/pre-processed-data/train.txt',header=None, sep=';')
test_df = pd.read_csv('D:/bert-fine-tune-tf2/data\pre-processed-data/test.txt',header=None, sep=';')

train_df.columns=['sentence', 'emotion']
test_df.columns=['sentence','emotion']

train_df=train_df.replace({'emotion': {'joy':0,'sadness':1,'anger':2,'fear':3,'love':4,'surprise':5}})
test_df=test_df.replace({'emotion': {'joy':0,'sadness':1,'anger':2,'fear':3,'love':4,'surprise':5}})

train_df["emotion"]=train_df['emotion'].astype('int32')
test_df["emotion"]=test_df['emotion'].astype('int32')

For train and test data, the data frame is split in 90/10 and result is two datasets-
of size train {18,000 by 2} and {2,000 by 2}

In [43]:
train_df.to_csv('D:/bert-fine-tune-tf2/data/processed-data/train.csv', sep=',',index=False, header=False)
test_df.to_csv('D:/bert-fine-tune-tf2/data/processed-data/test.csv', sep=',', index=False, header=False)

Hugging face offers pre-train models trained in the english language, there are different architectures-
to choose from. Each language transformer has its own Tokenizer; Tokenizing is the process of mapping-
words to numerical values; bert uses an attention model to understand context by masking a word for-
inference. The dataset to be used must be Tokenized with the corresponding tokenizer according to -
the model to be used.

In [44]:
tokenizer= FullTokenizer(vocab_file='D:/bert-fine-tune-tf2/architecture-uncased_L-24_H-1024_A-16/vocab.txt',do_lower_case=True)


Each transformer model folder comes with 3 files:

The model is the architecture used to train this model which must be loaded to fine-tune on a particula-
dataset and the shape is the following:

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "max_position_embeddings": 512,
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "type_vocab_size": 2,
  "vocab_size": 30522
}
The second file is the checkpoints file which is a filer that saves all parameter related to the model.
Checkpoints is different from the actual model because the model is the architecture while the checkpoints-
are hyper parameter and tensors created during the model training process and start-up.

In [45]:
bert_model='D:/bert-fine-tune-tf2/architecture-uncased_L-24_H-1024_A-16/uncased_L-24_H-1024_A-16'
bert_ckpt_file = 'D:/bert-fine-tune-tf2/architecture-uncased_L-24_H-1024_A-16/bert_model.ckpt'
bert_config_file = 'D:/bert-fine-tune-tf2/architecture-uncased_L-24_H-1024_A-16/bert_config.json'

Using transformers
Transformers is a method of training a model in a stack architecture, encoders also run in parallel-
with decoders and this model boosts speed in training. Each decoder and encoder has six elements,-
this is the model published but it can be more than 6.  The input of the encoder flows upstream and-
goes across all 6 encoders and each one of these encoders have 2 layers inside, a feed forward neural-
network and a self-attention component.

The following code takes two dataset: train and test and tokenizes the samples; it also adds "CLS"and "SEP"-
this is required for the model to know the beginning and ending of the phrase. We must limit input lenght-
to be 192 characters; since all the samples must be of the same lenght, if a sample is shorter, we will use-
padding to add zeroes and fill empty characters with zeroes until it reaches 192 characters.
The returning object will be all integers of encoding int32.

In [46]:
class IntentDetectionData:
    DATA_COLUMN = "sentence"
    LABEL_COLUMN = "emotion"

    def __init__(self, train_df, test_df, tokenizer: FullTokenizer, classes, max_seq_len=192):
        self.tokenizer = tokenizer
        self.max_seq_len = 0
        self.classes = classes

        ((self.train_x, self.train_y), (self.test_x, self.test_y)) =\
            map(self._prepare, [train_df, test_df])
        self.max_seq_len = min(self.max_seq_len, max_seq_len)
        self.train_x, self.test_x = map(self._pad, [self.train_x, self.test_x])

    def _prepare(self, df):
        x, y = [], []

        for _, row in df.iterrows():
            text, label = row[IntentDetectionData.DATA_COLUMN], row[IntentDetectionData.LABEL_COLUMN]
            tokens = self.tokenizer.tokenize(text)
            tokens = ["[CLS]"] + tokens + ["[SEP]"]
            tokens_ids = self.tokenizer.convert_tokens_to_ids(tokens)
            self.max_seq_len = max(self.max_seq_len, len(tokens_ids))
            x.append(tokens_ids)
            y.append(self.classes.index(label))
        return np.array(x), np.array(y)
        
    def _pad(self,ids):
        x=[]
        for input_ids in ids:
            input_ids=input_ids[:min(len(input_ids),self.max_seq_len-2)]
            input_ids = input_ids + [0] * (self.max_seq_len - len(input_ids))          
            x.append(np.array(input_ids))
        return np.array(x)
    

The Tensorflow 2 implementation was taken from this repo:

https://github.com/kpe/bert-for-tf2

In [47]:
def create_model(max_seq_len, bert_ckpt_file):

  with tf.io.gfile.GFile(bert_config_file, "r") as reader:
      bc = StockBertConfig.from_json_string(reader.read())
      bert_params = map_stock_config_to_params(bc)
      bert_params.adapter_size = None
      bert = BertModelLayer.from_params(bert_params, name="bert")
        
  input_ids = keras.layers.Input(shape=(max_seq_len, ), dtype='int32', name="input_ids")
  bert_output = bert(input_ids)

  print("bert shape", bert_output.shape)

  cls_out = keras.layers.Lambda(lambda seq: seq[:, 0, :])(bert_output)
  cls_out = keras.layers.Dropout(0.5)(cls_out)
  logits = keras.layers.Dense(units=768, activation="tanh")(cls_out)
  logits = keras.layers.Dropout(0.5)(logits)
  logits = keras.layers.Dense(units=len(classes), activation="softmax")(logits)

  model = keras.Model(inputs=input_ids, outputs=logits)
  model.build(input_shape=(None, max_seq_len))

  load_stock_weights(bert, bert_ckpt_file)
        
  return model

In [48]:
classes=train_df.emotion.unique().tolist()
data = IntentDetectionData(train_df, test_df, tokenizer, classes, max_seq_len=128)



In [49]:
model = create_model(data.max_seq_len, bert_ckpt_file)

bert shape (None, 87, 1024)
Done loading 388 BERT weights from: D:/bert-fine-tune-tf2/architecture-uncased_L-24_H-1024_A-16/bert_model.ckpt into <bert.model.BertModelLayer object at 0x0000024A872A3B88> (prefix:bert). Count of weights not found in the checkpoint was: [0]. Count of weights with mismatched shape: [0]
Unused weights from checkpoint: 
	bert/embeddings/token_type_embeddings
	bert/pooler/dense/bias
	bert/pooler/dense/kernel
	cls/predictions/output_bias
	cls/predictions/transform/LayerNorm/beta
	cls/predictions/transform/LayerNorm/gamma
	cls/predictions/transform/dense/bias
	cls/predictions/transform/dense/kernel
	cls/seq_relationship/output_bias
	cls/seq_relationship/output_weights


In [50]:
model.compile(optimizer=keras.optimizers.Adam(1e-5),loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")])


