# Macquarie University

- COMP6420 - AI for Text and Vision 
- Assignment 3, Part 2 - Find complex answers to medical questions
- 47828013 - Mason Phung

In [7]:
# Import libraries
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow.keras.backend as K
import keras_tuner as kt
import keras_nlp
from nltk.tokenize import sent_tokenize, word_tokenize
from keras import ops
from keras.models import Model, Sequential
from keras.layers import Input, Flatten, Dense, Dropout, Lambda, Layer, Embedding, MultiHeadAttention, LayerNormalization, GlobalAveragePooling1D
from keras.callbacks import EarlyStopping
from keras.preprocessing import sequence
from sklearn.feature_extraction.text import TfidfVectorizer


## Objectives of this assignment

In assignment 3 you will work on a general answer selection task. Given a question and a list of sentences, the final goal is to predict which of these sentences from the list can be used as part of the answer to the question. Assignment 3 is divided into two parts. Part 1 will help you get familiar with the data, and Part 2 requires you to implement deep neural networks.

The data is in the file `train.csv`, which is provided in both GitHub repository and in iLearn. Each row of the file consists of a question ('qtext' column), an answer ('atext' column), and a label ('label' column) that indicates whether the  answer is correctly related to the question (1) or not (0).

The following code uses pandas to store the file `train.csv` in a data frame and shows the first few rows of data.

In [8]:
# Import data, changed orginal 'dataset' name to 'train_data'
# I unzip the data then read each file following the example, which directly read the train data, not the zip.
test_data = pd.read_csv("test.csv")
val_data = pd.read_csv("val.csv")
train_data = pd.read_csv("train.csv")
train_data.head()

Unnamed: 0,qtext,label,atext
0,What are the symptoms of gastritis?,1,"However, the most common symptoms include: Nau..."
1,What are the symptoms of gastritis?,0,var s_context; s_context= s_context || {}; s_c...
2,What are the symptoms of gastritis?,0,"!s_sensitive, chron ID: $('article embeded_mod..."
3,What does the treatment for gastritis involve?,1,Treatment for gastritis usually involves: Taki...
4,What does the treatment for gastritis involve?,1,Eliminating irritating foods from your diet su...


In [9]:
# check data balance
train_data.groupby('label').count()

Unnamed: 0_level_0,qtext,atext
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5039,5039
1,4341,4341


Note: the left-most index is not part of the data, it is added by ipynb automatically for easy reading. You can also browse the data using Microsoft Excel or similar software.

# Now let's get started.

Use the provided files `train.csv`, `val.csv`, and `test.csv` in the data.zip file for all the tasks below.

# Task 1 (8 marks): Simple Siamese NN - Contrastive Loss

✅ = check as done

Implement a simple TensorFlow-Keras neural model that meets the following requirements:

1. ✅(0.5 marks) An input layer that will accept the tf.idf of paired data. The input of the Siamese network is a pair of data, i.e., (qtext, atext).
2. ✅(2 marks) Use two hidden layers and a ReLU activation function. You need to determine the size of the hidden layers in {64, 128, 256} using val data, assuming these two layers use the same hidden size.
3. ✅(0.5 marks) Use Euclidean-distance-based contrastive loss to train the model.
4. ✅(0.5 marks) Use Sigmoid function for classification.
5. ✅(1 mark) Calculate prediction accuracy.
6. ✅(1.5 marks) Give an example of failure case, and explain the possible reason and discuss potential solution. 
7. (1 mark) Good coding style as explained in the above Assessment Section.
8. (1 mark) Correctly feeding data into your model, and correctly training and testing of your models.

Use the test data to report the final accuracy of your best model.

**Let's create utility functions**

We use these to calculate essential metrics for the model

In [10]:
def euclidean_distance(vector):
    """
    Calculate the euclidean distance between two vectors.
    
    Parameters
    - vector: A list containing two input tensors (vectors) to compute the distance between.
    
    Returns:
        A tensor representing the Euclidean distance between the two input vectors.
    """
	# compute the sum of squared distances between the vectors
    sumSquared = K.sum(K.square(vector[0] - vector[1]), axis=1, keepdims=True)
	# return the euclidean distance between the vectors
    return K.sqrt(K.maximum(sumSquared, K.epsilon()))

def contrastive_loss(y_true, y_pred):
    """
    Utilize the function from the practical class w10.
    I made a small change when calling margin.
    Calculates the contrastive loss.

    Parameters:
    - y_true: List of labels, each label is of type float32.
    - y_pred: List of predictions of same length as of y_true,
                each label is of type float32.

    Returns:
        A tensor containing contrastive loss as floating point value.
    """
    # Directly set margin=1
    margin = 1
    square_pred = K.square(y_pred)
    margin_square = K.square(K.maximum(margin - (y_pred), 0))
    return K.mean((1 - y_true) * 0.5*square_pred + (y_true)*0.5 * margin_square)

### Data pre-processing

Process the data by vectorizing the text data, fit & transform

In [None]:
# Use TF-IDF to vectorize the text data
vectorizer = TfidfVectorizer(stop_words='english')

# The text data is split into two columns: qtext and atext
# Do fit_transform on the training data and transform on the validation and test data
qtrain_tfidf = vectorizer.fit_transform(train_data['qtext']).toarray()
atrain_tfidf = vectorizer.transform(train_data['atext']).toarray()
qval_tfidf = vectorizer.transform(val_data['qtext']).toarray()
aval_tfidf = vectorizer.transform(val_data['atext']).toarray()
qtest_tfidf = vectorizer.transform(test_data['qtext']).toarray()
atest_tfidf = vectorizer.transform(test_data['atext']).toarray()

# Convert labels to arrays
train_label = np.array(train_data['label'])
val_label = np.array(val_data['label'])
test_label = np.array(test_data['label'])

### Build the base model

Seperate this building part with other parts in case we need to build a different base model

In [12]:
def base_model(hidden_size):
    """
    Build a base neural network model for processing input vectors.
    
    Parameters:
    - hidden_size: An integer specifying the number of units in the hidden layers.
    
    Returns:
        A Keras model consisting of two hidden layers with a ReLU activation.
    """
    model = Sequential()
    # Two hidden layers and a relu activation function
    model.add(Dense(units = hidden_size, activation='relu', name = "relu_layer"))
    model.add(Dense(units = hidden_size, activation=None, name = "dense_layer"))
    
    return model

### Build a siamese model including:
- The base model
- Contrastive loss
- Hidden layer size tuning {64, 128, 256}

In [13]:
def siam_model(hp):
    """
    Construct a Siamese neural network model.
    Hidden size is ready to be hp tuned.
    
    Parameters:
    - hp(HyperParameters object): For tuning model configurations.
    
    Returns:
        A compiled Keras Model for Siamese architecture that takes two input vectors and outputs a similarity score.
    """
    # Build the base model with hidden size hp tuning
    hidden_size = hp.Choice(name = "hidden_size", values = [64,128,256])
    network = base_model(hidden_size)
    
    # Two input layers of two data
    input_qtext = Input(shape=(qtrain_tfidf.shape[1],))
    input_atext = Input(shape=(atrain_tfidf.shape[1],))
    # Add the input layer to the model
    processed_atext = network(input_qtext)
    processed_qtext = network(input_atext)

    # Create a Lambda layer that calculates the euclidean distance of data points using the inputs
    distance = Lambda(euclidean_distance, output_shape=(1,))([processed_atext, processed_qtext])
    
    # Create a classification layer with `sigmoid` function
    prediction = Dense(1, activation='sigmoid', name= 'class_layer')(distance)

    # Build a complete model w. the inputs and the prediction layers
    model = Model([input_qtext, input_atext], prediction)

    model.compile(optimizer="adam", loss=contrastive_loss, metrics = ["accuracy"])
    return model

### Tune the hidden size with keras tuner

In [14]:
# Set up the tuner class to search
tuner = kt.BayesianOptimization(
    hypermodel = siam_model,
    objective = kt.Objective('val_accuracy', 'max'),
    max_trials = 5,
    num_initial_points = 2,
    overwrite = True
)

# Take a look at the search space
tuner.search_space_summary()

Search space summary
Default search space size: 1
hidden_size (Choice)
{'default': 64, 'conditions': [], 'values': [64, 128, 256], 'ordered': True}


In [15]:
# Set early stopping settings
early_stopping = EarlyStopping(monitor = 'val_loss', min_delta = 0, patience = 3)
# Search for the best hyperparameters
tuner.search(
    [qtrain_tfidf, atrain_tfidf],
    train_label,
    validation_data = ([qval_tfidf, aval_tfidf], val_label),
    epochs = 10,
    shuffle = True,
    callbacks = [early_stopping]
    )

# Print the best hyperparameters and model (Top 1)
topN = 1
for x in range(topN):
    best_hp = tuner.get_best_hyperparameters(topN)[x]
    print(best_hp.values)
    print(tuner.get_best_models(topN)[x].summary())

Trial 5 Complete [00h 00m 03s]
val_accuracy: 0.5801600813865662

Best val_accuracy So Far: 0.5801600813865662
Total elapsed time: 00h 00m 20s
{'hidden_size': 64}


  saveable.load_own_variables(weights_store.get(inner_path))


None


### Train & test the optimal model

In [16]:
# Fit the model with the best hp
optimal_model = siam_model(best_hp)
print('Fit the model using train and val set\n')
optimal_model.fit(
    [qtrain_tfidf, atrain_tfidf], 
    train_label, 
    validation_data = ([qval_tfidf, aval_tfidf], val_label),
    callbacks = [early_stopping], 
    epochs = 10, shuffle=True,  batch_size = 32
)

# Evaluate the model with test set
print('\nEvaluate the model with the test set')
loss, acc =  optimal_model.evaluate([qtest_tfidf, atest_tfidf], test_label, batch_size = 32)

print('\nTest loss:', loss)
print('Test accuracy:', acc)

Fit the model using train and val set

Epoch 1/10




[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.5314 - loss: 0.1240 - val_accuracy: 0.5498 - val_loss: 0.1218
Epoch 2/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5539 - loss: 0.1095 - val_accuracy: 0.5712 - val_loss: 0.1208
Epoch 3/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7598 - loss: 0.0928 - val_accuracy: 0.5879 - val_loss: 0.1213
Epoch 4/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.8555 - loss: 0.0775 - val_accuracy: 0.5501 - val_loss: 0.1242
Epoch 5/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.8945 - loss: 0.0636 - val_accuracy: 0.5513 - val_loss: 0.1261

Evaluate the model with the test set
[1m161/161[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 445us/step - accuracy: 0.5681 - loss: 0.1210

Test loss: 0.1236235573887825
Test accuracy: 0.5535

<span style="color:red">

- Best accuracy is 0.55, the model is overfitting.
- The two dense layers was represented as `sequential_1` (our base model), that why we don't see them in the summary above.

</span>

### Error Analysis

- The predictions we get are probabilities.
- The left value means the probability the machine "thinks" that the label is 0 and similarly, the right value shows us how "sure" it is with the 1 as the label.
- We simply take a look at the right value, compare if it's larger than 0.5, which means that the model thinks that the label is more likely equal to 1, we mark it as True (else False).
- As we have 0 and 1 as our label, simply convert T/F to 1/0 to match the original label format.

In [17]:
# Get the predicted labels
preds = optimal_model.predict([qtest_tfidf, atest_tfidf])
# Get the orinal values to see "how sure" the model is when classifying
original_pred_label = preds.flatten()
# We check if each predicted value > 0.5, then return T/F to 1/0
# use flatten() to convert the array to 1D, similar to the test_label for easy comparison
pred_label = (preds > 0.5).astype(int).flatten()

# Compare each label
incorrect_index = np.where(pred_label != test_label)[0]

# Taking the 5th incorrect prediction as an example
index = incorrect_index[4] 

# Take the values at the chosen index
qtext = test_data.iloc[index]['qtext']
atext = test_data.iloc[index]['atext']
true_label = test_label[index]
predicted_label = pred_label[index]
original_predicted = original_pred_label[index]

print(f"Failure Case Example")
print("--------------------")
print(f"qtext: {qtext}")
print(f"atext: {atext}")
print(f"True Label: {true_label}")
print(f"Predicted Label: {predicted_label}")
print(f"Classification prob: {original_predicted}")

[1m161/161[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 917us/step
Failure Case Example
--------------------
qtext: What are the signs of an insulin overdose?
atext: Your doctor may call it hypoglycemia.
True Label: 0
Predicted Label: 1
Classification prob: 0.527047336101532


<span style="color:red">

- tf.idf does not consider the semantic meaning of the sentence. This is quite important as we need to consider the general context of the sentence or the words in that sentence to be able to understand the correct meaning.

- In this case:
    - The model may catched the word `insulin` in the question.
    - The answer includes `doctor`, `hypoglycemia`, which are also related to `insulin`.
    - Therefore, the model may think that they are similar and labeled `1`. Classification probability is 0.73, which means that the model is pretty confident about its guess.

- Potential solution:
    - We can use word embedding as this feature helps the model to better understand the semantic relationships between words of a sentence.

- Moreover, the model is overfitting, so we can reduce the hidden size to reduce the complexity or add dropout layers to regularize our model.

- P/S: The actual case would change due to the randomness of the tensorflow system, but the idea should be similar.
</span>

# Task 2 (12 marks): Transformer

In this task, let's use Transformer to predict whether two sentences are related or not. Implement a simple Transformer neural network that meets the following requirements:

1. ✅(1 mark) Each input for this model should be a concatenation of qtext and atext. Use [SEP] to separate qtext and atext, e.g., "Can high blood pressure bring on heart failure? [SEP] Hypertensive heart disease is the No." You need to pad the input to a fixed length. How do you determine a suitable length?
2. ✅(1.5 marks) Choose a suitable tokenizer and justify your choice.
3. ✅(1 mark) An embedding layer that generates embedding vectors of the sentence text into size 128. Remember to add position embedding.
4. ✅(1 mark) One transformer encoder layer, you need to find a hidden dimension in {64, 128, 256}. Use 3 heads in MultiHeadAttention.
5. ✅(1 mark) Do we need a transformer decoder layer for this task? If yes, find a hidden dimension in {64, 128, 256} and use 3 heads in MultiHeadAttention. If no, explain why.
6. ✅(0.5 marks) 1 hidden layer with size 256 and ReLU activation function.
7. ✅(0.5 marks) 1 output layer with size 2 for binary classification to predict whether two inputs are related or not. 
8. ✅(1 mark) Choose a suitable loss to train the model
9. ✅(1 mark) Report your best accuracy on the test split.
10. ✅(1.5 marks) Give an example of a failure case, and explain the possible reason and discuss a potential solution.
11. (1 mark) Good coding style as explained in the above Assessment Section.
12. (1 mark) Correctly feeding data into your model, and correctly training and testing of your models.



### Answer some of the required questions

<span style="color:red">


- 1. The max length is the maximum length that the model takes for each sentence, I think the suitable length would be a value that we can cover the length of most the sentences. I'll use median to determine this value later

- 2. We choose BERT tokenizer as it allow us to detect two input sentences as one independently with the help of [SEP]. The tokenizer of the ‘bert-base-uncased’ model is also suitable for text classification.


- 4. We don't need a Decoder layer.
        - As we are doing a classification task, each of our output is simply a single probability value.
        - A decoder layer is usually used when we need our output to be in sequence like in generation tasks or multiple regressions.
        - Use a decoder layer is excess in our case.

- 8. Cross-entropy optimizes directly for class probabilities, focusing on correctly classifying each input pair as related or unrelated. I use `sparse_categorical_crossentropy` as we have 2 output neurons here (instead of binary with a single neuron).


</span>

### To create the Transformer model
- I use the classes that we learned from the Practical, the `TokenAndPositionEmbedding` class for the embedding layer and the `TransformerBlock` to use as a transformer encoder layer.

- Use BERT tokenizer as explained above.

- Set required parameters (`embed_dim`, `num_heads`) based on the requirements. If I have to determine myself (`max_length`, `vocab_size`), I either explain when I do it or in the above answers.

- Data preprocessing
    - Concatenate data of 3 sets
    - Call function `data_preprocess` to convert them to list, tokenize then pad the sentences.
    - Convert the labels to array format

- Build the model following the requirements, and:
    - Tune hidden dimension using `keras_tuner`. Use `BayesianOptimization` as it is suitable for our case.
    - I had to use `GlobalAveragePooling1D()(x)` as we expect a single output label. Without this the `TransformerBlock` would have a sequence of output, which does not fit our output layer.
    - Use the optimal hidden size to fit the train/val data and evaluate with the test data.

### Embedding layer
- Source at keras documentation page and the practical class

In [18]:
class TokenAndPositionEmbedding(Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = ops.shape(x)[-1]
        positions = ops.arange(start=0, stop=maxlen, step=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

### Transformer block
- Source at keras documentation page and the practical class
- We'll later use this transformer block as our transformer encoder layer

In [19]:
class TransformerBlock(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = Sequential(
            [
                Dense(ff_dim, activation="relu"),
                Dense(embed_dim),
            ]
        )
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        return self.layernorm2(out1 + ffn_output)

### BERT tokenizer

In [20]:
# Use BERT pretrained tokenizer
tokenizer = keras_nlp.models.BertTokenizer.from_preset("bert_base_en_uncased")

### Concatenate data

- Based on the requirements

In [46]:
# Concatenate `qtext` and `atext` together, seperate them with [SEP]
text_train = train_data['qtext'] + " [SEP] " + train_data['atext']
text_val = val_data['qtext'] + " [SEP] " + val_data['atext']
text_test = test_data['qtext'] + " [SEP] " + test_data['atext']

### Set parameters

These parameters are set mostly based on the requirements of the question.
- `max_length` is 50 as this is close to our frequent sentence length (median) and I think that it can cover most of our sentences.
- `vocab_size` is equal to the vocabulary size of the BERT tokenizer but we add 1 for padding index 0.

We can do the below steps to check for the median of all the sets

In [49]:
# Check the median by using the tokenized data length
# Convert data to list
text = text_train.tolist()
# Tokenize each sentence in the data
text = tokenizer.tokenize(text)

# Print the length of each tokenized sentence
token_lengths = [len(sentence) for sentence in text]
med_length = np.median(token_lengths)
print("Median of tokenized sentence lengths:", med_length)

Median of tokenized sentence lengths: 35.0


In [21]:
max_length = 50
embed_dim = 128
num_heads = 3
vocab_size = tokenizer.vocabulary_size()+1 # +1 for padding index[0]

### Data pre-processing - Set parameters

- Concatenate data of 3 sets
- Call function `data_preprocess` to convert them to list, tokenize then pad the sentences.
- Convert the labels to array format

In [44]:
def data_preprocess(text, tokenizer, maxlen):
    """
    Preprocess the data by coverting each variable into list, tokenize with the predefined tokenizer.
    Then pad each sentence in the data to the max length allowed.
    
    Parameters
    - text (pd.Series): text data to be preprocessed
    - tokenizer (Tokenizer): A tokenizer instance used to convert text into sequences of token IDs.
    - maxlen (int): The maximum length for padding each sequence. Sequences shorter than this length are padded,
                    and longer ones are truncated.

    Returns:
    - numpy.ndarray: A 2D array where each row is a padded token sequence of a fixed length `maxlen`.
    """
    # Convert data to list
    text = text.tolist()
    # Tokenize each sentence in the data
    text = tokenizer.tokenize(text)
    
    # Pad the sentences
    text = sequence.pad_sequences(text, maxlen = maxlen)
    return text

In [45]:
# Preprocess data with our defined function
text_train = data_preprocess(text_train, tokenizer, max_length)
text_val = data_preprocess(text_val, tokenizer, max_length)
text_test = data_preprocess(text_test, tokenizer, max_length)

# Process labels
label_train = np.array(train_data['label'].tolist())
label_val = np.array(val_data['label'].tolist())
label_test = np.array(test_data['label'].tolist())

### Create model
- Simply create each layer then add them to the model one by one

In [24]:
def transformer_model(hp):
    """
    Builds and compiles a Transformer classification model.
    Hidden size is ready to be hp tuned.

    Parameters:
    - hp(HyperParameters object): For tuning model configurations.

    Returns:
        A compiled Transformer classfication model, with hidden size ready to be tuned with Keras tuner.
    """
    # Create an input layer
    input = Input(shape = (max_length,))

    # Create an embedding layer and use it with the input
    embedding_layer = TokenAndPositionEmbedding(max_length, vocab_size, embed_dim)
    model = embedding_layer(input)
    
    # Tune hidden size for the transformer block (encoder layer)
    hidden_size = hp.Choice(name = "hidden_size", values = [64,128,256])
    # Create a transformer block layer and add to the model
    transformer_block = TransformerBlock(embed_dim, num_heads, hidden_size)
    model = transformer_block(model)
    
    # A GAP1D layer to flatten the output of transformer block to fit our output layer
    model = GlobalAveragePooling1D()(model)
    
    # Add a 256 units hidden layer and the output layer
    model = Dense(units = 256, activation="relu")(model)
    outputs = Dense(units = 2, activation="softmax")(model)

    trans_model = Model(inputs=input, outputs=outputs)
    trans_model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics = ["accuracy"])
   
    return trans_model

In [25]:
# Set up the tuner class to search
tuner = kt.BayesianOptimization(
    hypermodel = transformer_model,
    objective = kt.Objective('val_accuracy', 'max'),
    max_trials = 5,
    num_initial_points = 2,
    overwrite = True
)

# Take a look at the search space
tuner.search_space_summary()

Search space summary
Default search space size: 1
hidden_size (Choice)
{'default': 64, 'conditions': [], 'values': [64, 128, 256], 'ordered': True}


In [26]:
# Set early stopping settings
early_stopping = EarlyStopping(monitor = 'val_loss', min_delta = 0, patience = 3)
# Search for the best hyperparameters
tuner.search(
    text_train,
    label_train,
    validation_data = (text_val, label_val),
    epochs = 10,
    shuffle = True,
    callbacks = [early_stopping]
    )

# Print the best hyperparameters and model (Top 1)
topN = 1
for x in range(topN):
    best_hp = tuner.get_best_hyperparameters(topN)[x]
    print(best_hp.values)
    print(tuner.get_best_models(topN)[x].summary())

Trial 5 Complete [00h 01m 39s]
val_accuracy: 0.5993208885192871

Best val_accuracy So Far: 0.6495270729064941
Total elapsed time: 00h 06m 09s
{'hidden_size': 64}


  saveable.load_own_variables(weights_store.get(inner_path))


None


In [27]:
# Fit the model with the best hp
optimal_trans_model = transformer_model(best_hp)
print('Fit the model using train and val set\n')
optimal_trans_model.fit(
    text_train, 
    label_train, 
    validation_data = (text_val, label_val),
    callbacks = [early_stopping], 
    epochs = 5, shuffle=True,  batch_size = 32
)

# Evaluate the model with test set
print('\nEvaluate the model with the test set')
loss, acc =  optimal_trans_model.evaluate(text_test, label_test, batch_size = 32)
print('\nTest loss:', loss)
print('Test accuracy:', acc)

Fit the model using train and val set

Epoch 1/5
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 61ms/step - accuracy: 0.5931 - loss: 0.6589 - val_accuracy: 0.5239 - val_loss: 0.7447
Epoch 2/5
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 56ms/step - accuracy: 0.8128 - loss: 0.4256 - val_accuracy: 0.6168 - val_loss: 0.7646
Epoch 3/5
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 58ms/step - accuracy: 0.8878 - loss: 0.2890 - val_accuracy: 0.5886 - val_loss: 0.8078
Epoch 4/5
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 55ms/step - accuracy: 0.9234 - loss: 0.2157 - val_accuracy: 0.5501 - val_loss: 1.3668

Evaluate the model with the test set
[1m161/161[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step - accuracy: 0.5815 - loss: 1.3924

Test loss: 1.3522511720657349
Test accuracy: 0.5772674083709717


<span style="color:red">

best accuracy is about 0.58, the model is overfitting.

</span>

### Error Analysis
- We have to deal with the predictions differently because we are having a 2D output for each prediction.
- The left value means the probability the machine thinks that the label is 0 and similarly, the right value shows us how sure it is with the 1 as the label.
- We simply take a look at the right value, compare if it's larger than 0.5, which means that the model thinks that the label is more likely equal to 1, we mark it as True (else False).
- As we have 0 and 1 as our label, simply convert T/F to 1/0 to match the original label format.

In [28]:
# Get the predicted labels
preds = optimal_trans_model.predict(text_test)

# Use the probability for the "related" class (second column) as the prediction confidence
original_pred_label = preds[:, 1]

# We check if the right prob value > 0.5, then return T/F (but then converts to 1/0)
# Convert probs to binary labels: 1 if the prob > 0.5, else 0
pred_label = (preds[:, 1] > 0.5).astype(int)

# Compare each label with the true test label
incorrect_index = np.where(pred_label != label_test)[0]

# Taking the 5th incorrect prediction as an example
index = incorrect_index[4] 

# Take the values at the chosen index
qtext = test_data.iloc[index]['qtext']
atext = test_data.iloc[index]['atext']
true_label = label_test[index]
predicted_label = pred_label[index]
original_predicted = original_pred_label[index]

print(f"Failure Case Example")
print("--------------------")
print(f"qtext: {qtext}")
print(f"atext: {atext}")
print(f"True Label: {true_label}")
print(f"Predicted Label: {predicted_label}")
print(f"Classification prob: {original_predicted}")

[1m161/161[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 15ms/step
Failure Case Example
--------------------
qtext: What are the signs of an insulin overdose?
atext: Your doctor may call it hypoglycemia.
True Label: 0
Predicted Label: 1
Classification prob: 0.9908889532089233


<span style="color:red">

- Aside from the semantic reasons that we could assume from words in a similar topic appearance (insulin/hypoglycemia), we can see that the model is highly overfitting.

- As I observe our settings (which given as requirements), the hidden sizes of dense layers are relatively high (up to 256 at the final dense layer), which can be our main reason. With high hidden size, the model is more complex and is better at classifying the set (we can clearly see how it is faster to get to a high acc in train set). 

- However, this would make it remember some specific details about the train set, "thinks" that it is the correct patterns. As a results, the model would have worse results with other sets, including the val/test set and is overfitting.

- Potential solution:
    - Reduce the hidden size, which is better for small size data like the one we have. Avoid too complex model when it's not necessary.
    - We can add dropout layers, which randomly deactive model neurons on training, better regularize our model.

</span>

## References

- https://datascience.stackexchange.com/questions/85847/role-of-decoder-in-transformer#:~:text=About%20needing%20the%20decoder%20in,task%2C%20the%20encoder%20would%20suffice.

- https://levelup.gitconnected.com/understanding-transformers-from-start-to-end-a-step-by-step-math-example-16d4e64e6eb1

- https://keras.io/examples/nlp/text_classification_from_scratch/

- http://jalammar.github.io/illustrated-bert/