# Assignment 3 Part 2 - Find complex answers to medical questions

*Submission deadline: Friday 1 November 2024, 11:55pm.*

*Assessment marks: 20 marks (20% of the total unit assessment)*

Unless a Special Consideration request has been submitted and approved, a 5% penalty (of the total possible mark of the task) will be applied for each day a written report or presentation assessment is not submitted, up until the 7th day (including weekends). After the 7th day, a grade of ‘0’ will be awarded even if the assessment is submitted. For example, if the assignment is worth 8 marks (of the entire unit) and your submission is late by 19 hours (or 23 hours 59 minutes 59 seconds), 0.4 marks (5% of 8 marks) will be deducted. If your submission is late by 24 hours (or 47 hours 59 minutes 59 seconds), 0.8 marks (10% of 8 marks) will be deducted, and so on. The submission time for all uploaded assessments is 11:55 pm. A 1-hour grace period will be provided to students who experience a technical concern. For any late submission of time-sensitive tasks, such as scheduled tests/exams, performance assessments/presentations, and/or scheduled practical assessments/labs, please apply for [Special Consideration](https://students.mq.edu.au/study/assessment-exams/special-consideration).

Note that the work submitted should be your own work. For rules of using of AI tools, refer to "Using Generative AI Tools" on iLearn.


# A note on the use of AI generators
In this assignment, we view AI code generators such as copilot, CodeGPT, etc as tools that can help you write code quickly. You are allowed to use these tools, but with some conditions. To understand what you can and what you cannot do, please visit these information pages provided by Macquarie University:

Artificial Intelligence Tools and Academic Integrity in FSE - https://bit.ly/3uxgQP4

If you choose to use these tools, make the following explicit in your submitted file as comments starting with "Use of AI generators in this assignment" explain:

* What part of your code is based on the output of such tools,
* What tools you used,
* What prompts you used to generate the code or text, and
* What modifications you made on the generated code or text.


This will help us assess your work fairly. If we observe that you have used an AI generator and you do not give the above information, you may face disciplinary action.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Objectives of this assignment

In assignment 3 you will work on a general answer selection task. Given a question and a list of sentences, the final goal is to predict which of these sentences from the list can be used as part of the answer to the question. Assignment 3 is divided into two parts. Part 1 will help you get familiar with the data, and Part 2 requires you to implement deep neural networks.

The data is in the file `train.csv`, which is provided in both GitHub repository and in iLearn. Each row of the file consists of a question ('qtext' column), an answer ('atext' column), and a label ('label' column) that indicates whether the  answer is correctly related to the question (1) or not (0).

The following code uses pandas to store the file `train.csv` in a data frame and shows the first few rows of data.

In [2]:
import pandas as pd
train_data = pd.read_csv("/content/drive/MyDrive/data/train.csv")
val_data = pd.read_csv("/content/drive/MyDrive/data/val.csv")
test_data = pd.read_csv("/content/drive/MyDrive/data/test.csv")
train_data.head()

Unnamed: 0,qtext,label,atext
0,What are the symptoms of gastritis?,1,"However, the most common symptoms include: Nau..."
1,What are the symptoms of gastritis?,0,var s_context; s_context= s_context || {}; s_c...
2,What are the symptoms of gastritis?,0,"!s_sensitive, chron ID: $('article embeded_mod..."
3,What does the treatment for gastritis involve?,1,Treatment for gastritis usually involves: Taki...
4,What does the treatment for gastritis involve?,1,Eliminating irritating foods from your diet su...


Note: the left-most index is not part of the data, it is added by ipynb automatically for easy reading. You can also browse the data using Microsoft Excel or similar software.

# Now let's get started.

Use the provided files `train.csv`, `val.csv`, and `test.csv` in the data.zip file for all the tasks below.

## Instruction
* You are required to finish the two tasks below.
* You need to write code in this ipynb file.
* Your ipynb file needs to include the running outputs of your final code.
* **You need to submit this ipynb file, containing your code and outputs, to iLearn.**

## Assessment

1. We mark based on the correctness of your code, outputs, and coding style.
2. We assign 2 marks (1 mark each Task) for good coding style, including but not limited to clean codes, self-explained variable names, good comments that help understand the code, etc.
3. We assign 2 marks (1 mark each Task) for correctly feeding data into your model, and correctly training and testing of your models.
4. 2 marks will be deducted for the task that does not have outputs or its outputs are incorrect.
4. For the remaining detailed marks, please refer to each specific task below.

# Task 1 (8 marks): Simple Siamese NN - Contrastive Loss

Implement a simple TensorFlow-Keras neural model that meets the following requirements:

1. (0.5 marks) An input layer that will accept the tf.idf of paired data. The input of the Siamese network is a pair of data, i.e., (qtext, atext).
2. (2 marks) Use two hidden layers and a ReLU activation function. You need to determine the size of the hidden layers in {64, 128, 256} using val data, assuming these two layers use the same hidden size.
3. (0.5 marks) Use Euclidean-distance-based contrastive loss to train the model.
4. (0.5 marks) Use Sigmoid function for classification.
5. (1 mark) Calculate prediction accuracy.
6. (1.5 marks) Give an example of failure case, and explain the possible reason and discuss potential solution.
7. (1 mark) Good coding style as explained in the above Assessment Section.
8. (1 mark) Correctly feeding data into your model, and correctly training and testing of your models.

Use the test data to report the final accuracy of your best model.

1. Creating the required functions for all the tasks.



In [3]:
import tensorflow as tf
from tensorflow.keras import layers, Model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import numpy as np

# 1. (0.5 marks) An input layer that will accept the tf.idf of paired data. The input of the Siamese network is a pair of data

In [4]:
# Step 1: Data Preparation (TF-IDF of qtext and atext)
def prepare_data(df, vectorizer=None):
    # Fit TF-IDF on training data if vectorizer is not provided
    if vectorizer is None:
        vectorizer = TfidfVectorizer(stop_words='english')
        tfidf_qtext = vectorizer.fit_transform(df['qtext'].values)
    else:
        # Transform using the provided vectorizer (for val/test data)
        tfidf_qtext = vectorizer.transform(df['qtext'].values)

    # Transform answer text using the same vectorizer
    tfidf_atext = vectorizer.transform(df['atext'].values)

    # Prepare paired input data
    Q1 = tfidf_qtext.toarray()
    A1 = tfidf_atext.toarray()
    y = df['label'].values

    return Q1, A1, y, vectorizer


# 2. (2 marks) Use two hidden layers and a ReLU activation function. You need to determine the size of the hidden layers in {64, 128, 256} using val data, assuming these two layers use the same hidden size.

- Hidden Size kept as variable.

In [5]:

# Step 2: Define the Siamese Neural Network Model
def build_siamese_nn(input_shape, hidden_size):
    # Define the shared subnetwork
    input_layer = layers.Input(shape=input_shape)
    hidden_layer1 = layers.Dense(hidden_size, activation='relu')(input_layer)
    hidden_layer2 = layers.Dense(hidden_size, activation='relu')(hidden_layer1)
    return Model(input_layer, hidden_layer2)


# 3. (0.5 marks) Use Euclidean-distance-based contrastive loss to train the model.

In [6]:
# Step 3: Contrastive Loss function
def contrastive_loss(y_true, distances):
    margin = 1
    return tf.reduce_mean(y_true * tf.square(distances) + (1 - y_true) * tf.square(tf.maximum(margin - distances, 0)))

# 4. (0.5 marks) Use Sigmoid function for classification.

In [7]:

# Step 4: Siamese Network Architecture
def build_siamese_model(input_shape, hidden_size):
    # Ensure input_shape is in tuple form
    input_shape = (input_shape,) if isinstance(input_shape, int) else input_shape

    # Two input layers (for qtext and atext)
    input_q = layers.Input(shape=input_shape)
    input_a = layers.Input(shape=input_shape)

    # Shared network
    shared_network = build_siamese_nn(input_shape, hidden_size)

    # Output embeddings from the shared network
    embedding_q = shared_network(input_q)
    embedding_a = shared_network(input_a)

    # Calculate Euclidean distance between embeddings
    distance = layers.Lambda(lambda tensors: tf.sqrt(tf.reduce_sum(tf.square(tensors[0] - tensors[1]), axis=1, keepdims=True)))(
        [embedding_q, embedding_a])

    # Sigmoid output for binary classification
    output = layers.Dense(1, activation='sigmoid')(distance)

    # Build the full model
    model = Model([input_q, input_a], output)

    # Compile the model with contrastive loss
    model.compile(optimizer='adam', loss=contrastive_loss, metrics=['accuracy'])

    return model


# Step 5: Train the Model
def train_model(X1_train, X2_train, y_train, X1_val, X2_val, y_val,hidden_size):

    input_shape = X1_train.shape[1]  # Input shape based on TF-IDF vector size
    hidden_size = hidden_size  # Validate using different sizes from {64, 128, 256}

    model = build_siamese_model(input_shape, hidden_size)

    # Train the model
    history = model.fit([X1_train, X2_train], y_train, validation_data=([X1_val, X2_val], y_val), epochs=1, batch_size=32)

    return model, history

def analyze_failure_case(model, df, vectorizer):


    # Prepare validation data
    X1_val, X2_val, y_val, _ = prepare_data(df, vectorizer=vectorizer)


    # Make predictions on validation set
    predictions = model.predict([X1_val, X2_val])

    # Loop through predictions to find a failure case
    for i, (pred, true) in enumerate(zip(predictions, y_val)):
        if np.isnan(pred[0]):  # Skip if prediction is NaN
            continue
        predicted_label = int(round(pred[0]))  # Round the prediction to 0 or 1 for comparison

        if predicted_label != true:  # Identify incorrect predictions
            print("Failure Case Identified:")
            print(f"Question: {df['qtext'].iloc[i]}")
            print(f"Answer: {df['atext'].iloc[i]}")
            print(f"True Label: {true}, Predicted Label: {predicted_label}")
            print("\nPossible Reason: The TF-IDF encoding may not capture subtle semantic differences in the text, which can lead to misclassifications.")
            print("Suggested Solution: Consider switching to word embeddings (e.g., Word2Vec, GloVe, or BERT embeddings) for better semantic understanding and accuracy in classification.")
            break
    else:
        print("No failure case found.")





Preparing the data in the TfIdf format.

In [8]:


# For training data, fit the vectorizer
Ques_train, Ans_train, Label_train, vectorizer = prepare_data(train_data)

# For validation and test data, use the fitted vectorizer
Ques_val, Ans_val, Label_val, vectorizer_val = prepare_data(val_data, vectorizer=vectorizer)
Ques_test, Ans_test, Label_test, vectorizer_test = prepare_data(test_data, vectorizer=vectorizer)


In [9]:
# Check the shapes of prepared data
print("Ques_train shape:", Ques_train.shape)
print("Ans_train shape:", Ans_train.shape)
print("Label_train shape:", Label_train.shape)


Ques_train shape: (9380, 1734)
Ans_train shape: (9380, 1734)
Label_train shape: (9380,)


# 3. Calculating the accuracy on hidden size [64,128,256] on validation Data.

In [10]:
# Evaluate Different Hidden Sizes

hidden_sizes = [64,128,256]
best_hidden_size = None
best_val_accuracy = 0
best_model = None

for hidden_size in hidden_sizes:
   print(f"Training with hidden size: {hidden_size}")



        # Train the model with the current hidden size
   model, history = train_model(X1_train= Ques_train, X2_train= Ans_train, y_train=Label_train, X1_val= Ques_val, X2_val= Ans_val, y_val= Label_val, hidden_size = 64)

        # Evaluate the model on the validation set
   val_loss, val_accuracy = model.evaluate([Ques_val, Ans_val], Label_val, verbose=0)

   print(f"Validation accuracy for hidden size {hidden_size}: {val_accuracy:.4f}")

        # Check if this is the best accuracy
   if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            best_hidden_size = hidden_size
            best_model = model


print(f"Best hidden size: {best_hidden_size} with validation accuracy: {best_val_accuracy:.4f}")





Training with hidden size: 64
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.5377 - loss: nan - val_accuracy: 0.5498 - val_loss: nan
Validation accuracy for hidden size 64: 0.5498
Training with hidden size: 128
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - accuracy: 0.5267 - loss: nan - val_accuracy: 0.5498 - val_loss: nan
Validation accuracy for hidden size 128: 0.5498
Training with hidden size: 256
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - accuracy: 0.5365 - loss: nan - val_accuracy: 0.5498 - val_loss: nan
Validation accuracy for hidden size 256: 0.5498
Best hidden size: 64 with validation accuracy: 0.5498


## Therefore, the best accuracy is 54.98 with hidden size 64.

## Training the best model on hidden size 64

In [11]:
model, history = train_model(X1_train = Ques_train, X2_train = Ans_train, y_train= Label_train , X1_val=Ques_val, X2_val = Ans_val, y_val= Label_val, hidden_size = 64)

[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 16ms/step - accuracy: 0.5308 - loss: nan - val_accuracy: 0.5498 - val_loss: nan


# 5. Calculating the accuracy on test dataset

In [12]:
test_loss, test_accuracy = model.evaluate([Ques_test, Ans_test], Label_test, verbose=0)
print(f" accuracy on Data Set for hidden size { best_hidden_size }: {test_accuracy:.4f}")

 accuracy on Data Set for hidden size 64: 0.5403


6. Calculating the failure cases.


In [13]:
import numpy as np

# Check for NaN or inf values in Ques_val and Ans_val
if np.isnan(Ques_val).any() or np.isinf(Ques_val).any():
    print("NaN or inf values found in Ques_val")
if np.isnan(Ans_val).any() or np.isinf(Ans_val).any():
    print("NaN or inf values found in Ans_val")


    # Make predictions on validation set
    predictions = model.predict([Ques_val, Ans_val])
    print(predictions)
    # Loop through predictions to find a failure case
    for i, (pred, true) in enumerate(zip(predictions, Label_val)):
        if np.isnan(pred[0]):  # Skip if prediction is NaN
            continue
        predicted_label = int(round(pred[0]))  # Round the prediction to 0 or 1 for comparison

        if predicted_label != true:  # Identify incorrect predictions
            print("Failure Case Identified:")
            print(f"Question: {df['qtext'].iloc[i]}")
            print(f"Answer: {df['atext'].iloc[i]}")
            print(f"True Label: {true}, Predicted Label: {predicted_label}")
            print("\nPossible Reason: The TF-IDF encoding may not capture subtle semantic differences in the text, which can lead to misclassifications.")
            print("Suggested Solution: Consider switching to word embeddings (e.g., Word2Vec, GloVe, or BERT embeddings) for better semantic understanding and accuracy in classification.")
            break
    else:
        print("No failure case found.")




# Task 2 (12 marks): Transformer

In this task, let's use Transformer to predict whether two sentences are related or not. Implement a simple Transformer neural network that meets the following requirements:

1. (1 mark) Each input for this model should be a concatenation of qtext and atext. Use [SEP] to separate qtext and atext, e.g., "Can high blood pressure bring on heart failure? [SEP] Hypertensive heart disease is the No." You need to pad the input to a fixed length. How do you determine a suitable length?
2. (1.5 marks) Choose a suitable tokenizer and justify your choice.
3. (1 mark) An embedding layer that generates embedding vectors of the sentence text into size 128. Remember to add position embedding.
4. (1 mark) One transformer encoder layer, you need to find a hidden dimension in {64, 128, 256}. Use 3 heads in MultiHeadAttention.
5. (1 mark) Do we need a transformer decoder layer for this task? If yes, find a hidden dimension in {64, 128, 256} and use 3 heads in MultiHeadAttention. If no, explain why.
6. (0.5 marks) 1 hidden layer with size 256 and ReLU activation function.
7. (0.5 marks) 1 output layer with size 2 for binary classification to predict whether two inputs are related or not.
8. (1 mark) Choose a suitable loss to train the model
9. (1 mark) Report your best accuracy on the test split.
10. (1.5 marks) Give an example of a failure case, and explain the possible reason and discuss a potential solution.
11. (1 mark) Good coding style as explained in the above Assessment Section.
12. (1 mark) Correctly feeding data into your model, and correctly training and testing of your models.



In [14]:
# Assuming `train_data`, `val_data`, and `test_data` have a 'label' column with binary labels
train_pairs = list(zip(train_data['qtext'], train_data['atext'], train_data['label']))
val_pairs = list(zip(val_data['qtext'], val_data['atext'], val_data['label']))
test_pairs = list(zip(test_data['qtext'], test_data['atext'], test_data['label']))


In [15]:
import numpy as np

def concatenate_texts(qtext, atext, sep_token="[SEP]"):
    """Concatenates qtext and atext with a separator token."""
    return f"{qtext} {sep_token} {atext}"

def calculate_suitable_length(data_pairs, percentile=95):
    """Calculates a suitable length for padding based on the specified percentile of text lengths."""
    concatenated_texts = [concatenate_texts(q, a) for q, a, _ in data_pairs]
    lengths = [len(text.split()) for text in concatenated_texts]
    suitable_length = int(np.percentile(lengths, percentile))
    return suitable_length


suitable_length = calculate_suitable_length(train_pairs)




In [16]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

# Load the datasets
train_data = pd.read_csv("/content/drive/MyDrive/data/train.csv")
val_data = pd.read_csv("/content/drive/MyDrive/data/val.csv")
test_data = pd.read_csv("/content/drive/MyDrive/data/test.csv")

# Task 1: Function to concatenate qtext and atext with [SEP] and pad to a fixed length
def preprocess_data(data, max_length= suitable_length):
    """Preprocess data by concatenating qtext and atext with [SEP] and padding."""
    # Concatenate qtext and atext
    data['input_text'] = data['qtext'] + " [SEP] " + data['atext']
    # Tokenization and padding
    tokenizer = keras.preprocessing.text.Tokenizer()
    tokenizer.fit_on_texts(data['input_text'])
    sequences = tokenizer.texts_to_sequences(data['input_text'])
    padded_sequences = keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_length)

    return padded_sequences, data['label'].values, tokenizer

# Preprocess the training, validation, and test data
train_inputs, train_labels, tokenizer = preprocess_data(train_data)
val_inputs, val_labels, _ = preprocess_data(val_data)
test_inputs, test_labels, _ = preprocess_data(test_data)

# Task 3: Embedding layer with positional embeddings
class SentenceEmbeddingWithPosition(layers.Layer):
    def __init__(self, vocab_size, embed_size, max_length):
        super(SentenceEmbeddingWithPosition, self).__init__()
        self.token_embedding = layers.Embedding(input_dim=vocab_size, output_dim=embed_size)
        self.position_embedding = layers.Embedding(input_dim=max_length, output_dim=embed_size)

    def call(self, inputs):
        positions = tf.range(start=0, limit=tf.shape(inputs)[-1], delta=1)
        embedded_tokens = self.token_embedding(inputs)
        embedded_positions = self.position_embedding(positions)
        return embedded_tokens + embedded_positions  # Combine token and position embeddings

# Transformer Encoder Layer
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_size, num_heads, ff_dim, rate=0.1):
        super(TransformerEncoder, self).__init__()
        self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_size)
        self.ffn = keras.Sequential([
            layers.Dense(ff_dim, activation='relu'),
            layers.Dense(embed_size)
        ])
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.attention(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)

        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)


# Simple Transformer Model
class SimpleTransformerModel(keras.Model):
    def __init__(self, vocab_size, embed_size, max_length):
        super(SimpleTransformerModel, self).__init__()
        self.embedding = SentenceEmbeddingWithPosition(vocab_size, embed_size, max_length)
        self.encoder = TransformerEncoder(embed_size, num_heads=3, ff_dim=128)  # Use the custom encoder
        self.hidden = layers.Dense(256, activation='relu')  # Hidden layer
        self.output_layer = layers.Dense(1, activation='sigmoid')  # Single output for binary classification

    def call(self, inputs, training=False):  # Include training parameter
        x = self.embedding(inputs)
        x = self.encoder(x, training=training)  # Encoder outputs a sequence

        # Aggregate the encoder outputs (e.g., by taking the mean of the outputs)
        x = tf.reduce_mean(x, axis=1)  # Shape becomes (None, embed_size)

        x = self.hidden(x)
        return self.output_layer(x)  # Final output shape (None, 1)


# Task 2: Choose a suitable tokenizer
vocab_size = len(tokenizer.word_index) + 1  # Add 1 for padding
embed_size = 128  # Task 3: Embedding size
max_length = 50  # Fixed length determined earlier

# Instantiate the model
model = SimpleTransformerModel(vocab_size, embed_size, max_length)

# Task 8: Choose a suitable loss function
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])  # For binary classification

# Create TensorFlow datasets for training and validation
train_dataset = tf.data.Dataset.from_tensor_slices((train_inputs, train_labels)).batch(32)
val_dataset = tf.data.Dataset.from_tensor_slices((val_inputs, val_labels)).batch(32)
test_dataset = tf.data.Dataset.from_tensor_slices((test_inputs, test_labels)).batch(32)
# Task 8: Choose a suitable loss function
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Task 12: Train the model
history = model.fit(train_dataset, validation_data=val_dataset, epochs=10)

# Task 9: Evaluate on test split
test_loss, test_accuracy = model.evaluate(test_dataset)

print(f"Test Accuracy: {test_accuracy:.4f}")






Epoch 1/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 215ms/step - accuracy: 0.5199 - loss: 0.6954 - val_accuracy: 0.5489 - val_loss: 0.7056
Epoch 2/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 191ms/step - accuracy: 0.6820 - loss: 0.6085 - val_accuracy: 0.5472 - val_loss: 0.8636
Epoch 3/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 176ms/step - accuracy: 0.7834 - loss: 0.4775 - val_accuracy: 0.5520 - val_loss: 0.8678
Epoch 4/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m99s[0m 235ms/step - accuracy: 0.8420 - loss: 0.3789 - val_accuracy: 0.5666 - val_loss: 1.0712
Epoch 5/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m77s[0m 219ms/step - accuracy: 0.8651 - loss: 0.3249 - val_accuracy: 0.5724 - val_loss: 1.2598
Epoch 6/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m71s[0m 243ms/step - accuracy: 0.9056 - loss: 0.2525 - val_accuracy: 0.5709 - val_loss: 1.1863
Epoch 7/10

2. (1.5 marks) Choose a suitable tokenizer and justify your choice.
For this type of text-pair input, BERT's tokenizer (or a tokenizer based on the BERT model, like bert-base-uncased from Hugging Face) is a strong choice. Here’s why:

[**SEP] Token:** BERT’s tokenizer naturally supports a [SEP] token, which we need to separate qtext and atext. The tokenizer treats [SEP] as a unique token that it encodes distinctly, which helps the model to understand the structure of the input by distinguishing the question and answer components.

WordPiece Tokenization: **bold text**BERT’s WordPiece tokenizer efficiently handles out-of-vocabulary (OOV) words by breaking them into subword tokens, which is valuable when handling specialized or uncommon terms in the texts. This subword tokenization captures context better for rare words and helps retain their meaning.

**Pre-trained Embeddings: **Since BERT is trained on large corpora, it provides robust embeddings for general English text. Fine-tuning BERT’s pre-trained tokenizer helps leverage these embeddings, which enhances the model's performance on tasks involving semantic understanding in question-answer pairs.
Consistent with Transformer Models: BERT's tokenizer is designed specifically to work well with transformer-based models, making it an ideal fit for a model architecture where pre-trained transformers could be applied or adapted.


8. (1 mark) Choose a suitable loss to train the model

For this binary classification task, categorical cross-entropy is suitable because the model has a softmax output layer with two nodes, one for each class (related or not related). Categorical cross-entropy measures the difference between the true labels and the predicted probabilities, making it effective for classification tasks with mutually exclusive classes.

Reasoning:
Two Classes: Since the output layer has two classes, categorical cross-entropy will work effectively in penalizing incorrect predictions based on how far the predicted probabilities are from the actual label.

Softmax Activation: With softmax, each output value is a probability (summing to 1), which pairs naturally with categorical cross-entropy for gradient-based optimization.

5. (1 mark) Do we need a transformer decoder layer for this task? If yes, find a hidden dimension in {64, 128, 256} and use 3 heads in MultiHeadAttention. If no, explain why.

Transformer Decoder Layer: Decide whether a transformer decoder is needed. For many answer selection tasks, it may not be necessary since the input is not autoregressive

In [None]:
# Task 10: Example of a failure case
def find_failure_cases(data, model, tokenizer):
    failure_cases = []
    for index, row in data.iterrows():
        input_text = row['qtext'] + " [SEP] " + row['atext']
        input_seq = tokenizer.texts_to_sequences([input_text])
        padded_seq = keras.preprocessing.sequence.pad_sequences(input_seq, maxlen=max_length)
        prediction = model.predict(padded_seq)
        predicted_label = np.argmax(prediction)
        if predicted_label != row['label']:
            failure_cases.append((row['qtext'], row['atext'], row['label'], predicted_label))
    return failure_cases
# Finding failure cases in the test data
failure_cases = find_failure_cases(test_data, model, tokenizer)

# Print a few failure cases
print("Failure Cases (if any):")
for case in failure_cases[:5]:  # Displaying only the first 5 failure cases
    print(f"Question: {case[0]}, Answer: {case[1]}, Actual: {case[2]}, Predicted: {case[3]}")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 77ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 67ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 96ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 61ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 62ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 65ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms

In [None]:
import numpy as np

# Check for NaN or inf values in Ques_val and Ans_val
if np.isnan(Ques_val).any() or np.isinf(Ques_val).any():
    print("NaN or inf values found in Ques_val")
if np.isnan(Ans_val).any() or np.isinf(Ans_val).any():
    print("NaN or inf values found in Ans_val")


# Submission

Your submission should consist of this Jupyter notebook with all your code and explanations inserted into the notebook as code/text cells. **The notebook should contain the output of the runs. All code should run without errors. Code with syntax errors or code without output will not be assessed.**

**Do not submit multiple files.**

Examine the text cells of this notebook so that you can have an idea of how to format text for good visual impact. You can also read this useful [guide to the MarkDown notation](https://daringfireball.net/projects/markdown/syntax),  which explains the format of the text cells.