# Introduction

This competition involves **building a machine learning model to answer multiple-choice questions that were created by a large language model (LLM)**. The dataset consists of questions with **five possible answers (labeled A through E)**.

The task is to **predict the top three most probable answers for each question in the test set**. For each question, there is one answer that is considered the most correct, according to the generating LLM.

The specific files are:

    train.csv: This file contains **200 example questions along with the correct answers**. This data should be used to train your machine learning model.

    test.csv: This file contains the questions for which you must predict the answers. Note that the provided test.csv file is just a placeholder; the actual test data will be provided when your submission is scored. The true test data will have a similar format, but will consist of ~4,000 different questions.

    sample_submission.csv: This file shows the correct format for submitting your predictions.

For each question in test.csv, your model should predict the labels of the top three most probable answers, separated by spaces. These predictions should be stored in a new column called 'prediction'. The order of the labels matters, with the first label being the most likely answer according to your model, the second label being the second most likely, and so on.

Finally, **your predictions should be written to a CSV file for submission**. The submission file should have two columns: 'id' and 'prediction'. The 'id' column should match the 'id' column in test.csv, and the 'prediction' column should contain your model's predictions.

# Load the Train Dataset

First, we need to load the dataset using a library like Pandas. In this case, we have train_df which is a DataFrame that contains our training data.

In [1]:
import pandas as pd
train_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/train.csv')
train_df

Unnamed: 0,id,prompt,A,B,C,D,E,answer
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D
...,...,...,...,...,...,...,...,...
195,195,What is the relation between the three moment ...,The three moment theorem expresses the relatio...,The three moment theorem is used to calculate ...,The three moment theorem describes the relatio...,The three moment theorem is used to calculate ...,The three moment theorem is used to derive the...,C
196,196,"What is the throttling process, and why is it ...",The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,The throttling process is a steady adiabatic f...,The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,B
197,197,What happens to excess base metal as a solutio...,"The excess base metal will often solidify, bec...",The excess base metal will often crystallize-o...,"The excess base metal will often dissolve, bec...","The excess base metal will often liquefy, beco...","The excess base metal will often evaporate, be...",B
198,198,"What is the relationship between mass, force, ...",Mass is a property that determines the weight ...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is a property that determines the size of...,D


# Tokenize and Format the Dataset

The LLM understands only numbers, not raw text. Therefore, we need to **tokenize our dataset (convert the text into numbers)** and **format it in the way the LLM expects**. In the case of a multiple-choice question answering task, each sample in our dataset will consist of **a context (the question) and five possible responses (the options)**. We would use **a transformer's tokenizer** for this purpose. For example, **assuming the LLM we are using is based on BERT and tokenizer is a BERT tokenizer**.

**The LLM needs to know the correct answer for each question** in the dataset to learn from it. In our dataset, the correct answer is given in the 'answer' column as a letter (A, B, C, D, or E). We need to **convert these letters into indices (0, 1, 2, 3, or 4)** because the LLM works with numbers.

In [2]:
from transformers import AutoTokenizer
from datasets import Dataset

MODEL_DIR = "/kaggle/input/huggingface-bert/"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "bert-large-uncased")

def encode(row):
    # Format the context and the options.
    prompt = str(row['prompt'])
    options = [str(option) for option in row[['A', 'B', 'C', 'D', 'E']].values.tolist()]
    
    answer_mapping = {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
    correct_answer_id = answer_mapping[row['answer']]

    encoded_rows = []
    # Tokenize the question and the options, and include the correct answer label.
    for idx, option in enumerate(options):
        text_pair = [prompt, option]
        encoded = tokenizer(text_pair, truncation = True, padding = 'max_length', max_length = 512)
        
        # We set the label to 1 if this is the correct answer, otherwise 0.
        encoded['labels'] = 1 if idx == correct_answer_id else 0
        encoded_rows.append(encoded)

    return encoded_rows

encoded_train = []
for _, row in train_df.iterrows():
    encoded_train.extend(encode(row))

# Now each item in encoded_train is a dictionary representing a single example.
# We can convert it into a Dataset.
encoded_train_dataset = Dataset.from_dict({key: [dic[key] for dic in encoded_train] for key in encoded_train[0]})

In [3]:
encoded_train_dataset

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 1000
})

The encoded_train_dataset is an instance of the Dataset class from the Hugging Face datasets library. This dataset contains preprocessed and tokenized training data that we can use to train a machine learning model.

The features field tells what kind of information each example in the dataset includes. Here, it includes:

    input_ids: These are the tokenized inputs to the model. Each token in the input has been mapped to an ID using the vocabulary of the tokenizer.
    
    token_type_ids: These are used by some models (like BERT) to differentiate between different sequences in the input. For example, it can tell the model where the question ends and where the answer options begin.
    
    attention_mask: This is used to tell the model which parts of the input are actual content and which parts are padding (i.e., meaningless tokens added to make all inputs the same length).
    
    labels: These are the correct answers for each question. This is what the model is trying to predict.

The num_rows field tells you that there are 1,000 examples in this dataset.

Please note that the actual content of the dataset is not shown in this overview. we can access the data using indexing, for example "encoded_train_dataset[0]" to get the first example.

In [4]:
#encoded_train_dataset[0]

# See the Labels

In [5]:
print(encoded_train_dataset['labels'][:10])

[0, 0, 0, 1, 0, 1, 0, 0, 0, 0]


In [6]:
answer_mapping = {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
train_labels = train_df['answer'].map(answer_mapping)

In [7]:
train_labels

0      3
1      0
2      0
3      2
4      3
      ..
195    2
196    1
197    1
198    3
199    2
Name: answer, Length: 200, dtype: int64

# Initialize the Model

We need to initialize the LLM for fine-tuning. We use a version of the LLM that is **specifically designed for multiple-choice tasks**.

**This time we use BERT-large.** The primary difference between the "base" and "large" versions of BERT models lies in **their size, which is reflected in the number of parameters they have, the number of transformer layers (i.e., the "depth" of the network), and the size of these layers (i.e., the "width" of the network)**. This directly impacts the model's capacity to learn from data, its computational requirements, and its performance on different tasks.

Here's a quick comparison:

    BERT-base: BERT-base models are smaller versions, with 12 transformer layers, each with a hidden size of 768, and 12 attention heads. This results in a total of about 110 million parameters.

    BERT-large: BERT-large models are much bigger, with 24 transformer layers, each with a hidden size of 1024, and 16 attention heads. This results in a total of about 340 million parameters.

**Because BERT-large models are larger and have more parameters, they have a greater capacity to learn and model complex patterns in data.** As a result, they typically perform better on tasks involving understanding natural language. **However, they also require more computational resources (both for training and inference), and the improvements they provide may not always justify the increased computational cost**, depending on the specific application and available resources.

**The uncased model does not distinguish between uppercase and lowercase letters (it lowercases all input before tokenizing), whereas the cased model does keep the original letter cases.**

In [8]:
from transformers import AutoModelForMultipleChoice

model = AutoModelForMultipleChoice.from_pretrained(MODEL_DIR + "bert-large-uncased")

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
Some weights of the model checkpoint at /kaggle/input/huggingface-bert/bert-large-uncased were not used when initializing BertForMultipleChoice: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSe

# Train the Model

Finally, we can train the model using a library like Hugging Face's Transformers, which provides an easy-to-use Trainer class. We need to provide our encoded dataset, the correct labels, and some training arguments to the Trainer, and then call the train method to start training.

**Remember, this is a simplification. In a real setting, you would probably want to include a validation step, handle the tokenization in a more sophisticated way to deal with long sequences, and so on.**

Moreover, **training LLMs from scratch is computationally expensive and can take a very long time, even on multiple GPUs**. In practice, **we often use a pre-trained LLM and fine-tune it on our specific task, which is much quicker and requires less computational resources**.

In [9]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions
    map3 = mean_average_precision_at_3(labels, preds)
    return {
        'map3': map3
    }

def mean_average_precision_at_3(labels, preds):
    ap3s = [average_precision_at_3(label, pred) for label, pred in zip(labels, preds)]
    return sum(ap3s) / len(ap3s)

def average_precision_at_3(label, pred):
    try:
        return (1 / (pred[:3].index(label) + 1))
    except ValueError:
        return 0

This code **computes the average precision at 3 for each question, then takes the mean of these scores**. The average_precision_at_3 function returns the precision at the rank of the correct label if it is within the top 3 predictions, or 0 otherwise. It uses the index method to find the rank of the correct label, adding 1 because index is 0-based while ranks are 1-based. The try/except block handles the case where the correct label is not in the top 3 predictions.

In [10]:
from transformers import TrainingArguments, Trainer

# Disable wandb globally.
import os
os.environ["WANDB_DISABLED"] = "true"

training_args = TrainingArguments(
    output_dir = './finetuned_bert',  # change to a local directory
    num_train_epochs = 3,
    per_device_train_batch_size = 1,
    learning_rate = 2e-5,
    gradient_accumulation_steps = 2,
    report_to =  [],  # Disable all integrations.
)


trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = encoded_train_dataset,
    compute_metrics = compute_metrics,  # optional function to compute metrics for evaluation
)

In [11]:
trainer.train()



Step,Training Loss
500,0.8116
1000,0.8996
1500,0.9007


TrainOutput(global_step=1500, training_loss=0.870613037109375, metrics={'train_runtime': 1261.9969, 'train_samples_per_second': 2.377, 'train_steps_per_second': 1.189, 'total_flos': 5591569287168000.0, 'train_loss': 0.870613037109375, 'epoch': 3.0})

# Predict the Test Data

We will make predictions with the trained model and test data.

In [12]:
test_df = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv")
test_df

Unnamed: 0,id,prompt,A,B,C,D,E
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...
...,...,...,...,...,...,...,...
195,195,What is the relation between the three moment ...,The three moment theorem expresses the relatio...,The three moment theorem is used to calculate ...,The three moment theorem describes the relatio...,The three moment theorem is used to calculate ...,The three moment theorem is used to derive the...
196,196,"What is the throttling process, and why is it ...",The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,The throttling process is a steady adiabatic f...,The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...
197,197,What happens to excess base metal as a solutio...,"The excess base metal will often solidify, bec...",The excess base metal will often crystallize-o...,"The excess base metal will often dissolve, bec...","The excess base metal will often liquefy, beco...","The excess base metal will often evaporate, be..."
198,198,"What is the relationship between mass, force, ...",Mass is a property that determines the weight ...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is a property that determines the size of...


Encoding: This is the step we are performing with our encode_test() function. Each prompt and option pair is tokenized.

In [13]:
def encode_test(example):
    # Format the context and the options.
    prompt = str(example['prompt'])
    options = [str(option) for option in example[['A', 'B', 'C', 'D', 'E']].values.tolist()]
    examples = []

    # Tokenize the question and the options.
    for option in options:
        text_pair = [prompt, option]
        encoded = tokenizer(text_pair, truncation = True, padding = 'max_length', max_length = 512)
        examples.append(encoded)

    return examples

encoded_test_df = test_df.apply(encode_test, axis = 1)
encoded_test_df

0      [[input_ids, token_type_ids, attention_mask], ...
1      [[input_ids, token_type_ids, attention_mask], ...
2      [[input_ids, token_type_ids, attention_mask], ...
3      [[input_ids, token_type_ids, attention_mask], ...
4      [[input_ids, token_type_ids, attention_mask], ...
                             ...                        
195    [[input_ids, token_type_ids, attention_mask], ...
196    [[input_ids, token_type_ids, attention_mask], ...
197    [[input_ids, token_type_ids, attention_mask], ...
198    [[input_ids, token_type_ids, attention_mask], ...
199    [[input_ids, token_type_ids, attention_mask], ...
Length: 200, dtype: object

Prediction: Next, we need to loop over the encoded inputs, feed them to the model, and store the model outputs.

In [14]:
import torch

# Check if a GPU is available and if not, default to CPU.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Reduce batch size and limit sequence length.
batch_size = 2
sequence_length = 128

predictions = []
for row in encoded_test_df:
    # Truncate or pad sequences to a fixed length.
    row = row[:sequence_length]

    # Create tensors for input_ids and attention_mask.
    input_ids = torch.tensor([item['input_ids'] for item in row], dtype = torch.long).to(device)
    attention_mask = torch.tensor([item['attention_mask'] for item in row], dtype = torch.long).to(device)

    # Run inference with reduced batch size.
    with torch.no_grad():
        outputs = model(input_ids = input_ids, attention_mask = attention_mask)

    predictions.append(outputs.logits.detach().cpu().numpy())

    # Free GPU memory by deleting tensors.
    del input_ids, attention_mask, outputs

In [15]:
predictions[0:5]

[array([[ 1.7470471, -2.75272  ],
        [ 1.8250095, -2.7173905],
        [ 1.4937419, -2.5346873],
        [ 1.5739166, -2.7298005],
        [ 1.7606474, -3.0581536]], dtype=float32),
 array([[ 1.7203988, -3.1391954],
        [ 1.6464144, -2.781812 ],
        [ 1.8905977, -3.043336 ],
        [ 1.960199 , -3.2677853],
        [ 1.578221 , -2.9811356]], dtype=float32),
 array([[ 1.6226839, -2.752939 ],
        [ 1.3052988, -2.861816 ],
        [ 1.8684896, -2.730403 ],
        [ 1.5543457, -2.7988257],
        [ 1.3113433, -2.4241931]], dtype=float32),
 array([[ 1.5114813, -2.925351 ],
        [ 1.7303464, -2.8416784],
        [ 1.3561625, -3.0312107],
        [ 1.8990922, -2.547281 ],
        [ 1.5986308, -2.5669444]], dtype=float32),
 array([[ 1.5440885, -3.2125337],
        [ 1.6516278, -2.7359486],
        [ 1.7285665, -2.8988252],
        [ 1.6379677, -2.4674737],
        [ 1.4059753, -3.0843155]], dtype=float32)]

# Submission

In [16]:
import numpy as np

# Convert the list of predictions to a numpy array.
predictions = np.array(predictions)

# Get the indices of the top 3 predictions for each question.
top_three_indices = (-predictions).argsort(axis = 1)[:, :3].tolist()

In [17]:
# Initialize an empty list to store the extracted values.
top_values = []

# Loop over all elements in the 'top_three_indices' list.
for i in range(len(top_three_indices)):
    # Use a list comprehension to extract the second element (index 1) from each sublist.
    # This will create a new list 'values' containing these three elements.
    values = [top_three_indices[i][j][1] for j in range(3)]
    # Append this new list to our 'top_values' list.
    top_values.append(values)

# Print the resulting list of lists.
print(top_values)

[[2, 1, 3], [1, 4, 2], [4, 2, 0], [3, 4, 1], [3, 1, 2], [1, 3, 4], [4, 2, 1], [4, 0, 2], [3, 1, 2], [4, 1, 0], [0, 4, 1], [3, 2, 1], [4, 3, 2], [1, 2, 0], [2, 0, 4], [1, 0, 3], [0, 1, 3], [4, 0, 3], [0, 4, 1], [4, 1, 0], [1, 2, 0], [3, 1, 2], [3, 0, 2], [4, 1, 2], [0, 4, 3], [2, 3, 0], [2, 0, 1], [4, 3, 1], [0, 1, 2], [2, 4, 3], [0, 1, 3], [3, 4, 1], [3, 1, 0], [3, 2, 1], [3, 0, 1], [0, 4, 3], [1, 3, 2], [1, 3, 4], [4, 0, 2], [1, 2, 3], [0, 1, 2], [4, 2, 0], [1, 4, 2], [1, 3, 2], [2, 3, 4], [2, 0, 3], [1, 4, 2], [1, 2, 4], [0, 4, 1], [1, 3, 2], [1, 3, 4], [1, 3, 2], [0, 2, 4], [2, 1, 0], [1, 0, 4], [2, 0, 3], [2, 0, 3], [0, 2, 4], [3, 2, 1], [2, 1, 0], [1, 4, 2], [0, 1, 3], [3, 2, 1], [2, 4, 1], [1, 4, 3], [4, 1, 3], [3, 0, 2], [1, 4, 0], [4, 2, 3], [2, 4, 3], [1, 3, 2], [1, 2, 4], [3, 4, 2], [3, 4, 1], [3, 0, 2], [2, 0, 4], [0, 4, 2], [1, 0, 4], [2, 0, 1], [2, 3, 4], [0, 4, 3], [1, 0, 3], [3, 1, 0], [0, 2, 3], [4, 2, 0], [1, 0, 3], [0, 2, 4], [2, 4, 0], [1, 0, 2], [4, 1, 0], [3, 0, 2]

In [18]:
# Define a mapping from indices to labels.
index_to_label = {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E'}

# Convert the top three indices to the required format (labels separated by spaces).
top_three_labels = [' '.join([index_to_label[idx] for idx in sublist]) for sublist in top_values]
top_three_labels[0:5]

['C B D', 'B E C', 'E C A', 'D E B', 'D B C']

In [19]:
# Create a new DataFrame for the submission.
submission_df = pd.DataFrame({
    'id': test_df['id'],
    'prediction': top_three_labels
})

# Save the submission DataFrame to a .csv file.
submission_df.to_csv('submission.csv', index = False)

In [20]:
submission_df

Unnamed: 0,id,prediction
0,0,C B D
1,1,B E C
2,2,E C A
3,3,D E B
4,4,D B C
...,...,...
195,195,B A C
196,196,A B C
197,197,A C E
198,198,B C A


# Conclusion

**You will understand the basic concept as to the use of LLMs with the data.**

I am a medical doctor working on **artificial intelligence (AI) for medicine**. At present AI is also widely used in the medical field. Particularly, AI performs in the healthcare sector following tasks: **image classification, object detection, semantic segmentation, GANs, text classification, etc**. **If you are interested in AI for medicine, please see my other notebooks.**