# NLP Projektarbeit WiSe 2024/25

Luca Kaesmann & Michael Zimmet

## 1. Einleitung

Eines der zentralen Themen im Bereich der Künstlichen Intelligenz ist die Verarbeitung von natürlicher Sprache, dabei handelt es sich um die Interaktion zwischen Computer und menschlicher Sprache. Ein Teilgebiet hiervon beschäftigt sich mit dem Question Answering, also dem Beantworten gezielter Fragen auf Basis eines gegebenen Textkontextes. Ein typischer Ansatz ist dabei ein Transformer-Modell mithilfe eines spezifischen Datensatzes zu optimieren. Dies wird als fine-tuning bezeichnet. Mit dem Aufkommen leistungsstarker Large Language Models, ist es jedoch nicht mehr notwendig Transformer-Modelle zu optimieren, da Large Language Models ohne Optimierung und mit geziehlten Prompt-Templates das Question Answering bewerkstelligen können. Ziel dieser Studienarbeit ist es 2 solcher Transformer-Modelle (DistilBERT und T5) eigens zu fine-tunen und diese mit den vortrainierten Large Language Models (ChatGPT und LLAMA 2) gegenüberzustellen. Durch die Gegenüberstellung der Modelle kann die Leistungsfähigkeit beider Ansätze verglichen werden. Dies sollte Rückschlüsse in Bezug auf die Genaugikeit, Flexibilität und Effizienz geben. Für das Question Answering soll der SQuAD Datensatz herangezogen werden und anhand des F1-Scores, sowie des Exact Match Scores eine Bewertung der Modelle abgegeben werden.

## 2. Projektplanung und Vorgehen

In diesem Abschnitt wird kurz das Vorgehen während der Studienarbeit beschrieben. Zunächst machte sich jeder mit der Aufgabenstellung vertraut und beschäftigte sich mit grundlegenden Aspekten wie der Beschaffung des Datensatzes und der Modelle. Da sich die Studienarbeit mit dem Gegenüberstellen von Transformer-Modellen und Large Language Models beschäftigt, erfolgte ebenso die Aufteilung der Arbeit nach den Modelltypen. Einerseits musste somit das fine-tuning für zwei Transformermodelle mit anschließender Evaluierung implementiert werden und andererseits die programmatische Erstellung der Prompts, welche an die zwei Large Language Models übergeben werden, mit anschließender Evaluierung implementiert werden. Bei der Evaluierung der Modelle wurde der Exact Match Score und der F1 Score herangezogen. Diese Aufgabenteilung ermöglichte das parallele und unabhängige bearbeiten der Aufgaben.

## 3. Implementierung

### Installing Packages

In [1]:
!pip3 install --user torch torchvision torchaudio
!pip3 install --user datasets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m





### Import the needed packages

In [89]:
# Downloading, saving and loading the dataset
import datasets
import json

# Handling the models and data
from transformers import DistilBertForQuestionAnswering
from transformers import DistilBertTokenizerFast

from transformers import T5ForQuestionAnswering
from transformers import T5TokenizerFast

import torch
from torch.utils.data import DataLoader
from tqdm import tqdm

### Get the SQuAD Dataset

In [10]:
ds = datasets.load_dataset("rajpurkar/squad")

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [16]:
ds.save_to_disk("./squad/squad_1_0")

Saving the dataset (0/1 shards):   0%|          | 0/87599 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/10570 [00:00<?, ? examples/s]

In [3]:
ds = datasets.load_from_disk("./squad/squad_1_0")

In [4]:
train_ds = ds["train"]
test_ds = ds["validation"]

### Data Preparation

#### Get the relevat information from the dataset

In [5]:
def prepare_data(data):
    contexts = []
    questions = []
    answers = []
    
    for article in data:
        contexts.append(article["context"])
        questions.append(article["question"])

        # the answer is saved in a list. The list cause issues in the future so it will be resolved into a string
        answer_dict = {"text": article["answers"]["text"][0], "answer_start": article["answers"]["answer_start"][0]}
        answers.append(answer_dict)

    return contexts, questions, answers

In [6]:
train_contexts, train_questions, train_answers = prepare_data(train_ds)
test_contexts, test_questions, test_answers = prepare_data(test_ds)

In [7]:
print(train_contexts[0])
print(train_questions[0])
print(train_answers[0])

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
{'text': 'Saint Bernadette Soubirous', 'answer_start': 515}


#### Get the end of the answer

In [8]:
def get_end_index(answers, contexts):
    for answer, context in zip(answers, contexts):
        true_answer = answer["text"]
        start_index = answer["answer_start"]
        end_index = start_index + len(true_answer)

        if context[start_index:end_index] == true_answer:
            answer["answer_end"] = end_index
        else:
            for n in [1,2]:
                if context[start_index-n:end_index-n] == true_answer:
                    answer["answer_start"] = start_index
                    answer["answer_end"] = end_index

In [9]:
get_end_index(train_answers, train_contexts)
get_end_index(test_answers, test_contexts)

In [10]:
print(train_answers[5:10])
print(test_answers[5:10])

[{'text': 'September 1876', 'answer_start': 248, 'answer_end': 262}, {'text': 'twice', 'answer_start': 441, 'answer_end': 446}, {'text': 'The Observer', 'answer_start': 598, 'answer_end': 610}, {'text': 'three', 'answer_start': 126, 'answer_end': 131}, {'text': '1987', 'answer_start': 908, 'answer_end': 912}]
[{'text': '"golden anniversary"', 'answer_start': 487, 'answer_end': 507}, {'text': 'February 7, 2016', 'answer_start': 334, 'answer_end': 350}, {'text': 'American Football Conference', 'answer_start': 133, 'answer_end': 161}, {'text': '"golden anniversary"', 'answer_start': 487, 'answer_end': 507}, {'text': 'American Football Conference', 'answer_start': 133, 'answer_end': 161}]


### Utility Operations and Funcitons

#### Setting up some variables

In [11]:
cuda5_device = torch.device("cuda:5")

#### Add token position

In [12]:
def add_token_postition(encodings, answers, tokenizer):
    start_positions = []
    end_positions = []

    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i, answers[i]["answer_start"]))
        end_positions.append(encodings.char_to_token(i, answers[i]["answer_end"]))

        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        go_back = 1
        while end_positions[-1] is None:
            end_positions[-1] = encodings.char_to_token(i, answers[i]["answer_end"]-go_back)
            go_back += 1

        encodings.update({
            "start_positions": start_positions,
            "end_positions": end_positions
        })

#### Create a Dataset

In [13]:
class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encoding):
        self.encodings = encoding
        
    def __getitem__(self, index):
        return {key: torch.tensor(val[index]) for key, val in self.encodings.items()}
        
    def __len__(self):
        return len(self.encodings.input_ids)

#### Loop to evaluate a dataset with a given model and tokenizer

In [55]:
# this function takes a tokenizer and a model and gets the Accuracy and F1-Score of the give dataset
f1_scores = []
em_scores = []

def evaluate_dataset(model, tokenizer, dataset, contexts, questions):
    for i in tqdm(range(0, len(dataset), 1)):
        # get the gold answer tokens
        gold_start = dataset[i]["start_positions"].item()
        gold_end = dataset[i]["end_positions"].item()

        gold_answer = dataset[i]["input_ids"][gold_start:gold_end]
        # get the prediction
        with torch.no_grad():
            model_inputs = tokenizer(contexts[i], questions[i], return_tensors="pt", padding=True).to(cuda5_device)
            input_ids = model_inputs["input_ids"].to(cuda5_device)
            attention_mask = model_inputs["attention_mask"].to(cuda5_device)

            if input_ids.size()[-1] > 512:
                continue
            
            outputs = model(input_ids, attention_mask=attention_mask)

            start_pred = torch.argmax(outputs["start_logits"], dim=1)
            end_pred = torch.argmax(outputs["end_logits"], dim=1)
            pred_answer = input_ids[0][start_pred.item():end_pred.item()]

        em_scores.append(gold_answer.tolist() == pred_answer.tolist())

        # calculate f1-score and acc:
        f1_score = compute_f1(gold_answer, pred_answer)
        f1_scores.append(f1_score)

    return f1_scores, em_scores


#### Get an answer from a given model

In [60]:
def get_answer(context, question, model, tokenizer):
    inputs = tokenizer(context, question, return_tensors="pt", padding=True).to(cuda5_device)
    with torch.no_grad():
        input_ids = inputs["input_ids"].to(cuda5_device)
        attention_mask = inputs["attention_mask"].to(cuda5_device)
        
        outputs = model(input_ids, attention_mask=attention_mask)

        start_pred = torch.argmax(outputs["start_logits"], dim=1)
        end_pred = torch.argmax(outputs["end_logits"], dim=1)

        answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start_pred:end_pred]))

    return answer

#### Compute the f1 score of a gold answer and a prediction

In [21]:
def compute_f1(gold, pred):
    gold = gold.tolist()
    pred = pred.tolist()
    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred) == 0 or len(gold) == 0:
        return int(pred == gold)

    common_tokens = set(pred) & set(gold)
  
    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0
  
    prec = len(common_tokens) / len(pred)
    rec = len(common_tokens) / len(gold)
  
    return round(2 * (prec * rec) / (prec + rec), 2)

### DistilBERT for Question Answering

#### Training

In [14]:
# plain model and tokenizer from pretrained
distilbert_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
distilbert_model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
distilbert_train_encodings = distilbert_tokenizer(train_contexts, train_questions, truncation=True, padding=True)
distilbert_test_encodings = distilbert_tokenizer(test_contexts, test_questions, truncation=True, padding=True)

In [16]:
distilbert_train_encodings.keys()

dict_keys(['input_ids', 'attention_mask'])

In [17]:
# Add start and end token of the answer
add_token_postition(distilbert_train_encodings, train_answers, distilbert_tokenizer)
add_token_postition(distilbert_test_encodings, test_answers, distilbert_tokenizer)

In [18]:
# Create a dataset object
distilbert_train_dataset = SquadDataset(distilbert_train_encodings)
distilbert_test_dataset = SquadDataset(distilbert_test_encodings)

In [22]:
# move the model to the device and set an opimizer
distilbert_model.to(cuda5_device)
distilbert_model.train()
distitlbert_optim = torch.optim.AdamW(distilbert_model.parameters(), lr=5e-5)

In [27]:
# Create a trainloader Object
train_loader = DataLoader(distilbert_train_dataset, batch_size=64, shuffle=True)

In [28]:
distilbert_train_encodings.keys()

dict_keys(['input_ids', 'attention_mask', 'start_positions', 'end_positions'])

In [29]:
for epoch in range(3):
    # this is for the visualisation of the progress with tqdm
    loop = tqdm(train_loader)
    for batch in loop:
        distitlbert_optim.zero_grad()

        input_ids = batch["input_ids"].to(cuda5_device)
        attention_mask = batch["attention_mask"].to(cuda5_device)
        start_positions = batch["start_positions"].to(cuda5_device)
        end_positions = batch["end_positions"].to(cuda5_device)

        outputs = distilbert_model(input_ids, 
                        attention_mask=attention_mask,
                        start_positions=start_positions,
                        end_positions=end_positions)

        loss = outputs[0]
        loss.backward()
        distitlbert_optim.step()

        loop.set_description(f"Epoch {epoch}")
        loop.set_postfix(loss = loss.item())

Epoch 0: 100%|██████████| 1369/1369 [14:19<00:00,  1.59it/s, loss=1.58] 
Epoch 1: 100%|██████████| 1369/1369 [14:21<00:00,  1.59it/s, loss=0.885]
Epoch 2: 100%|██████████| 1369/1369 [14:19<00:00,  1.59it/s, loss=0.76] 


In [30]:
# Save the model
distilbert_model_path = "model/distilbert_kaesmann_zimmet"
distilbert_model.save_pretrained(distilbert_model_path)
distilbert_tokenizer.save_pretrained(distilbert_model_path)

('model/distilbert_kaesmann_zimmet/tokenizer_config.json',
 'model/distilbert_kaesmann_zimmet/special_tokens_map.json',
 'model/distilbert_kaesmann_zimmet/vocab.txt',
 'model/distilbert_kaesmann_zimmet/added_tokens.json',
 'model/distilbert_kaesmann_zimmet/tokenizer.json')

#### Evaluate the model

In [72]:
# Load the model and tokenizer from the disk
distilbert_trained_model = DistilBertForQuestionAnswering.from_pretrained("model/distilbert_kaesmann_zimmet/", local_files_only=True)
distilbert_trained_tokenizer = DistilBertTokenizerFast.from_pretrained("model/distilbert_kaesmann_zimmet/", local_files_only=True)

In [73]:
distilbert_trained_model.eval()
distilbert_trained_model.to(cuda5_device)

DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
     

In [141]:
distilbert_f1_scores, distilbert_em_scores = evaluate_dataset(distilbert_trained_model, distilbert_trained_tokenizer, distilbert_test_dataset, test_contexts, test_questions)

100%|██████████| 10570/10570 [00:42<00:00, 248.07it/s]


In [145]:
print(sum(distilbert_f1_scores) / len(distilbert_f1_scores))
print(sum(distilbert_em_scores) / len(distilbert_em_scores))

0.7132413924291433
0.5715236827087693


### T5 for Question Answering

#### Training

In [11]:
# Pretrained models
t5_tokenizer = T5TokenizerFast.from_pretrained("t5-base")
t5_model = T5ForQuestionAnswering.from_pretrained("t5-base")

In [12]:
t5_train_encodings = t5_tokenizer(train_contexts, train_questions, padding=True)
t5_test_encodings = t5_tokenizer(test_contexts, test_questions, padding=True)

In [13]:
t5_train_encodings.keys()

dict_keys(['input_ids', 'attention_mask'])

In [14]:
# Add the token positions of the answer
add_token_postition(t5_train_encodings, train_answers, t5_tokenizer)
add_token_postition(t5_test_encodings, test_answers, t5_tokenizer)

In [17]:
# Create Dataset objects
t5_train_dataset = SquadDataset(t5_train_encodings)
t5_test_dataset = SquadDataset(t5_test_encodings)

In [23]:
# move the model to the device and set a optimizer
t5_model.to(cuda5_device)
t5_model.train()
t5_optim = torch.optim.AdamW(t5_model.parameters(), lr=5e-5)

In [40]:
# create a trainloader object
t5_train_loader = torch.utils.data.DataLoader(t5_train_dataset, batch_size=4, shuffle=True)

In [None]:
for epoch in range(3):
    # this is for the visualisation of the progress with tqdm
    loop = tqdm(t5_train_loader)
    for batch in loop:
        t5_optim.zero_grad()

        input_ids = batch["input_ids"].to(cuda5_device)
        attention_mask = batch["attention_mask"].to(cuda5_device)
        start_positions = batch["start_positions"].to(cuda5_device)
        end_positions = batch["end_positions"].to(cuda5_device)

        outputs = t5_model(input_ids, 
                        attention_mask=attention_mask,
                        start_positions=start_positions,
                        end_positions=end_positions)

        loss = outputs[0]
        loss.backward()
        t5_optim.step()

        loop.set_description(f"Epoch {epoch}")
        loop.set_postfix(loss = loss.item())

Epoch 0:  41%|████      | 8881/21900 [1:31:19<2:13:27,  1.63it/s, loss=1.2]   

In [None]:
t5_model_path = "model/t5_kaesmann_zimmet"
t5_model.save_pretrained(t5_model_path)
t5_tokenizer.save_pretrained(t5_model_path)

#### Evaluation

In [15]:
# Load the fine-tuned model from the disk
t5_trained_model = T5ForQuestionAnswering.from_pretrained("model/t5_kaesmann_zimmet/", local_files_only=True)
t5_trained_tokenizer = T5TokenizerFast.from_pretrained("model/t5_kaesmann_zimmet/", local_files_only=True)

In [16]:
# Set to eval mode and move to device
t5_trained_model.eval()
t5_trained_model.to(cuda5_device)

T5ForQuestionAnswering(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dropout

In [19]:
# create the t5 encodings, add the token positions of the answers and create a dataset object
t5_test_encodings = t5_trained_tokenizer(test_contexts, test_questions, padding=True)
add_token_postition(t5_test_encodings, test_answers, t5_trained_tokenizer)
t5_test_dataset = SquadDataset(t5_test_encodings)

In [22]:
t5_f1_scores, t5_em_scores = evaluate_dataset(t5_trained_model, t5_trained_tokenizer, t5_test_dataset, test_contexts, test_questions)

100%|██████████| 10570/10570 [04:40<00:00, 37.64it/s]


In [23]:
print(sum(t5_f1_scores) / len(t5_f1_scores))
print(sum(t5_em_scores) / len(t5_em_scores))

0.7933860970725666
0.6476926010678871


#### Check the model before it was finetuned

This is just a validation of the training

In [24]:
t5_plain_model = T5ForQuestionAnswering.from_pretrained("t5-base")
t5_plain_tokenizer = T5TokenizerFast.from_pretrained("t5-base")

Some weights of T5ForQuestionAnswering were not initialized from the model checkpoint at t5-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
t5_plain_model.eval()
t5_plain_model.to(cuda5_device)

T5ForQuestionAnswering(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dropout

In [52]:
# create the t5 encodings, add the token positions of the answers and create a dataset object
t5_test_encodings = t5_plain_tokenizer(test_contexts, test_questions, padding=True)
add_token_postition(t5_test_encodings, test_answers, t5_plain_tokenizer)
t5_test_dataset = SquadDataset(t5_test_encodings)

In [53]:
t5_plain_f1_scores, t5_plain_em_scores = evaluate_dataset(t5_plain_model, t5_plain_tokenizer, t5_test_dataset, test_contexts, test_questions)

 39%|███▉      | 4154/10570 [01:52<01:53, 56.62it/s]

contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing


 40%|███▉      | 4189/10570 [01:53<02:05, 50.69it/s]

contunuing
contunuing
contunuing
contunuing
contunuing
contunuing


 40%|███▉      | 4203/10570 [01:53<02:02, 51.82it/s]

contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing


 41%|████      | 4284/10570 [01:55<01:27, 72.11it/s]

contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing


 41%|████      | 4299/10570 [01:55<01:35, 65.54it/s]

contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing
contunuing


 44%|████▎     | 4603/10570 [02:03<02:03, 48.34it/s]

contunuing
contunuing
contunuing
contunuing
contunuing


 46%|████▌     | 4861/10570 [02:10<02:07, 44.81it/s]

contunuing
contunuing
contunuing
contunuing


 52%|█████▏    | 5454/10570 [02:26<02:11, 38.95it/s]

contunuing


 54%|█████▎    | 5663/10570 [02:31<01:51, 43.85it/s]

contunuing
contunuing
contunuing
contunuing
contunuing


 59%|█████▉    | 6230/10570 [02:47<01:30, 48.08it/s]

contunuing
contunuing
contunuing
contunuing
contunuing


 60%|██████    | 6384/10570 [02:51<01:32, 45.07it/s]

contunuing
contunuing
contunuing
contunuing
contunuing


 69%|██████▉   | 7339/10570 [03:17<01:26, 37.20it/s]

contunuing


100%|██████████| 10570/10570 [04:46<00:00, 36.87it/s]


In [54]:
print(sum(t5_plain_f1_scores) / len(t5_plain_f1_scores))
print(sum(t5_plain_em_scores) / len(t5_plain_em_scores))

0.09862863966197859
0.04473224947184716


##### Distilbert on TD-B

### TD-B dataset

#### Get data

In [76]:
# Open the text file for reading
with open('TD_B.txt', 'r') as file:
    # Read the content of the file
    content = file.read()

# Convert the string to a Python list
ids_list = eval(content)  # Safely evaluate the string as a list

In [77]:
td_b = []

for entry in test_ds:
    if entry["id"] in ids_list:
        td_b.append(entry)

#### Prepare Data

In [78]:
td_b_contexts, td_b_questions, td_b_answers = prepare_data(td_b)
get_end_index(td_b_answers, td_b_contexts)

#### TD-B mit Distilbert

In [40]:
distilbert_td_b_encodings = distilbert_trained_tokenizer(td_b_contexts, td_b_questions)
add_token_postition(distilbert_td_b_encodings, td_b_answers, distilbert_trained_tokenizer)
distilbert_td_b_dataset = SquadDataset(distilbert_td_b_encodings)

In [50]:
distilbert_td_b_f1_scores, distilbert_td_b_em_scores = evaluate_dataset(distilbert_trained_model, distilbert_trained_tokenizer, distilbert_td_b_dataset, td_b_contexts, td_b_questions)

100%|██████████| 100/100 [00:00<00:00, 198.52it/s]


In [51]:
print(sum(distilbert_td_b_f1_scores) / len(distilbert_td_b_f1_scores))
print(sum(distilbert_td_b_em_scores) / len(distilbert_td_b_em_scores))

0.7687500000000003
0.59


#### TD-B mit T5

In [33]:
t5_td_b_encodings = t5_trained_tokenizer(td_b_contexts, td_b_questions, padding=True)
add_token_postition(t5_td_b_encodings, td_b_answers, t5_trained_tokenizer)
t5_tdb_dataset = SquadDataset(t5_td_b_encodings)

In [48]:
t5_td_b_f1_scores, t5_td_b_em_scores = evaluate_dataset(t5_trained_model, t5_trained_tokenizer, t5_tdb_dataset, td_b_contexts, td_b_questions)

100%|██████████| 100/100 [00:02<00:00, 35.98it/s]


In [49]:
print(sum(t5_td_b_f1_scores) / len(t5_td_b_f1_scores))
print(sum(t5_td_b_em_scores) / len(t5_td_b_em_scores))

0.7791
0.59


#### Get answers on TD-B with Distilbert

In [79]:
distilbert_answers_td_b = {}

for i in range (0, len(td_b_questions), 1):
    context = td_b_contexts[i]
    question = td_b_questions[i]
    gold_answer = td_b_answers[i]["text"]
    pred_answer = get_answer(context, question, distilbert_trained_model, distilbert_trained_tokenizer)
    distilbert_answers_td_b[i] = {
        "context": context,
        "question": question,
        "gold_answer": gold_answer,
        "pred_answer": pred_answer
    }

#### Get answers in TD-B with T5

In [82]:
t5_answers_td_b = {}

for i in range (0, len(td_b_questions), 1):
    context = td_b_contexts[i]
    question = td_b_questions[i]
    gold_answer = td_b_answers[i]["text"]
    pred_answer = get_answer(context, question, t5_trained_model, t5_trained_tokenizer)
    t5_answers_td_b[i] = {
        "context": context,
        "question": question,
        "gold_answer": gold_answer,
        "pred_answer": pred_answer
    }

In [84]:
t5_answers_td_b[1]

{'context': 'The Panthers finished the regular season with a 15–1 record, and quarterback Cam Newton was named the NFL Most Valuable Player (MVP). They defeated the Arizona Cardinals 49–15 in the NFC Championship Game and advanced to their second Super Bowl appearance since the franchise was founded in 1995. The Broncos finished the regular season with a 12–4 record, and denied the New England Patriots a chance to defend their title from Super Bowl XLIX by defeating them 20–18 in the AFC Championship Game. They joined the Patriots, Dallas Cowboys, and Pittsburgh Steelers as one of four teams that have made eight appearances in the Super Bowl.',
 'question': 'Who did the Panthers beat in the NFC Championship Game?',
 'gold_answer': 'Arizona Cardinals',
 'pred_answer': 'Arizona Cardinal'}

In [86]:
# Save the dictionaries as json files
with open("distilbert_answers.json", "w") as outfile: 
    json.dump(distilbert_answers_td_b, outfile)

with open("t5_answers.json", "w") as outfile: 
    json.dump(t5_answers_td_b, outfile)

## Large Language Models for Question Answering

The complete process for question answering using the large language models (ChatGPT, LLAMA) is described below.
Due to the pricing of ChatGPT, the prompts had to be transferred manually via the web interface.

Important files:
1. prompts_1.txt, prompts_2.txt -> contain the prepared prompts for the models
2. answers_1_gpt.txt, answers_2_gpt.txt -> contain the answers for ChatGPT 
3. answers_1_lama.txt, answers_2_lama.txt -> contain the answers for LLAMA

### Get SQuAD Dataset

In [23]:
import pandas as pd
from typing import Tuple

def download_squad_dataset() -> None:
    splits = {'train': 'plain_text/train-00000-of-00001.parquet',
              'validation': 'plain_text/validation-00000-of-00001.parquet'}

    test_df = pd.read_parquet("hf://datasets/rajpurkar/squad/" + splits["train"])
    validation_df = pd.read_parquet("hf://datasets/rajpurkar/squad/" + splits["validation"])

    test_df.to_json('squad_test_data.json')
    validation_df.to_json('squad_validation_data.json')

In [24]:
def load_squad_dataset() -> Tuple[pd.DataFrame, pd.DataFrame]:
    test_df = pd.read_json('squad_test_data.json')
    val_df = pd.read_json('squad_validation_data.json')
    return test_df, val_df

In [25]:
def create_random_batch(n: int, dataframe: pd.DataFrame) -> pd.DataFrame:
    #set random_state to always get the same dataset
    return dataframe.sample(n=n, random_state=42)

In [26]:
# only use the download_squad_dataset() function below once at the beginning
#download_squad_dataset()
test_df, validation_df = load_squad_dataset()

# original TD-B dataset
validation_batch = create_random_batch(100, validation_df)

# This ids are the ids used in the below code block from TD_B.txt
ids = validation_batch["id"].tolist()

### Prompt Generation

In [27]:
import pandas as pd
from typing import List

In [28]:
def generate_prompts(dataframe: pd.DataFrame) -> List[str]:
    """
    Generates a List based on a prompt-template out of the given dataframe
    :param dataframe: squad dataset as pandas dataframe
    :return: List of prepared prompts
    """
    
    prompt_template = ('Can you answer me the following question "%1" based on the following context "%2"? Please structure your answer always in the same format like '
                       'Question ":" Answer". Dont output long Instruction just the answer as short as possible. If its possible only with one word/phrase')

    prompts = []
    for i in dataframe.itertuples():
        #prepare question and context to avoid multiline prompts
        question = i.question.strip().replace('\n', ' ')
        context = i.context.strip().replace('\n', ' ')
        
        # add question and context to the prompt-template
        prompt = prompt_template.replace('%1', question).replace('%2', context)
        prompts.append(prompt)

    return prompts

In [29]:
def write_data_to_file(data: List[str], file_name:str) -> None:
    with open(file_name, "w") as file:
        for i in data:
            file.write(i + "\n")

In [30]:
prompts = generate_prompts(validation_batch)
write_data_to_file(prompts, 'LLM_Results/second_atempt/prompts_2.txt')

### Prompt Processing for LLAMA

In [31]:
def load_prompts(file_name: str) -> List[str]:
    """
    Reads the prompts from a file and returns them as a list
    :param file_name: prompts file
    :return: List of prompts
    """
    with open(file_name, 'r') as f:
        prompts = f.readlines()

    return prompts

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompts = load_prompts('prompts_2.txt')

answers = []
count = 1
print('Start')
for i in prompts:
    # Tokenize the prompt into a for the model suitable format ('pt' = pytorch tensor)
    inputs = tokenizer(i, return_tensors="pt")

    #Generate the answer
    outputs = model.generate(inputs['input_ids'],
                             num_return_sequences=1,                # Limit the Answers to one
                             pad_token_id=tokenizer.eos_token_id)   # Use end of sequence token as padding
    
    # Decode back to readable text
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answers.append(response.strip().replace('\n', ''))
    
    # Monitor the progress
    print(count, ' = ', response)
    count += 1

    write_data_to_file(answers, 'answers_2_lama.txt')

When using the Large Language Models LLAMA, a massive amount of computing time was required during the first try.  
The model required more than 5 hours to process the 100 prepared prompts from the TD-B data set. Likewise, the  
quality of the answers left a lot to be desired, as the questions was not always answered using the attached context.   
However, after the prompt template was revised again, this problem was reduced to a minimum. In general,  
there were no serious problems in connection with the Large Language Models.

### Evaluation for both Models

In [32]:
from pathlib import Path
from datasets import load_dataset
import datasets

In [33]:
def load_answers(file: Path) -> List[str]:
    """
    Opens a file and reads the answers line by line into a list
    :param file: answers file
    :return: List containing the answers
    """
    with file.open('r') as f:
        answers = f.readlines()

    return answers

In [34]:
def calc_exact_match_score(answers_file: Path, dataset: datasets) -> float:
    """
    Calculates the exact match score over a while dataset
    :param answers_file: File containing the answers (one answer per line)
    :param dataset: original dataset for the real answers
    :return: exact match score
    """
    td_b_answers = load_answers(answers_file)

    exact_matches = 0
    
    # iterate over both lists and compare the answers
    for i, j in zip(dataset['answers'], td_b_answers):
        answers = i['text']
        
        #every answer has the following format "Answer: answer"
        answer_model = j.split(':', maxsplit=1)[1].strip()

        if answer_model.lower() in (answer.lower() for answer in answers):
            exact_matches += 1

    exact_match_score = exact_matches / len(td_b)
    return exact_match_score

In [35]:
def calc_f1_score(answers_file: Path, dataset: datasets) -> float:
    """
    Calculates the f1 score over a whole dataset
    :param answers_file: File containing the answers (one answer per line)
    :param dataset: original dataset for the real answers
    :return: f1 score
    """
    td_b_answers = load_answers(answers_file)
    f1_scores = []

    for i, j in zip(dataset['answers'], td_b_answers):
        answers = [answer.lower() for answer in i['text']]
        answer_model = j.split(':', maxsplit=1)[1].strip().lower()

        f1 = 0
        # calc the highest f1 score for each possible answer (every answer has various choices)
        for item in answers:
            f1 = max(f1, calc_single_f1_score(answer_model, item))
        f1_scores.append(f1)
    
    # calculate the average f1 score over all answers
    final_f1_score = sum(f1_scores) / len(f1_scores)
    return final_f1_score

The F1-Score is a metric that combines the precision and recall of a model. It is calculated by the following formular:  
   
<div style="text-align: center">
$Precision = \frac{TP}{TP + FP}$
<br><br>
$Recall = \frac{TP}{TP + FN}$
<br><br>
$F1 = 2 \cdot \frac{\text{Präzision} \cdot \text{Recall}}{\text{Präzision} + \text{Recall}}$
</div>
<br>

- TP (True Positives): The number of equal word in the predicted answer and the real answer  
- FP (False Positives): The number of words in the predicted answer that are not in the real answer  
- FN (False Negatives): The number of words in the real answer that are missing in the predicted answer  

In [36]:
def calc_single_f1_score(predicted: str, answer: str) -> float:
    """
    Calculates the f1 score for a given question
    :param predicted: Answer predicted by the model
    :param answer: Real answer from the dataset
    :return: f1 score
    """
    predicted_tokens = predicted.split()
    tokens = answer.split()

    # calculate basic values
    # To find the intersection of two lists, we first convert the lists to sets and link them with the & operator
    common_tokens = set(predicted_tokens) & set(tokens)
    tp = len(common_tokens)             # True Positives
    fp = len(predicted_tokens) - tp     # False Positives
    fn = len(tokens) - tp               # False Negatives

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0

    if precision + recall == 0:
        return 0.0
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

Due to problems with the answers column in the dataframe (column was a whole string which wasn't that easy to convert to a suitable dictionary)
it was more practical to work with datasets. Important to mention hereby is that the answers for each model were generated in a specific order
to keep this order I used list comprehension. 

In [37]:
validation_dataset = load_dataset("rajpurkar/squad")['validation']
td_b = validation_dataset.select(
    [validation_dataset['id'].index(id_) for id_ in ids if id_ in validation_dataset['id']])

exact_match_gpt = calc_exact_match_score(Path('LLM_Results/second_atempt/answers_2_gpt.txt'), td_b)
exact_match_llama = calc_exact_match_score(Path('LLM_Results/second_atempt/answers_2_lama.txt'), td_b)

f1_gpt = calc_f1_score(Path('LLM_Results/second_atempt/answers_2_gpt.txt'), td_b)
f1_llama = calc_f1_score(Path('LLM_Results/second_atempt/answers_2_lama.txt'), td_b)

print('Exact Match GPT: ', exact_match_gpt)
print('Exact Match LLAMA: ', exact_match_llama)
print()
print('F1 GPT: ', f1_gpt)
print('F1 LLAMA: ', f1_llama)

Exact Match GPT:  0.81
Exact Match LLAMA:  0.48

F1 GPT:  0.9197234432234431
F1 LLAMA:  0.5953380394115687


## 4. Ergebnisse

1. Fine-tuned Modelle auf TD-A und TD-B  
    Die fine-tuned Modelle zeigen unterschiedliche Ergebnisse, je nachdem, ob sie auf TD-A oder TD-B bewertet werden.
    Die Ergebnisse sind in den nachstehenden Tabellen aufgeführt. Auf den größeren Datensatz TD-A performed T5 leicht
    besser als DistilBERT, allerdings erreichen beide keine herausragend guten Ergebnisse.
     
2. Ist TD-B groß genug, um Rückschlüsse auf das Verhalten bei TD-A zu ziehen?  
    Ja, TD-B lässt Rückschlüsse auf das Verhalten der Modelle auf TD-A zu, da die Datenquellen identisch sind,  
    aber die geringe Größe von TD-B bedeutet, dass die Ergebnisse nicht die gesamte Komplexität und Vielfalt von TD-A abdecken. Daher kann eine  
    Analyse auf TD-B einige der Herausforderungen von TD-A nicht vollständig widerspiegeln.

### Exact Match Score

<table border="1" style="border-collapse: collapse; text-align: center; float: left">
  <tr>
    <th>Datensatz</th>
    <th>DistilBERT</th>
    <th>T5</th>
    <th>Chat GPT</th>
    <th>LLAMA</th>
  </tr>
  <tr>
    <td>TD-A</td>
    <td>57,3%</td>
    <td>64,8%</td>
    <td>-</td>
    <td>-</td>
  </tr>
  <tr>
    <td>TD-B</td>
    <td>59%</td>
    <td>59%</td>
    <td>81%</td>
    <td>48%</td>
  </tr>
</table>


### F1-Score

<table border="1" style="border-collapse: collapse; text-align: center; float: left">
  <tr>
    <th>Datensatz</th>
    <th>DistilBERT</th>
    <th>T5</th>
    <th>Chat GPT</th>
    <th>LLAMA</th>
  </tr>
  <tr>
    <td>TD-A</td>
    <td>71,3%</td>
    <td>79,3%</td>
    <td>-</td>
    <td>-</td>
  </tr>
  <tr>
    <td>TD-B</td>
    <td>76,9%</td>
    <td>77,9%</td>
    <td>91%</td>
    <td>59%</td>
  </tr>
</table>


### Betrachtung verschiedener Antworten:
#### What does increased oxygen concentrations in the patient's lungs displace:  

 In 1979, the Soviet Union deployed its 40th Army into Afghanistan, attempting to suppress an Islamic rebellion 
 against an allied Marxist regime in the Afghan Civil War. The conflict, pitting indigenous impoverished Muslims (mujahideen)
 against an anti-religious superpower, galvanized thousands of Muslims around the world to send aid and sometimes to go
 themselves to fight for their faith. Leading this pan-Islamic effort was Palestinian sheikh Abdullah Yusuf Azzam. 
 While the military effectiveness of these "Afghan Arabs" was marginal, an estimated 16,000 to 35,000 Muslim volunteers 
 came from around the world came to fight in Afghanistan.

 Distilbert:     carbon monoxide from the heme group of hemoglobin  
 T5:             carbon monoxid  
 ChatGPT:        What does increased oxygen concentrations in the patient's lungs displace? : Carbon monoxide  
 LLAMA:          Answer: Displace  
 
 Gold Answer:   carbon monoxide  
 
 Kommentar: Alle modelle liefern gute Ergebnisse außer LLAMA, T5 hat allerdings "monoxid" statt "monoxide" zurückgegeben.     
   
  
#### How many Muslims came from around the world to fight in Afghanistan?  
    
Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient 
and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the 'bends') are 
sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme 
group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial 
pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles 
of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of 
the treatment.
  
Distilbert:     16, 000 to 35,  
T5:             16,000 to 3  
ChatGPT:        How many Muslims came from around the world to fight in Afghanistan? : 16,000 to 35,000  
LLAMA:          Answer: 16,000 to 35,000   

Gold Answer:    16,000 to 35,000

Kommentar: Die LLMs geben die richtige Antwort zurück. Die Transformer Modelle scheinen Probleme mit Zahlen zu haben.
  
#### What differs about secondary chloroplasts' membranes?  

 While primary chloroplasts have a double membrane from their cyanobacterial ancestor, secondary chloroplasts have additional 
 membranes outside of the original two, as a result of the secondary endosymbiotic event, when a nonphotosynthetic eukaryote engulfed 
 a chloroplast-containing alga but failed to digest it—much like the cyanobacterium at the beginning of this story. The engulfed alga 
 was broken down, leaving only its chloroplast, and sometimes its cell membrane and nucleus, forming a chloroplast with three or four 
 membranes—the two cyanobacterial membranes, sometimes the eaten alga's cell membrane, and the phagosomal vacuole from the host's cell 
 membrane.

 Distilbert:     double membrane from their cyanobacterial ancestor, secondary chloroplasts have additional membranes outside of the original two  
 T5:             additional membranes outside of the original two  
 ChatGPT:        What differs about secondary chloroplasts' membranes? : They have additional membranes outside the original two.  
 LLAMA:          Answer: additional   

 Gold Answer:   additional membranes outside of the original two
 
 Kommentar: Bis auf LLAMA geben alle Modelle korrekte Antworten zurück.

## Probleme und Diskussion

### Kapazität der GPUs

Während dem Training des T5-Modells mit dem pretrained-Modell "t5-base" konnte nur eine sehr kleine Batch-Size gewählt werden. Sobald die Batch-Size auf 8 oder größer gestellt wurde gab es Probleme mit dem V-RAM der GPUs. Überraschenderweise lief dieser bei einer zu großen Batch-Size voll und das Training wurde abgebrochen. Nach dem das Training abgebrochen wurde, lagen die "Reste" der Tensoren noch im V-RAM der GPU. Eine Lösung, um den V-RAM aufzuräumen war der Restart des Python-Kernels.  

### Rechenleistung for LLAMA Modell

Bei der Benutzung der Large Language Models LLAMA wurde beim erstmaligen ab-
setzen der Prompts eine massive Rechenzeit benötigt. Das Modell benötigte für die
Bearbeitung der 100 aufbereiteten Prompts aus dem TD-B Datensatz mehr als 5 Stunden. 
Ebenso lies die Qualität der Antworten immer wieder zu wünschen übrig, da die
Frage nicht immer mithilfe des angefügten Kontextes beantwortet wurde. Nachdem
das Prompt-Template nochmals überarbeitet wurde, konnte dieses Problem jedoch auf
ein Minimum reduziert werden. Im Allgemeinen ergaben sich im Zusammenhang mit
den Large Language Models aber keine Schwerwiegenden Probleme.

### Diskussion

Besonders für das Modell T5 gibt es weiter pretrained-Modelle neben dem "t5-base", welche möglicherweise eine bessere Performance liefern. Im Rahmen dieser Arbeit wurde allerding nur das vorhing genannte pretrained-Modell verwendet. Es sind einerseits Modelle in weiteren Größen vorhanden, welche unterschiedlich viel Zeit für das Training und die Evaluierung benötigen. Andererseits gibt es von den Modellen. der verschiedenen Größen auch weiterentwickelte Versionen, mit unterschiedlichen Verbesserungen in der Architektur der Modelle. Um die Performance des T5-Modells zu verbessert können verschiedene Modelle getestet werden. Diese versuche ziehen auch eine Anpassung der Hyperparameter für das Training wie die Anzahl der Epochen, Auswahl des Optimizers oder der Learning-Rate nach sich