# Proyek UAS NLP 
### Anggota Kelompok:
* Christopher Nathaniel Tanamas // 222200153
* Elaine Evelyn // 222102311
* Grace Calista Lim // 222102176

# Masalah
Dokter seringkali membutuhkan akses cepat dan tepat terhadap informasi yang relevan dalam rekam medis pasien untuk menjawab pertanyaan medis yang muncul selama konsultasi. Namun, menelusuri catatan medis secara manual bisa memakan waktu dan tidak efisien. Oleh karena itu, kami menawarkan solusi berupa model NLP Question Answering yang dapat membantu dokter menjawab pertanyaan berbasis teks secara akurat dan relevan, dengan memanfaatkan konteks dari rekam medis pasien yang sudah tersedia.

Model Question Answering ini dirancang untuk memberikan jawaban atas pertanyaan medis berdasarkan informasi dalam rekam medis pasien, seperti riwayat konsumsi obat, diagnosis sebelumnya, hingga prosedur medis yang telah dijalani. Dalam proyek ini, kami menggunakan dataset EMRQA-MSquad yang berfokus pada pertanyaan-pertanyaan medis umum. Model yang akan kami eksplorasi meliputi GRU, BERT, dan RoBERTa.

# Import All Libraries

In [26]:
pip install transformers datasets evaluate



In [None]:
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForQuestionAnswering, default_data_collator, TrainingArguments, Trainer, pipeline)
from evaluate import load
import numpy as np
from tqdm.auto import tqdm
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.optim import AdamW
from torch.nn import CrossEntropyLoss

# Pilihan Penggunaan GPU jika Tersedia

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Load Dataset
Kami menggunakan dataset EMRQA-MSquad yang tersedia di Hugging Face. Dataset ini berisi kumpulan pertanyaan dan jawaban yang berfokus pada topik-topik medis. Data ini mengandung pertanyaan medis yang sering diajukan oleh pasien atau tenaga medis, bersama dengan jawaban yang relevan. Dataset ini sangat cocok untuk digunakan dalam pelatihan model QnA medis, karena mencakup berbagai pertanyaan yang berfokus pada penyakit, gejala, prosedur medis, dan pengobatan. Dataset ini dapat diakses melalui https://huggingface.co/datasets/Eladio/emrqa-msquad/viewer/default/train?p=2&views%5B%5D=train


In [28]:
dataset = load_dataset("Eladio/emrqa-msquad")

In [29]:
dataset

DatasetDict({
    train: Dataset({
        features: ['context', 'question', 'answers'],
        num_rows: 130956
    })
    validation: Dataset({
        features: ['context', 'question', 'answers'],
        num_rows: 32739
    })
})

## Train test split
Train test split dilakukan dengan menggunakan dataset yang telah disediakan. Dataset berisikan train data dan validation data. Maka, kami menggunakan train data untuk training dan validation data untuk testing

In [30]:
raw_train_dataset = dataset["train"]
raw_test_dataset = dataset["validation"]

Menggunakan sebagian dataset untuk training agar tidak terlalu memakan banyak waktu

In [31]:
selected_train = raw_train_dataset.select(range(5000)) # gunakan 5000 data
selected_test = raw_test_dataset.select(range(1000)) # gunakan 1000 data

Kami menambahkan ID pada setiap data untuk memudahkan proses evaluasi model

In [32]:
def add_id(example, idx):
    example["id"] = str(idx)
    return example

In [33]:
train_dataset = selected_train.map(add_id, with_indices=True)
test_dataset = selected_test.map(add_id, with_indices=True)

# Pre-Processing Data

### Siapkan tokenizer untuk setiap model
**NOTE**: bert_tokenizer akan digunakan juga sebagai tokenizer data GRU

In [34]:
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [35]:
roberta_tokenizer = AutoTokenizer.from_pretrained("deepset/tinyroberta-squad2")

### Fungsi Pre-processing

Pre-processing digunakan untuk tokenisasi data, padding, dan mencari posisi jawaban setelah ditokenisasi

In [None]:
def preprocess_function(examples, model_name):
    questions = [q.strip() for q in examples["question"]]

    # Tentukan jenis model yang digunakan
    if model_name == "bert" or model_name=="gru":
        tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    elif model_name == "roberta":
        tokenizer = AutoTokenizer.from_pretrained("deepset/tinyroberta-squad2")

    # Tokenisasi
    tokenized = tokenizer(
        questions,
        examples["context"],
        truncation="only_second", # potong bagian konteks jika panjang melebihi batas
        max_length=384, # batas token adalah 384
        stride=128, # jarak geser sliding window
        return_overflowing_tokens=True, # kembalikan contoh yang terpotong (jika ada)
        return_offsets_mapping=True, # offset (indeks awal dan akhir karakter)
        padding="max_length" # padding agar panjang token sama
    )

    sample_mapping = tokenized.pop("overflow_to_sample_mapping") # menghubungkan token yang terpotong dengan contoh asli
    offset_mapping = tokenized["offset_mapping"] # untuk menemukan jawaban dalam konteks

    start_positions = []
    end_positions = []
    ids = []

    # mencari posisi jawaban untuk setiap contoh
    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id) # posisi awal
        sequence_ids = tokenized.sequence_ids(i) # apakah token dari pertanyaan atau konteks

        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        ids.append(examples["id"][sample_index])

        # Jika tidak ada jawaban
        if len(answers["answer_start"]) == 0:
            start_positions.append(cls_index)
            end_positions.append(cls_index)
        # kondisi ketika ada jawaban
        else:
            # menemukan posisi token untuk jawaban
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # menentukan token awal dan akhir jawaban
            if offsets[token_start_index][0] > start_char or offsets[token_end_index][1] < end_char:
                start_positions.append(cls_index)
                end_positions.append(cls_index)
            else:
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                start_positions.append(token_start_index - 1)

                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                end_positions.append(token_end_index + 1)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions
    tokenized["id"] = ids
    return tokenized


### Pre-processing For GRU Model

In [None]:
tokenized_train_gru = train_dataset.map(
    lambda examples: preprocess_function(examples, model_name="gru"),
    batched=True,
    remove_columns=train_dataset.column_names)
tokenized_test_gru = test_dataset.map(
    lambda examples: preprocess_function(examples, model_name="gru"),
    batched=True,
    remove_columns=test_dataset.column_names)

### Pre-processing For Bert Model

In [37]:
tokenized_train_bert = train_dataset.map(
    lambda examples: preprocess_function(examples, model_name="bert"),
    batched=True,
    remove_columns=train_dataset.column_names)
tokenized_test_bert = test_dataset.map(
    lambda examples: preprocess_function(examples, model_name="bert"),
    batched=True,
    remove_columns=test_dataset.column_names)

### Pre-processing For Roberta Model

In [38]:
tokenized_train_roberta = train_dataset.map(
    lambda examples: preprocess_function(examples, model_name="roberta"),
    batched=True,
    remove_columns=train_dataset.column_names)
tokenized_test_roberta = test_dataset.map(
    lambda examples: preprocess_function(examples, model_name="roberta"),
    batched=True,
    remove_columns=test_dataset.column_names)

# Model Preparation

### GRU Model

In [None]:
class GRUForQA(nn.Module):
    def __init__(self, vocab_size, embedding_dim=128, hidden_dim=256):
        super(GRUForQA, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim) # Konversi token ID ke vektor
        self.gru = nn.GRU(embedding_dim, hidden_dim, batch_first=True, bidirectional=True) # bidirectional agar bisa memahami konteks sebelum dan sesudah
        self.fc_start = nn.Linear(hidden_dim * 2, 1) # token awal
        self.fc_end = nn.Linear(hidden_dim * 2, 1) # token akhir

    def forward(self, input_ids):
        embedded = self.embedding(input_ids)
        gru_out, _ = self.gru(embedded) # vektor hasil embedding
        start_logits = self.fc_start(gru_out).squeeze(-1) # output posisi awal
        end_logits = self.fc_end(gru_out).squeeze(-1) # output posisi akhir
        return start_logits, end_logits

In [None]:
# Penggabungan data dari kumpulan batch. Ubah data menjadi bentuk tensor
def collate_fn(batch):
    input_ids = torch.tensor([item["input_ids"] for item in batch]) 
    start_positions = torch.tensor([item["start_positions"] for item in batch])
    end_positions = torch.tensor([item["end_positions"] for item in batch])
    return input_ids, start_positions, end_positions

### Bert Model

In [None]:
bert_model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Roberta Model
Model diambil dari https://huggingface.co/deepset/tinyroberta-squad2

In [None]:
roberta_model = AutoModelForQuestionAnswering.from_pretrained("deepset/tinyroberta-squad2")

## Gunakan API untuk training model (kecuali GRU)

#### Training model untuk GRU

In [None]:
def train_gru_model(model, train_dataset, epochs=3, batch_size=16, lr=5e-4):
    model.to(device)
    model.train()

    optimizer = AdamW(model.parameters(), lr=lr)
    loss_fn = CrossEntropyLoss()
    dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn) # Ambil batch data

    for epoch in range(epochs):
        total_loss = 0
        step = 0
        for input_ids, start_pos, end_pos in tqdm(dataloader, desc=f"Epoch {epoch+1}"): # loop untuk setiap batch
            optimizer.zero_grad()

            # Pakai GPU
            input_ids = input_ids.to(device)
            start_pos = start_pos.to(device)
            end_pos = end_pos.to(device)

            # Mendapatkan logits untuk start dan end dari model
            start_logits, end_logits = model(input_ids)

            # Menghitung loss untuk start dan end positions
            loss_start = loss_fn(start_logits, start_pos)
            loss_end = loss_fn(end_logits, end_pos)
            loss = (loss_start + loss_end) / 2

            # Backward untuk hitung gradien
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            step += 1
            avg_loss = total_loss / step

        print(f"Epoch {epoch+1} - Loss: {avg_loss:.4f}")


### Training Arguments untuk Bert dan Roberta

In [31]:
training_args = TrainingArguments(
    output_dir="./bert-qa-emrqa",   # folder hasil model
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,  # regularisasi
    logging_steps=200
)

#### Training model untuk Bert

In [32]:
bert_trainer = Trainer(
    model=bert_model,
    args=training_args,
    train_dataset=tokenized_train_bert,
    eval_dataset=tokenized_test_bert,
    tokenizer=bert_tokenizer,
    data_collator=default_data_collator, # mengatur penggabungan batch
)

  bert_trainer = Trainer(


#### Training model untuk Roberta

In [33]:
roberta_trainer = Trainer(
    model=roberta_model,
    args=training_args,
    train_dataset=tokenized_train_roberta,
    eval_dataset=tokenized_test_roberta,
    tokenizer=roberta_tokenizer,
    data_collator=default_data_collator, # mengatur penggabungan batch
)

  roberta_trainer = Trainer(


# Model Training
Training model dan menyimpan modelnya

#### Training model GRU

In [None]:
vocab_size = bert_tokenizer.vocab_size # ambil ukuran kosakata
gru_model = GRUForQA(vocab_size).to(device)

In [None]:
train_gru_model(gru_model, tokenized_train_gru)
gru_model.save_pretrained("./model_gru")

Epoch 1:   0%|          | 0/822 [00:00<?, ?it/s]

Epoch 1 - Loss: 2.5197


Epoch 2:   0%|          | 0/822 [00:00<?, ?it/s]

Epoch 2 - Loss: 2.1548


Epoch 3:   0%|          | 0/822 [00:00<?, ?it/s]

Epoch 3 - Loss: 1.9749


#### Training model Bert

In [34]:
bert_trainer.train()
bert_model.save_pretrained("./model_bert")

Step,Training Loss
200,1.1547
400,1.2992
600,1.1915
800,1.1366
1000,0.9046
1200,0.9029
1400,0.8574
1600,0.8334
1800,0.6885
2000,0.6643


#### Training model Roberta

In [35]:
roberta_trainer.train()
roberta_model.save_pretrained("./model_roberta")

Step,Training Loss
200,1.4362
400,1.2034
600,1.1074
800,1.0578
1000,0.9187
1200,0.8204
1400,0.8368
1600,0.797
1800,0.6541
2000,0.6641


# Evaluation
Kami menggunakan 2 metriks untuk evaluasi, yaitu Exact Match (EM) dan F1 score. 
- **Exact Match (EM)** adalah seberapa sering prediksi model tepat sama dengan jawaban sebenarnya. 
- **F1 Score** mengukur kemiripan antara prediksi dan jawaban (berdasarkan precision dan recall)

#### Exact Match (EM)

$$
\text{EM} = \frac{\text{Jumlah prediksi tepat}}{\text{Jumlah total pertanyaan}} \times 100\%
$$

#### F1 Score

$$
F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$



#### Gunakan metriks squad untuk evaluasi

In [15]:
metric = load("squad")

#### Post-processing digunakan untuk mengubah output dari model menjadi teks jawaban biasa

In [None]:
def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size=20, max_answer_length=30):
    all_start_logits, all_end_logits = raw_predictions  # mengambil hasil prediksi model

    # Hubungkan example dengan features
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = {}
    for i, feature in enumerate(features):
        example_id = feature["id"]
        features_per_example.setdefault(example_id, []).append(i)

    predictions = {}

    # ambil semua fitur untuk setiap example
    for example_id, feature_indices in tqdm(features_per_example.items()):
        context = examples[example_id_to_index[example_id]]["context"]

        valid_answers = []

        for feature_index in feature_indices:
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            offset_mapping = features[feature_index]["offset_mapping"]

            # mengambil kandidat jawaban
            start_indexes = np.argsort(start_logits)[-1: -n_best_size - 1: -1].tolist()
            end_indexes = np.argsort(end_logits)[-1: -n_best_size - 1: -1].tolist()

            # periksa kombinasi kandidat token
            for start_index in start_indexes:
                for end_index in end_indexes:
                    if start_index >= len(offset_mapping) or end_index >= len(offset_mapping):
                        continue
                    if offset_mapping[start_index] is None or offset_mapping[end_index] is None:
                        continue
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    # simpah skor jika valid
                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    answer = context[start_char: end_char]
                    score = start_logits[start_index] + end_logits[end_index]
                    valid_answers.append({"text": answer, "score": score})

        # pilih jawaban terbaik
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            best_answer = {"text": "", "score": 0.0}

        # simpan hasil akhir
        predictions[example_id] = best_answer["text"]

    return predictions


## GRU

In [None]:
dataloader = DataLoader(tokenized_test_gru, batch_size=16, collate_fn=collate_fn) # Membuat dataloader agar dapat diproses dalam batch
all_start_logits, all_end_logits = [], []

# mode evaluasi
gru_model.eval()
with torch.no_grad():
    for input_ids, _, _ in dataloader:
        input_ids = input_ids.to(device)
        start_logits, end_logits = gru_model(input_ids) 
        all_start_logits.extend(start_logits.tolist()) # tambahin elemen ke list
        all_end_logits.extend(end_logits.tolist())

# Post-process predictions
final_predictions = postprocess_qa_predictions(
    examples=test_dataset,              # dataset asli
    features=tokenized_test_gru,        # hasil tokenisasi
    raw_predictions=(all_start_logits, all_end_logits)
)

# Evaluation
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": {'answer_start': ex['answers']['answer_start'], 'text': ex['answers']['text']}} for ex in test_dataset]

metric_result = metric.compute(predictions=formatted_predictions, references=references)

print("Hasil evaluasi GRU:")
print(metric_result)


  0%|          | 0/1000 [00:00<?, ?it/s]

Hasil evaluasi GRU:
{'exact_match': 1.9, 'f1': 3.532710927370739}


## Bert
Load ulang model Bert yang sudah di fine tuned kemudian buat trainernya kembali untuk prediksi.

In [None]:
bert_model = AutoModelForQuestionAnswering.from_pretrained("./model_bert")

In [41]:
training_args = TrainingArguments(
    output_dir="./bert-qa-model",
    per_device_eval_batch_size=16,
)

bert_trainer = Trainer(
    model=bert_model,
    args=training_args,
    tokenizer=bert_tokenizer,
    data_collator=default_data_collator
)

  bert_trainer = Trainer(


#### Prediksi menggunakan model Bert

In [None]:
raw_predictions = bert_trainer.predict(tokenized_test_bert) # memprediksi dataset uji
final_predictions = postprocess_qa_predictions(
    test_dataset,  # versi asli
    tokenized_test_bert,  # versi tokenisasi
    raw_predictions.predictions # hasil logits
)
# menyusun hasil prediksi
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]

# referensi jawaban asli
references = [{"id": ex["id"], "answers": {'answer_start': ex['answers']['answer_start'], 'text': ex['answers']['text']}} for ex in test_dataset]

# evaluasi dengan SQuAD metrics
metric_result = metric.compute(predictions=formatted_predictions, references=references)

print("Hasil evaluasi Bert:")
print(metric_result)

[34m[1mwandb[0m: Currently logged in as: [33mnathanchris435[0m ([33mnathanchris435-calvin-institute-of-technology[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 0/1000 [00:00<?, ?it/s]

Hasil evaluasi Bert:
{'exact_match': 40.9, 'f1': 53.39313913556089}


## Roberta
Load ulang model Roberta yang sudah di fine tuned kemudian buat trainernya kembali untuk prediksi.

In [None]:
roberta_model = AutoModelForQuestionAnswering.from_pretrained("./model_roberta")

In [44]:
training_args = TrainingArguments(
    output_dir="./bert-qa-model",
    per_device_eval_batch_size=16,
)

roberta_trainer = Trainer(
    model=roberta_model,
    args=training_args,
    tokenizer=roberta_tokenizer,
    data_collator=default_data_collator
)

  roberta_trainer = Trainer(


#### Prediksi menggunakan model Roberta

In [None]:
raw_predictions = roberta_trainer.predict(tokenized_test_roberta) # memprediksi dataset uji
final_predictions = postprocess_qa_predictions(
    test_dataset,  # versi asli
    tokenized_test_roberta,  # versi tokenisasi
    raw_predictions.predictions # hasil logits
)

# menyusun hasil prediksi
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]

# referensi jawaban asli
references = [{"id": ex["id"], "answers": {'answer_start': ex['answers']['answer_start'], 'text': ex['answers']['text']}} for ex in test_dataset]

# evaluasi dengan SQuAD metrics
metric_result = metric.compute(predictions=formatted_predictions, references=references)

print("Hasil evaluasi Roberta:")
print(metric_result)

  0%|          | 0/1000 [00:00<?, ?it/s]

Hasil evaluasi Roberta:
{'exact_match': 48.2, 'f1': 70.56925233621385}


## Hasil Evaluasi Model

| Nama Model  | F1 Score |   EM   |
|-------------|----------|--------|
| GRU         |   3.53   |  1.9   |
| Bert        |  53.39   | 40.9   |
| Roberta     |  70.57   | 48.2   |


# Inference
Mencoba menggunakan model yang sudah dilatih untuk memprediksi output dari input baru

## Pertanyaan:

In [20]:
# Example
question = "Has patient ever been prescribed lopressor"
context = "Mr. Quigg is a 42-year-old man with history of diabetes, end-stage renal disease on hemodialysis, left Charcot foot complicated by recurrent cellulitis who presented with left lower leg swelling, erythema, and pain. On admission, his temperature was 100.8, heart rate was 111, and blood pressure was 140/70. His left lower extremity had 1+ pitting edema with erythema on the anterior shin and foot. He was uptitrated to 5mg and also lopressor, started on Lyrica and oxycodone for breakthrough pain, and received Fentanyl PCA. His home medications included Colace 100 mg b.i.d., folate 1 mg p.o. daily, gemfibrozil 600 mg b.i.d., Lantus 30 mg subcu q.p.m., Lipitor 80 mg nightly, Nephrocaps, Neurontin 300 mg daily, PhosLo 2001 mg t.i.d., Protonix 40 mg daily, Renagel 3200 mg t.i.d., Requip 2 mg p.o. b.i.d., and Coumadin. His Lipitor was decreased to 20mg due to rhabdomylosis risk, and he was also started on low dose b-blocker to reduce perioperative MI risk prior to his surgery. His Vancomycin was continued given his history of MRSA cellulitis, with a goal of a level less than 20, and he was bridged with heparin with a goal PTT of 60-80. He was restarted on his Lantus and Aspart doses with meals, and his Coumadin was held prior to surgery and decreased to 20mg with a repeat lipid panel in 4-6 weeks. He required antibiotics which were discontinued at this time and he was discharged with dry sterile dressing changes to his residual limb daily, PTT goal 60-80, INR goal 2-3 until stable off of levofloxacin, monitoring of FS and adjustment of DM regimen, monitoring pain scale and decreasing pain medications as pain improves, hemodialysis M/W/F, and follow up with Dr. Carpino voice message left on his medical assistant's voice mail and Dr. Lynes 6/10/06 at 9:30am. Psychiatry service was consulted who recommended low dose Ativan prior to him going for dialysis. He was initially placed on a ketamine drip and given IV Levofloxacin and IV Flagyl to cover gram negatives and anaerobes respectively, and started on oxycontin 80mg tid with oxycodone for breakthrough pain and Lyrica for neuropathic pain. He was comfortable prior to discharge on this current regimen."

## Expected Output: 
``` He was uptitrated to 5mg and also lopressor,```

### GRU

##### Model GRU membutuhkan metode tersendiri untuk prediksi

In [None]:
def predict_gru_answer(model, question, context, tokenizer, max_length=384):
    model.eval()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Tokenisasi input
    inputs = tokenizer(question, context,
                       return_offsets_mapping=True,
                       padding="max_length",
                       truncation=True,
                       max_length=max_length,
                       return_tensors="pt")

    input_ids = inputs["input_ids"].to(device)
    offsets = inputs["offset_mapping"][0]  # (seq_len, 2)

    # Predict
    with torch.no_grad():
        start_logits, end_logits = model(input_ids)

    # Hapus kemungkinan dia mengambil index ke 0 (token CLS --> jawaban kosong)
    start_logits = start_logits[:, 1:]
    end_logits = end_logits[:, 1:]
    offsets = offsets[1:]

    # Ambil posisi tertinggi (logit paling besar)
    start_index = torch.argmax(start_logits, dim=1).item()
    end_index = torch.argmax(end_logits, dim=1).item()

    # Koreksi jika end sebelum start
    if end_index < start_index:
        end_index = start_index

    # Ambil teks asli dari context
    tokens = input_ids[0][start_index:end_index + 1]
    answer = tokenizer.decode(tokens, skip_special_tokens=True)

    return answer


In [None]:
result = predict_gru_answer(gru_model, question, context, bert_tokenizer)
print("Jawaban:", result)

Jawaban: . his vancomycin was continued given his history of mrsa cellulitis


### Bert

In [21]:
# Load fine-tuned model
bert_model = AutoModelForQuestionAnswering.from_pretrained("./model_bert")

# Create QA pipeline
qa_bert_pipeline = pipeline("question-answering", model=bert_model, tokenizer=bert_tokenizer)

Device set to use cuda:0


In [22]:
result = qa_bert_pipeline(question=question, context=context)
print("Jawaban:", result['answer'])

Jawaban: started on Lyrica


### Roberta

In [23]:
# Load fine-tuned model
roberta_model = AutoModelForQuestionAnswering.from_pretrained("./model_roberta")

# Create QA pipeline
qa_roberta_pipeline = pipeline("question-answering", model=roberta_model, tokenizer=roberta_tokenizer)

Device set to use cuda:0


In [24]:
result = qa_roberta_pipeline(question=question, context=context)
print("Jawaban:", result['answer'])

Jawaban: He was uptitrated to 5mg and also lopressor,


# Kesimpulan
Dari ketiga model yang ada (GRU, BERT, dan RoBERTa), RoBERTa menjadi model yang menunjukkan performa terbaik untuk masalah Question Answering pada dataset ini. F1 score yang tinggi (70.57) menunjukkan bahwa model menghasilkan jawaban yang cukup akurat dengan aslinya. EM yang lebih tinggi (48.2) menunjukkan bahwa kemungkinan bahwa kemungkinan jawaban yang dihasilkan oleh RoBERTa lebih sering sama dengan jawaban aslinya. Sementara itu, BERT memiliki nilai f1 score dan EM di bawah RoBERTa. Hal ini menunjukkan bahwa RoBERTa lebih efektif dibandingkan BERT. Terakhir, GRU memiliki f1 score dan EM yang paling rendah di antara kedua model lainnya. Hal ini menunjukkan bahwa GRU kesulitan untuk memberikan jawaban yang tepat. Hal ini disebabkan karena GRU tidak dapat memberikan jawaban sebaik model-model Transformer