<a href="https://colab.research.google.com/github/gonzalovaldenebro/NaturalLanguageProcessing-Portfolio/blob/main/F7_1_TransferLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Handling Long-Term Information in Recurrent Neural Networks

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F7_1_TransferLearning.ipynb)

## Reference

Hugging Face NLP Course Chapter 1: Transformer Models https://huggingface.co/learn/nlp-course/chapter1/1

Hugging Face NLP Course Chapter 3: Fine-tuning a model with the Trainer API or Keras https://huggingface.co/learn/nlp-course/chapter3/1

Hugging Face NLP Course Chapter 7, Section 5: Summarization https://huggingface.co/learn/nlp-course/chapter7/5?fw=tf

In [55]:
import sys
!{sys.executable} -m pip install --no-cache-dir datasets keras tensorflow sentencepiece

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Transfer Learning

**Transfer Learning** is the process of taking a model that was trained (**pre-trained**) on one task and then **fine tuned** for another task.

Today we're going to practice fine-tuning a pre-trained **transformer** model - we'll cover transformers in more detail next week, but they work a lot like the other neural network models we've looked at so far.

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/pretraining.svg?raw=1" width=700>
    <br />
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/finetuning.svg?raw=1" width=700>
</div>

image source: https://huggingface.co/learn/nlp-course/chapter1/4?fw=tf

## Common pre-trained models

There are a variety of pre-trained models out there
* usually *very large*
* pretrained on *massive amounts of data*

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/model_parameters.png?raw=1" width=800>
</div>

**Encoders:** BERT, ALBERT, DistilBERT, ELECTRA, RoBERTa
* Usually trained on masked input - model tries to predict the missing word in a sequence


**Decoders:** CTRL, GPT, GPT-2, Transformer XL
* Neural language models - usually trying to predict the next word in a sequence

**Encoder-Decoder Models:** BART, mBART, Marian, T5
* full sequence-to-sequence models


## Working Example

We're going to work through our text-to-emoji example, fine-tuning a variant of T5.

### Load and filter our dataset just like before

In [56]:
from datasets import load_dataset


# Define a function to check if 'text' is not None
def is_not_none(example):
    return example['text'] is not None

dataset = load_dataset("KomeijiForce/Text2Emoji",split="train")

# Filter the dataset
dataset = dataset.filter(is_not_none)
dataset

Dataset({
    features: ['text', 'emoji', 'topic'],
    num_rows: 503682
})

### choosing a sample to work with

Even the smaller transformer models will take too long to train on in class

Let's choose a small sample to work on in class

In [57]:
# Shuffle the dataset
shuffled_dataset = dataset.shuffle(seed=42)

# Select a small sample
sample_size = 1000  # Define your sample size
sample_dataset = shuffled_dataset.select(range(sample_size))

#if you want to use the entire dataset just uncomment the following
#sample_dataset = shuffled_dataset

### Train/test split

Hugging Face datasets actually include a `train_test_split` function for splitting into training and testing sets if you don't already have them split.

In [58]:
dataset_split = sample_dataset.train_test_split(test_size=0.2)
dataset_split

DatasetDict({
    train: Dataset({
        features: ['text', 'emoji', 'topic'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'emoji', 'topic'],
        num_rows: 200
    })
})

### Reminder of what the data looks like

In [41]:
print(dataset_split["train"]["text"][46])
print(dataset_split["train"]["emoji"][46])

La lluvia est llegando con fuerza, es mejor quedarse adentro y disfrutar de una buena jornada de pelculas y mantita.
☔🏠🎥🍿🌈


### The Tokenizer

Since we will be using an existing model to start, we need to make sure we prepare our data in the same way that model was trained on.

**T5:** an encoder-decoder Transformer architecture suitable for sequence-to-sequences tasks

**mT5:** A multilingual version of T5, pretrained on the multilingual Common Crawl corpus (mC4), covering 101 languages

**mt5-small:** A small version of mT5, suitable for getting things working before attempting to train on a large model

`mt5-small` uses the SentencePiece tokenizer

In [26]:
from transformers import AutoTokenizer

#uses the sentencepiece tokenizer
model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)

### Looking at an example of the tokenization

You'll see that the token ids get returned as `input_ids`

It also includes an `attention_mask` which allows the algorithm to focus on specific important words using its attention mechanism - it's initialized to all 1s

In [27]:
inputs = tokenizer(dataset_split["train"]["text"][46])
inputs

{'input_ids': [12808, 259, 262, 9154, 304, 6775, 288, 772, 20727, 8174, 514, 287, 64642, 1052, 305, 97794, 12954, 1052, 24375, 149200, 285, 12017, 259, 264, 609, 277, 263, 1469, 259, 262, 10112, 51053, 3153, 13636, 347, 751, 3721, 25086, 281, 772, 5810, 2586, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Converting ids back to tokens

Here's what the tokens look like.

The `▁` and `</s>` are hallmarks of the SentencePiece tokenizer algorithm

In [28]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['▁Add',
 '▁',
 'a',
 '▁pop',
 '▁of',
 '▁color',
 '▁to',
 '▁your',
 '▁windows',
 'ill',
 '▁with',
 '▁the',
 '▁vibr',
 'ant',
 '▁and',
 '▁flam',
 'boy',
 'ant',
 '▁Bro',
 'melia',
 'd',
 '▁plant',
 '▁',
 '-',
 '▁it',
 "'",
 's',
 '▁like',
 '▁',
 'a',
 '▁fire',
 'works',
 '▁show',
 '▁happen',
 'ing',
 '▁all',
 '▁year',
 '▁round',
 '▁in',
 '▁your',
 '▁own',
 '▁home',
 '!',
 '</s>']

### How does it work on the emojis?

Fortunately, this seems to work pretty well for the emoji output too

some may come back as `<unk>` for unknown tokens

In [29]:
target = tokenizer(dataset_split["train"]["emoji"][46])
target

{'input_ids': [259, 244125, 239199, 191820, 238833, 239328, 248969, 96088, 223546, 238782, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [30]:
tokenizer.convert_ids_to_tokens(target.input_ids)

['▁', '🎆', '🌺', '💥', '🌈', '💫', '🌆', '🔥', '🇺', '🇸', '</s>']

In [31]:
tokenizer.decode(target.input_ids)

'🎆🌺💥🌈💫🌆🔥🇺🇸</s>'

### Let's define a preprocessing function

This will allow us to tokenize both the text and labels while allow use to add the token ids from the emojis as the `"labels"` key in the overall data structure where it will be convenient to have them for training.

In [32]:
max_input_length = 100
max_target_length = 20


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["text"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["emoji"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs



Hugging Face datasets have a `map` method that allows you to apply a preprocessing function like this to every example in the data set.

Notice that we get everything we had before (text, emoji, topic), but now we also have the input_ids (the tokens), the attention mask, and the labels (also token ids).

In [33]:
#turn the tokenized data back into a dataset
tokenized_datasets = dataset_split.map(preprocess_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'emoji', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'emoji', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

### Grabbing the pre-trained model

as a reminder, `model_checkpoint` was defined earlier - it is `"google/mt5-small"`

Note that this is an encoder-decoder transformer model the was pretrained on a 750 GB dataset which included tasks for summarization, translation, question answering, and classification.

In [34]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

All model checkpoint layers were used when initializing TFMT5ForConditionalGeneration.

All the layers of TFMT5ForConditionalGeneration were initialized from the model checkpoint at google/mt5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMT5ForConditionalGeneration for predictions without further training.


### Using a data collator

Hugging Face provides a Data Collator class which is used to collect the training data into batches and dynamically pad them so that each batch is appropriately padded but without an overall fixed length.

With `return_tensors="tf"` we're saying we want the data back in an appropriate data structure suitable for using with Keras/Tensorflow.

In [35]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

Let's make a version of the dataset where the original text fields are removed so we can use it with the collator.

In [36]:
tokenized_datasets_no_text = tokenized_datasets.remove_columns(["text","emoji","topic"])
tokenized_datasets_no_text

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 800
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

In [37]:
tf_train_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

### Setting up the optimizer

When fine-tuning a pre-trained algorithm, you usually want to use a smaller learning rate.

Note that we do not specify a loss function - it will use whatever was used in the base model.

*NB:* I'm using values that were in the example on the website (https://huggingface.co/learn/nlp-course/chapter7/5?fw=tf ) for a different dataset - I don't know if these are the best for this problem

In [38]:
from transformers import create_optimizer
import tensorflow as tf

num_train_epochs = 8
num_train_steps = len(tf_train_dataset) * num_train_epochs

optimizer, schedule = create_optimizer(
    init_lr=5.6e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(optimizer=optimizer)

# Train in mixed-precision float16 - can be helpful if running on a GPU
tf.keras.mixed_precision.set_global_policy("mixed_float16")

In [39]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.src.callbacks.History at 0x2ffc116d0>

### Saving a copy of the model's weights

This will allow us to load the model later and work with it without completely retraining.

In [None]:
model.save_pretrained("models/emoji-model-v2")

### Reload a saved model

In [None]:
model = TFAutoModelForSeq2SeqLM.from_pretrained("models/emoji-model-v1")

### Inference

Let's suppose we have an example to get a prediction for. For now, let's grab one from the test set

In [44]:
print( tokenized_datasets["test"]["text"][15] )
print( tokenized_datasets["test"]["emoji"][15])
print( tokenized_datasets["test"]["input_ids"][15])

Looking for the perfect gadget to streamline your home organization? You'll love these versatile storage bins.
📦✨🏠💕
[259, 61408, 332, 287, 5571, 259, 88586, 288, 38101, 1397, 772, 2586, 29660, 291, 1662, 277, 1578, 3869, 259, 3824, 259, 170280, 468, 25973, 3111, 263, 260, 1]


Use the `generate` method to get a prediction sequence from the intput IDs.

If you don't already have the tokens, make sure to use your tokenizer first.

In [45]:
prediction = model.generate([tokenized_datasets["test"]["input_ids"][15]], max_length=max_target_length)
tokenizer.convert_ids_to_tokens(prediction[0])

2023-11-21 16:30:08.213844: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x34050cb60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-11-21 16:30:08.214001: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-11-21 16:30:08.236120: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-11-21 16:30:08.338200: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


['<pad>', '▁<extra_id_0>', '.', '</s>']

In [46]:
decoded_output = tokenizer.decode(prediction[0], skip_special_tokens=True)
decoded_output

'<extra_id_0>.'

## Applied Exploration

The applied exploration for this fortnight will be a little different. I want everyone to get some experience fine-tuning an existing model, so this will be the task for the entire fortnight.

Fine-tune an existing model with the following requirements
* Choose a different starting model - you can use any Hugging Face model, but consider starting with a general one like BART or Llama2.
* Choose a different data set - think about something that would be good to include in an application that interests you
* Evaluate how well it performed. For sequence-to-sequence model, try going back and using Rouge from Fortnight 1.

# Fine-tunning the Dynamic Tinibert Model with Medical Data


I will be working on fine-tunning the existing [Dynamic-TinyBERT](https://huggingface.co/Intel/dynamic_tinybert) with medical data, this is the [dataset](https://huggingface.co/datasets/GonzaloValdenebro/MedicalQuestionAnsweringDataset) that I will be using for that.

In [233]:
from datasets import load_dataset

MedData = (load_dataset("GonzaloValdenebro/MedicalQuestionAnsweringDataset", split='train')
        .train_test_split(train_size=0.80, test_size=0.20))
MedData


DatasetDict({
    train: Dataset({
        features: ['id', 'Question', 'Context', 'Topic', 'Answer'],
        num_rows: 13124
    })
    test: Dataset({
        features: ['id', 'Question', 'Context', 'Topic', 'Answer'],
        num_rows: 3282
    })
})

In [234]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [235]:
context = MedData["train"][0]["Context"]
question = MedData["train"][0]["Question"]

inputs = tokenizer(question, context)
tokenizer.decode(inputs["input_ids"])

'[CLS] What are the symptoms of Moyamoya disease? [SEP] What are the signs and symptoms of Moyamoya disease? The Human Phenotype Ontology provides the following list of signs and symptoms for Moyamoya disease. If the information is available, the table below includes how often the symptom is seen in people with this condition. You can use the MedlinePlus Medical Dictionary to look up the definitions for these medical terms. Signs and Symptoms Approximate number of patients ( when available ) Abnormality of the cerebral vasculature 50 % Cognitive impairment 50 % Seizures 50 % Ventriculomegaly 50 % Autosomal recessive inheritance - Inflammatory arteriopathy - Telangiectasia - The Human Phenotype Ontology ( HPO ) has collected information on how often a sign or symptom occurs in a condition. Much of this information comes from Orphanet, a European rare disease database. The frequency of a sign or symptom is usually listed as a rough estimate of the percentage of patients who have that fea

In [236]:
inputs = tokenizer(
    question,
    context,
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

In [237]:
inputs = tokenizer(
    question,
    context,
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

In [238]:
inputs = tokenizer(
    MedData["train"][2:6]["Question"],
    MedData["train"][2:6]["Context"],
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

print(f"The 4 examples gave {len(inputs['input_ids'])} features.")
print(f"Here is where each comes from: {inputs['overflow_to_sample_mapping']}.")

In [239]:
max_length = 384
stride = 128


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["Question"]]
    inputs = tokenizer(
        questions,
        examples["Context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["Answer"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = int(answer["answer_start"])
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [241]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["Question"]]
    inputs = tokenizer(
        questions,
        examples["Context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [242]:
validation_dataset = MedData["test"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=MedData["test"].column_names,
)
len(MedData["test"]), len(validation_dataset)

Map:   0%|          | 0/3282 [00:00<?, ? examples/s]

(3282, 4820)

In [243]:
small_eval_set = MedData["test"].select(range(100))
trained_checkpoint = "distilbert-base-cased-distilled-squad"

tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)
eval_set = small_eval_set.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=MedData["test"].column_names,
)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [97]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [98]:
import torch
from transformers import AutoModelForQuestionAnswering

eval_set_for_model = eval_set.remove_columns(["example_id", "offset_mapping"])
eval_set_for_model.set_format("torch")

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
batch = {k: eval_set_for_model[k].to(device) for k in eval_set_for_model.column_names}
trained_model = AutoModelForQuestionAnswering.from_pretrained(trained_checkpoint).to(
    device
)

with torch.no_grad():
    outputs = trained_model(**batch)

In [99]:
start_logits = outputs.start_logits.cpu().numpy()
end_logits = outputs.end_logits.cpu().numpy()

In [100]:
import collections

example_to_features = collections.defaultdict(list)
for idx, feature in enumerate(eval_set):
    example_to_features[feature["example_id"]].append(idx)

In [130]:
import numpy as np

n_best = 20
max_answer_length = 30
predicted_answers = []

for example in small_eval_set:
    example_id = example["id"]
    context = example["Context"]
    answers = []

    for feature_index in example_to_features[example_id]:
        start_logit = start_logits[feature_index]
        end_logit = end_logits[feature_index]
        offsets = eval_set["offset_mapping"][feature_index]

        start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
        end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
        for start_index in start_indexes:
            for end_index in end_indexes:
                # Skip answers that are not fully in the context
                if offsets[start_index] is None or offsets[end_index] is None:
                    continue
                # Skip answers with a length that is either < 0 or > max_answer_length.
                if (
                    end_index < start_index
                    or end_index - start_index + 1 > max_answer_length
                ):
                    continue

                answers.append(
                    {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                )

    best_answer = max(answers, key=lambda x: x["logit_score"])
    predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})

In [117]:
import evaluate

metric = evaluate.load("squad")

In [132]:
theoretical_answers = [
    {"id": ex["id"], "Answer": ex["Answer"]} for ex in small_eval_set
]

# Convert 'answers' to a list of dictionaries for theoretical_answers
theoretical_answers = [{'id': item['id'], 'Answer': [{'text': item['Answer']}]} for item in theoretical_answers]

# Now both theoretical_answers and predicted_answers have the expected format


In [133]:
print(predicted_answers[0])
print(theoretical_answers[0])

{'id': 908, 'prediction_text': 'bone and eye abnormalities'}
{'id': 908, 'Answer': [{'text': 'bone and eye abnormalities'}]}


In [164]:
from tqdm.auto import tqdm


def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["Context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})
    

    # Modify the format of predicted_answers
    formatted_predicted_answers = [{'id': str(pred['id']), 'prediction_text': pred['prediction_text']} for pred in predicted_answers]

    # Modify the format of theoretical_answers
    formatted_theoretical_answers = [{'id': str(ex['id']), 'answers': [{'text': ex['Answer'], 'answer_start': 0}]} for ex in examples]
    
    # Assuming 'metric' is an instance of your evaluation module
    result = metric.compute(predictions=formatted_predicted_answers, references=formatted_theoretical_answers)

    return result

# Call the compute_metrics function with the necessary arguments
compute_metrics(start_logits, end_logits, eval_set, small_eval_set)


  0%|          | 0/100 [00:00<?, ?it/s]

{'exact_match': 81.0, 'f1': 88.52499508575357}

In [144]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [198]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [201]:

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="/Users/gonzalovaldenebro/Library/CloudStorage/OneDrive-DrakeUniversity/CS 195/ Fortnight 7/bert-finetuned-model",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=False,  # Set to False to disable mixed precision training
    push_to_hub=False,
)


In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
)
trainer.train()

In [206]:


from torch.utils.data import DataLoader

# Assuming you have a `train_dataset` defined
train_dataloader = DataLoader(train_dataset, batch_size=3, shuffle=True)

# Now you can iterate through batches
for batch in train_dataloader:
    print(batch)
    break

In [220]:
from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_dir="./Users/gonzalovaldenebro/Library/CloudStorage/OneDrive-DrakeUniversity/CS 195/ Fortnight 7/bert-finetuned-model",  # Set the local directory for saving the fine-tuned model
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=False,  # Set to False to disable mixed precision training
    push_to_hub=True,  # Set to False to avoid pushing to the Hugging Face Model Hub
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    
)
trainer.train()

  0%|          | 0/4308 [00:00<?, ?it/s]

  return table.fast_gather(key % table.num_rows)


IndexError: list index out of range

In [218]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
compute_metrics(start_logits, end_logits, validation_dataset, raw_datasets["validation"])

IndexError: list index out of range

In [225]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

train_dataset.set_format("torch")
validation_set = validation_dataset.remove_columns(["id"])
validation_set.set_format("torch")

train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    validation_set, collate_fn=default_data_collator, batch_size=8
)

In [226]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Downloading pytorch_model.bin:   0%|          | 0.00/431M [00:00<?, ?B/s]

In [227]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [229]:
from accelerate import Accelerator

accelerator = Accelerator(fp16=False)
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

TypeError: Accelerator.__init__() got an unexpected keyword argument 'fp16'

In [230]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

## Loading the Intel/dynamic_tinybert model

In [45]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
model_checkpoint = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")

In [46]:
inputs = tokenizer(MedData["train"]["Question"][46])
inputs


{'input_ids': [0, 2264, 32, 5, 9186, 1022, 1330, 7, 5991, 12, 45558, 18546, 118, 14115, 17487, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [47]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['<s>',
 'What',
 'Ġare',
 'Ġthe',
 'Ġgenetic',
 'Ġchanges',
 'Ġrelated',
 'Ġto',
 'ĠLi',
 '-',
 'Fra',
 'umen',
 'i',
 'Ġsyndrome',
 'Ġ?',
 '</s>']

In [48]:
target = tokenizer(MedData["train"]["Answer"][46])
target

{'input_ids': [0, 133, 3858, 717, 530, 176, 8, 31847, 4540, 14819, 32, 3059, 19, 5991, 12, 45558, 18546, 118, 14115, 4, 1437, 901, 87, 457, 9, 70, 1232, 19, 5991, 12, 45558, 18546, 118, 14115, 33, 17136, 28513, 11, 5, 31847, 4540, 10596, 4, 31847, 4540, 16, 10, 16570, 23192, 368, 10596, 6, 61, 839, 14, 24, 6329, 2607, 797, 5, 434, 8, 2757, 9, 4590, 4, 13549, 1635, 11, 42, 10596, 64, 1157, 4590, 7, 11079, 11, 41, 38411, 169, 8, 1026, 23991, 4, 1944, 9186, 8, 3039, 2433, 32, 67, 533, 7, 3327, 5, 810, 9, 1668, 11, 82, 19, 31847, 4540, 28513, 4, 1437, 83, 367, 1232, 19, 16640, 26293, 9, 5991, 12, 45558, 18546, 118, 14115, 8, 5991, 12, 45558, 18546, 118, 12, 3341, 14115, 109, 45, 33, 31847, 4540, 28513, 6, 53, 33, 28513, 11, 5, 3858, 717, 530, 176, 10596, 4, 2011, 5, 31847, 4540, 10596, 6, 3858, 717, 530, 176, 16, 10, 16570, 23192, 368, 10596, 4, 11161, 32, 9684, 549, 3858, 717, 530, 176, 28513, 888, 1303, 209, 1274, 50, 32, 8315, 3059, 19, 41, 1130, 810, 9, 1402, 16640, 36, 8529, 6181, 166

In [None]:
tokenizer.convert_ids_to_tokens(target.input_ids)

In [50]:
tokenizer.decode(target.input_ids)

'<s>The CHEK2 and TP53 genes are associated with Li-Fraumeni syndrome.  More than half of all families with Li-Fraumeni syndrome have inherited mutations in the TP53 gene. TP53 is a tumor suppressor gene, which means that it normally helps control the growth and division of cells. Mutations in this gene can allow cells to divide in an uncontrolled way and form tumors. Other genetic and environmental factors are also likely to affect the risk of cancer in people with TP53 mutations.  A few families with cancers characteristic of Li-Fraumeni syndrome and Li-Fraumeni-like syndrome do not have TP53 mutations, but have mutations in the CHEK2 gene. Like the TP53 gene, CHEK2 is a tumor suppressor gene. Researchers are uncertain whether CHEK2 mutations actually cause these conditions or are merely associated with an increased risk of certain cancers (including breast cancer).</s>'

# Preprocessing function

This will allow us to tokenize both the text and labels while allow us to add the token ids from the emojis as the "labels" key in the overall data structure where it will be convenient to have them for training.

In [51]:
max_input_length = 100
max_target_length = 20


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["Question"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["Answer"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [52]:
tokenized_datasets = MedData.map(preprocess_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'Question', 'Answer', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['id', 'Question', 'Answer', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [54]:
from transformers import AutoModelForQuestionAnswering

model_checkpoint = "deepset/roberta-base-squad2"
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)


# Using a data collator

Hugging Face provides a Data Collator class which is used to collect the training data into batches and dynamically pad them so that each batch is appropriately padded but without an overall fixed length.

With return_tensors="tf" we're saying we want the data back in an appropriate data structure suitable for using with Keras/Tensorflow.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [56]:
from transformers import DefaultDataCollator
data_collator = DefaultDataCollator(tokenizer, model=model, return_tensors="tf")

TypeError: DefaultDataCollator.__init__() got an unexpected keyword argument 'model'

In [58]:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering, DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")

class CustomDataCollator(DataCollatorWithPadding):
    def __init__(self, tokenizer, model):
        super().__init__(tokenizer=tokenizer)

        # Store the model for later use
        self.model = model

    def __call__(self, features):
        # Implement your custom collation logic here, using self.model if needed
        return super().__call__(features)

# Example usage
custom_data_collator = CustomDataCollator(tokenizer=tokenizer, model=model)



In [59]:
tokenized_datasets_no_text = tokenized_datasets.remove_columns(["Question","Answer"])
tokenized_datasets_no_text

DatasetDict({
    train: Dataset({
        features: ['id', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['id', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [63]:
from transformers import Trainer, TrainingArguments

# Define your training arguments
training_args = TrainingArguments(
    output_dir="./Users/gonzalovaldenebro/Library/CloudStorage/OneDrive-DrakeUniversity/CS 195/ Fortnight 7",  # Specify the directory where the trained model will be saved
    per_device_train_batch_size=8,  # Adjust batch size as needed
    per_device_eval_batch_size=8,   # Adjust batch size as needed
    num_train_epochs=3,             # Adjust number of training epochs as needed
    logging_dir="./Users/gonzalovaldenebro/Library/CloudStorage/OneDrive-DrakeUniversity/CS 195/ Fortnight 7",           # Specify the directory for TensorBoard logs
    logging_steps=100,               # Adjust logging frequency as needed
    evaluation_strategy="steps",     # Specify evaluation strategy (steps or epoch)
    save_strategy="steps",           # Specify saving strategy (steps or epoch)
    save_total_limit=2,              # Specify the maximum number of checkpoints to keep
)

# Define your Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets_no_text["train"],
    eval_dataset=tokenized_datasets_no_text["test"],
    data_collator=data_collator,
)

# Start training
trainer.train()


  0%|          | 0/3000 [00:00<?, ?it/s]

ValueError: expected sequence of length 10 at dim 1 (got 9)

In [None]:
tf_train_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="/Users/gonzalovaldenebro/MedicalQuestionAnswer",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
    hub_token="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",
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets_no_text["train"],
    eval_dataset=tokenized_datasets_no_text["test"],
    tokenizer=tokenizer,
)

# Training
trainer.train()

In [105]:
from transformers import AutoTokenizer
from datasets import load_dataset

# Load dataset
#dataset = load_dataset("GonzaloValdenebro/MedicalQuestionAnsweringDataset", split='train')

dataset = (load_dataset("GonzaloValdenebro/MedicalQuestionAnsweringDataset", split='train')
        .train_test_split(train_size=800, test_size=200))

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

def preprocess_function(example):
    return tokenizer(example["Question"], example["Answer"], padding="max_length", truncation="only_second", return_offsets_mapping=True)

# Tokenize and preprocess the dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [52]:
import pandas as pd

# Load the existing DataFrame from the CSV file
existing_df = pd.read_csv("/Users/gonzalovaldenebro/Library/CloudStorage/OneDrive-DrakeUniversity/CS 195/Medical_QA/MedicalQuestionAnswering.csv")



In [53]:
# Assuming existing_df is your DataFrame
existing_df = existing_df.rename(columns={'id': 'id', 'Question': 'Question', 'Answer': 'Context', 'topic': 'Topic'})

existing_df

Unnamed: 0,id,Question,Context,Topic
0,1,What is (are) keratoderma with woolly hair ?,Keratoderma with woolly hair is a group of rel...,growth_hormone_receptor
1,2,How many people are affected by keratoderma wi...,Keratoderma with woolly hair is rare; its prev...,growth_hormone_receptor
2,3,What are the genetic changes related to kerato...,"Mutations in the JUP, DSP, DSC2, and KANK2 gen...",growth_hormone_receptor
3,4,Is keratoderma with woolly hair inherited ?,Most cases of keratoderma with woolly hair hav...,growth_hormone_receptor
4,5,What are the treatments for keratoderma with w...,These resources address the diagnosis or manag...,growth_hormone_receptor
...,...,...,...,...
16401,16402,What is (are) COPD ?,COPD (chronic obstructive pulmonary disease) m...,Other
16402,16403,What is (are) Complex Regional Pain Syndrome ?,Complex regional pain syndrome (CRPS) is a chr...,Other
16403,16404,What is (are) Kidney Stones ?,A kidney stone is a solid piece of material th...,Other
16404,16405,What is (are) Meniere's Disease ?,Meniere's disease is a disorder of the inner e...,Other


In [54]:
import pandas as pd
from transformers import pipeline
import time

# Use a Hugging Face model to generate answers
qa_generator = pipeline('question-answering', model='distilbert-base-cased-distilled-squad')

# Record the start time
start_time = time.time()

# Generate answers based on the provided context and question
existing_df['Answer'] = existing_df.apply(lambda row: qa_generator(question=row['Question'], context=row['Context'])['answer'], axis=1)

# Record the end time
end_time = time.time()

# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed Time: {elapsed_time} seconds")

# Optionally, you can print the updated DataFrame with the answers
print("\nUpdated DataFrame:")
existing_df

Elapsed Time: 1234.3289606571198 seconds

Updated DataFrame:


Unnamed: 0,id,Question,Context,Topic,Answer
0,1,What is (are) keratoderma with woolly hair ?,Keratoderma with woolly hair is a group of rel...,growth_hormone_receptor,palmoplantar
1,2,How many people are affected by keratoderma wi...,Keratoderma with woolly hair is rare; its prev...,growth_hormone_receptor,"up to 1 in 1,000"
2,3,What are the genetic changes related to kerato...,"Mutations in the JUP, DSP, DSC2, and KANK2 gen...",growth_hormone_receptor,skin and hair abnormalities
3,4,Is keratoderma with woolly hair inherited ?,Most cases of keratoderma with woolly hair hav...,growth_hormone_receptor,autosomal recessive pattern of inheritance
4,5,What are the treatments for keratoderma with w...,These resources address the diagnosis or manag...,growth_hormone_receptor,"Cardiomyopathy, dilated"
...,...,...,...,...,...
16401,16402,What is (are) COPD ?,COPD (chronic obstructive pulmonary disease) m...,Other,chronic obstructive pulmonary disease
16402,16403,What is (are) Complex Regional Pain Syndrome ?,Complex regional pain syndrome (CRPS) is a chr...,Other,a chronic pain condition
16403,16404,What is (are) Kidney Stones ?,A kidney stone is a solid piece of material th...,Other,National Institute of Diabetes and Digestive a...
16404,16405,What is (are) Meniere's Disease ?,Meniere's disease is a disorder of the inner e...,Other,a disorder of the inner ear
