<a href="https://colab.research.google.com/github/gonzalovaldenebro/NaturalLanguageProcessing-Portfolio/blob/main/F7_1_TransferLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Handling Long-Term Information in Recurrent Neural Networks

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F7_1_TransferLearning.ipynb)

## Reference

Hugging Face NLP Course Chapter 1: Transformer Models https://huggingface.co/learn/nlp-course/chapter1/1

Hugging Face NLP Course Chapter 3: Fine-tuning a model with the Trainer API or Keras https://huggingface.co/learn/nlp-course/chapter3/1

Hugging Face NLP Course Chapter 7, Section 5: Summarization https://huggingface.co/learn/nlp-course/chapter7/5?fw=tf

In [55]:
import sys
!{sys.executable} -m pip install --no-cache-dir datasets keras tensorflow sentencepiece

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Transfer Learning

**Transfer Learning** is the process of taking a model that was trained (**pre-trained**) on one task and then **fine tuned** for another task.

Today we're going to practice fine-tuning a pre-trained **transformer** model - we'll cover transformers in more detail next week, but they work a lot like the other neural network models we've looked at so far.

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/pretraining.svg?raw=1" width=700>
    <br />
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/finetuning.svg?raw=1" width=700>
</div>

image source: https://huggingface.co/learn/nlp-course/chapter1/4?fw=tf

## Common pre-trained models

There are a variety of pre-trained models out there
* usually *very large*
* pretrained on *massive amounts of data*

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/model_parameters.png?raw=1" width=800>
</div>

**Encoders:** BERT, ALBERT, DistilBERT, ELECTRA, RoBERTa
* Usually trained on masked input - model tries to predict the missing word in a sequence


**Decoders:** CTRL, GPT, GPT-2, Transformer XL
* Neural language models - usually trying to predict the next word in a sequence

**Encoder-Decoder Models:** BART, mBART, Marian, T5
* full sequence-to-sequence models


## Working Example

We're going to work through our text-to-emoji example, fine-tuning a variant of T5.

### Load and filter our dataset just like before

In [56]:
from datasets import load_dataset


# Define a function to check if 'text' is not None
def is_not_none(example):
    return example['text'] is not None

dataset = load_dataset("KomeijiForce/Text2Emoji",split="train")

# Filter the dataset
dataset = dataset.filter(is_not_none)
dataset

Dataset({
    features: ['text', 'emoji', 'topic'],
    num_rows: 503682
})

### choosing a sample to work with

Even the smaller transformer models will take too long to train on in class

Let's choose a small sample to work on in class

In [57]:
# Shuffle the dataset
shuffled_dataset = dataset.shuffle(seed=42)

# Select a small sample
sample_size = 1000  # Define your sample size
sample_dataset = shuffled_dataset.select(range(sample_size))

#if you want to use the entire dataset just uncomment the following
#sample_dataset = shuffled_dataset

### Train/test split

Hugging Face datasets actually include a `train_test_split` function for splitting into training and testing sets if you don't already have them split.

In [58]:
dataset_split = sample_dataset.train_test_split(test_size=0.2)
dataset_split

DatasetDict({
    train: Dataset({
        features: ['text', 'emoji', 'topic'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'emoji', 'topic'],
        num_rows: 200
    })
})

### Reminder of what the data looks like

In [41]:
print(dataset_split["train"]["text"][46])
print(dataset_split["train"]["emoji"][46])

La lluvia est llegando con fuerza, es mejor quedarse adentro y disfrutar de una buena jornada de pelculas y mantita.
☔🏠🎥🍿🌈


### The Tokenizer

Since we will be using an existing model to start, we need to make sure we prepare our data in the same way that model was trained on.

**T5:** an encoder-decoder Transformer architecture suitable for sequence-to-sequences tasks

**mT5:** A multilingual version of T5, pretrained on the multilingual Common Crawl corpus (mC4), covering 101 languages

**mt5-small:** A small version of mT5, suitable for getting things working before attempting to train on a large model

`mt5-small` uses the SentencePiece tokenizer

In [26]:
from transformers import AutoTokenizer

#uses the sentencepiece tokenizer
model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)

### Looking at an example of the tokenization

You'll see that the token ids get returned as `input_ids`

It also includes an `attention_mask` which allows the algorithm to focus on specific important words using its attention mechanism - it's initialized to all 1s

In [27]:
inputs = tokenizer(dataset_split["train"]["text"][46])
inputs

{'input_ids': [12808, 259, 262, 9154, 304, 6775, 288, 772, 20727, 8174, 514, 287, 64642, 1052, 305, 97794, 12954, 1052, 24375, 149200, 285, 12017, 259, 264, 609, 277, 263, 1469, 259, 262, 10112, 51053, 3153, 13636, 347, 751, 3721, 25086, 281, 772, 5810, 2586, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Converting ids back to tokens

Here's what the tokens look like.

The `▁` and `</s>` are hallmarks of the SentencePiece tokenizer algorithm

In [28]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['▁Add',
 '▁',
 'a',
 '▁pop',
 '▁of',
 '▁color',
 '▁to',
 '▁your',
 '▁windows',
 'ill',
 '▁with',
 '▁the',
 '▁vibr',
 'ant',
 '▁and',
 '▁flam',
 'boy',
 'ant',
 '▁Bro',
 'melia',
 'd',
 '▁plant',
 '▁',
 '-',
 '▁it',
 "'",
 's',
 '▁like',
 '▁',
 'a',
 '▁fire',
 'works',
 '▁show',
 '▁happen',
 'ing',
 '▁all',
 '▁year',
 '▁round',
 '▁in',
 '▁your',
 '▁own',
 '▁home',
 '!',
 '</s>']

### How does it work on the emojis?

Fortunately, this seems to work pretty well for the emoji output too

some may come back as `<unk>` for unknown tokens

In [29]:
target = tokenizer(dataset_split["train"]["emoji"][46])
target

{'input_ids': [259, 244125, 239199, 191820, 238833, 239328, 248969, 96088, 223546, 238782, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [30]:
tokenizer.convert_ids_to_tokens(target.input_ids)

['▁', '🎆', '🌺', '💥', '🌈', '💫', '🌆', '🔥', '🇺', '🇸', '</s>']

In [31]:
tokenizer.decode(target.input_ids)

'🎆🌺💥🌈💫🌆🔥🇺🇸</s>'

### Let's define a preprocessing function

This will allow us to tokenize both the text and labels while allow use to add the token ids from the emojis as the `"labels"` key in the overall data structure where it will be convenient to have them for training.

In [32]:
max_input_length = 100
max_target_length = 20


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["text"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["emoji"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs



Hugging Face datasets have a `map` method that allows you to apply a preprocessing function like this to every example in the data set.

Notice that we get everything we had before (text, emoji, topic), but now we also have the input_ids (the tokens), the attention mask, and the labels (also token ids).

In [33]:
#turn the tokenized data back into a dataset
tokenized_datasets = dataset_split.map(preprocess_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'emoji', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'emoji', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

### Grabbing the pre-trained model

as a reminder, `model_checkpoint` was defined earlier - it is `"google/mt5-small"`

Note that this is an encoder-decoder transformer model the was pretrained on a 750 GB dataset which included tasks for summarization, translation, question answering, and classification.

In [34]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

All model checkpoint layers were used when initializing TFMT5ForConditionalGeneration.

All the layers of TFMT5ForConditionalGeneration were initialized from the model checkpoint at google/mt5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMT5ForConditionalGeneration for predictions without further training.


### Using a data collator

Hugging Face provides a Data Collator class which is used to collect the training data into batches and dynamically pad them so that each batch is appropriately padded but without an overall fixed length.

With `return_tensors="tf"` we're saying we want the data back in an appropriate data structure suitable for using with Keras/Tensorflow.

In [35]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

Let's make a version of the dataset where the original text fields are removed so we can use it with the collator.

In [36]:
tokenized_datasets_no_text = tokenized_datasets.remove_columns(["text","emoji","topic"])
tokenized_datasets_no_text

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 800
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

In [37]:
tf_train_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

### Setting up the optimizer

When fine-tuning a pre-trained algorithm, you usually want to use a smaller learning rate.

Note that we do not specify a loss function - it will use whatever was used in the base model.

*NB:* I'm using values that were in the example on the website (https://huggingface.co/learn/nlp-course/chapter7/5?fw=tf ) for a different dataset - I don't know if these are the best for this problem

In [38]:
from transformers import create_optimizer
import tensorflow as tf

num_train_epochs = 8
num_train_steps = len(tf_train_dataset) * num_train_epochs

optimizer, schedule = create_optimizer(
    init_lr=5.6e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(optimizer=optimizer)

# Train in mixed-precision float16 - can be helpful if running on a GPU
tf.keras.mixed_precision.set_global_policy("mixed_float16")

In [39]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.src.callbacks.History at 0x2ffc116d0>

### Saving a copy of the model's weights

This will allow us to load the model later and work with it without completely retraining.

In [None]:
model.save_pretrained("models/emoji-model-v2")

### Reload a saved model

In [None]:
model = TFAutoModelForSeq2SeqLM.from_pretrained("models/emoji-model-v1")

### Inference

Let's suppose we have an example to get a prediction for. For now, let's grab one from the test set

In [44]:
print( tokenized_datasets["test"]["text"][15] )
print( tokenized_datasets["test"]["emoji"][15])
print( tokenized_datasets["test"]["input_ids"][15])

Looking for the perfect gadget to streamline your home organization? You'll love these versatile storage bins.
📦✨🏠💕
[259, 61408, 332, 287, 5571, 259, 88586, 288, 38101, 1397, 772, 2586, 29660, 291, 1662, 277, 1578, 3869, 259, 3824, 259, 170280, 468, 25973, 3111, 263, 260, 1]


Use the `generate` method to get a prediction sequence from the intput IDs.

If you don't already have the tokens, make sure to use your tokenizer first.

In [45]:
prediction = model.generate([tokenized_datasets["test"]["input_ids"][15]], max_length=max_target_length)
tokenizer.convert_ids_to_tokens(prediction[0])

2023-11-21 16:30:08.213844: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x34050cb60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-11-21 16:30:08.214001: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-11-21 16:30:08.236120: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-11-21 16:30:08.338200: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


['<pad>', '▁<extra_id_0>', '.', '</s>']

In [46]:
decoded_output = tokenizer.decode(prediction[0], skip_special_tokens=True)
decoded_output

'<extra_id_0>.'

## Applied Exploration

The applied exploration for this fortnight will be a little different. I want everyone to get some experience fine-tuning an existing model, so this will be the task for the entire fortnight.

Fine-tune an existing model with the following requirements
* Choose a different starting model - you can use any Hugging Face model, but consider starting with a general one like BART or Llama2.
* Choose a different data set - think about something that would be good to include in an application that interests you
* Evaluate how well it performed. For sequence-to-sequence model, try going back and using Rouge from Fortnight 1.

# Fine-tunning the Dynamic Tinibert Model with Medical Data


I will be working on fine-tunning the existing [Dynamic-TinyBERT](https://huggingface.co/Intel/dynamic_tinybert) with medical data, this is the [dataset](https://huggingface.co/datasets/GonzaloValdenebro/MedicalQuestionAnsweringDataset) that I will be using for that.

In [1]:
from datasets import load_dataset

MedData = (load_dataset("GonzaloValdenebro/MedicalQuestionAnsweringDataset", split='train')
        .train_test_split(train_size=0.80, test_size=0.20))
MedData


DatasetDict({
    train: Dataset({
        features: ['id', 'Question', 'Context', 'Topic', 'Answer'],
        num_rows: 13124
    })
    test: Dataset({
        features: ['id', 'Question', 'Context', 'Topic', 'Answer'],
        num_rows: 3282
    })
})

## Importing the model and tokenizer

In [4]:
from transformers import AutoTokenizer

model_checkpoint = "Intel/dynamic_tinybert"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer

BertTokenizerFast(name_or_path='Intel/dynamic_tinybert', vocab_size=30522, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

## Checking tokenizer results

We can pass to our tokenizer the question and the context together, and it will properly insert the special tokens to form a sentence like this:

*[CLS] question [SEP] context [SEP]*


In [6]:
context = MedData["train"][0]["Context"]
question = MedData["train"][0]["Question"]

inputs = tokenizer(question, context)
tokenizer.decode(inputs["input_ids"])

'[CLS] what is ( are ) x - linked creatine deficiency? [SEP] x - linked creatine deficiency is an inherited disorder that primarily affects the brain. people with this disorder have intellectual disability, which can range from mild to severe, and delayed speech development. some affected individuals develop behavioral disorders such as attention deficit hyperactivity disorder or autistic behaviors that affect communication and social interaction. they may also experience seizures. children with x - linked creatine deficiency may experience slow growth and exhibit delayed development of motor skills such as sitting and walking. affected individuals tend to tire easily. a small number of people with x - linked creatine deficiency have additional signs and symptoms including abnormal heart rhythms, an unusually small head ( microcephaly ), or distinctive facial features such as a broad forehead and a flat or sunken appearance of the middle of the face ( midface hypoplasia ). [SEP]'

To see how this works using the current example, we can limit the length to 100 and use a sliding window of 50 tokens. As a reminder, we use:

- max_length to set the maximum length (here 100)
- truncation="only_second" to truncate the context (which is in the second position) when the question with its context is too long
- stride to set the number of overlapping tokens between two successive chunks (here 50)
- return_overflowing_tokens=True to let the tokenizer know we want the overflowing tokens

In [7]:
inputs = tokenizer(
    question,
    context,
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

[CLS] what is ( are ) x - linked creatine deficiency? [SEP] x - linked creatine deficiency is an inherited disorder that primarily affects the brain. people with this disorder have intellectual disability, which can range from mild to severe, and delayed speech development. some affected individuals develop behavioral disorders such as attention deficit hyperactivity disorder or autistic behaviors that affect communication and social interaction. they may also experience seizures. children with x - linked creatine deficiency may experience slow growth and [SEP]
[CLS] what is ( are ) x - linked creatine deficiency? [SEP] delayed speech development. some affected individuals develop behavioral disorders such as attention deficit hyperactivity disorder or autistic behaviors that affect communication and social interaction. they may also experience seizures. children with x - linked creatine deficiency may experience slow growth and exhibit delayed development of motor skills such as sitti

The dataset provides us with the start character of the answer in the context, and by adding the length of the answer, we can find the end character in the context. To map those to token indices, we will need to use the offset mappings. We can have our tokenizer return these by passing along return_offsets_mapping=True:

In [8]:
inputs = tokenizer(
    question,
    context,
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

As we can see, we get back the usual input IDs, token type IDs, and attention mask, as well as the offset mapping we required and an extra key, overflow_to_sample_mapping. The corresponding value will be of use to us when we tokenize several texts at the same time (which we should do to benefit from the fact that our tokenizer is backed by Rust). Since one sample can give several features, it maps each feature to the example it originated from. Because here we only tokenized one example, we get a list of 0s:

In [9]:
inputs["overflow_to_sample_mapping"]

[0, 0, 0, 0]

But if we tokenize more examples, this will become more useful:

In [10]:
inputs = tokenizer(
    MedData["train"][2:6]["Question"],
    MedData["train"][2:6]["Context"],
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

print(f"The 4 examples gave {len(inputs['input_ids'])} features.")
print(f"Here is where each comes from: {inputs['overflow_to_sample_mapping']}.")

The 4 examples gave 18 features.
Here is where each comes from: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 3].


As we can see, the first three examples (at indices 2, 3, and 4 in the training set) each gave four features and the last example (at index 5 in the training set) gave 7 features.

This information will be useful to map each feature we get to its corresponding label. As mentioned earlier, those labels are:

- ( 0, 0) if the answer is not in the corresponding span of the context
- (start_position, end_position) if the answer is in the corresponding span of the context, with start_position being the index of the token (in the input IDs) at the start of the answer and end_position being the index of the token (in the input IDs) where the answer ends

To determine which of these is the case and, if relevant, the positions of the tokens, we first find the indices that start and end the context in the input IDs. We could use the token type IDs to do this, but since those do not necessarily exist for all models (DistilBERT does not require them, for instance), we’ll instead use the sequence_ids() method of the BatchEncoding our tokenizer returns.

Once we have those token indices, we look at the corresponding offsets, which are tuples of two integers representing the span of characters inside the original context. We can thus detect if the chunk of the context in this feature starts after the answer or ends before the answer begins (in which case the label is (0, 0)). If that’s not the case, we loop to find the first and last token of the answer:

In [49]:
answers = MedData["train"][2:6]["Answer"]
start_positions = []
end_positions = []

for i, offset_tuple in enumerate(inputs["offset_mapping"]):
    sample_idx = inputs["overflow_to_sample_mapping"][i]
    answer = answers[sample_idx]

    # Check if offset_tuple is not empty
    if offset_tuple and isinstance(answer, list) and len(answer) > 0:
        # Assuming you want to consider the first string in the list
        answer_text = answer[0]

        # Extract the start and end offsets from the tuple
        start_offset, end_offset = offset_tuple[0], offset_tuple[1]

        # Use the start offset directly
        start_char = start_offset
        end_char = start_char + len(answer_text)

        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if end_offset < start_char or end_offset < start_offset or start_offset > end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise, it's the start and end token positions
            idx = context_start
            while idx <= context_end and end_offset >= start_offset:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and start_offset <= end_char:
                idx -= 1
            end_positions.append(idx + 1)
    else:
        # Handle the case where offset_tuple is empty or answer is not in the expected format
        start_positions.append(0)
        end_positions.append(0)

start_positions, end_positions


([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Let’s take a look at a few results to verify that our approach is correct. For the first feature we find (0, 0) as labels, so let’s compare the theoretical answer with the decoded span of tokens from 0 to 0 (inclusive):

In [46]:
idx = 0
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]

start = start_positions[idx]
end = end_positions[idx]
labeled_answer = tokenizer.decode(inputs["input_ids"][idx][start : end + 1])

print(f"Theoretical answer: {answer}, labels give: {labeled_answer}")

Theoretical answer: people who have diabetes, labels give: [CLS]


So that’s a match! Now let’s check index 4, where we set the labels to (0, 0), which means the answer is not in the context chunk of that feature:

In [52]:
idx = 4
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]

decoded_example = tokenizer.decode(inputs["input_ids"][idx])
print(f"Theoretical answer: {answer}, decoded example: {decoded_example}")

Theoretical answer: people who have diabetes, decoded example: [CLS] who is at risk for peripheral artery disease?? [SEP] for p. a. d. plaque builds up in your arteries as you age. older age combined with other risk factors, such as smoking or diabetes, also puts you at higher risk for p. a. d. diseases and conditions many diseases and conditions can raise your risk of p. a. d., including : diabetes high blood pressure high blood cholesterol coronary heart disease stroke metabolic syndrome [SEP]


Now that we have seen step by step how to preprocess our training data, we can group it in a function we will apply on the whole training dataset. We’ll pad every feature to the maximum length we set, as most of the contexts will be long (and the corresponding samples will be split into several features), so there is no real benefit to applying dynamic padding here:

In [53]:
max_length = 384
stride = 128

def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["Question"]]
    inputs = tokenizer(
        questions,
        examples["Context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["Answer"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]

        if isinstance(answer, list) and len(answer) > 0:
            # Assuming you want to consider the first string in the list
            answer_text = answer[0]
            start_char = offset[0]  # Assuming start character is at the beginning of the offset
            end_char = start_char + len(answer_text)

            sequence_ids = inputs.sequence_ids(i)

            # Find the start and end of the context
            idx = 0
            while sequence_ids[idx] != 1:
                idx += 1
            context_start = idx
            while sequence_ids[idx] == 1:
                idx += 1
            context_end = idx - 1

            # If the answer is not fully inside the context, label is (0, 0)
            if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
                start_positions.append(0)
                end_positions.append(0)
            else:
                # Otherwise, it's the start and end token positions
                idx = context_start
                while idx <= context_end and offset[idx][0] <= start_char:
                    idx += 1
                start_positions.append(idx - 1)

                idx = context_end
                while idx >= context_start and offset[idx][1] >= end_char:
                    idx -= 1
                end_positions.append(idx + 1)
        else:
            # Handle the case where answer is not in the expected format
            start_positions.append(0)
            end_positions.append(0)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions

    return inputs


Note that we defined two constants to determine the maximum length used as well as the length of the sliding window, and that we added a tiny bit of cleanup before tokenizing: some of the questions in the MedData dataset have extra spaces at the beginning and the end that don’t add anything (and take up space when being tokenized if you use a model like RoBERTa), so we removed those extra spaces.

To apply this function to the whole training set, we use the Dataset.map() method with the batched=True flag. It’s necessary here as we are changing the length of the dataset (since one example can give several training features):

In [54]:
train_dataset = MedData["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=MedData["train"].column_names,
)
len(MedData["train"]), len(train_dataset)

Map:   0%|          | 0/13124 [00:00<?, ? examples/s]

(13124, 18631)

As we can see, the preprocessing added roughly 1,000 features. Our training set is now ready to be used — let’s dig into the preprocessing of the validation set!

## Processing the validation data

Preprocessing the validation data will be slightly easier as we don’t need to generate labels (unless we want to compute a validation loss, but that number won’t really help us understand how good the model is). The real joy will be to interpret the predictions of the model into spans of the original context. For this, we will just need to store both the offset mappings and some way to match each created feature to the original example it comes from. Since there is an ID column in the original dataset, we’ll use that ID.

The only thing we’ll add here is a tiny bit of cleanup of the offset mappings. They will contain offsets for the question and the context, but once we’re in the post-processing stage we won’t have any way to know which part of the input IDs corresponded to the context and which part was the question (the sequence_ids() method we used is available for the output of the tokenizer only). So, we’ll set the offsets corresponding to the question to None:

In [59]:
MedData

DatasetDict({
    train: Dataset({
        features: ['id', 'Question', 'Context', 'Topic', 'Answer'],
        num_rows: 13124
    })
    test: Dataset({
        features: ['id', 'Question', 'Context', 'Topic', 'Answer'],
        num_rows: 3282
    })
})

In [60]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["Question"]]
    inputs = tokenizer(
        questions,
        examples["Context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

We can apply this function on the whole validation dataset like before:/

In [61]:
validation_dataset = MedData["test"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=MedData["test"].column_names,
)
len(MedData["test"]), len(validation_dataset)

Map:   0%|          | 0/3282 [00:00<?, ? examples/s]

(3282, 4579)

In this case we’ve only added a couple of hundred samples, so it appears the contexts in the validation dataset are a bit shorter.

Now that we have preprocessed all the data, we can get to the training.

## Fine-tuning the model with the Trainer API


The training code for this example will look a lot like the code in the previous sections — the hardest thing will be to write the compute_metrics() function. Since we padded all the samples to the maximum length we set, there is no data collator to define, so this metric computation is really the only thing we have to worry about. The difficult part will be to post-process the model predictions into spans of text in the original examples; once we have done that, the metric from the 🤗 Datasets library will do most of the work for us.

The model will output logits for the start and end positions of the answer in the input IDs, as we saw during our exploration of the question-answering pipeline. The post-processing step will be similar to what we did there, so here’s a quick reminder of the actions we took:

We masked the start and end logits corresponding to tokens outside of the context.
We then converted the start and end logits into probabilities using a softmax.
We attributed a score to each (start_token, end_token) pair by taking the product of the corresponding two probabilities.
We looked for the pair with the maximum score that yielded a valid answer (e.g., a start_token lower than end_token).
Here we will change this process slightly because we don’t need to compute actual scores (just the predicted answer). This means we can skip the softmax step. To go faster, we also won’t score all the possible (start_token, end_token) pairs, but only the ones corresponding to the highest n_best logits (with n_best=20). Since we will skip the softmax, those scores will be logit scores, and will be obtained by taking the sum of the start and end logits (instead of the product, because of the rule 
log

To demonstrate all of this, we will need some kind of predictions. Since we have not trained our model yet, we are going to use the default model for the QA pipeline to generate some predictions on a small part of the validation set. We can use the same processing function as before; because it relies on the global constant tokenizer, we just have to change that object to the tokenizer of the model we want to use temporarily:

In [62]:
small_eval_set = MedData["test"].select(range(100))
#trained_checkpoint = "distilbert-base-cased-distilled-squad"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
eval_set = small_eval_set.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=MedData["test"].column_names,
)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Now that the preprocessing is done, we change the tokenizer back to the one we originally picked:

In [64]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer

BertTokenizerFast(name_or_path='Intel/dynamic_tinybert', vocab_size=30522, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

We then remove the columns of our eval_set that are not expected by the model, build a batch with all of that small validation set, and pass it through the model. If a GPU is available, we use it to go faster:

In [65]:
import torch
from transformers import AutoModelForQuestionAnswering

eval_set_for_model = eval_set.remove_columns(["example_id", "offset_mapping"])
eval_set_for_model.set_format("torch")

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
batch = {k: eval_set_for_model[k].to(device) for k in eval_set_for_model.column_names}
trained_model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint).to(
    device
)

with torch.no_grad():
    outputs = trained_model(**batch)

Since the Trainer will give us predictions as NumPy arrays, we grab the start and end logits and convert them to that format:

In [66]:
start_logits = outputs.start_logits.cpu().numpy()
end_logits = outputs.end_logits.cpu().numpy()

Now, we need to find the predicted answer for each example in our small_eval_set. One example may have been split into several features in eval_set, so the first step is to map each example in small_eval_set to the corresponding features in eval_set:

In [67]:
import collections

example_to_features = collections.defaultdict(list)
for idx, feature in enumerate(eval_set):
    example_to_features[feature["example_id"]].append(idx)

With this in hand, we can really get to work by looping through all the examples and, for each example, through all the associated features. As we said before, we’ll look at the logit scores for the n_best start logits and end logits, excluding positions that give:

- An answer that wouldn’t be inside the context
- An answer with negative length
- An answer that is too long (we limit the possibilities at max_answer_length=30)

Once we have all the scored possible answers for one example, we just pick the one with the best logit score:

In [68]:
import numpy as np

n_best = 20
max_answer_length = 30
predicted_answers = []

for example in small_eval_set:
    example_id = example["id"]
    context = example["Context"]
    answers = []

    for feature_index in example_to_features[example_id]:
        start_logit = start_logits[feature_index]
        end_logit = end_logits[feature_index]
        offsets = eval_set["offset_mapping"][feature_index]

        start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
        end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
        for start_index in start_indexes:
            for end_index in end_indexes:
                # Skip answers that are not fully in the context
                if offsets[start_index] is None or offsets[end_index] is None:
                    continue
                # Skip answers with a length that is either < 0 or > max_answer_length.
                if (
                    end_index < start_index
                    or end_index - start_index + 1 > max_answer_length
                ):
                    continue

                answers.append(
                    {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                )

    best_answer = max(answers, key=lambda x: x["logit_score"])
    predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})

The final format of the predicted answers is the one that will be expected by the metric we will use. As usual, we can load it with the help of the 🤗 Evaluate library:

In [74]:
import evaluate

metric = evaluate.load("squad")

This metric expects the predicted answers in the format we saw above (a list of dictionaries with one key for the ID of the example and one key for the predicted text) and the theoretical answers in the format below (a list of dictionaries with one key for the ID of the example and one key for the possible answers):

In [76]:
theoretical_answers = [
    {"id": ex["id"], "answers": ex["Answer"]} for ex in small_eval_set
]

We can now check that we get sensible results by looking at the first element of both lists:

In [77]:
print(predicted_answers[0])
print(theoretical_answers[0])

{'id': 3050, 'prediction_text': 'Diagnostic Tests  - Drug Therapy  - Surgery and Rehabilitation  - Genetic Counseling   - Palliative Care'}
{'id': 3050, 'answers': 'Genetic Testing Registry'}


Not too bad! Now let’s have a look at the score the metric gives us:

In [79]:
from tqdm.auto import tqdm


def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["Context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})
    

    # Modify the format of predicted_answers
    formatted_predicted_answers = [{'id': str(pred['id']), 'prediction_text': pred['prediction_text']} for pred in predicted_answers]

    # Modify the format of theoretical_answers
    formatted_theoretical_answers = [{'id': str(ex['id']), 'answers': [{'text': ex['Answer'], 'answer_start': 0}]} for ex in examples]
    
    # Assuming 'metric' is an instance of your evaluation module
    result = metric.compute(predictions=formatted_predicted_answers, references=formatted_theoretical_answers)

    return result

# Call the compute_metrics function with the necessary arguments
compute_metrics(start_logits, end_logits, eval_set, small_eval_set)


  0%|          | 0/100 [00:00<?, ?it/s]

{'exact_match': 38.0, 'f1': 57.82542720952935}

In [103]:
from tqdm.auto import tqdm
import collections
import numpy as np

from datasets import load_metric

# Load the SQuAD metric
squad_metric = load_metric("squad")

def compute_metrics(start_logits, end_logits, features, examples, n_best=10, max_answer_length=30, metric=None):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["Context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})
    
    # Modify the format of predicted_answers
    formatted_predicted_answers = [{'id': str(pred['id']), 'prediction_text': pred['prediction_text']} for pred in predicted_answers]

    # Modify the format of theoretical_answers
    formatted_theoretical_answers = [{'id': str(ex['id']), 'answers': [{'text': ex['Answer'], 'answer_start': 0}]} for ex in examples]
    
    # Assuming 'metric' is an instance of your evaluation module
    if metric is not None:
        result = metric.compute(predictions=formatted_predicted_answers, references=formatted_theoretical_answers)
        return result
    else:
        return None


# Call the compute_metrics function with the necessary arguments
compute_metrics(start_logits, end_logits, eval_set, small_eval_set, n_best=10, max_answer_length=30, metric=squad_metric)


  0%|          | 0/100 [00:00<?, ?it/s]

{'exact_match': 1.0, 'f1': 9.956097229298873}

Actually this results are not really good, given that the **"Intel/dynamic_tinybert"** performance analysis trained on the SQuAD1.1. was giving a 88.71 F1 and now we get 58, so fur sure lots of work to do here in the future

## Fine-tuning the model

We are now ready to train our model. Let’s create it first, using the AutoModelForQuestionAnswering class like before:

In [81]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
model

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-5): 6 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elemen

As usual, we get a warning that some weights are not used (the ones from the pretraining head) and some others are initialized randomly (the ones for the question answering head). You should be used to this by now, but that means this model is not ready to be used just yet and needs fine-tuning — good thing we’re about to do that!

To be able to push our model to the Hub, we’ll need to log in to Hugging Face. If you’re running this code in a notebook, you can do so with the following utility function, which displays a widget where you can enter your login credentials:

In [82]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Once this is done, we can define our TrainingArguments. As we said when we defined our function to compute the metric, we won’t be able to have a regular evaluation loop because of the signature of the compute_metrics() function. We could write our own subclass of Trainer to do this (an approach you can find in the question answering example script), but that’s a bit too long for this section. Instead, we will only evaluate the model at the end of training here and show you how to do a regular evaluation in “A custom training loop” below.

This is really where the Trainer API shows its limits and the 🤗 Accelerate library shines: customizing the class to a specific use case can be painful, but tweaking a fully exposed training loop is easy.

Let’s take a look at our TrainingArguments:

In [87]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./Users/gonzalovaldenebro/Library/CloudStorage/OneDrive-DrakeUniversity/CS 195/ Fortnight 7/bert-finetuned-model",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=False,   # Set to False to disable mixed-precision training
    push_to_hub=False,
)

Finally, we just pass everything to the Trainer class and launch the training:

In [88]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
)
trainer.train()

  0%|          | 0/6987 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 0.0543, 'learning_rate': 1.8568770573923e-05, 'epoch': 0.21}
{'loss': 0.0, 'learning_rate': 1.7137541147846e-05, 'epoch': 0.43}
{'loss': 0.0, 'learning_rate': 1.5706311721769e-05, 'epoch': 0.64}
{'loss': 0.0, 'learning_rate': 1.4275082295692e-05, 'epoch': 0.86}
{'loss': 0.0, 'learning_rate': 1.2843852869615001e-05, 'epoch': 1.07}
{'loss': 0.0, 'learning_rate': 1.1412623443538e-05, 'epoch': 1.29}
{'loss': 0.0, 'learning_rate': 9.981394017461e-06, 'epoch': 1.5}
{'loss': 0.0, 'learning_rate': 8.550164591383999e-06, 'epoch': 1.72}
{'loss': 0.0, 'learning_rate': 7.118935165307e-06, 'epoch': 1.93}
{'loss': 0.0, 'learning_rate': 5.6877057392299995e-06, 'epoch': 2.15}
{'loss': 0.0, 'learning_rate': 4.256476313152999e-06, 'epoch': 2.36}
{'loss': 0.0, 'learning_rate': 2.8252468870759988e-06, 'epoch': 2.58}
{'loss': 0.0, 'learning_rate': 1.3940174609989982e-06, 'epoch': 2.79}
{'train_runtime': 3492.1293, 'train_samples_per_second': 16.005, 'train_steps_per_second': 2.001, 'train_loss': 0

TrainOutput(global_step=6987, training_loss=0.0038880865149810715, metrics={'train_runtime': 3492.1293, 'train_samples_per_second': 16.005, 'train_steps_per_second': 2.001, 'train_loss': 0.0038880865149810715, 'epoch': 3.0})

Note that while the training happens, each time the model is saved (here, every epoch) it is uploaded to the Hub in the background. This way, you will be able to to resume your training on another machine if necessary. The whole training takes a while (a little over an hour on a Titan RTX), so you can grab a coffee or reread some of the parts of the course that you’ve found more challenging while it proceeds. Also note that as soon as the first epoch is finished, you will see some weights uploaded to the Hub and you can start playing with your model on its page.

Once the training is complete, we can finally evaluate our model (and pray we didn’t spend all that compute time on nothing). The predict() method of the Trainer will return a tuple where the first elements will be the predictions of the model (here a pair with the start and end logits). We send this to our compute_metrics() function:

In [108]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
model_results = compute_metrics(start_logits, end_logits, validation_dataset, MedData["test"])


  0%|          | 0/573 [00:00<?, ?it/s]

  0%|          | 0/3282 [00:00<?, ?it/s]

In [109]:
model_results

In [110]:
from datasets import load_metric

# Load the SQuAD metric
squad_metric = load_metric("squad")

# Call the compute_metrics function with the necessary arguments
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions

compute_metrics(start_logits, end_logits, validation_dataset, MedData["test"], metric=squad_metric)

  0%|          | 0/573 [00:00<?, ?it/s]

In [None]:
MedData["test"]

In [None]:
compute_metrics(start_logits, end_logits, validation_dataset, validation_dataset)

In [100]:
#for example in examples:
#    print(example)
example

{'id': 3716,
 'Question': 'What is (are) retroperitoneal fibrosis ?',
 'Context': 'Retroperitoneal fibrosis is a disorder in which inflammation and extensive scar tissue (fibrosis) occur in the back of the abdominal cavity, behind (retro-) the membrane that surrounds the organs of the digestive system (the peritoneum). This area is known as the retroperitoneal space. Retroperitoneal fibrosis can occur at any age but appears most frequently between the ages of 40 and 60.  The inflamed tissue characteristic of retroperitoneal fibrosis typically causes gradually increasing pain in the lower abdomen, back, or side. Other symptoms arise from blockage of blood flow to and from various parts of the lower body, due to the development of scar tissue around blood vessels. The fibrosis usually develops first around the aorta, which is the large blood vessel that distributes blood from the heart to the rest of the body. Additional blood vessels including the inferior vena cava, which returns blood

Great! As a comparison, the baseline scores reported in the BERT article for this model are 80.8 and 88.5, so we’re right where we should be.

Finally, we use the push_to_hub() method to make sure we upload the latest version of the model:

In [206]:
trainer.push_to_hub(commit_message="Training complete")