<a href="https://colab.research.google.com/github/gonzalovaldenebro/NaturalLanguageProcessing-Portfolio/blob/main/F7_1_TransferLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Handling Long-Term Information in Recurrent Neural Networks

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F7_1_TransferLearning.ipynb)

## Reference

Hugging Face NLP Course Chapter 1: Transformer Models https://huggingface.co/learn/nlp-course/chapter1/1

Hugging Face NLP Course Chapter 3: Fine-tuning a model with the Trainer API or Keras https://huggingface.co/learn/nlp-course/chapter3/1

Hugging Face NLP Course Chapter 7, Section 5: Summarization https://huggingface.co/learn/nlp-course/chapter7/5?fw=tf

In [55]:
import sys
!{sys.executable} -m pip install --no-cache-dir datasets keras tensorflow sentencepiece

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Transfer Learning

**Transfer Learning** is the process of taking a model that was trained (**pre-trained**) on one task and then **fine tuned** for another task.

Today we're going to practice fine-tuning a pre-trained **transformer** model - we'll cover transformers in more detail next week, but they work a lot like the other neural network models we've looked at so far.

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/pretraining.svg?raw=1" width=700>
    <br />
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/finetuning.svg?raw=1" width=700>
</div>

image source: https://huggingface.co/learn/nlp-course/chapter1/4?fw=tf

## Common pre-trained models

There are a variety of pre-trained models out there
* usually *very large*
* pretrained on *massive amounts of data*

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/model_parameters.png?raw=1" width=800>
</div>

**Encoders:** BERT, ALBERT, DistilBERT, ELECTRA, RoBERTa
* Usually trained on masked input - model tries to predict the missing word in a sequence


**Decoders:** CTRL, GPT, GPT-2, Transformer XL
* Neural language models - usually trying to predict the next word in a sequence

**Encoder-Decoder Models:** BART, mBART, Marian, T5
* full sequence-to-sequence models


## Working Example

We're going to work through our text-to-emoji example, fine-tuning a variant of T5.

### Load and filter our dataset just like before

In [56]:
from datasets import load_dataset


# Define a function to check if 'text' is not None
def is_not_none(example):
    return example['text'] is not None

dataset = load_dataset("KomeijiForce/Text2Emoji",split="train")

# Filter the dataset
dataset = dataset.filter(is_not_none)
dataset

Dataset({
    features: ['text', 'emoji', 'topic'],
    num_rows: 503682
})

### choosing a sample to work with

Even the smaller transformer models will take too long to train on in class

Let's choose a small sample to work on in class

In [57]:
# Shuffle the dataset
shuffled_dataset = dataset.shuffle(seed=42)

# Select a small sample
sample_size = 1000  # Define your sample size
sample_dataset = shuffled_dataset.select(range(sample_size))

#if you want to use the entire dataset just uncomment the following
#sample_dataset = shuffled_dataset

### Train/test split

Hugging Face datasets actually include a `train_test_split` function for splitting into training and testing sets if you don't already have them split.

In [58]:
dataset_split = sample_dataset.train_test_split(test_size=0.2)
dataset_split

DatasetDict({
    train: Dataset({
        features: ['text', 'emoji', 'topic'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'emoji', 'topic'],
        num_rows: 200
    })
})

### Reminder of what the data looks like

In [41]:
print(dataset_split["train"]["text"][46])
print(dataset_split["train"]["emoji"][46])

La lluvia est llegando con fuerza, es mejor quedarse adentro y disfrutar de una buena jornada de pelculas y mantita.
☔🏠🎥🍿🌈


### The Tokenizer

Since we will be using an existing model to start, we need to make sure we prepare our data in the same way that model was trained on.

**T5:** an encoder-decoder Transformer architecture suitable for sequence-to-sequences tasks

**mT5:** A multilingual version of T5, pretrained on the multilingual Common Crawl corpus (mC4), covering 101 languages

**mt5-small:** A small version of mT5, suitable for getting things working before attempting to train on a large model

`mt5-small` uses the SentencePiece tokenizer

In [26]:
from transformers import AutoTokenizer

#uses the sentencepiece tokenizer
model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)

### Looking at an example of the tokenization

You'll see that the token ids get returned as `input_ids`

It also includes an `attention_mask` which allows the algorithm to focus on specific important words using its attention mechanism - it's initialized to all 1s

In [27]:
inputs = tokenizer(dataset_split["train"]["text"][46])
inputs

{'input_ids': [12808, 259, 262, 9154, 304, 6775, 288, 772, 20727, 8174, 514, 287, 64642, 1052, 305, 97794, 12954, 1052, 24375, 149200, 285, 12017, 259, 264, 609, 277, 263, 1469, 259, 262, 10112, 51053, 3153, 13636, 347, 751, 3721, 25086, 281, 772, 5810, 2586, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Converting ids back to tokens

Here's what the tokens look like.

The `▁` and `</s>` are hallmarks of the SentencePiece tokenizer algorithm

In [28]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['▁Add',
 '▁',
 'a',
 '▁pop',
 '▁of',
 '▁color',
 '▁to',
 '▁your',
 '▁windows',
 'ill',
 '▁with',
 '▁the',
 '▁vibr',
 'ant',
 '▁and',
 '▁flam',
 'boy',
 'ant',
 '▁Bro',
 'melia',
 'd',
 '▁plant',
 '▁',
 '-',
 '▁it',
 "'",
 's',
 '▁like',
 '▁',
 'a',
 '▁fire',
 'works',
 '▁show',
 '▁happen',
 'ing',
 '▁all',
 '▁year',
 '▁round',
 '▁in',
 '▁your',
 '▁own',
 '▁home',
 '!',
 '</s>']

### How does it work on the emojis?

Fortunately, this seems to work pretty well for the emoji output too

some may come back as `<unk>` for unknown tokens

In [29]:
target = tokenizer(dataset_split["train"]["emoji"][46])
target

{'input_ids': [259, 244125, 239199, 191820, 238833, 239328, 248969, 96088, 223546, 238782, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [30]:
tokenizer.convert_ids_to_tokens(target.input_ids)

['▁', '🎆', '🌺', '💥', '🌈', '💫', '🌆', '🔥', '🇺', '🇸', '</s>']

In [31]:
tokenizer.decode(target.input_ids)

'🎆🌺💥🌈💫🌆🔥🇺🇸</s>'

### Let's define a preprocessing function

This will allow us to tokenize both the text and labels while allow use to add the token ids from the emojis as the `"labels"` key in the overall data structure where it will be convenient to have them for training.

In [32]:
max_input_length = 100
max_target_length = 20


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["text"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["emoji"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs



Hugging Face datasets have a `map` method that allows you to apply a preprocessing function like this to every example in the data set.

Notice that we get everything we had before (text, emoji, topic), but now we also have the input_ids (the tokens), the attention mask, and the labels (also token ids).

In [33]:
#turn the tokenized data back into a dataset
tokenized_datasets = dataset_split.map(preprocess_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'emoji', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'emoji', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

### Grabbing the pre-trained model

as a reminder, `model_checkpoint` was defined earlier - it is `"google/mt5-small"`

Note that this is an encoder-decoder transformer model the was pretrained on a 750 GB dataset which included tasks for summarization, translation, question answering, and classification.

In [34]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

All model checkpoint layers were used when initializing TFMT5ForConditionalGeneration.

All the layers of TFMT5ForConditionalGeneration were initialized from the model checkpoint at google/mt5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMT5ForConditionalGeneration for predictions without further training.


### Using a data collator

Hugging Face provides a Data Collator class which is used to collect the training data into batches and dynamically pad them so that each batch is appropriately padded but without an overall fixed length.

With `return_tensors="tf"` we're saying we want the data back in an appropriate data structure suitable for using with Keras/Tensorflow.

In [35]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

Let's make a version of the dataset where the original text fields are removed so we can use it with the collator.

In [36]:
tokenized_datasets_no_text = tokenized_datasets.remove_columns(["text","emoji","topic"])
tokenized_datasets_no_text

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 800
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

In [37]:
tf_train_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

### Setting up the optimizer

When fine-tuning a pre-trained algorithm, you usually want to use a smaller learning rate.

Note that we do not specify a loss function - it will use whatever was used in the base model.

*NB:* I'm using values that were in the example on the website (https://huggingface.co/learn/nlp-course/chapter7/5?fw=tf ) for a different dataset - I don't know if these are the best for this problem

In [38]:
from transformers import create_optimizer
import tensorflow as tf

num_train_epochs = 8
num_train_steps = len(tf_train_dataset) * num_train_epochs

optimizer, schedule = create_optimizer(
    init_lr=5.6e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(optimizer=optimizer)

# Train in mixed-precision float16 - can be helpful if running on a GPU
tf.keras.mixed_precision.set_global_policy("mixed_float16")

In [39]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.src.callbacks.History at 0x2ffc116d0>

### Saving a copy of the model's weights

This will allow us to load the model later and work with it without completely retraining.

In [None]:
model.save_pretrained("models/emoji-model-v2")

### Reload a saved model

In [None]:
model = TFAutoModelForSeq2SeqLM.from_pretrained("models/emoji-model-v1")

### Inference

Let's suppose we have an example to get a prediction for. For now, let's grab one from the test set

In [44]:
print( tokenized_datasets["test"]["text"][15] )
print( tokenized_datasets["test"]["emoji"][15])
print( tokenized_datasets["test"]["input_ids"][15])

Looking for the perfect gadget to streamline your home organization? You'll love these versatile storage bins.
📦✨🏠💕
[259, 61408, 332, 287, 5571, 259, 88586, 288, 38101, 1397, 772, 2586, 29660, 291, 1662, 277, 1578, 3869, 259, 3824, 259, 170280, 468, 25973, 3111, 263, 260, 1]


Use the `generate` method to get a prediction sequence from the intput IDs.

If you don't already have the tokens, make sure to use your tokenizer first.

In [45]:
prediction = model.generate([tokenized_datasets["test"]["input_ids"][15]], max_length=max_target_length)
tokenizer.convert_ids_to_tokens(prediction[0])

2023-11-21 16:30:08.213844: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x34050cb60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-11-21 16:30:08.214001: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-11-21 16:30:08.236120: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-11-21 16:30:08.338200: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


['<pad>', '▁<extra_id_0>', '.', '</s>']

In [46]:
decoded_output = tokenizer.decode(prediction[0], skip_special_tokens=True)
decoded_output

'<extra_id_0>.'

## Applied Exploration

The applied exploration for this fortnight will be a little different. I want everyone to get some experience fine-tuning an existing model, so this will be the task for the entire fortnight.

Fine-tune an existing model with the following requirements
* Choose a different starting model - you can use any Hugging Face model, but consider starting with a general one like BART or Llama2.
* Choose a different data set - think about something that would be good to include in an application that interests you
* Evaluate how well it performed. For sequence-to-sequence model, try going back and using Rouge from Fortnight 1.

# Fine-tunning the Dynamic Tinibert Model with Medical Data


I will be working on fine-tunning the existing [Dynamic-TinyBERT](https://huggingface.co/Intel/dynamic_tinybert) with medical data, this is the [dataset](https://huggingface.co/datasets/GonzaloValdenebro/MedicalQuestionAnsweringDataset) that I will be using for that.

In [169]:
from datasets import load_dataset

MedData = (load_dataset("GonzaloValdenebro/MedicalQuestionAnsweringDataset", split='train')
        .train_test_split(train_size=8000, test_size=2000))
MedData

DatasetDict({
    train: Dataset({
        features: ['Question', 'Answer', 'topic', 'split'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['Question', 'Answer', 'topic', 'split'],
        num_rows: 2000
    })
})

In [170]:
print(MedData["train"]["Question"][0])
print(MedData["train"]["Answer"][0])

print(MedData["test"]["Question"][0])
print(MedData["test"]["Answer"][0])

What are the symptoms of Microcephalic primordial dwarfism, Montreal type ?
What are the signs and symptoms of Microcephalic primordial dwarfism, Montreal type? The Human Phenotype Ontology provides the following list of signs and symptoms for Microcephalic primordial dwarfism, Montreal type. If the information is available, the table below includes how often the symptom is seen in people with this condition. You can use the MedlinePlus Medical Dictionary to look up the definitions for these medical terms. Signs and Symptoms Approximate number of patients (when available) Abnormal dermatoglyphics 90% Abnormal hair quantity 90% Abnormality of the nipple 90% Abnormality of the palate 90% Carious teeth 90% Cognitive impairment 90% Convex nasal ridge 90% Cryptorchidism 90% Dental malocclusion 90% Dry skin 90% EEG abnormality 90% Hernia of the abdominal wall 90% Hyperhidrosis 90% Hyperreflexia 90% Hypertonia 90% Kyphosis 90% Lipoatrophy 90% Low posterior hairline 90% Low-set, posteriorly ro

## Loading the Intel/dynamic_tinybert model

In [171]:
# Use a pipeline as a high-level helper
from transformers import pipeline

model = pipeline("question-answering", model="Intel/dynamic_tinybert")

In [172]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [173]:
inputs = tokenizer(MedData["train"]["Question"][46])
inputs


{'input_ids': [101, 2054, 2000, 2079, 2005, 2460, 6812, 2884, 8715, 1029, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [174]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['[CLS]',
 'what',
 'to',
 'do',
 'for',
 'short',
 'bow',
 '##el',
 'syndrome',
 '?',
 '[SEP]']

In [175]:
target = tokenizer(MedData["train"]["Answer"][46])
target

{'input_ids': [101, 1011, 2460, 6812, 2884, 8715, 2003, 1037, 2177, 1997, 3471, 3141, 2000, 3532, 16326, 1997, 20435, 1012, 1011, 2111, 2007, 2460, 6812, 2884, 8715, 3685, 16888, 2438, 2300, 1010, 17663, 2015, 1010, 13246, 1010, 5250, 1010, 6638, 1010, 10250, 18909, 1010, 1998, 2060, 20435, 2013, 2833, 1012, 1011, 1996, 2364, 25353, 27718, 5358, 1997, 2460, 6812, 2884, 8715, 2003, 22939, 12171, 20192, 4135, 9232, 1010, 28259, 14708, 2015, 1012, 22939, 12171, 20192, 2064, 2599, 2000, 2139, 10536, 7265, 3508, 1010, 15451, 24072, 14778, 3258, 1010, 1998, 3635, 3279, 1012, 1011, 1037, 2740, 2729, 10802, 2097, 16755, 3949, 2005, 2460, 6812, 2884, 8715, 2241, 2006, 1037, 5776, 1005, 1055, 28268, 3791, 1012, 3949, 2089, 2421, 1011, 28268, 2490, 1011, 20992, 1011, 5970, 1011, 20014, 19126, 22291, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [176]:
tokenizer.convert_ids_to_tokens(target.input_ids)

['[CLS]',
 '-',
 'short',
 'bow',
 '##el',
 'syndrome',
 'is',
 'a',
 'group',
 'of',
 'problems',
 'related',
 'to',
 'poor',
 'absorption',
 'of',
 'nutrients',
 '.',
 '-',
 'people',
 'with',
 'short',
 'bow',
 '##el',
 'syndrome',
 'cannot',
 'absorb',
 'enough',
 'water',
 ',',
 'vitamin',
 '##s',
 ',',
 'minerals',
 ',',
 'protein',
 ',',
 'fat',
 ',',
 'cal',
 '##ories',
 ',',
 'and',
 'other',
 'nutrients',
 'from',
 'food',
 '.',
 '-',
 'the',
 'main',
 'sy',
 '##mpt',
 '##om',
 'of',
 'short',
 'bow',
 '##el',
 'syndrome',
 'is',
 'dia',
 '##rr',
 '##hea',
 '##lo',
 '##ose',
 ',',
 'watery',
 'stool',
 '##s',
 '.',
 'dia',
 '##rr',
 '##hea',
 'can',
 'lead',
 'to',
 'de',
 '##hy',
 '##dra',
 '##tion',
 ',',
 'mal',
 '##nut',
 '##rit',
 '##ion',
 ',',
 'and',
 'weight',
 'loss',
 '.',
 '-',
 'a',
 'health',
 'care',
 'provider',
 'will',
 'recommend',
 'treatment',
 'for',
 'short',
 'bow',
 '##el',
 'syndrome',
 'based',
 'on',
 'a',
 'patient',
 "'",
 's',
 'nutritional',
 '

In [177]:
tokenizer.decode(target.input_ids)

"[CLS] - short bowel syndrome is a group of problems related to poor absorption of nutrients. - people with short bowel syndrome cannot absorb enough water, vitamins, minerals, protein, fat, calories, and other nutrients from food. - the main symptom of short bowel syndrome is diarrhealoose, watery stools. diarrhea can lead to dehydration, malnutrition, and weight loss. - a health care provider will recommend treatment for short bowel syndrome based on a patient's nutritional needs. treatment may include - nutritional support - medications - surgery - intestinal transplant [SEP]"

In [178]:
max_input_length = 100
max_target_length = 20


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["Question"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["Answer"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [179]:
tokenized_datasets = MedData.map(preprocess_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Question', 'Answer', 'topic', 'split', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['Question', 'Answer', 'topic', 'split', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [180]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
model = AutoModelForQuestionAnswering.from_pretrained("Intel/dynamic_tinybert")

In [181]:
from transformers import DefaultDataCollator
data_collator = DefaultDataCollator()

In [182]:
tokenized_datasets_no_text = tokenized_datasets.remove_columns(["Question","Answer"])
tokenized_datasets_no_text

DatasetDict({
    train: Dataset({
        features: ['topic', 'split', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['topic', 'split', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="/Users/gonzalovaldenebro/Downloads",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
    hub_token="my_token",
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=MedData["train"],
    eval_dataset=MedData["test"],
    tokenizer=tokenizer,
)

# Training
trainer.train()