# Adaptative tuning 🤖⚙️

In this third notebook, we will perform MLM fine-tuning over a pre-trained BERT model in order to adapt it to a target domain. This will generate a BERT encoder able to better capture our dataset's semantics, hence leading to improved results on downstream tasks.

Adaptative fine-tuning is performed using the same kins of unsupervised objectives that the ones performed during a from-scratch LM training. This is:
* Masked Language Modeling: A random n% of the input tokens are masked and the model is asked to fill in the gaps.
* Next Sentence Prediction: The model is asked to predict the whole next sentence of a text.


<figure style='text-align:center';>
  <img src="../data/images/AFT.png">
  
  <figcaption>
  Adaptative fine-tuning schema 
  </figcaption>
</figure>


**Due to the nature of our dataset, we will just perform MLM.**

**Also keep in mind that there are different ways to do this. We will use a simple approach as demonstration, but more complex procedures are usually done (we will mention some of them).**

Important points:
* Dataset: [medical_questions_pairs](https://huggingface.co/datasets/medical_questions_pairs)
* Model: [bert-base-cased](https://huggingface.co/bert-base-cased)
* We will define auxiliar functions in auxiliar.py file
* We will be logging the results in Weight&Biases.
<br>

In [3]:
import torch
import config

if torch.cuda.is_available():
   device = torch.device("cuda:0")
else:
    device = torch.device("cpu")

In [None]:
device

## 1. Data preparation

The data prep in this case will require a little bit more work.

We will have to mask a random % of input tokens to create our training loop input.

**Important: We will just use the training partition, since we don't want the model to see any of our test set. We keep that for the downstream task evaluation.**

### 1.1. Import and set creation

Import data, create partitions and select train set. In order to better work with the data, let's export this to a pandas dataframe.

**We have to replicate the process we have followed in previous notebooks, so the partitions are the same.**

In [9]:
from datasets import load_dataset
import pandas as pd

# Download and extract data
data = load_dataset("medical_questions_pairs")
data = data['train']

# Split it
dataset = data.train_test_split(test_size=0.07, seed=config.SEED)

# Just keep the train partition
dataset = dataset['train']

# Export to pandas
df = dataset.to_pandas()

Found cached dataset medical_questions_pairs (C:/Users/Juanju/.cache/huggingface/datasets/medical_questions_pairs/default/0.0.0/db30a35b934dceb7abed5ef6b73a432bb59682d00e26f9a1acd960635333bc80)
100%|██████████| 1/1 [00:00<00:00, 91.39it/s]
Loading cached split indices for dataset at C:\Users\Juanju\.cache\huggingface\datasets\medical_questions_pairs\default\0.0.0\db30a35b934dceb7abed5ef6b73a432bb59682d00e26f9a1acd960635333bc80\cache-3a6913e31ee3f147.arrow and C:\Users\Juanju\.cache\huggingface\datasets\medical_questions_pairs\default\0.0.0\db30a35b934dceb7abed5ef6b73a432bb59682d00e26f9a1acd960635333bc80\cache-55366722f45172c0.arrow


### 1.2. Dataset modification

Next step is to gather proper data for our training. We want to have a big list of sentences.

Let's take a look at how our original dataset is composed.

In [12]:
data[:3]

{'dr_id': [1, 1, 1],
 'question_1': ['After how many hour from drinking an antibiotic can I drink alcohol?',
  'After how many hour from drinking an antibiotic can I drink alcohol?',
  'Am I over weight (192.9) for my age (39)?'],
 'question_2': ['I have a party tonight and I took my last dose of Azithromycin this morning. Can I have a few drinks?',
  'I vomited this morning and I am not sure if it is the side effect of my antibiotic or the alcohol I took last night...',
  'I am a 39 y/o male currently weighing about 193 lbs. Do you think I am overweight?'],
 'label': [1, 0, 1]}

As you can see, the field *question_1* contains repeated segments, since for each *question_1* there are two rephrasings in *question_2*.

* For our usecase, we don't want duplicated sentences in the training set, but we DO want to also consider *question_2* field for training our MLM.
* We will join all texts into a single list and remove duplicates.

In [19]:
# Join texts
texts = df['question_1'].to_list() + df['question_2'].to_list()

# Remove duplicates
texts = list(set(texts))

In [24]:
len(texts)

4351

### 1.3. Tokenization and encoding

Last step of our preprocessing consists of tokenizing our texts and create the encodings.

* Tokenize texts and create input_ids.
* Insert [MASK] tokens randomly in our input_ids.
* Create labels as a copy of our input_ids.
* Build a dataset.
* KUDOS to [James Briggs](https://www.youtube.com/watch?v=R6hcxMMOrPE) for the quick implementation!

In [28]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(config.checkpoint, use_fast=True)

In [47]:
from typing import Dict
import torch

def create_mlm_inputs(texts, tokenizer, percentage=0.15) -> Dict:
    inputs = tokenizer(texts, return_tensors='pt', max_length=512, truncation=True, padding='max_length')

    # Create labels as a clone of the input_ids.
    inputs['labels'] = inputs.input_ids.detach().clone()

    # Create mask filter
    # We don't want to mask special tokens:
    # 101 -> [CLS]
    # 0 -> [PAD]
    # 102 -> [SEP]
    rand = torch.rand(inputs.input_ids.shape)
    mask_filt = (rand < percentage) * (inputs.input_ids != 101) * (inputs.input_ids != 102) * (inputs.input_ids != 0)

    # Mask tokens!
    # For each sample, get mask_filt row and mask tokens at index.
    for i in range(mask_filt.shape[0]):
        mask_idxs = torch.flatten(mask_filt[i].nonzero()).tolist()
        inputs.input_ids[i, mask_idxs] = 103
    
    return inputs
    

In [37]:
inputs = create_mlm_inputs(texts, tokenizer)

---

##### **NOTE**

Another option is tu make use of transformer's datacollator functionallity

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Also, there are other kind of techniques for preparing input data in a wiser maner.
* Joining all texts together and then split them in chunks (so we have less risk to truncation in larger datasets).
* Apply word masking: instead of mask single tokens, we can mask subsequent tokens (words).

Check [HuggingFace's tutorial](https://huggingface.co/course/chapter7/3?fw=tf#preprocessing-the-data) for a full guide 

---

And for faster use, back again to a HF dataset.

In [44]:
from datasets import Dataset

train_data = Dataset.from_dict(inputs)

In [45]:
train_data

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 4351
})

## 2. Training

Okay, we are ready to go!

In [None]:
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained(config.checkpoint)

### 2.1. Init WandB

In [None]:
import wandb

wandb.login()

In [None]:
run_name = 'adaptative_training'
notes = "This experiment consists on performing MLM finetuning over a pre-trained bert with our dataset."
run = wandb.init(project='fine-tuning-mlms',
           name=run_name,
           notes=notes,
           job_type='train')


[34m[1mwandb[0m: Currently logged in as: [33mjjceamoran[0m. Use [1m`wandb login --relogin`[0m to force relogin


### 2.2. Train

In [48]:
from transformers import Trainer, TrainingArguments
import sklearn

training_args = TrainingArguments(
    output_dir="./experiments/" + run_name,
    learning_rate=1e-5, # lower learning rate.
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=8,
    weight_decay=0.01,
    do_eval=False, # We just want to train the model. Not eval objective.
    save_strategy="epoch",
    # load_best_model_at_end=True,
    report_to='wandb',
    run_name=run_name
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=data['train'],
    tokenizer=tokenizer,
)

NameError: name 'run_name' is not defined

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: question_1, question_2, dr_id. If question_1, question_2, dr_id are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2834
  Num Epochs = 8
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2840
  Number of trainable parameters = 108311810
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.493627,0.79,1
2,0.502200,0.520202,0.81,1
3,0.267700,0.768769,0.82,1
4,0.267700,0.972102,0.82,1
5,0.118500,1.013869,0.84,1
6,0.032900,1.261037,0.8,1
7,0.032900,1.179059,0.85,1
8,0.012000,1.199661,0.83,1


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: question_1, question_2, dr_id. If question_1, question_2, dr_id are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 214
  Batch size = 8
Saving model checkpoint to ./experiments/behavioural_training/checkpoint-355
Configuration saved in ./experiments/behavioural_training/checkpoint-355/config.json
Model weights saved in ./experiments/behavioural_training/checkpoint-355/pytorch_model.bin
tokenizer config file saved in ./experiments/behavioural_training/checkpoint-355/tokenizer_config.json
Special tokens file saved in ./experiments/behavioural_training/checkpoint-355/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: question_1, questi

TrainOutput(global_step=2840, training_loss=0.16475957799965227, metrics={'train_runtime': 2124.0411, 'train_samples_per_second': 10.674, 'train_steps_per_second': 1.337, 'total_flos': 5965253847121920.0, 'train_loss': 0.16475957799965227, 'epoch': 8.0})

### 2.3. Store Model

In [None]:
# Log model

artifact = wandb.Artifact('classifier', type='model')
artifact.add_dir('./experiments/behavioural_training/checkpoint-2485')
wandb.log_artifact(artifact)

[34m[1mwandb[0m: Adding directory to artifact (./experiments/behavioural_training/checkpoint-2485)... Done. 6.7s


<wandb.sdk.wandb_artifacts.Artifact at 0x7fc602499d90>