# Fine-tuning MLMs 🤖⚙️

This brief research is intended to explore the different fine-tuning approaches that can be applied for adapting bert-like MLMs to a custom domain datasets.

Important points:
* Dataset: [medical_questions_pairs](https://huggingface.co/datasets/medical_questions_pairs)
* Model: [bert-base-cased](https://huggingface.co/bert-base-cased)
* We will define auxiliar functions in auxiliar.py file
* We will be logging the results in Weight&Biases.
<br>

<figure>
  <img src="../data/images/adaptive_fine-tuning.png">
  
  <figcaption style='text-align:center';>
  Framework for fine-tuning LMs. 
  <a href="https://ruder.io/recent-advances-lm-fine-tuning/">Sebastian Rude's post</a>
  </figcaption>
</figure>

In [1]:
import torch
import config

if torch.cuda.is_available():
   device = torch.device("cuda:0")
else:
    device = torch.device("cpu")

In [2]:
device

device(type='cuda', index=0)

## 1. Data preparation

### 1.1. Import and set creation

Import data and create partitions.

In [2]:
from datasets import load_dataset

# Download and extract data
data = load_dataset("medical_questions_pairs")
data = data['train']

# Split it
data = data.train_test_split(test_size=0.07, seed=config.SEED)



  0%|          | 0/1 [00:00<?, ?it/s]



In [4]:
data

DatasetDict({
    train: Dataset({
        features: ['dr_id', 'question_1', 'question_2', 'label'],
        num_rows: 2834
    })
    test: Dataset({
        features: ['dr_id', 'question_1', 'question_2', 'label'],
        num_rows: 214
    })
})

As we can see, there is not that much ammount of samples. We will have to take that into consideration when training the models.

### 1.2. Tokenize and encode data

As mentioned, we will use **bert-base-cased** tokenizer

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(config.checkpoint, use_fast=True)

In [4]:
data = data.map(lambda x: tokenizer(x['question_1'], x['question_2'], truncation=True, padding='max_length'), batched=True)



In [5]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## 2. Exp 1: Baseline model training

Our first experiment consists on a basic training without any fine-tuning. We will freeze all parameters from the base model and just train the las FC layer. 

In [6]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(config.checkpoint, num_labels=2)

# freeze all params
for param in model.bert.parameters():
    param.requires_grad = False

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

### 2.1. Init WandB

In [7]:
import wandb

wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mjjceamoran[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [8]:
run_name = 'baseline_training'
notes = "This experiment consists on a basic bert training with all encoder's layers frozen"
run = wandb.init(project='fine-tuning-mlms',
           name=run_name,
           notes=notes,
           job_type='train')


In [9]:
from transformers import Trainer, TrainingArguments
from training_aux import compute_metrics
import sklearn

training_args = TrainingArguments(
    output_dir="./experiments/" + run_name,
    learning_rate=2e-5, # low learning rate.
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=8,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to='wandb',
    run_name=run_name
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=data['train'],
    eval_dataset=data['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [10]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: question_2, dr_id, question_1. If question_2, dr_id, question_1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2834
  Num Epochs = 8
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1424
  Number of trainable parameters = 1538
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.692202,0.55,1
2,No log,0.686184,0.58,1
3,0.704300,0.680066,0.57,1
4,0.704300,0.678746,0.65,1
5,0.704300,0.675955,0.65,1
6,0.689100,0.674562,0.66,1
7,0.689100,0.673423,0.65,1
8,0.689100,0.673317,0.66,1


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: question_2, dr_id, question_1. If question_2, dr_id, question_1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 214
  Batch size = 16
Saving model checkpoint to ./experiments/baseline_training/checkpoint-178
Configuration saved in ./experiments/baseline_training/checkpoint-178/config.json
Model weights saved in ./experiments/baseline_training/checkpoint-178/pytorch_model.bin
tokenizer config file saved in ./experiments/baseline_training/checkpoint-178/tokenizer_config.json
Special tokens file saved in ./experiments/baseline_training/checkpoint-178/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: question_2, dr_id, question_1. I

TrainOutput(global_step=1424, training_loss=0.6922249311811468, metrics={'train_runtime': 960.3683, 'train_samples_per_second': 23.608, 'train_steps_per_second': 1.483, 'total_flos': 5965253847121920.0, 'train_loss': 0.6922249311811468, 'epoch': 8.0})

In [11]:
# Log model

artifact = wandb.Artifact('classifier', type='model')
artifact.add_dir('./experiments/baseline_training/checkpoint-1424')
wandb.log_artifact(artifact)


[34m[1mwandb[0m: Adding directory to artifact (./experiments/baseline_training/checkpoint-1424)... Done. 2.6s


<wandb.sdk.wandb_artifacts.Artifact at 0x7fe0dc4f1220>

## 3. Exp: Behavioural finetuning

Next thing to test, we want to train the whole model (FCL + BERT) so it is adapted to our specific task.

In this case, we will let BERT's weights unfrozen.

In [6]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(config.checkpoint, num_labels=2)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

### 3.1. Init WandB

In [7]:
import wandb

wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mjjceamoran[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [8]:
run_name = 'behavioural_training'
notes = "This experiment consists on a behavioural finetuning. We want to adapt the model to our target task by training also the encoder's weights."
run = wandb.init(project='fine-tuning-mlms',
           name=run_name,
           notes=notes,
           job_type='train')


In [9]:
from transformers import Trainer, TrainingArguments
from training_aux import compute_metrics
import sklearn

training_args = TrainingArguments(
    output_dir="./experiments/" + run_name,
    learning_rate=3e-5, # low learning rate.
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=8,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to='wandb',
    run_name=run_name
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=data['train'],
    eval_dataset=data['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [10]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: question_2, question_1, dr_id. If question_2, question_1, dr_id are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2834
  Num Epochs = 8
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1424
  Number of trainable parameters = 108311810
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.496193,0.77,1
2,No log,0.511428,0.77,1
3,0.432600,0.685371,0.76,1
4,0.432600,1.000692,0.78,1
5,0.432600,1.091926,0.78,1
6,0.140600,1.209835,0.77,1
7,0.140600,1.330236,0.77,1
8,0.140600,1.358934,0.79,1


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: question_2, question_1, dr_id. If question_2, question_1, dr_id are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 214
  Batch size = 16
Saving model checkpoint to ./experiments/behavioural_training/checkpoint-178
Configuration saved in ./experiments/behavioural_training/checkpoint-178/config.json
Model weights saved in ./experiments/behavioural_training/checkpoint-178/pytorch_model.bin
tokenizer config file saved in ./experiments/behavioural_training/checkpoint-178/tokenizer_config.json
Special tokens file saved in ./experiments/behavioural_training/checkpoint-178/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: question_2, quest

TrainOutput(global_step=1424, training_loss=0.2111153207468183, metrics={'train_runtime': 2286.0683, 'train_samples_per_second': 9.917, 'train_steps_per_second': 0.623, 'total_flos': 5965253847121920.0, 'train_loss': 0.2111153207468183, 'epoch': 8.0})

In [None]:
# Log model

artifact = wandb.Artifact('classifier', type='model')
artifact.add_dir('./experiments/baseline_training/checkpoint-1424')
wandb.log_artifact(artifact)