# Fine-tuning a model with the Trainer API or Keras

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## Recap from previous section

In [57]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
print(tokenized_datasets)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})


## Training

### First step: Define `Training Arguments`

`TrainingArguments` class will contain: 
- Hyperparameters that the `Trainer` object will use for training & evaluation

`"test-trainer"` is just the directory where the trained model is saved.

In [2]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

### Second Step: Define our model

The second step is to define our model. We use the `AutoModelForSequenceClassification`

In [3]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning comes from that `BERT` has not been pretrained on classifying **pairs of sentences** 

The HEAD has been discarded and a new head has been added.

Now we can define a `Trainer` by passing all the objects we constructed: 
- `model`
- `training_args`
- training
- datasets
- `data_collator`
- tokenizer

By default the `Trainer` collator is the `AutoModelForSequenceClassification` so is redundant the `data_collator`



In [4]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [5]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.5658
1000,0.3723


TrainOutput(global_step=1377, training_loss=0.4043386787630634, metrics={'train_runtime': 1575.4073, 'train_samples_per_second': 6.985, 'train_steps_per_second': 0.874, 'total_flos': 405626802939840.0, 'train_loss': 0.4043386787630634, 'epoch': 3.0})

The training has told us (badly) how our model is performing.

This is because 2 reasons:

1. We did not tell the `Trainer`to evaluate during the training by setting `evaluation_strategy` to:
   1. `steps`: Eval every `eval_steps`
   2. `epoch`: Evaluate at the end of each epoch.
  
2. We did not provide the `Trainer` with:
   1. `compute_metrics()` function to compute a metric during evaluation. 

## Evaluation

**Predicted values**: In ML the **predicted values** are the outputs generated by a trained model. These are the model's estimations regarding the target variable.

We will use a `compute_metrics()`function and use it the next time we train. 

The function must take an `EvalPrediction` object. It is: 
- named tuple with: 
    - `predictions` field
    - `labels_ids` field
It returns a **dictionary**. 

To get **predicted values** `Trainer.predict()`. 

In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)

The output of the `predict()` is a **named tuple** with: 
- `predictions`: Output logits for each element in the dataset.
- `label_ids`: Known correct answers (true values)
- `metrics`: Loss on the dataset passed and metrics (long it took to predict, total and avg)

Predictions iis a 2D array with 408 x 2 shape: 
- 408: Number of elements in the dataset used
- 2: Logit size

To transform the logit results into **predictions** we can compare to our labels: 
- we must take the index with the **maximum** value on the **second axis**

In [None]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

Now we obtain the metrics from the **mrpc** dataset and we use them to evaluate our predictions

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542}

### F1 Score: 

 Measure used to evaluate a model's performance, particularly in binary classification settings. It combines both precision and recall to provide a single score that balances between them.
- Precision: ratio of correctly predicted positive observations to the total predicted positive observations
- Recall (Sensitivity): ratio of correctly predicted positive observations to the all observations

F1 result is the **harmonic mean** of precision and recall: 
- (2 * (precision * recall)) / (precision + recall)

### Wrap Up

In [None]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

`evaluation_strategy`to `epoch` and a new model to start from scratch.

- epoch: Refers to one complete pass through the entire training dataset during the training phase.

In [None]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

## Test Exercise

Fine-tune a model on any GLUE dataset. Use everything you learned until now. Evaluation is required as well

In [68]:
def filter_columns(dataset, columns_to_keep):
    columns_to_delete = [column for column in dataset.column_names if column not in columns_to_keep]
    return dataset.remove_columns(columns_to_delete)

In [64]:
def tokenize_row_dataset(row, column_names):
    return tokenizer(row[column_names[0]], row[column_names[1]], truncation = True)

In [65]:
# Metrics
import numpy as np
import evaluate

def compute_metrics(eval_predictions): 
    metrics = evaluate.load('glue', glue_dataset)    
    logits, labels = eval_predictions
    predictions = np.argmax(logits, axis = -1) 
    return metric.compute(predictions = predictions, references = labels)

In [69]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding

# Load dataset
glue_datasets = ['ax', 'cola', 'mnli', 'mnli_matched', 'mnli_mismatched', 'mrpc', 'qnli', 'qqp', 'rte', 'sst2', 'stsb', 'wnli']
glue_dataset = input('Select one GLUE dataset.\nAvailable: ax, cola, mnli, mnli_matched, mnli_mismatched, mrpc, qnli, qqp, rte, sst2, stsb, wnli\n')
raw_dataset = load_dataset('glue', glue_dataset)

# Load Tokenizer
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Tokenize dataset
raw_dataset_column_names = raw_dataset.column_names
train_dataset_column_names = list(raw_dataset_column_names['train'])
tokenized_dataset = raw_dataset.map(
    lambda row: tokenize_row_dataset(row, train_dataset_column_names), 
    batched = True)

# Get Test Dataset
tokenized_test_dataset = tokenized_dataset['test']
tokenized_validation_dataset = tokenized_dataset['validation']

# Remove unnecessary columns from tokenized test dataset
keys_to_keep = ['input_ids','token_type_ids','attention_mask', 'label']
filtered_tokenized_test_dataset = filter_columns(tokenized_test_dataset, keys_to_keep)
filtered_tokenized_validation_dataset = filter_columns(tokenized_validation_dataset, keys_to_keep)

# data_collator
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

# Model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 2)

# Training Arguements
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")

# Trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset = filtered_tokenized_test_dataset,
    eval_dataset = filtered_tokenized_validation_dataset,
    data_collator = data_collator,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

print("Start Training \n")
trainer.train()

Select one GLUE dataset.
Available: ax, cola, mnli, mnli_matched, mnli_mismatched, mrpc, qnli, qqp, rte, sst2, stsb, wnli
 mrpc


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Start Training 



Epoch,Training Loss,Validation Loss


Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

ImportError: To be able to use evaluate-metric/glue, you need to install the following dependencies['scikit-learn', 'scipy'] using 'pip install sklearn scipy' for instance'