#  Fine-tuning a pretrained model

## Processing the data

Here is a first small example:

In [1]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

batch["labels"] = torch.tensor([1, 1]) # Set labels, here both sequence are labelled as 1

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

  from .autonotebook import tqdm as notebook_tqdm
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


##  Loading a dataset from the Hub

- The Hub contain models multiple datasets in lots of different languages.(https://huggingface.co/datasets)

- **MRPC dataset**: This is one of the 10 datasets composing the GLUE benchmark, which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks.

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [3]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

- Labels are already in integer format; no preprocessing is needed.
- To identify the integer-to-label mapping just inspect the features of the `raw_train_dataset`.
- Integer mapping:
   - `0` corresponds to `not_equivalent`.
   - `1` corresponds to `equivalent`.

In [4]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [5]:
raw_datasets["train"]["sentence1"][:3]

['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .']

# Preprocessing a dataset

To preprocess the dataset, we need to convert the text to numbers the model can make sense of. This is done with a tokenizer. We can feed the tokenizer one sentence or a list of sentences, so we can directly tokenize all the first sentences and all the second sentences of each pair like this:

In [6]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

- A direct input of two sequences to the model won't yield a prediction for whether the sentences are paraphrases.
- The two sequences must be handled as a pair and preprocessed appropriately.
- The tokenizer can accept a pair of sequences and process them in the format required by the BERT model.


In [7]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

- `token_type_ids` indicates to the model the segmentation of the input.

In [8]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

- **Token Type IDs**:
  - Parts of the input corresponding to `[CLS] sentence1 [SEP]` have a token type ID of 0.
  - Parts of the input corresponding to `sentence2 [SEP]` have a token type ID of 1.

- **Handling Token Type IDs**:
  - Generally, there’s no need to worry about token_type_ids in tokenized inputs.
  - As long as the tokenizer and model use the same checkpoint, the tokenizer will correctly provide the required information.

- **Tokenizing a Dataset**:
  - The tokenizer can handle a list of sentence pairs by taking separate lists for the first and second sentences.
  - This approach is compatible with padding and truncation options.


In [9]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

- The initial method works well but has some limitations:
  - Returns a dictionary with specific keys (`input_ids`, `attention_mask`, `token_type_ids`) and values as lists of lists.
  - Requires sufficient RAM to store the entire dataset during tokenization.
  - In contrast, 🤗 Datasets library datasets are Apache Arrow files, stored on disk, so only the requested samples are loaded in memory.


- To address these limitations and retain the dataset format:
  - Use `Dataset.map()` method.
  - This method allows for extra preprocessing beyond tokenization.
  - `map()` applies a function to each element in the dataset, enabling customized tokenization functions.


In [10]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [11]:
tokenize_function( raw_datasets["train"][0])

{'input_ids': [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

- The function takes a dictionary as input (similar to dataset items) and returns a new dictionary with the keys:
  - `input_ids`
  - `attention_mask`
  - `token_type_ids`
  
- The function can handle multiple samples simultaneously:
  - Each key can contain a list of sentences.
  - This allows for using `batched=True` in the `map()` call, enhancing tokenization speed by processing multiple samples at once.


- Padding optimization:
  - Padding is excluded from the function to avoid inefficiency.
  - Instead, padding is applied when building a batch, so only the maximum length in each batch is padded, not across the entire dataset.
  - This strategy saves time and processing power, especially with variable-length inputs.


- Application on datasets:
  - The tokenization function is applied to all datasets at once using `batched=True` with `map()`.
  - The `Datasets` library adds new fields to each dataset based on the keys in the returned dictionary for efficient preprocessing.


In [12]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

- **Tokenize Function Output**:
  - Returns a dictionary with the following keys:
    - `input_ids`
    - `attention_mask`
    - `token_type_ids`
  - These fields are added to all splits of the dataset.
  

- **Customization with map()**:
  - Possible to modify existing fields in the dataset by returning new values for an existing key in the preprocessing function.


## With your own Dataset

In [None]:
import pandas as pd
df = pd.read_csv(r'E:\Pierre\text1995.csv')
data_wang = pd.read_json(r'E:\Pierre\Result\wang_all\concept_2_3_0_restricted50\1995.json')
data_wang['score'] = data_wang['concept_2_wang_3_restricted50'].apply(lambda x: x['score']['novelty'] )
df = pd.merge(df[['id','text']],data_wang[['id','score']],on = 'id', how = 'inner')


num_positive = df[df['score'] > 0].shape[0]
zero_score_subset = df[df['score'] == 0].sample(n=num_positive, random_state=42) 
balanced_df = pd.concat([df[df['score'] == 1], zero_score_subset])
balanced_df = balanced_df.sample(frac=1, random_state=24).reset_index(drop=True)

from sklearn.model_selection import train_test_split
balanced_df = balanced_df[['id','score', 'text']].set_index('id')
train_df, temp_df = train_test_split(balanced_df, test_size=0.4, random_state=42)
valid_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42) 

from datasets import Dataset, DatasetDict, load_dataset

train_dataset = Dataset.from_pandas(train_df)
valid_dataset = Dataset.from_pandas(valid_df)
test_dataset = Dataset.from_pandas(test_df)

datasets = DatasetDict({
    'train': train_dataset,
    'validation': valid_dataset,
    'test':test_dataset
})
datasets.save_to_disk('test_novelty')


from datasets import load_from_disk

datasets = load_from_disk('test_novelty')
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, cache_dir='hub')

tokenized_datasets = datasets.map(tokenize_function, batched=True)

tokenized_datasets = tokenized_datasets.remove_columns(["text", "id"])
tokenized_datasets = tokenized_datasets.rename_column("score", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

## Dynamic padding

- **Collate Function**: 
  - The collate function organizes samples within a batch in a DataLoader.
  - Default behavior: Converts samples to PyTorch tensors and concatenates them, handling lists, tuples, or dictionaries recursively.
  - Limitation: This approach won’t work if inputs vary in size.


- **Batch Padding Strategy**:
  - Padding is deliberately applied only as needed for each batch to minimize excessive padding.
  - Benefits: Speeds up training by reducing over-long inputs.



- **Custom Collate Function with Padding**:
  - A custom collate function applies appropriate padding to batch items.
  - Transformers library provides `DataCollatorWithPadding` for this purpose.
    - Requires a tokenizer to handle padding tokens and specify left or right padding as needed by the model.


In [13]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [14]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

- Samples have varying lengths, ranging from 32 to 67.
- **Dynamic padding**: Pads samples in a batch to the maximum length within that batch (67 in this case).
- **Without dynamic padding**: Would require padding all samples to the maximum length across the entire dataset or to the model's maximum acceptable length.
- A check on `data_collator` confirms proper application of dynamic padding for the batch.


In [15]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

# Fine-tuning a model with the Trainer API

- **Trainer Class in Transformers**:
  - The Trainer class is provided by Transformers for fine-tuning pretrained models on custom datasets.
  - After data preprocessing, only a few steps are needed to define the Trainer.


- **Setting Up the Training Environment**:
  - Running `Trainer.train()` on a CPU is very slow; a GPU is recommended.
  - Google Colab offers access to free GPUs and TPUs for faster training.


In [16]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Training

- **Define TrainingArguments**: 
  - Before defining the Trainer, set up a `TrainingArguments` class.
  - This class will include all the necessary hyperparameters for training and evaluation.


- **Required Argument**:
  - Specify a directory where:
    - The trained model will be saved.
    - Checkpoints will be stored during training.


- **Default Settings**:
  - Defaults for other parameters are generally sufficient for basic fine-tuning.


In [17]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer0")

In [18]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


- **Warning on Model Instantiation**:
  - A warning appears after loading the pretrained BERT model.
  - BERT was not pretrained for sentence pair classification, so the original model head is discarded.
  - A new head for sequence classification is added, causing:
    - Some weights to be unused (from the discarded pretraining head).
    - Some weights to be randomly initialized (for the new classification head).
  - The warning suggests training the model to optimize the new head.

- **Defining a Trainer**:
  - The Trainer requires the following components:
    - The modified model (with a new head).
    - `training_args`: settings and configurations for the training process.
    - `training` and `validation` datasets.
    - `data_collator`: a function to collate batches of data.
    - `tokenizer`: to process text inputs.


In [19]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


 - To fine-tune the model on our dataset, we just have to call the train() method of our Trainer

- The fine-tuning process will begin, which should take a few minutes on a GPU.
- Training loss will be reported every 500 steps.
- However, model performance (quality) is not assessed due to:
  - Lack of evaluation strategy:
    - `evaluation_strategy` was not set to "steps" (evaluate every `eval_steps`) or "epoch" (evaluate at the end of each epoch).
  - Absence of `compute_metrics()` function:
    - Without this, no metrics are calculated during evaluation; only the loss would be printed, which is not very informative.


## Evaluation

- **Goal**: Build a `compute_metrics()` function to use during model training.

- **Function Requirements**:
  - Accepts an `EvalPrediction` object (a named tuple with:
    - `predictions` field
    - `label_ids` field)
  - Returns a dictionary:
    - Keys are metric names (strings)
    - Values are metric values (floats)

- **Usage**:
  - Use `Trainer.predict()` to generate model predictions.


In [20]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


- **predict() method output**:
  - Returns a named tuple with three fields:
    - **predictions**: 
      - A 2D array with shape 408 x 2 (for 408 elements in the dataset).
      - Contains logits for each element, which need to be transformed to make predictions.
      - Transformation process: select the index with the maximum value on the second axis.
    - **label_ids**: Stores the labels for comparison.
    - **metrics**:
      - Initially includes:
        - **Loss** on the dataset passed.
        - **Time metrics** (total and average prediction time).
      - When `compute_metrics()` is defined and passed to `Trainer`, `metrics` also includes the metrics returned by `compute_metrics()`.


In [21]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

 - We can now compare those preds to the labels.
 - To build our compute_metric() function, we will rely on the metrics from the Evaluate library. 
 - We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the evaluate.load() function.
 - The object returned has a compute() method we can use to do the metric calculation:

In [22]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.6838235294117647, 'f1': 0.8122270742358079}

In [23]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [24]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


- **New TrainingArguments**:
  - A new `TrainingArguments` object is created.
  - The `evaluation_strategy` parameter is set to `"epoch"`.


- **New Model**:
  - A new model is instantiated for training.
  - This prevents continuing training on an already trained model.


- To launch a new training run, we execute:

In [25]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.328425,0.875,0.911612
2,0.504700,0.570676,0.852941,0.894366
3,0.262000,0.662599,0.867647,0.906574


TrainOutput(global_step=1377, training_loss=0.3190316066589577, metrics={'train_runtime': 1551.2139, 'train_samples_per_second': 7.094, 'train_steps_per_second': 0.888, 'total_flos': 405114969714960.0, 'train_loss': 0.3190316066589577, 'epoch': 3.0})

- The model will now:
  - Report validation loss and metrics (accuracy, F1 score) at the end of each epoch.
  - Continue reporting training loss.

- Note:
  - The exact accuracy/F1 score may vary slightly due to the model's random head initialization.
  - Despite this variability, results should remain close to the expected range.

In [26]:
test_pred = trainer.predict(tokenized_datasets["test"])

In [30]:
preds = np.argmax(test_pred.predictions, axis=-1)
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=test_pred.label_ids)

{'accuracy': 0.8353623188405798, 'f1': 0.8807724601175483}