# Minerva - Fine Tuning for Sequence Classification
This notebook explores fine-tuning Minerva models for text classification. We will load the model using `AutoModelForSequenceClassification` specifying the number of classes instead of loading it via `AutoModelForCausalLM`. The former swaps the LLM's generative head for a classification head.

In [None]:
import numpy as np
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    BitsAndBytesConfig,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from peft import(
    LoraConfig,
    prepare_model_for_kbit_training,
    get_peft_model
)
from dotenv import load_dotenv
from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score

load_dotenv()

## Configs
Here, we can set some parameters for importing and training.

In [None]:
model_version : str   = '350M'
model_id      : str   = f'sapienzanlp/Minerva-{model_version}-base-v1.0'
dataset_id    : str   = 'istat-ai/sentipolc_dataset'
num_labels    : int   = 3
max_model_len : int   = 16384

lora_r        : int   = 16
lora_alpha    : int   = 8
target_modules: list  = ['q_proj', 'k_proj', 'v_proj', 'o_proj']
lora_dropout  : float = 0.05
lora_bias     : str   = 'none'

output_dir    : str   = f'saved_models/Minerva-{model_version}-base-v1.0'
epochs        : int   = 3
learn_rate    : float = 2e-5
scheduler     : str   = 'linear'
train_bs      : int   = 16
eval_bs       : int   = 32
ga_steps      : int   = 2
decay         : float = 0.01
warmup        : float = 0.1
log_steps     : int   = 20
eval_strategy : str   = 'steps'
save_strategy : str   = 'steps'
fp16          : bool  = True
load_best     : bool  = True
report_to     : list  = []
log_level     : str   = 'warning'

<hr>

## Load the Model
Load the model and tokenizer from huggingface. We will quantize the model to 4-bit precision and prepare it for parameter-efficient fine-tuning (PEFT). If the model is gated or private, you need to set an environment variable called `"HF_TOKEN"` that contans your huggingface token.

In [None]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

lora_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    target_modules=target_modules,
    lora_dropout=lora_dropout,
    bias=lora_bias,
    task_type='SEQ_CLS'
)

model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    num_labels=num_labels,
    quantization_config=quant_config
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    add_prefix_space=True
)
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

model.config.pad_token_id = tokenizer.pad_token_id

## Data Preprocessing
Load the data from Hugging Face. The data should have a `text` column and a `label` column that comprises numerical labels.



In [None]:
data = load_dataset(dataset_id)

Now we tokenize and pad the data using the pretrained tokenizer.

In [None]:
def tokenize(example):
    return tokenizer(example["text"], padding=True, truncation=True, max_length=max_model_len)

tokenized_data = data.map(tokenize, batched=True)

<hr>

## Training
First, we define a function to compute the metrics that we want to monitor during training.

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average='macro')
    return {'accuracy': accuracy, 'f1_macro': f1}

Then, we define the training arguments and trainer classes.

In [None]:
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=epochs,
    learning_rate=learn_rate,
    lr_scheduler_type=scheduler,
    per_device_train_batch_size=train_bs,
    per_device_eval_batch_size=eval_bs,
    gradient_accumulation_steps=ga_steps,
    warmup_ratio=warmup,
    weight_decay=decay,
    logging_dir='./logs',
    logging_steps=log_steps,
    eval_strategy=eval_strategy,
    save_strategy=save_strategy,
    fp16=fp16,
    load_best_model_at_end=load_best,
    report_to=report_to,
    log_level=log_level,
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data['train'],
    eval_dataset=tokenized_data['eval'],
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

Finally, we can start training the model.

In [None]:
trainer.train()

## Evaluation
We can now evaluate the model on our test set.

In [None]:
eval_results = trainer.evaluate(tokenized_data['test'])
eval_results