# Parameter-Efficient Fine-Tuning (PEFT) ‚öôÔ∏èüß©

In the previous notebook, you explored how **large language models (LLMs)** can be made smaller and faster through **model compression techniques** such as **Knowledge Distillation** and **Quantization**.  
You learned how to preserve most of a model‚Äôs capability while drastically reducing its size and computational footprint.

In this notebook, we‚Äôll take the next step toward **efficient model adaptation**: learning how to **fine-tune large models for new tasks without retraining all their parameters**.  
This is where **Parameter-Efficient Fine-Tuning (PEFT)** comes in.

### üß† What You‚Äôll Learn

You‚Äôll discover how PEFT allows us to adapt large pretrained models (like **DistilBERT**) to specific tasks while keeping **most of the model frozen** and only training a *small number* of additional parameters.  
This results in:
- **Smaller memory footprint** ü™∂  
- **Faster training** ‚ö°  
- **Comparable performance** üéØ to full fine-tuning

In this notebook we'll explore **LoRA (Low-Rank Adaptation)** which adds lightweight trainable matrices inside attention layers.  

### üìö Dataset

Just like before, we‚Äôll use the **Yelp Polarity** dataset, a real-world collection of positive and negative user reviews.  
This time, instead of masked word prediction, we‚Äôll fine-tune our model for **sentiment classification**.

### üöÄ By the End of This Notebook...

You‚Äôll have:
- Compared **zero-shot**, **full fine-tuning**, and **PEFT** approaches  
- Measured their **accuracy, training cost, and parameter efficiency**  
- Understood how modern LLM systems achieve rapid, low-cost adaptation at scale

Let‚Äôs dive in and see how **PEFT bridges the gap between power and efficiency** in modern NLP!


In [None]:
from tqdm import tqdm
import numpy as np

from typing import Dict
from transformers import EvalPrediction
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from peft import LoraConfig, get_peft_model, TaskType

from transformers import Trainer, TrainingArguments
import torch

from sklearn.metrics import accuracy_score, f1_score

In [2]:
model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

**`TODO:`** Since we're using `distilbert-base-uncased` as our model we should only consider samples that can fit in its context size. Complete the implementation of the `filter_long_examples` function that is used to filter our dataset..

In [None]:
# Load Yelp dataset
dataset = load_dataset("yelp_polarity", split={"train": "train[:10000]", "test": "test[:2000]"})

def filter_long_examples(example:Dict) -> bool:
    """Filters out text samples that exceed the maximum token length supported by the model tokenizer.

    Args:
        example (dict): 
            A single dataset example containing at least a `"text"` field. 
            Example structure:
            {
                "text": "This is a review example.", 
                ...
            }
    """
    tokens = tokenizer(
        example["text"],
        truncation=False,
        add_special_tokens=True
    )
    return len(tokens["input_ids"]) <= 512

filtered_dataset = dataset.filter(filter_long_examples)



train_ds = filtered_dataset["train"].select(range(5000))   # small subset for speed
test_ds  = filtered_dataset["test"].select(range(1000))


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/256M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/560000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/38000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1019 > 512). Running this sequence through the model will result in indexing errors


Filter:   0%|          | 0/2000 [00:00<?, ? examples/s]

## üîç Evaluating the Base Model (Zero-Shot)

Before we start fine-tuning (either with standard training or using **LoRA**), it‚Äôs important to understand **how well our base model performs out-of-the-box** on the downstream task.

Our downstream task here is **sentiment classification** using the **Yelp Polarity** dataset.  
Each review is labeled as either:
- `0` ‚Üí **Negative**
- `1` ‚Üí **Positive**

We'll use the pretrained **DistilBERT** model directly ‚Äî without any additional training ‚Äî to see how accurately it can classify reviews into positive or negative sentiment.  
This kind of evaluation is often called a **zero-shot test**, since the model hasn‚Äôt been fine-tuned on this specific dataset yet.

**`TODO:`** Use the [Hugging Face pipeline](https://huggingface.co/docs/transformers/en/main_classes/pipelines) for text classification to evaluate the original model on the test set. Print the final accuracy. For more information about HF pipelines, refer to the official documentation or revisit the examples from our previous lab sessions.

In [None]:
clf = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)

correct = 0
total = len(test_ds)

for example in tqdm(test_ds):
    pred = clf(example["text"])[0]
    label = 1 if pred["label"].lower() in ["positive", "pos", "LABEL_1"] else 0
    if label == example["label"]:
        correct += 1

accuracy = correct / total
print(f"Base model accuracy on Yelp Polarity (zero-shot): {accuracy:.3f}")


Device set to use cuda:0


## üß† Fine-Tuning the Base Model (Full Fine-Tuning)

Now that we‚Äôve seen how the pretrained model performs in a **zero-shot** setting, let‚Äôs improve its accuracy by **fine-tuning** it directly on our downstream task ‚Äî **sentiment classification** using the Yelp Polarity dataset.

In full fine-tuning, **all model parameters are updated** during training.  
This means the model learns task-specific representations by adjusting every layer of the pretrained network.  
While this typically leads to excellent performance, it also comes with notable downsides:
- üê¢ **Slower training**
- üíæ **Higher memory consumption**
- ‚öôÔ∏è **More parameters to update and store**

We‚Äôll fine-tune our **DistilBERT** model on a subset of the Yelp dataset, using standard training parameters (learning rate, batch size, etc.), and then evaluate its accuracy on the test set.

This will serve as our **baseline for comparison** when we later explore **Parameter-Efficient Fine-Tuning (PEFT)** methods like **LoRA**, where only a small number of parameters are trained.


### üß© Preparing the Dataset for Fine-Tuning

Before we can fine-tune the model, we need to preprocess our text data so it‚Äôs ready for training.  
As you know, transformers models like **DistilBERT** require inputs in the form of **token IDs** rather than raw text.

**`TODO:`**
1. **Tokenize** each review using the pretrained tokenizer to convert text into token IDs.  
2. **Truncate and pad** each sequence to a fixed length of 128 tokens.
3. **Format for PyTorch** by setting the dataset columns to:
   - `input_ids` ‚Üí tokenized text  
   - `attention_mask` ‚Üí marks real tokens vs. padding  
   - `label` ‚Üí sentiment label (0 = negative, 1 = positive)

In [6]:
def preprocess_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=128,
    )

train_enc = train_ds.map(preprocess_function, batched=True)
test_enc  = test_ds.map(preprocess_function, batched=True)

train_enc.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
test_enc.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

**`TODO:`** Complete the `compute_metrics` function. This function should calculate evaluation metrics (such as accuracy and F1 score) from the model‚Äôs predictions and true labels.

In [None]:
def compute_metrics(eval_pred: EvalPrediction) -> Dict[str, float]:
    """
    Compute evaluation metrics (accuracy and F1 score) for model predictions.

    Args:
        eval_pred (EvalPrediction): 
            An object containing model predictions (`eval_pred.predictions`) and true labels (`eval_pred.label_ids`).

    Returns:
        Dict[str, float]: 
            A dictionary with computed "accuracy" and "f1" scores.
    """
    preds = eval_pred.predictions.argmax(-1)
    labels = eval_pred.label_ids
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds, average="weighted"),
    }

training_args = TrainingArguments(
    output_dir="distilbert-full-yelp",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    fp16=torch.cuda.is_available(),
    logging_steps=50,
    save_steps=500,
    eval_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_enc,
    eval_dataset=test_enc,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


**`TODO:`** Use the `trainer` to train train the model and once that's done save the trained model.

In [None]:
trainer.train()
trainer.save_model("distilbert-full-yelp")

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mpotamitisn[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
50,0.5172
100,0.3282
150,0.3209
200,0.2958
250,0.2833
300,0.2299
350,0.2039
400,0.1466
450,0.1689
500,0.1473


TrainOutput(global_step=626, training_loss=0.24799309751857965, metrics={'train_runtime': 73.3328, 'train_samples_per_second': 136.365, 'train_steps_per_second': 8.536, 'total_flos': 331168496640000.0, 'train_loss': 0.24799309751857965, 'epoch': 2.0})

**`TODO:`** Use the trainer to evaluate the fine-tuned model on the test set.

In [None]:
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 0.2796965539455414, 'eval_accuracy': 0.903, 'eval_f1': 0.902981653625493, 'eval_runtime': 1.1034, 'eval_samples_per_second': 906.26, 'eval_steps_per_second': 57.094, 'epoch': 2.0}


## ‚öôÔ∏è Fine-Tuning with LoRA (Parameter-Efficient Fine-Tuning)

Now that we‚Äôve trained the full model and seen how resource-intensive it can be, let‚Äôs explore a more efficient approach: **LoRA (Low-Rank Adaptation)**.

LoRA is a **Parameter-Efficient Fine-Tuning (PEFT)** method that allows us to adapt large pretrained models to new tasks **without updating all their parameters**.  
Instead, LoRA introduces a few small trainable matrices into the model‚Äôs attention layers.  
During training:
- The original model weights remain **frozen** üßä  
- Only the LoRA parameters are **trained** üß†  
- The number of trainable parameters is reduced by **orders of magnitude**

Despite this massive reduction in trainable parameters, LoRA often achieves performance **comparable to full fine-tuning**, making it an ideal choice when working with limited compute or storage.

Let‚Äôs see how much performance we can retain while training only a tiny fraction of the model!


In [None]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### üß© The PEFT Library and LoRA Configuration

To make LoRA integration simple and flexible, we use the **[PEFT (Parameter-Efficient Fine-Tuning)](https://huggingface.co/docs/peft/index)** library from Hugging Face.  
This library provides a unified interface for applying efficient fine-tuning techniques such as **LoRA**, **Prefix Tuning**, **Adapters**, and more, all built on top of the `transformers` ecosystem.

Instead of modifying the model architecture manually, PEFT lets us define a **configuration** that describes *where* and *how* LoRA layers should be inserted.  
Once configured, we simply wrap the base model using `get_peft_model()`, and PEFT handles all the internal modifications automatically.

**LoRA Configuration Parameters:**

- **`task_type`**: Specifies the task type. Here, `TaskType.SEQ_CLS` indicates a *sequence classification* task (e.g., sentiment analysis).

- **`r`**: The **rank** of the LoRA matrices. Controls how many additional parameters are introduced; lower values mean lighter adapters.

- **`lora_alpha`**: A **scaling factor** that amplifies the LoRA updates before they‚Äôre added to the base weights.

- **`lora_dropout`**: The **dropout rate** applied to LoRA layers during training to reduce overfitting.

- **`bias`**: Determines whether bias terms are trainable. Setting it to `"none"` means only LoRA parameters are updated.

- **`target_modules`**: Specifies which model submodules receive LoRA adapters. `"k_lin"` and `"v_lin"` refer to the **key** and **value** projection layers in the attention mechanism.


In [None]:
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # sequence classification
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    target_modules=["k_lin", "v_lin"],  # attention projections
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


trainable params: 739,586 || all params: 67,694,596 || trainable%: 1.0925


**`TODO:`** In the same way as before define the `TrainingArguments`and the `Trainer`. Once that's done, train the `model`(notice how the model was wrapped with peft) and save it again.

In [None]:

training_args = TrainingArguments(
    output_dir="distilbert-lora-yelp",
    learning_rate=2e-4,  # higher than full fine-tune
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    fp16=torch.cuda.is_available(),
    logging_steps=50,
    save_steps=500,
    eval_steps=500,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_enc,
    eval_dataset=test_enc,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()
model.save_pretrained("distilbert-lora-yelp")

Step,Training Loss
50,0.5431
100,0.3493
150,0.35
200,0.296
250,0.3018
300,0.2386
350,0.2504
400,0.2135
450,0.2473
500,0.228


TrainOutput(global_step=626, training_loss=0.2934441596936113, metrics={'train_runtime': 26.5069, 'train_samples_per_second': 377.26, 'train_steps_per_second': 23.616, 'total_flos': 336848517120000.0, 'train_loss': 0.2934441596936113, 'epoch': 2.0})

**`TODO:`** Use the trainer to evaluate the fine-tuned model on the test set.

In [None]:
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 0.25135618448257446, 'eval_accuracy': 0.898, 'eval_f1': 0.8979902023565574, 'eval_runtime': 1.1083, 'eval_samples_per_second': 902.278, 'eval_steps_per_second': 56.844, 'epoch': 2.0}


## Conclusion

**`TODO:`** Now that you've finetuned your model once with and once without LoRA, what do you notice? Whate are your key takeaways?

### üß† Interpretation

1. **Training Speed:** The LoRA model trained nearly **3√ó faster** than the fully fine-tuned model.  
   This is because LoRA only updates a small set of adapter parameters, while the base model remains frozen, drastically reducing the amount of computation required per step.

2. **Performance (Accuracy & F1):** The performance difference between LoRA and full fine-tuning is **minimal** (less than 0.5%).  
   This demonstrates that LoRA can retain almost the same task-specific accuracy even when training only a tiny fraction of the model parameters.

3. **Generalization:** Interestingly, LoRA achieved a **lower evaluation loss**, suggesting it may generalize slightly better to unseen data, likely because the limited number of trainable parameters helps reduce overfitting.

4. **Compute Efficiency:** LoRA achieved much higher **training throughput (samples per second)** and completed training in less than half the time of full fine-tuning, highlighting its efficiency advantages.

### üí° Key Takeaway
- **LoRA delivers nearly the same performance as full fine-tuning, while training significantly faster and using far fewer resources.**  
- This makes it an ideal approach for adapting large language models efficiently, especially when compute or memory are limited.


## üß© Beyond LoRA: A Whole Family of Parameter-Efficient Fine-Tuning (PEFT) Methods

While **LoRA (Low-Rank Adaptation)** has become the most popular parameter-efficient fine-tuning method, it‚Äôs far from the only one.  
The PEFT ecosystem includes several complementary techniques that balance **compute efficiency**, **modularity**, and **adaptability**, each suited to different fine-tuning goals.

Here are some key approaches worth exploring:

| Method | Core Idea | When to Use |
|:--|:--|:--|
| **LoRA** | Injects low-rank trainable matrices into attention layers | General-purpose fine-tuning with small GPU footprint |
| **QLoRA** | Combines LoRA with 4-bit quantization for ultra-efficient fine-tuning | Fine-tuning very large models (33B‚Äì70B) on a single GPU |
| **Prefix Tuning** | Learns virtual ‚Äúprefix‚Äù tokens prepended to transformer layers | When you need adaptation without modifying base weights |
| **Prompt Tuning** | Learns continuous prompt embeddings instead of textual prompts | For lightweight control or domain-specific prompting |
| **Adapter Tuning** | Adds small trainable ‚Äúadapter‚Äù modules between frozen layers | When you want modular, composable task adapters |
| **BitFit** | Fine-tunes only bias terms across layers | For extremely fast, low-cost adaptation |
| **IA¬≥** | Learns multiplicative scaling vectors for hidden states | For memory-limited setups or stability-focused fine-tuning |

#### üß∞ Useful Libraries

| Library | Description |
|:--|:--|
| [ü§ó **PEFT**](https://github.com/huggingface/peft) | The official Hugging Face library implementing LoRA, QLoRA, Prefix, Prompt, and Adapter tuning ‚Äî integrates seamlessly with `transformers` and `accelerate`. |
| [**Adapter-Transformers**](https://github.com/adapter-hub/adapter-transformers) | AdapterHub‚Äôs modular framework for Adapter, Prefix, and BitFit tuning on top of Hugging Face models. |
| [**Colossal-AI**](https://github.com/hpcaitech/ColossalAI) | Large-scale training toolkit supporting LoRA and hybrid PEFT with distributed optimization. |
| [**DeepSpeed**](https://github.com/microsoft/DeepSpeed) | Microsoft‚Äôs library that includes LoRA/QLoRA support and efficient parameter partitioning. |
| [**OpenDelta**](https://github.com/thunlp/OpenDelta) | Research-oriented toolkit supporting multiple delta-tuning methods (LoRA, Adapter, BitFit, Prefix, IA¬≥). |

These frameworks make it straightforward to experiment with PEFT methods in just a few lines of code ‚Äî you can swap fine-tuning strategies without rewriting your training loop.

#### üìö Key References

- Hu et al., *LoRA: Low-Rank Adaptation of Large Language Models* (2021)  
- Dettmers et al., *QLoRA: Efficient Finetuning of Quantized LLMs* (2023)  
- Li & Liang, *Prefix-Tuning: Optimizing Continuous Prompts for Generation* (2021)  
- Lester et al., *The Power of Scale: Parameter-Efficient Prompt Tuning* (2021)  
- Houlsby et al., *Parameter-Efficient Transfer Learning for NLP* (2019)  
- Zaken et al., *BitFit: Simple Parameter-Efficient Fine-Tuning for Transformers* (2021)  
- Liu et al., *Few-Shot Parameter-Efficient Fine-Tuning via IA¬≥* (2022)


These methods form a toolkit of strategies for **scaling fine-tuning without retraining full models**, enabling rapid experimentation and deployment across diverse domains.


## ‚ö†Ô∏è Notes on Initializing QLoRA Environments

Setting up **QLoRA training** can often be difficult due to the tight coupling between
specific versions of **PyTorch**, **Transformers**, **BitsAndBytes**, **PEFT**, and **CUDA**.  
These libraries evolve rapidly and frequently introduce breaking changes, especially
around:

- **Low-bit quantization kernels** (which must match your GPU architecture + CUDA version)
- **Transformer backend changes** that affect how quantization is loaded or dispatched
- **PEFT layer injection logic**, which varies across versions
- **BitsAndBytes build compatibility** (wheel availability differs by CUDA version)

Because of these interdependencies, small mismatches, such as a minor CUDA version
difference or an outdated bitsandbytes wheel‚Äîcan cause errors like:
- Missing quantization kernels  
- Incompatible compute dtypes  
- Device mapping failures  
- Incorrect module names for LoRA injection  

For this reason, **QLoRA scripts that work in one environment may fail in another**, even with seemingly identical versions.


### üìå Example Provided Below (But Not Recommended to Run)

In the **next code cell**, we provide a detailed end-to-end example of initializing a
QLoRA training pipeline, including model loading, quantization configuration, and LoRA
adapter setup.

However, **we do not encourage you to run the code directly**, as it may fail unless
your environment is configured *exactly* with the correct combinations of library and
CUDA versions. Treat it as a **reference implementation**, not a plug-and-play script.


In [None]:
from transformers import AutoModelForSequenceClassification, BitsAndBytesConfig

# Configure 4-bit quantization using bitsandbytes (QLoRA-style)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                # Load model weights in 4-bit precision to save VRAM
    bnb_4bit_use_double_quant=True,   # Apply a second quantization step (improves accuracy)
    bnb_4bit_quant_type="nf4",        # Use NF4 quantization (best-performing for LLMs)
    bnb_4bit_compute_dtype="bfloat16" # Use bfloat16 for computation (safe + efficient)
)

# Load a sequence classification model with quantization enabled
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,                       # Hugging Face model name or local path
    num_labels=2,                     # Binary classification task
    quantization_config=bnb_config,   # Apply 4-bit quantization config
    device_map="auto"                 # Automatically spread model across available GPUs/CPU
)

# Configure LoRA (trainable low-rank adapters on top of frozen quantized weights)
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,       # This LoRA config is for sequence classification
    r=8,                              # Rank of the LoRA matrices (controls trainable params)
    lora_alpha=16,                    # Scaling factor for LoRA updates
    lora_dropout=0.05,                # Dropout applied to LoRA layers
    bias="none",                      # Do not train biases
    target_modules=["k_lin", "v_lin"],# Inject LoRA into DistilBERT attention key/value layers
)

# Wrap the frozen quantized model with LoRA trainable adapters
model = get_peft_model(model, lora_config)

# Training hyperparameters and logging setup
training_args = TrainingArguments(
    output_dir="distilbert-qlora-yelp",      # Where checkpoints + logs are saved
    evaluation_strategy="epoch",             # Evaluate at the end of each epoch
    save_strategy="epoch",                   # Save a checkpoint each epoch
    learning_rate=2e-4,                      # Learning rate for LoRA layers
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,                      # Full passes through the dataset
    fp16=torch.cuda.is_available(),          # Use FP16 when GPU supports it
    logging_steps=50,                        # Log metrics every 50 steps
)

# The Trainer manages training, evaluation, and checkpointing
trainer = Trainer(
    model=model,                   # Quantized + LoRA model
    args=training_args,            # Training configuration
    train_dataset=train_enc,       # Encoded training dataset
    eval_dataset=test_enc,         # Encoded validation/test dataset
    tokenizer=tokenizer,           # Tokenizer for input processing
    compute_metrics=compute_metrics, # Function to compute accuracy/F1/etc.
)

# Start fine-tuning with LoRA on top of a 4-bit model
trainer.train()