# Fine-Tuning Documentation
---
### **1. Task Overview**

We fine-tuned a 4-bit quantized **LLaMA 3.1 8B Instruct model** using the [TweetEval Sentiment Analysis dataset](https://huggingface.co/datasets/tweet_eval) and Customizing it. The objective was to classify tweets as **positive, negative, or neutral**.

---

###  **2. Dataset Details**

| Property          | Value                                         |
| ----------------- | --------------------------------------------- |
| Dataset Name      | `tweet_eval`                                  |
| Subset            | `sentiment`                                   |
| Preprocessed Size | 1,000 samples (train) + 200 (validation)      |
| Input Format      | (Instruction, Input, Output) |

#### Input Formatting Example:

```
### Instruction:
Classify the sentiment of the tweet as positive, negative, or neutral.

### Input:
I'm so happy today!

### Output:
positive
```

---

###  **3. Fine-tuning Configuration**

| Parameter                  | Value                                                                |
| -------------------------- | -------------------------------------------------------------------- |
| Model                      | `unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit`                        |
| Finetuning Strategy        | LoRA via `unsloth`                                                   |
| LoRA Target Modules        | q\_proj, k\_proj, v\_proj, o\_proj, gate\_proj, up\_proj, down\_proj |
| LoRA Params Trained        | \~42M / 8B (0.52%)                                                   |
| Max Sequence Length        | 2048 tokens                                                          |
| Batch Size                 | 2                                                                    |
| Gradient Accumulation      | 4                                                                    |
| Total Effective Batch Size | 8 (2 × 4)                                                            |
| Max Steps                  | 60                                                                   |
| Learning Rate              | 2e-4                                                                 |

---
### **4. Results**

#### Training Output:

* `Train Loss`: **2.13**
* `Runtime`: \~2432 sec

#### Evaluation Output:

* `Validation Loss`: **1.88**
* `Eval Runtime`: \~587 sec

####  Analysis:

* The **validation loss is lower than training loss**, indicating no overfitting.
* Training loss improved by `0.25`, suggesting meaningful learning occurred.


---

###  **5. Conclusion**

The LLaMA 3.1 8B model successfully fine-tuned on a small sentiment dataset using LoRA and Unsloth with quantized weights.
Performance showed **no signs of overfitting**, and results can be further improved with:

* Larger dataset
* More training steps
* Scheduled learning rate
---
### **6. HuggingFace Model**
[HF_Model_LINK](https://huggingface.co/krtkjais/llama3.1-8B-tweet_eval_sentiment-finetuned)


## 1. Installing & Importing Libraries

In [None]:
!pip install --upgrade datasets fsspec
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are i

In [None]:
import unsloth
from transformers import  TrainingArguments
from datasets import load_dataset
from unsloth import FastLanguageModel, is_bfloat16_supported
from peft import get_peft_model
from trl import SFTTrainer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## 2. Loading Dataset & Preproccessing

In [None]:
dataset = load_dataset("tweet_eval", "sentiment")

README.md:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.78M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/901k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/167k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/45615 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/12284 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 45615
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12284
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})


In [None]:
label_map = {0: "negative", 1: "neutral", 2: "positive"}

def convert_to_custom_format(example):
    return {
        "instruction": "Classify the sentiment of the tweet as positive, negative, or neutral.",
        "input": example["text"],
        "output": label_map[example["label"]]
    }

custom_dataset = dataset["train"].map(convert_to_custom_format)

Map:   0%|          | 0/45615 [00:00<?, ? examples/s]

In [None]:
print(custom_dataset)

Dataset({
    features: ['text', 'label', 'instruction', 'input', 'output'],
    num_rows: 45615
})


In [None]:
validation_dataset = dataset["validation"].map(convert_to_custom_format)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
custom_dataset_small = custom_dataset.select(range(1000))
validation_dataset_small = validation_dataset.select(range(200))

## 3. Loading LLaMA Model (Unsloth) & Tokenization

In [None]:
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = fourbit_models[0],
    max_seq_length=2048,
    load_in_4bit=True
)

==((====))==  Unsloth 2025.5.4: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
)

Unsloth 2025.5.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
def formatting_func(example):
    text = (
        "### Instruction:\n" + example["instruction"] + "\n\n" +
        "### Input:\n" + example["input"] + "\n\n" +
        "### Output:\n" + example["output"]
    )
    return text

def add_text_field(example):
    example["text_"] = formatting_func(example)
    return example

custom_dataset_with_text = custom_dataset_small.map(add_text_field)
val_dataset_with_text = validation_dataset_small.map(add_text_field)

def tokenize(example):
    return tokenizer(
        example["text_"],
        truncation=True,
        padding="max_length",
        max_length=2048,
    )

train_tokenized_dataset = custom_dataset_with_text.map(tokenize)
val_tokenized_dataset = val_dataset_with_text.map(tokenize)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
print(custom_dataset_with_text['text_'][0])

### Instruction:
Classify the sentiment of the tweet as positive, negative, or neutral.

### Input:
"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"

### Output:
positive


## 4. Training Configuration & Training

In [None]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_tokenized_dataset,
    eval_dataset = val_tokenized_dataset,
    dataset_text_field = "text_",
    max_seq_length = 2048,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
        per_device_eval_batch_size = 2
    ),
)

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,3.3671
2,3.6066
3,3.3455
4,3.3194
5,3.1143
6,2.7141
7,2.331
8,2.2708
9,2.275
10,2.253


In [None]:
print(trainer_stats)

TrainOutput(global_step=60, training_loss=2.1300754944483438, metrics={'train_runtime': 2431.9601, 'train_samples_per_second': 0.197, 'train_steps_per_second': 0.025, 'total_flos': 4.451323701362688e+16, 'train_loss': 2.1300754944483438})


## 5. Evaluating on Validation Set

In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
eval_metrics = trainer.evaluate()

print("Evaluation metrics on validation set:")
print(eval_metrics)

Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


Evaluation metrics on validation set:
{'eval_loss': 1.8808320760726929, 'eval_runtime': 587.5601, 'eval_samples_per_second': 0.34, 'eval_steps_per_second': 0.17}


In [None]:
training_loss = trainer_stats.training_loss
eval_loss = eval_metrics["eval_loss"]
performance_improvement = training_loss - eval_loss
print(f"Loss improvement from training to validation: {performance_improvement:.4f}")

Loss improvement from training to validation: 0.2492


**Validation loss is lower than training loss — model generalizes well**

## 6. Saving & Pushing The Model

In [None]:
model.save_pretrained("llama3.1-8B-tweet_eval_sentiment-finetuned")
tokenizer.save_pretrained("llama3.1-8B-tweet_eval_sentiment-finetuned")

('llama3.1-8B-tweet_eval_sentiment-finetuned/tokenizer_config.json',
 'llama3.1-8B-tweet_eval_sentiment-finetuned/special_tokens_map.json',
 'llama3.1-8B-tweet_eval_sentiment-finetuned/tokenizer.json')

In [None]:
from huggingface_hub import login
login()

model.push_to_hub("krtkjais/llama3.1-8B-tweet_eval_sentiment-finetuned")
tokenizer.push_to_hub("krtkjais/llama3.1-8B-tweet_eval_sentiment-finetuned")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

README.md:   0%|          | 0.00/607 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/83.9M [00:00<?, ?B/s]

Saved model to https://huggingface.co/krtkjais/llama3.1-8B-tweet_eval_sentiment-finetuned


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

## 7. Conclusion

*The model shows improvement from training loss (2.13) to validation loss (1.88), indicating that it generalized well on unseen data.
The difference (Δ ≈ -0.25) suggests no overfitting. With a small dataset (1,000 samples), this is a good result.
Further training on a larger dataset or tuning learning rate could lead to even better performance.*
