<a href="https://colab.research.google.com/github/pushan9/Colab-notebook/blob/main/Fine_Tuning_with_8_bit_and_16_bit_Precision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


---

## **Section I: Introduction to Fine-Tuning**

### 1. What is Fine-Tuning?

**Definition**:
Fine-tuning refers to the process of taking a pre-trained model (usually trained on a large and generic corpus) and adapting it to a specific task or dataset by continuing the training process on that new data.

**Analogy**:
Imagine a person who has a general education (like a college degree in English) and is now being trained specifically to become a legal contract writer. That transition from general to domain-specific expertise is what fine-tuning does for a model.

**Mathematical Overview**:
Let `M` be a model trained on a large dataset `D_pretrain`. Fine-tuning adapts `M` on a smaller dataset `D_task` by updating its weights `W` to `W'`, minimizing the loss specific to `D_task`.

---

### 2. Why Do We Need Fine-Tuning?

| General Model                 | Task-Specific Model                                                                            |
| ----------------------------- | ---------------------------------------------------------------------------------------------- |
| Trained on vast internet data | Needs to perform a narrow task (e.g., classify emotion, summarize emails, extract key phrases) |
| Can generate fluent text      | May lack domain accuracy                                                                       |
| Generic knowledge             | No personalization or adaptation                                                               |

**Use Cases**:

* **Customer Support Chatbot**: Fine-tune a GPT-like model to answer only company-related FAQs.
* **Medical NER**: Use BERT to tag diseases and treatments from clinical notes.
* **Legal Document Classifier**: Fine-tune on thousands of case summaries to classify judgment types.

---

### 3. Steps Involved in Fine-Tuning

1. **Select a Pretrained Model**

   * E.g., `bert-base-uncased`, `gpt2`, `TinyLlama`, `mistral-instruct`.

2. **Prepare Your Dataset**

   * Format into prompt–completion pairs, or input–label format for classification.
   * Example for text generation:

     ```json
     { "prompt": "She opened the letter and", "completion": " started crying out of joy." }
     ```

3. **Tokenize the Data**

   * Convert text into input IDs compatible with the model.

4. **Define Model Architecture and Load Pretrained Weights**

5. **Set Training Hyperparameters**

   * Batch size, learning rate, epochs, evaluation strategy.

6. **Train Using a Trainer or Loop**

   * Most commonly with Hugging Face’s `Trainer`.

---

### 4. Drawbacks of Full Fine-Tuning

| Problem                 | Explanation                                                                        |
| ----------------------- | ---------------------------------------------------------------------------------- |
| **Memory Usage**        | Updating all parameters of large models (7B–65B) requires 24 GB+ VRAM.             |
| **Training Time**       | Each update touches every layer, increasing computation.                           |
| **Storage Needs**       | Fine-tuning multiple tasks = multiple copies of full model.                        |
| **Generalization Risk** | On small datasets, model may *overfit* or *forget* general capabilities.           |
| **Inflexibility**       | Can’t easily switch between multiple fine-tuned tasks without storing full models. |

---

### 5. Example: Full Fine-Tuning on DistilGPT2

#### Dataset Sample (Emotion Classification turned to prompt-generation):

```python
from datasets import load_dataset
dataset = load_dataset("emotion", split="train")

print(dataset[0])
# Output: {'text': "i feel great about this decision", 'label': 2}
```

Reformatted for fine-tuning a language model:

```python
{'prompt': "i feel great about this decision"}
```

Fine-tuning the full DistilGPT2 would mean:

* Updating **all** \~82 million parameters.
* High GPU RAM requirement (\~12–16 GB).
* Fine-tuned model saved = full model size (\~300MB+).

---

### 6. When is Full Fine-Tuning Acceptable?

* When the downstream dataset is **large** (e.g., 1M+ examples).
* When task is **very different** from pretraining (e.g., code generation).
* When you have **high compute availability** (e.g., A100 GPU).
* When **task-critical precision** is required (e.g., medical diagnosis).

---

### Summary of Section I

| Concept     | Key Insight                                              |
| ----------- | -------------------------------------------------------- |
| Fine-tuning | Adapts general models to task-specific needs             |
| Benefits    | Better performance on specialized tasks                  |
| Limitations | Memory, compute, time, storage constraints               |
| Next Steps  | Use efficient methods like LoRA, QLoRA to overcome these |

---




---

## **Section II: Introduction to Parameter-Efficient Fine-Tuning (PEFT)**

---

### 1. What is PEFT (Parameter-Efficient Fine-Tuning)?

**Definition**:
Parameter-Efficient Fine-Tuning (PEFT) is a strategy that updates only a *small subset of model parameters* while keeping the majority of the pretrained model frozen.

Instead of fine-tuning the entire model (which may have hundreds of millions or billions of parameters), PEFT techniques modify a small number of *learnable components*—drastically reducing memory usage and training cost.

---

### 2. Motivation for PEFT

| Challenge with Full Fine-Tuning | PEFT Advantage                        |
| ------------------------------- | ------------------------------------- |
| High VRAM needed (20GB+)        | Works on 8GB–16GB GPUs                |
| Long training times             | Much faster                           |
| Large storage for each variant  | Only additional *adapters* are stored |
| Risk of overfitting             | Regularization through fewer updates  |

**Typical Setting**: Instead of updating all parameters `θ`, we update only a small subset `Δθ` (usually <1–5% of total parameters).

---

### 3. Key Techniques in PEFT

| Technique          | Description                                         |
| ------------------ | --------------------------------------------------- |
| **LoRA**           | Injects low-rank matrices into attention layers     |
| **Prompt Tuning**  | Optimizes a set of continuous tokens (soft prompts) |
| **Prefix Tuning**  | Appends learnable prefixes to the input sequences   |
| **Adapter Layers** | Adds small trainable layers between existing ones   |
| **BitFit**         | Only tunes bias terms in layers                     |

Among these, **LoRA** has become the most popular for LLMs due to its balance of efficiency and performance.

---

### 4. What is LoRA (Low-Rank Adaptation)?

#### a. Motivation

Transformer layers have dense matrices (e.g., weight matrices `Wq`, `Wv`, etc.). LoRA replaces the weight update `ΔW` with two low-rank matrices `A` and `B`:

> ΔW = A × B
> Where:
>
> * `A` is of shape (d, r),
> * `B` is of shape (r, k),
>   and `r` is a small rank (e.g., 4, 8).

This allows the model to learn `r × (d + k)` parameters instead of `d × k`.

#### b. Example

Assume:

* `W` is a 4096 × 4096 matrix in a transformer block.
* With full fine-tuning: 16 million parameters updated.
* With LoRA (r=8): Only 65,000 parameters updated.

#### c. Injection Points

LoRA is typically applied to:

* **q\_proj**: Query projection in attention
* **v\_proj**: Value projection in attention

But may vary by model architecture.

---

### 5. Benefits of LoRA

| Benefit                      | Explanation                                                  |
| ---------------------------- | ------------------------------------------------------------ |
| **Memory Efficient**         | Updates only a small number of parameters                    |
| **Fast Training**            | Fewer parameters = faster backward pass                      |
| **Modular Storage**          | Stores only LoRA adapter weights (\~5–50MB)                  |
| **Task Switching**           | You can load/unload adapters for different tasks dynamically |
| **Lower Risk of Forgetting** | Base model remains untouched                                 |

---

### 6. When to Use LoRA?

* When you're using Google Colab, T4, A100, or consumer GPUs with limited VRAM
* When training on small or medium-sized datasets (500–50,000 examples)
* When you want to experiment with multiple tasks or domains
* When you aim to *combine multiple LoRA adapters* (multi-task learning)

---

### 7. Example: Applying LoRA to TinyLlama

In the code you've used:

```python
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
```

This applies LoRA to the query and value projection layers in the attention mechanism, allowing TinyLlama to adapt to emotion prompts while freezing the rest of the model.

---

### 8. Summary of Section II

| Concept   | Insight                                                           |
| --------- | ----------------------------------------------------------------- |
| PEFT      | Allows efficient fine-tuning with low memory and faster training  |
| LoRA      | Updates low-rank adapter matrices inside attention layers         |
| Advantage | High-quality fine-tuning with minimal resources                   |
| Use Case  | Best suited for Google Colab or resource-constrained environments |

---



---

## **Section III: Mixed-Precision Training (FP16) and 8-bit Quantization (QLoRA)**

---

### 1. What is Mixed-Precision Training?

**Definition**:
Mixed-precision training refers to using both 16-bit floating point (FP16) and 32-bit floating point (FP32) data types during model training.

* **FP32 (Float32)**: Full precision. More accurate but consumes more memory and compute.
* **FP16 (Float16)**: Half precision. Uses less memory and is faster, but risks numerical instability.

**Why Use Mixed Precision?**
It accelerates training and reduces memory usage while maintaining accuracy by:

* Performing most operations in FP16
* Keeping a small number of sensitive operations (like loss scaling) in FP32

---

### 2. Advantages of Mixed-Precision Training

| Advantage                      | Explanation                                                      |
| ------------------------------ | ---------------------------------------------------------------- |
| **Memory Efficient**           | FP16 uses 50% less memory than FP32                              |
| **Faster Computation**         | FP16 arithmetic is faster on supported hardware (e.g., T4, A100) |
| **Scalability**                | Enables training of larger models on the same hardware           |
| **Little or No Accuracy Loss** | With techniques like loss scaling                                |

---

### 3. What is 8-bit Quantization?

**Definition**:
8-bit quantization converts model weights (typically stored in FP32 or FP16) into 8-bit integers (INT8), drastically reducing the model size and memory bandwidth requirements.

**Use Case in Fine-Tuning**:
You can load a pretrained model in **8-bit** and then fine-tune only small LoRA modules in **16-bit (or 32-bit)** precision. This approach is known as **QLoRA**.

---

### 4. What is QLoRA?

**QLoRA** stands for **Quantized Low-Rank Adaptation**. It is an advanced PEFT technique that combines:

* **8-bit quantized base model** (using `bitsandbytes`)
* **Low-rank trainable adapters** (LoRA)
* **16-bit or 32-bit adapter training**

**This approach allows fine-tuning large models (7B–65B) on consumer GPUs (12–16 GB VRAM)**

---

### 5. How Does QLoRA Work?

1. **Load the base model in 8-bit** using `bitsandbytes`.
2. **Freeze all base model weights** to avoid memory overhead.
3. **Inject LoRA adapters** into selected modules.
4. **Train only the LoRA parameters** in 16-bit or 32-bit.

This hybrid quantized setup gives high-quality fine-tuning with minimal resource requirements.

---

### 6. Key Components of QLoRA

| Component             | Role                                                         |
| --------------------- | ------------------------------------------------------------ |
| **bitsandbytes**      | Quantizes model weights to 8-bit                             |
| **PEFT/LoRA**         | Provides the trainable adapter layers                        |
| **Transformers (HF)** | Manages model architecture, tokenization, and training logic |
| **Trainer**           | Orchestrates training loop, backpropagation, and logging     |

---

### 7. Benefits of QLoRA

| Benefit                          | Explanation                                           |
| -------------------------------- | ----------------------------------------------------- |
| **Massive Memory Savings**       | Reduces base model memory by \~60–70%                 |
| **Minimal Performance Drop**     | Maintains performance close to full-precision tuning  |
| **Low VRAM Hardware Compatible** | Works on 8 GB or 12 GB GPUs like T4, RTX 3060         |
| **Adapter Modularity**           | Same as LoRA: You can swap out task-specific adapters |

---

### 8. Example from Your Code

Your code uses:

```python
from peft import prepare_model_for_kbit_training
```

This automatically:

* Prepares the model for 8-bit quantized loading
* Ensures compatibility with LoRA injection

Then the LoRA configuration is applied:

```python
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
```

---

### 9. Which Layers Are Typically Fine-Tuned?

When using LoRA or QLoRA in transformer-based language models, **attention sub-layers** are most commonly targeted:

| Model Type         | Layers Commonly Tuned                  |
| ------------------ | -------------------------------------- |
| GPT-2 / DistilGPT2 | `c_attn`, `c_proj`                     |
| LLaMA / TinyLlama  | `q_proj`, `v_proj`, `k_proj`, `o_proj` |
| T5 / FLAN-T5       | `SelfAttention.q`, `SelfAttention.v`   |

You only need to tune a few of these to get performance gains.

---

### 10. Summary of Section III

| Concept                | Insight                                                          |
| ---------------------- | ---------------------------------------------------------------- |
| Mixed-Precision (FP16) | Speeds up training and reduces memory with minimal accuracy loss |
| 8-bit Quantization     | Compresses model weights for efficient loading                   |
| QLoRA                  | Combines both for highly efficient adapter-based fine-tuning     |
| Compatible Layers      | Apply LoRA to attention projections like `q_proj`, `v_proj`      |

---



---

## **Section IV: Comparison – Full Fine-Tuning vs PEFT vs QLoRA**

This section gives you a clear, industry-relevant comparison across the most common fine-tuning strategies:

---

### 1. Comparison Table

| Criteria                    | Full Fine-Tuning                 | PEFT (LoRA)                         | QLoRA                                       |
| --------------------------- | -------------------------------- | ----------------------------------- | ------------------------------------------- |
| **Definition**              | Updates all model weights        | Trains only low-rank adapter layers | Trains adapters while loading base in 8-bit |
| **Memory Usage**            | Very High                        | Low                                 | Extremely Low                               |
| **Speed**                   | Slow (depends on model size)     | Fast                                | Very Fast                                   |
| **Hardware Required**       | High-end GPUs (A100, V100, etc.) | Consumer GPUs (T4, 3060)            | Consumer GPUs (T4, 3060, even Colab T4)     |
| **Model Modifications**     | Entire model updated             | Only adapter layers injected        | Base model frozen; LoRA adapters trained    |
| **Reusability**             | No – retrain per task            | Yes – swap adapters per task        | Yes – efficient and modular                 |
| **Inference Overhead**      | High                             | Low                                 | Very Low                                    |
| **Model Size After Tuning** | Large (same size as base)        | Small + adapter file                | Small + adapter file                        |
| **Typical Use Case**        | Enterprise with full infra       | Academic, startups, research labs   | Efficient tuning with limited resources     |

---

### 2. When Should You Use What?

| Situation                                            | Recommended Method | Justification                                       |
| ---------------------------------------------------- | ------------------ | --------------------------------------------------- |
| You have a powerful GPU cluster                      | Full Fine-Tuning   | Maximum flexibility, but expensive                  |
| You want fast adaptation for multiple tasks          | LoRA (PEFT)        | Adapter reuse and fast switching                    |
| You have limited memory and budget                   | QLoRA              | Efficient training without sacrificing much quality |
| You want to publish multiple task models efficiently | LoRA or QLoRA      | Upload only adapters, not full models               |
| You're working on a single task, small-scale project | PEFT               | Easy and low-cost to set up                         |

---

### 3. Key Observations

* **Full fine-tuning** is becoming **rare** in practice unless resources are unlimited.
* **PEFT techniques (like LoRA)** can match the performance of full fine-tuning with just a **fraction of the compute**.
* **QLoRA** combines 8-bit quantization and LoRA to push efficiency even further, making it ideal for Colab-based or small-setup environments.

---

### 4. Example: Memory Requirements (Approximate)

| Model (7B)           | Fine-Tuning Method | VRAM Needed |
| -------------------- | ------------------ | ----------- |
| Full Fine-Tuning     | All layers         | 48 GB       |
| LoRA (FP16)          | Adapters only      | 16 GB       |
| QLoRA (8-bit + LoRA) | Adapters + 8-bit   | 12 GB       |

This means you can fine-tune models like **LLaMA-7B or TinyLlama-1.1B** on a **single Colab T4 or RTX 3060 GPU** using QLoRA.

---

### 5. Summary of Section IV

* Full fine-tuning is powerful but resource-heavy.
* LoRA brings adapter-based modularity and speed.
* QLoRA enables powerful low-cost fine-tuning with 8-bit + FP16 hybrid setup.
* For most modern use cases in startups, academia, and hobby projects, **PEFT or QLoRA is the preferred route**.

---

---

## **Section V: LoRA Configuration & Layer Selection in Practice**

In this section, we will explain how to configure LoRA and how to identify the correct target modules (`q_proj`, `v_proj`, `c_attn`, etc.) based on the underlying transformer architecture (e.g., GPT-2, LLaMA, T5). We’ll also explore how this impacts training effectiveness and memory efficiency.

---

### 1. **Understanding LoRA Target Modules**

LoRA (Low-Rank Adaptation) injects trainable adapter layers into **specific weight matrices** of a model. These typically include:

* **Query projection layer (`q_proj`)**
* **Value projection layer (`v_proj`)**
* Sometimes **key** or **output projections** too, depending on model type.

The intuition: Instead of updating large parameter matrices (e.g., 4096×4096), LoRA adds smaller rank adapters (e.g., 4096×8 and 8×4096) that are updated during training.

---

### 2. **Common Target Modules per Model**

| Model Family           | Target Modules for LoRA        | Notes                                              |
| ---------------------- | ------------------------------ | -------------------------------------------------- |
| **GPT-2 / DistilGPT2** | `c_attn`                       | GPT-2 uses `Conv1D` layer combining Q, K, V        |
| **GPT-Neo / GPT-J**    | `q_proj`, `v_proj`             | Follows standard transformer architecture          |
| **LLaMA / TinyLLaMA**  | `q_proj`, `v_proj`             | Same as above; separate projections                |
| **BLOOM**              | `query_key_value`              | All three projections packed into one linear layer |
| **T5 / Flan-T5**       | `q`, `v` or `q_proj`, `v_proj` | Must be verified in tokenizer/model structure      |
| **OPT / Galactica**    | `q_proj`, `v_proj`             | Standard meta transformer style                    |

You must always **inspect the model** using `model.named_modules()` or `model.state_dict().keys()` to identify actual layer names.

---

### 3. **How to Identify Target Modules**

```python
# Inspect the model to list all modules
for name, module in model.named_modules():
    print(name)
```

This will print all sub-modules and their names. You can then grep for layers like `q_proj`, `v_proj`, `c_attn`, `query_key_value`, etc.

---

### 4. **Example: GPT-2 vs LLaMA**

#### GPT-2 Example:

```python
# For GPT-2 and DistilGPT2
target_modules = ["c_attn"]
```

* In GPT-2, query, key, and value projections are packed into a single Conv1D layer called `c_attn`.

#### LLaMA Example:

```python
# For TinyLLaMA, LLaMA-2, etc.
target_modules = ["q_proj", "v_proj"]
```

* In LLaMA, these projections are separate and named explicitly.

---

### 5. **Why Not All Layers?**

Training all LoRA adapters (Q, K, V, O, FFN) increases memory. Studies show that tuning just `q_proj` and `v_proj` gives nearly **90–95%** of the performance benefit while saving most of the compute.

So, choose only essential layers unless you absolutely need full control.

---

### 6. **Custom LoRA Config in Code**

```python
from peft import LoraConfig

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],  # or "c_attn" for GPT2
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
```

Always make sure your `target_modules` match the model architecture, otherwise the PEFT library will raise a `ValueError` indicating that the module was not found.

---

### 7. **Summary of Section V**

* LoRA modifies only specific projection layers for efficiency.
* Target modules vary by model family (GPT-2, LLaMA, T5, etc.).
* Use model inspection to confirm correct module names.
* Stick to `q_proj` and `v_proj` unless you need broader coverage.
* LoRA configs must exactly match the model's architecture to avoid runtime errors.

---




---

## **Section VI: What Is QLoRA and How Does It Enable Efficient Training with 8-bit Precision?**

QLoRA is an advanced training technique that combines two powerful ideas:

1. **8-bit quantization** to reduce memory usage during training
2. **Parameter-Efficient Fine-Tuning (PEFT)** using LoRA adapters to train only a small part of the model

Let’s break this down step by step.

---

### 1. **What Is Quantization?**

Quantization is a technique that reduces the numerical precision of model weights (from 32-bit or 16-bit floats to 8-bit integers or floats) to reduce:

* Memory footprint
* Computational requirements
* Bandwidth usage

In QLoRA, **the base model is quantized to 8-bit**, but the trainable adapter layers (LoRA) remain in 16-bit or 32-bit precision for stability.

---

### 2. **Why Not Just Quantize and Train Everything?**

Training on 8-bit weights is numerically unstable and often leads to poor convergence. Therefore, QLoRA **freezes the 8-bit base model** and fine-tunes only the lightweight LoRA adapters in full precision.

This avoids instability while still leveraging the memory savings of quantization.

---

### 3. **QLoRA = Quantized Model + LoRA Adapters**

| Component            | Precision | Trainable |
| -------------------- | --------- | --------- |
| Base Model Weights   | 8-bit     | No        |
| LoRA Adapter Weights | 16-bit    | Yes       |

---

### 4. **Key Components in QLoRA Setup**

* **`bitsandbytes`**: Library that loads the base model in 8-bit with minimal loss in accuracy
* **`bnb.nn.Linear8bitLt`**: Special 8-bit linear layers used in quantized models
* **`prepare_model_for_kbit_training()`**: Prepares the model for mixed-precision training and gradient checkpointing
* **`LoraConfig`**: Defines the adapter rank, dropout, and target modules

---

### 5. **Why QLoRA Works So Well**

QLoRA enables training of **large models (6B, 13B, etc.) on consumer-grade GPUs**, even in Google Colab, due to:

* Memory-efficient 8-bit base model
* Lightweight adapter updates
* Reduced disk size for saving models (only adapter weights are saved)

Example:

* Full GPT-J model (6B) in fp16 needs \~24GB
* Same GPT-J model with QLoRA needs \~6.5GB

---

### 6. **Steps in QLoRA Workflow**

1. Load a large base model in **8-bit** using `load_in_8bit=True`
2. Freeze all base model layers
3. Inject **LoRA adapters** in target layers
4. Fine-tune only adapter layers
5. Save only adapter weights (`.bin` or `.safetensors`)
6. At inference time, load the base model again in 8-bit and **merge the adapters**

---

### 7. **Example Code Snippet**

```python
from transformers import AutoModelForCausalLM
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

# Load 8-bit base model
model = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    load_in_8bit=True,
    device_map="auto"
)

# Prepare for 8-bit training
model = prepare_model_for_kbit_training(model)

# Add LoRA adapters
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
```

---

### 8. **Limitations and Considerations**

* Some model architectures are not fully compatible with quantization (e.g., T5, BART with encoder-decoder)
* Accuracy might degrade slightly if quantization is too aggressive (e.g., 4-bit without fine-grained control)
* Use `bnb_config` options to control quantization behavior if needed

---

### 9. **Benefits Recap**

| Benefit                          | Description                                 |
| -------------------------------- | ------------------------------------------- |
| **Reduced Memory Usage**         | Train large models with <10 GB VRAM         |
| **Faster Training**              | Less bandwidth and compute per forward pass |
| **Modularity**                   | Save and swap lightweight adapters per task |
| **Scalable to Colab or Laptops** | Democratizes fine-tuning for everyone       |

---

### 10. **Summary of Section VI**

* QLoRA combines 8-bit quantization with adapter-based fine-tuning
* It reduces memory use drastically while maintaining performance
* Works well for transformer models with separate attention projections
* Best suited for setups with limited memory but large model needs

---




---

## **Section VII: Practical Walkthrough with Live Code – Fine-Tuning a Quantized Model using QLoRA on Google Colab**

This section ties together all the concepts previously discussed—fine-tuning, LoRA, 8-bit quantization—into a hands-on code walkthrough using the `TinyLlama-1.1B-Chat` model on Hugging Face and the `emotion` dataset.

We will:

* Load a quantized model in 8-bit precision
* Inject LoRA adapters into selected layers
* Fine-tune the model with adapter layers only
* Compare predictions before and after fine-tuning

---

### 1. **Installing Required Libraries**

Make sure the following libraries are installed in Colab:

```python
!pip install -q transformers datasets peft accelerate bitsandbytes
```

---

### 2. **Import Required Modules**

```python
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    Trainer, TrainingArguments, DataCollatorForLanguageModeling
)
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
```

---

### 3. **Load and Prepare the Dataset**

We’ll use Hugging Face’s `emotion` dataset and rename the text column for clarity.

```python
dataset = load_dataset("emotion", split="train")
dataset = dataset.train_test_split(test_size=0.2)
dataset = dataset.rename_column("text", "prompt")

# Use a subset for fast demonstration
dataset["train"] = dataset["train"].select(range(5000))
dataset["test"] = dataset["test"].select(range(1000))
```

---

### 4. **Tokenizer Setup**

```python
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
```

---

### 5. **Tokenize the Dataset**

```python
def tokenize_fn(ex):
    return tokenizer(ex["prompt"], truncation=True, padding="max_length", max_length=64)

tokenized = dataset.map(tokenize_fn, batched=True)
```

---

### 6. **Load Model with 8-bit Quantization**

```python
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)
```

---

### 7. **Prepare for LoRA Fine-Tuning**

```python
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
```

---

### 8. **Helper Function for Inference**

```python
def infer(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=30, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(out[0], skip_special_tokens=True)
```

---

### 9. **Predictions Before Fine-Tuning**

```python
test_prompts = [
    "I just got a promotion",
    "I'm feeling very low today",
    "The weather makes me feel",
    "She smiled as she opened the gift",
    "I'm afraid of what happens next"
]

print("\nPredictions BEFORE fine-tuning:\n")
for p in test_prompts:
    print(f"Prompt: {p}\nOutput: {infer(p)}\n")
```

---

### 10. **Training Configuration and Fine-Tuning**

```python
training_args = TrainingArguments(
    output_dir="./emotion-quant",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    logging_dir="./logs",
    report_to=[],
    fp16=True,
    save_total_limit=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

trainer.train()
```

---

### 11. **Predictions After Fine-Tuning**

```python
print("\nPredictions AFTER fine-tuning:\n")
for p in test_prompts:
    print(f"Prompt: {p}\nOutput: {infer(p)}\n")
```

---

# Additional Reference content starts here



## **Fine-Tuning with 8-bit & 16-bit Precision: A Technical Deep Dive**

### **1. Precision Fundamentals**
#### **1.1 Numerical Representation**
| Precision | Bits | Exponent | Mantissa | Range | Common Use |
|-----------|------|----------|----------|-------|------------|
| FP32 | 32 | 8 | 23 | ±1.18×10⁻³⁸ to ±3.4×10³⁸ | Baseline training |
| FP16 | 16 | 5 | 10 | ±6.1×10⁻⁵ to ±6.6×10⁴ | Mixed-precision |
| INT8 | 8 | - | - | -128 to 127 | Inference |

#### **1.2 Memory Requirements**
| Model Size | FP32 | FP16 | INT8 |
|------------|------|------|------|
| 7B params | 28GB | 14GB | 7GB |
| 13B params | 52GB | 26GB | 13GB |
| 70B params | 280GB | 140GB | 70GB |

---

### **2. 16-bit Mixed Precision Training**
#### **2.1 How It Works**
- **FP16 for:**
  - Forward/backward passes
  - Gradient computation
- **FP32 Master Copy for:**
  - Weight updates
  - Optimizer states (Adam momentum/variance)

#### **2.2 Key Components**
1. **Loss Scaling** (Critical for Gradients < 2⁻²⁴)
   ```python
   scaler = GradScaler()  # PyTorch AMP
   scaled_loss = scaler.scale(loss)
   scaled_loss.backward()
   scaler.step(optimizer)
   scaler.update()
   ```
2. **Automatic Mixed Precision (AMP)**
   ```python
   with torch.autocast(device_type='cuda', dtype=torch.float16):
       outputs = model(inputs)
   ```

#### **2.3 Hardware Acceleration**
- **NVIDIA Tensor Cores:**
  - A100: 624 TFLOPS FP16 vs 19.5 TFLOPS FP32
  - H100: 2,000 TFLOPS FP16

#### **2.4 Performance Gains**
| Metric | FP32 | FP16 | Improvement |
|--------|------|------|-------------|
| Training Speed | 1x | 1.5-3x | ✓✓✓ |
| Memory Usage | 1x | 0.5x | ✓✓ |
| Batch Size | 8 | 16-32 | ✓✓✓ |

---

### **3. 8-bit Quantized Training**
#### **3.1 Techniques Compared**
| Method | Backprop? | Accuracy | Use Case |
|--------|-----------|----------|----------|
| **8-bit Optimizers** | Yes | Near-FP32 | Training |
| **QLoRA** | Yes | -1-2% | LLM Fine-tuning |
| **Pure INT8** | No | -3-5% | Inference |

#### **3.2 8-bit Optimizers (e.g., bitsandbytes)**
```python
import bitsandbytes as bnb

# Replace Adam:
optimizer = bnb.optim.Adam8bit(
    model.parameters(),
    lr=2e-5,
    optim_bits=8
)

# Or for 8-bit quantization:
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_8bit=True  # Uses LLM.int8()
)
```

#### **3.3 LLM.int8() Breakdown**
1. **Vector-wise Quantization**
   - Scales rows independently
2. **Outlier Handling**
   - Keeps >6σ activations in FP16
3. **Mixed Precision Decomposition**
    Explained in next image


            W (Weights)                X (Inputs)
        ┌─────────────────┐        ┌─────────────────┐
        │                 │        │                 │
        │   W_{FP16}      │        │   X_{FP16}      │
        │  (FP16 format)  │        │  (FP16 format)  │
        │                 │        │                 │
        ├─────────────────┤        ├─────────────────┤
        │                 │        │                 │
        │   W_{INT8}      │        │   X_{INT8}      │
        │  (INT8 format)  │        │  (INT8 format)  │
        │                 │        │                 │
        └─────────────────┘        └─────────────────┘
                  |                          |
                  |                          |
                  ▼                          ▼
        ┌─────────────────┐        ┌─────────────────┐
        │  FP16 Multiply   │        │  INT8 Multiply  │
        │ (High Precision) │        │ (Low Precision) │
        └─────────────────┘        └─────────────────┘
                  |                          |
                  └───────────┬──────────────┘
                              ▼
                    ┌─────────────────────┐
                    │   Sum (FP16 + INT8)  │
                    │       WX = ...       │
                    └─────────────────────┘
                              ▼
                    ┌─────────────────────┐
                    │   Output (FP16)     │
                    └─────────────────────┘

## **Motivation for Low-Precision Training (FP16 & INT8): A Technical Analysis**

### **1. Core Drivers for Low-Precision Adoption**

#### **1.1 The Hardware Efficiency Imperative**
- **Memory Bandwidth Wall**:
  - Modern GPUs (e.g., H100) have 3TB/s memory bandwidth but require 7.8TFLOPS FP32 compute
  - FP16 cuts memory traffic by 2x, better balancing compute/memory
- **Specialized Compute Units**:
  - NVIDIA Tensor Cores deliver:
    - 125 TFLOPS FP32 → 1,000 TFLOPS FP16 (A100)
    - 4,000 TOPS INT8 (H100)

#### **1.2 Economic Factors**
| Precision | Cost per 1M Tokens (Training) | Relative Speed |
|-----------|-------------------------------|----------------|
| FP32      | $100                          | 1x             |
| FP16      | $45                           | 2.5x           |
| INT8      | $22                           | 5x             |

*Based on AWS p4d.24xlarge instances*

---

### **2. FP16: The Gateway to Efficient Training**

#### **2.1 Technical Advantages**
- **Memory Compression**:
  ```python
  # 7B parameter model memory usage
  model_fp32 = 28GB  # (7B * 4 bytes)
  model_fp16 = 14GB  # (7B * 2 bytes)
  ```
- **Compute Throughput**:
  - A100 FP16: 312 TFLOPS vs FP32: 19.5 TFLOPS (16x difference)

#### **2.2 Key Innovations Enabling FP16**
1. **Loss Scaling** (Micikevicius et al., 2017):
   - Prevents gradient underflow (<2^-24)
   - Algorithm:
     ```python
     scaler = GradScaler(init_scale=2**16)
     scaled_loss = loss * scaler.scale
     scaled_loss.backward()
     optimizer.step()
     scaler.update()  # Adjusts scale dynamically
     ```
2. **Mixed-Precision Training**:
   - Master weights in FP32
   - Activations/gradients in FP16

#### **2.3 Real-World Impact**
- **BERT-Large Training**:
  - FP32: 4 days on 16xV100
  - FP16: 1.8 days (2.2x faster)

---

### **3. INT8: Pushing the Efficiency Frontier**

#### **3.1 Quantization Fundamentals**
- **Uniform Affine Quantization**:
  ```math
  Q(x) = round\left(\frac{x}{s}\right) + z
  ```
  Where:
  - `s = (x_max - x_min) / (2^8 - 1)`
  - `z = -round(x_min / s)`

#### **3.2 Breakthrough Techniques**
| Method | Key Innovation | Accuracy Preservation |
|--------|----------------|-----------------------|
| **LLM.int8()** | Outlier handling (FP16 for >6σ) | <1% drop |
| **SmoothQuant** | Activation/weight balancing | <0.5% drop |
| **GPTQ** | Layer-wise optimal quantization | 2-3% drop |

#### **3.3 Hardware Synergy**
- **NVIDIA INT8 Acceleration**:
  - Turing+: 4x INT8 ops per clock vs FP32
  - H100 Transformer Engine: 8x INT8 matmul speedup

---

### **4. Software Ecosystem Enablement**

#### **4.1 Critical Libraries**
| Library | FP16 Features | INT8 Features |
|---------|--------------|---------------|
| PyTorch | AMP, GradScaler | torch.quantize |
| TensorFlow | MixedPrecision | TFLite converter |
| bitsandbytes | - | 8-bit optimizers |

#### **4.2 Framework Integration**
```python
# HuggingFace Transformers FP16
training_args = TrainingArguments(fp16=True)

# INT8 Quantization
model = AutoModel.from_pretrained(
    "llama-7b",
    load_in_8bit=True,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=False,
        llm_int8_threshold=6.0
    )
)
```

---

### **5. Emerging Research Frontiers**

#### **5.1 FP8: The Next Generation**
- **H100 FP8 Formats**:
  - E4M3 (4 exp, 3 mantissa): 6.1e-5 to 57344
  - E5M2 (5 exp, 2 mantissa): Wider range
- **Theoretical Speedup**:
  - 2x over FP16 in memory-bound ops

## **Mixed Precision (FP16) Fine-Tuning: A Comprehensive Technical Guide**

---

### **1. Core Concepts of Mixed Precision Training**

#### **1.1 Precision Components**
| Component | Storage Format | Compute Format | Purpose |
|-----------|----------------|----------------|---------|
| **Weights** | FP32 (Master) | FP16 | Stable weight updates |
| **Activations** | FP16 | FP16 | Faster matrix math |
| **Gradients** | FP16 | FP16 | Memory efficiency |
| **Optimizer States** | FP32 | FP32 | Numerical stability |

#### **1.2 Key Mathematical Operations**
```python
# Forward Pass (FP16)
with autocast():
    outputs = model(inputs)  # FP16 compute
    
# Backward Pass (FP16)
loss.backward()  # FP16 gradients

# Weight Update (FP32)
optimizer.step()  # FP32 master weights
```

---

### **2. Hardware Acceleration Mechanics**

#### **2.1 NVIDIA Tensor Core Operation**
- **Volta Architecture (V100) and Later**:
  - Specialized 4x4x4 matrix multiply units
  - FP16 input → FP32 accumulation → FP16/FP32 output
  - Theoretical speedup:
    ```math
    \frac{125\ \text{TFLOPS (FP16)}}{15.7\ \text{TFLOPS (FP32)}} = 8\times
    ```

#### **2.2 Memory Bandwidth Optimization**
| Precision | Bandwidth Utilization (A100) | Effective Bandwidth |
|-----------|------------------------------|---------------------|
| FP32 | 1,555 GB/s | 100% |
| FP16 | 1,555 GB/s | 200% (2x elements) |

---

### **3. Critical Implementation Details**

#### **3.1 Loss Scaling Algorithm**
1. **Initial Scale**: Typically 2^16 (65,536)
2. **Dynamic Adjustment**:
   ```python
   if torch.isinf(scaled_loss).any():
       scaler.update(scale/2)  # Halve if overflow
   elif scale < 2**20:
       scaler.update(scale*2)  # Double if safe
   ```

#### **3.2 Gradient Clipping**
```python
# FP16-friendly clipping
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
```

#### **3.3 Batch Size Optimization**
| Model Size | FP32 BS | FP16 BS | Memory Gain |
|------------|---------|---------|-------------|
| 7B params | 8 | 32 | 4x |
| 13B params | 4 | 16 | 4x |
| 70B params | 1 | 4 | 4x |

---

### **4. Framework-Specific Implementations**

#### **4.1 PyTorch (AMP)**
```python
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler(init_scale=2.**16)

for inputs, labels in dataloader:
    optimizer.zero_grad()
    
    with autocast():
        outputs = model(inputs)
        loss = criterion(outputs, labels)
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
```

#### **4.2 TensorFlow**
```python
policy = tf.keras.mixed_precision.Policy('mixed_float16')
tf.keras.mixed_precision.set_global_policy(policy)

# Automatically handles:
# - Loss scaling
# - Master weights
# - Gradient casting
```

#### **4.3 HuggingFace Transformers**
```python
training_args = TrainingArguments(
    fp16=True,
    fp16_opt_level="O2",  # Optimizations level
    gradient_accumulation_steps=4,
    gradient_checkpointing=True
)
```

---

### **5. Performance Benchmarks**

#### **5.1 Training Speed Comparison**
| Model | FP32 Epoch Time | FP16 Epoch Time | Speedup |
|-------|-----------------|-----------------|---------|
| BERT-base | 4.2 hrs | 1.8 hrs | 2.3x |
| GPT-2 Medium | 8.5 hrs | 3.1 hrs | 2.7x |
| ViT-Large | 6.3 hrs | 2.4 hrs | 2.6x |

#### **5.2 Memory Efficiency**
| Metric | FP32 | FP16 | Reduction |
|--------|------|------|-----------|
| Model Memory | 100% | 50% | 2x |
| Gradient Memory | 100% | 50% | 2x |
| Total Training Memory | 100% | 60% | 1.67x |

*(Includes optimizer state overhead)*

---

### **6. Advanced Optimization Techniques**

#### **6.1 Gradient Accumulation**
```python
# Simulates larger batches
for i, (inputs, labels) in enumerate(dataloader):
    with autocast():
        outputs = model(inputs)
        loss = criterion(outputs, labels) / accumulation_steps
    
    scaler.scale(loss).backward()
    
    if (i+1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
```

#### **6.2 Flash Attention Integration**
```python
# Requires CUDA 11.6+
from flash_attn import flash_attention

class FlashAttentionWrapper(nn.Module):
    def forward(self, q, k, v):
        return flash_attention(q, k, v)
```

---

### **7. Troubleshooting Guide**

#### **7.1 Common Issues & Solutions**
| Symptom | Cause | Fix |
|---------|-------|-----|
| NaN Loss | Gradient underflow | Increase loss scale init (try 2^20) |
| Training Divergence | Weight update instability | Add gradient clipping (1.0-5.0) |
| OOM Errors | FP16 memory fragmentation | Reduce batch size by 10% |

#### **7.2 Debugging Tools**
```python
# Check precision consistency
print(next(model.parameters()).dtype)  # Should show float32

# Monitor gradient scales
print(scaler.get_scale())  # Should fluctuate between 2^8-2^20
```

---

### **8. Real-World Case Study: LLaMA-2 Fine-Tuning**

#### **8.1 Configuration**
```yaml
model: llama-2-7b
precision: fp16
batch_size: 32  # Was 8 in FP32
optimizer: adamw_8bit
learning_rate: 2e-5
gradient_accumulation: 4
```

#### **8.2 Performance Gains**
| Metric | FP32 | FP16 | Improvement |
|--------|------|------|-------------|
| GPU Memory | 29GB | 14GB | 2.1x |
| Samples/sec | 42 | 118 | 2.8x |
| Convergence Steps | 50k | 48k | Comparable |

---

### **9. Emerging Trends**

#### **9.1 FP8 Training (H100)**
- **Formats**:
  - E4M3: Better precision
  - E5M2: Wider range
- **Early Results**:
  - 1.8x speedup over FP16

#### **9.2 Automatic Precision Selection**
```python
# Experimental (PyTorch 2.3+)
torch.set_autocast_precision('optimal')  # Dynamically selects FP16/FP32
```

---

### **Key Recommendations**
1. **Always use GradScaler** with FP16
2. **Validate convergence** against FP32 baseline
3. **Leverage hardware** (Tensor Cores/AMX)
4. **Monitor loss scale** for stability



## **8-bit Quantization & QLoRA: Advanced LLM Fine-Tuning Techniques**

---

### **1. 8-bit Quantization Fundamentals**

#### **1.1 Core Principles**
| Concept | Description | Mathematical Formulation |
|---------|-------------|--------------------------|
| **Linear Quantization** | Maps FP32 → INT8 via scaling | \( Q(x) = round\left(\frac{x}{s}\right) + z \) |
| **Scale Factor (s)** | Determines quantization step size | \( s = \frac{x_{max} - x_{min}}{2^8 - 1} \) |
| **Zero Point (z)** | Shifts quantization range | \( z = -round\left(\frac{x_{min}}{s}\right) \) |

#### **1.2 Key Algorithms**
| Method | Innovation | Accuracy Preservation |
|--------|------------|-----------------------|
| **LLM.int8()** | Outlier handling in FP16 | <1% drop |
| **SmoothQuant** | Activation/weight balancing | <0.5% drop |
| **GPTQ** | Layer-wise optimal quantization | 2-3% drop |

---

### **2. QLoRA Architecture Breakdown**

#### **2.1 Core Components**
| Component | Purpose | Parameters Saved |
|-----------|---------|------------------|
| **4-bit Quantized Base** | Frozen pretrained weights | 4x vs FP32 |
| **LoRA Adapters** | Trainable low-rank matrices | 0.1-1% of original |
| **Memory-Efficient Optimizers** | 8-bit Adam | 4x vs FP32 optim states |

#### **2.2 Mathematical Formulation**
For a weight matrix \( W \in \mathbb{R}^{d \times k} \):
```math
W' = dequantize(W_{4bit}) + \Delta W
```
Where the LoRA update is:
```math
\Delta W = BA \quad (B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k})
```
Typical rank \( r = 8-64 \).

---

### **3. Hardware-Software Synergy**

#### **3.1 GPU Memory Breakdown (7B Model)**
| Component | FP32 | FP16 | 4-bit + LoRA |
|-----------|------|------|--------------|
| Model Weights | 28GB | 14GB | 3.5GB |
| Optimizer | 56GB | 28GB | 7GB |
| Activations | 8GB | 4GB | 4GB |
| **Total** | 92GB | 46GB | **14.5GB** |

#### **3.2 Supported Hardware**
| GPU | 4-bit Support | Performance |
|-----|---------------|-------------|
| RTX 3090 | Yes (24GB) | 12 tok/s |
| A100 40GB | Native | 45 tok/s |
| H100 80GB | FP8 Acceleration | 80 tok/s |

---

### **4. Implementation Guide**

#### **4.1 bitsandbytes Configuration**
```python
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat4
    bnb_4bit_use_double_quant=True,  # Second quantization
    bnb_4bit_compute_dtype=torch.bfloat16  # Compute dtype
)
```

#### **4.2 QLoRA with PEFT**
```python
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,  # Rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config
)
model = get_peft_model(model, lora_config)
```

---

### **5. Performance Benchmarks**

#### **5.1 Training Efficiency (7B Model)**
| Method | GPU | Batch Size | VRAM | Speed |
|--------|-----|------------|------|-------|
| FP32 | A100 | 8 | 80GB | 1x |
| FP16 | A100 | 16 | 40GB | 2.5x |
| QLoRA | RTX 3090 | 32 | 18GB | 3.2x |

#### **5.2 Accuracy Comparison (MMLU Benchmark)**
| Method | Accuracy | % of FP32 |
|--------|----------|-----------|
| FP32 | 68.2% | 100% |
| FP16 | 67.9% | 99.6% |
| QLoRA | 67.1% | 98.4% |

---

### **6. Advanced Optimization Techniques**

#### **6.1 Double Quantization**
```python
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,  # Quantizes quantization constants
    bnb_4bit_compute_dtype=torch.float16
)
```
*Saves additional 0.5GB for 7B model*

#### **6.2 NF4 Quantization**
- **NormalFloat4**: Optimal for normal weight distributions
- Outperforms standard INT4 by 0.5-1% accuracy

---

### **7. Troubleshooting Guide**

#### **7.1 Common Issues**
| Problem | Solution |
|---------|----------|
| OOM during training | Enable gradient checkpointing |
| NaN losses | Reduce learning rate (try 1e-5) |
| Slow throughput | Set `bnb_4bit_compute_dtype=torch.float16` |

#### **7.2 Debugging Commands**
```python
# Check quantization status
print(model.quantization_config)

# Verify adapter injection
print(peft_model.print_trainable_parameters())
```

---

### **8. Real-World Deployment**

#### **8.1 Inference Optimization**
```python
model = AutoModelForCausalLM.from_pretrained(
    "my_qlora_model",
    device_map="auto",
    quantization_config=bnb_config
)

pipe = pipeline("text-generation", model=model, torch_dtype=torch.float16)
```

#### **8.2 Serving Performance**
| Method | VRAM | Tokens/sec |
|--------|------|------------|
| FP16 | 14GB | 85 |
| 8-bit | 7GB | 78 |
| 4-bit QLoRA | 5GB | 65 |

---

### **9. Emerging Frontiers**

#### **9.1 1-bit LLMs (BitNet)**
- **1.58-bit weights**
- Matches FP16 accuracy at 10x efficiency

#### **9.2 Hybrid Precision**
```python
# Mix 4-bit weights with FP8 activations
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    fp8_storage=True  # Experimental
)
```

---

### **Key Recommendations**
1. **Start with r=8** for LoRA rank
2. **Use NF4** for best 4-bit accuracy
3. **Enable double quantization** for memory savings
4. **Validate on 100 samples** before full training


## **. Hands-on Exercise: Fine-Tune with 8-bit Quantization using QLoRA**

### **Objective**

Fine-tune a model like `facebook/opt-1.3b` using **8-bit quantization** and **LoRA adapters** on a **sample classification dataset**.

### **Tools Required**

* `transformers`
* `datasets`
* `peft`
* `bitsandbytes`
* `accelerate`

---

### **4.1 Install Dependencies**

```bash
pip install transformers datasets peft accelerate bitsandbytes
```

---

### **4.2 Load Quantized Model**

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-1.3b",
    load_in_8bit=True,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
```

---

### **4.3 Apply QLoRA with PEFT**

```python
from peft import get_peft_model, LoraConfig, TaskType

config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, config)
model.print_trainable_parameters()
```

---

### **4.4 Dataset Example**

Use a small dataset like SST2 or create a toy dataset of \~1000 samples.

```python
from datasets import load_dataset
dataset = load_dataset("glue", "sst2")
```

---

### **4.5 Training**

Use Hugging Face `Trainer` or native PyTorch + `accelerate`.

```python
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,  # Mixed precision
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

trainer.train()
```

---

## **5. Summary Table**

| Feature   | FP16 (Mixed Precision)    | QLoRA (8-bit + LoRA)             |
| --------- | ------------------------- | -------------------------------- |
| Precision | 16-bit + 32-bit mix       | 8-bit + LoRA layers              |
| GPU Usage | Medium                    | Very Low                         |
| Speed     | Fast                      | Fast                             |
| Use Case  | Mid-sized models          | Large models on small GPUs       |
| Stability | Good                      | High (with adapters frozen)      |
| Libraries | PyTorch AMP, Transformers | bitsandbytes, PEFT, Transformers |

---

## **6. Outcomes of the Session**

* Understand memory-efficient fine-tuning techniques.
* Use `fp16` and `load_in_8bit` flags appropriately.
* Apply QLoRA using Hugging Face + PEFT stack.
* Experiment with real model tuning on limited GPU.

---

## **7. Optional Discussion**

* Comparing QLoRA with full fine-tuning and LoRA alone.
* Impact on downstream task accuracy.
* Deployment implications of quantized models.


In [None]:
# prompt: login with huggingface to access tokens

!pip install -q huggingface_hub
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…


---

### **Problem Statement**  
Large language models (LLMs) like GPT-2 require significant computational resources for full fine-tuning, making it impractical for users with limited hardware. This demo addresses the challenge of **efficiently adapting pre-trained LLMs to custom tasks** using **parameter-efficient fine-tuning (PEFT)** techniques like LoRA (Low-Rank Adaptation), while avoiding memory-intensive optimizations like `BitsAndBytes` (8-bit quantization).

---

### **Objectives**  
1. **Demonstrate LoRA-based fine-tuning** of GPT-2 on a toy dataset without advanced quantization.  
2. **Minimize computational overhead** by freezing the base model and training only low-rank adapter layers.  
3. **Showcase a lightweight workflow** for causal language modeling (text generation) using Hugging Face tools (`transformers`, `peft`, `datasets`).  
4. **Provide a reproducible template** for small-scale PEFT experiments.  

---

### **Expected Outcomes**  
1. **A fine-tuned GPT-2 model** with LoRA adapters trained on the toy text corpus.  
2. **Reduced GPU memory usage** compared to full fine-tuning (achieved via LoRA's low-rank matrices).  
3. **Logs and checkpoints** saved in `./results` for evaluation.  
4. **Validation of the approach** through:  
   - Successful training/evaluation loop execution.  
   - Preservation of the base model's capabilities while adapting to the demo dataset.  

---

### **Key Components Validated**  
- **LoRA Configuration**: Correct targeting of attention layers (`c_attn`) with rank `r=8`.  
- **Tokenization**: Proper handling of padding/truncation for causal LM.  
- **Training Efficiency**: Mixed-precision (`fp16`) training with small batch sizes.  

This demo serves as a foundation for scaling to larger datasets/tasks while maintaining hardware efficiency.

In [None]:
!pip install -U datasets -q

In [None]:
# Simple Dataset for Fine-Tuning Demo (No BitsAndBytes)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from datasets import Dataset

# === Create a toy dataset ===
data = {
    "text": [
        "The cat sat on the mat.",
        "The quick brown fox jumps over the lazy dog.",
        "AI is transforming the world.",
        "Python is a great programming language.",
        "Transformers are state-of-the-art NLP models."
    ]
}
dataset = Dataset.from_dict(data)
dataset = dataset.train_test_split(test_size=0.2)

# === Load tokenizer and base model ===
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add pad token if missing
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=64)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# === Load model without BitsAndBytes ===
model = AutoModelForCausalLM.from_pretrained(model_name)

# === Prepare for training with LoRA ===
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["c_attn"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

# === Training arguments ===
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    logging_dir="./logs",
    save_total_limit=1,
    fp16=True,
    report_to=[]
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
)

# === Train ===
trainer.train()


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


TrainOutput(global_step=2, training_loss=3.657654285430908, metrics={'train_runtime': 2.2554, 'train_samples_per_second': 1.774, 'train_steps_per_second': 0.887, 'total_flos': 131099000832.0, 'train_loss': 3.657654285430908, 'epoch': 1.0})

# Additional Reference content ends here

# Hands-on content starts here


---

## **Case Study Overview: Fine-Tuning GPT-2 with LoRA on a Toy Dataset**

### **Problem Statement**

Pretrained language models such as GPT-2 are general-purpose and trained on large corpora. However, they may not perform well on domain-specific or task-specific prompts out-of-the-box. This case study aims to fine-tune a lightweight GPT-2 model on a small, custom dataset using **parameter-efficient techniques** (LoRA) to adapt the model for better performance on specific sentence completions, while minimizing computational and memory overhead.

---

### **Objectives**

1. To demonstrate the concept and practical use of **parameter-efficient fine-tuning (PEFT)** using **LoRA (Low-Rank Adaptation)**.
2. To apply **mixed precision training (fp16)** for performance optimization.
3. To illustrate how even a **tiny dataset** can be used to observe measurable changes in model output through fine-tuning.
4. To evaluate and compare **model predictions before and after fine-tuning** to assess the effectiveness of the LoRA adapter.
5. To prepare the foundation for more advanced use cases like QLoRA (8-bit) by understanding LoRA basics.

---

### **Expected Outcomes**

* The model will produce more **contextually adapted and relevant completions** after fine-tuning.
* Memory and training time will be reduced significantly compared to full model fine-tuning.
* Participants will learn how to inject and control LoRA adapters using the `peft` library.
* A clearer understanding of how small, focused datasets can shift the behavior of a base model.

---

### **Step-by-Step Flow and Improvements**

| **Step**                                       | **Description**                                                                              | **Improvement Introduced**                                           |
| ---------------------------------------------- | -------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| **1. Setup & Imports**                         | Install and import necessary libraries including `peft`, `bitsandbytes`, and `transformers`. | Environment prepared for low-resource fine-tuning.                   |
| **2. Toy Dataset Creation**                    | A tiny dataset of 5 sample sentences is created.                                             | Helps simulate fine-tuning on low-data scenarios.                    |
| **3. Tokenization**                            | Texts are tokenized using GPT-2’s tokenizer. Padding is set appropriately.                   | Ensures consistent input length, preparing for training.             |
| **4. Model Loading**                           | GPT-2 is loaded as the base model.                                                           | Uses a widely available and light model for demonstration.           |
| **5. LoRA Adapter Injection**                  | LoRA adapters are injected into the `c_attn` layers using PEFT.                              | Reduces the number of trainable parameters.                          |
| **6. Baseline Evaluation**                     | Generate predictions before training.                                                        | Establishes baseline behavior of the pre-trained model.              |
| **7. Fine-Tuning with Mixed Precision (fp16)** | The model is trained using LoRA layers for 3 epochs on 3 samples.                            | Demonstrates efficient fine-tuning using minimal GPU memory.         |
| **8. Post-Finetune Evaluation**                | Generate outputs again after fine-tuning.                                                    | Helps visually assess performance gains through adapted completions. |

---

### **Key Takeaways to Discuss**

* Why LoRA: Freezes original model weights and only trains small adapter modules.
* Why `c_attn`: GPT-2 has a combined QKV attention layer (`c_attn`) where LoRA is most effective.
* Why fp16: Reduces memory usage and training time while retaining performance.
* Limitations of the demo: Tiny dataset is used for illustration and not for production-level improvements.
* Next steps: Move to QLoRA with `bnb_4bit` and larger, more diverse datasets to see greater gains.

---



In [None]:
# Install dependencies
!pip install -q transformers datasets peft accelerate bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 MB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m125.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m100.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

| Package          | Purpose                                                                                                                                                       |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **transformers** | From Hugging Face. Provides pre-trained models (like GPT-2, BERT) and tools for fine-tuning and inference.                                                    |
| **datasets**     | Also from Hugging Face. Allows loading and managing NLP datasets easily.                                                                                      |
| **peft**         | Short for *Parameter-Efficient Fine-Tuning*. Used for techniques like **LoRA** (Low-Rank Adaptation), which fine-tune models with fewer trainable parameters. |
| **accelerate**   | Speeds up training on different hardware (CPU, GPU, multi-GPU setups). Simplifies distributed training.                                                       |
| **bitsandbytes** | Enables **8-bit and 4-bit quantized** training of large models using minimal memory. Used for memory-efficient fine-tuning.                                   |


### from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

| Function                          | Purpose                                                                                                         |
| --------------------------------- | --------------------------------------------------------------------------------------------------------------- |
| `prepare_model_for_kbit_training` | Prepares a model for 8-bit or 4-bit quantized training (e.g., enables gradient checkpointing, layer freezing).  |
| `LoraConfig`                      | Configuration object to define how LoRA (Low-Rank Adaptation) should be applied to specific layers.             |
| `get_peft_model`                  | Takes a base model and applies LoRA transformation as per `LoraConfig`. Returns a LoRA-wrapped trainable model. |


These are from the peft library and enable parameter-efficient fine-tuning, saving memory and compute resources. Very useful when training large models on limited hardware.



In [None]:
#  Imports
import torch
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    Trainer, TrainingArguments, DataCollatorForLanguageModeling
)
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from datasets import Dataset

# Toy dataset
data = {
    "text": [
        "The cat sat on the mat.",
        "The quick brown fox jumps over the lazy dog.",
        "AI is transforming the world.",
        "Python is a great programming language.",
        "Transformers are state-of-the-art NLP models."
    ]
}
dataset = Dataset.from_dict(data).train_test_split(test_size=0.4) # splits this dataset into 60% training and 40% testing.
# In this case, out of 5 examples, 3 will go into the training set, and 2 into the test set (randomly).

# Load tokenizer and base model (no quantization)
model_name = "gpt2" # "gpt2" is a small autoregressive transformer model trained for text generation (causal language modeling).
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 does not have a pad_token defined by default, because it was not trained with padding. However, when training or evaluating using batches, padding becomes necessary to make all sequences the same length. So, you manually set the padding token to be the same as the end-of-sequence (eos_token), which is typically <|endoftext|> in GPT-2.

# Tokenize the dataset
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=64)

tokenized_datasets = dataset.map(tokenize_function, batched=True) # batched=True: It passes a batch (i.e., a list of examples) at once, making it more efficient.

# Load model and prepare for LoRA
model = AutoModelForCausalLM.from_pretrained(model_name)

# This prepares the model for low-bit training using quantization.
# Internally, this method adjusts model layers so they can be fine-tuned using either 8-bit or 4-bit precision, reducing memory usage.
# Normally, fine-tuning full-precision (32-bit) models is memory-intensive. Using prepare_model_for_kbit_training enables
# memory-efficient fine-tuning, especially on consumer GPUs.
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=8, # LoRA rank — controls the size of the low-rank update matrices. Lower values mean fewer additional parameters
    lora_alpha=16, # A scaling factor used to stabilize training. Often set to 2 × r.
    target_modules=["c_attn"],  # Specifies which parts of the model to apply LoRA to. In GPT-2, c_attn is a key attention projection layer.
    lora_dropout=0.1, # Dropout applied to LoRA layers during training for regularization.
    bias="none", # Indicates whether biases should be trained or frozen. "none" means no additional bias parameters are added.
    task_type="CAUSAL_LM", # Specifies the type of task (causal language modeling) to help LoRA integrate correctly with training logic.
)
# Why this matters: LoRA modifies only a few key parts of the model using small trainable matrices — this makes it
# lightweight and fast to fine-tune.

model = get_peft_model(model, lora_config) # This applies the LoRA configuration to the base model.
# The model is now a parameter-efficient fine-tuning (PEFT) model using LoRA, capable of learning new tasks with a
# small number of additional parameters — much faster and more memory-efficient than traditional fine-tuning.

# Sample prompts to compare before/after
test_prompts = [
    "AI will replace many jobs in the future, but",
    "Python is widely used for",
]

def generate_output(model, prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=20)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print("🔍 Predictions BEFORE fine-tuning:\n")
for prompt in test_prompts:
    print(f"Prompt: {prompt}")
    print("Output:", generate_output(model, prompt), "\n")

# Fine-tune model
training_args = TrainingArguments(
    output_dir="./results", # Folder to save model checkpoints and logs.
    per_device_train_batch_size=2, # Two samples will be used per training step on each GPU/CPU.
    per_device_eval_batch_size=2,
    num_train_epochs=3, # The entire dataset will be shown to the model 3 times.
    logging_dir="./logs",
    save_total_limit=1,
    fp16=True, # Enables 16-bit mixed-precision training for faster computation and lower memory usage (important on modern GPUs).
    report_to=[], # Prevents external logging (like Weights & Biases)
    logging_steps=10 # Log training metrics every 10 steps.
)

# dynamically pads examples to the maximum length in a batch.
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) # mlm=False means this is not masked language modeling (used in BERT), but causal language modeling (like GPT)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
)
trainer.train()

# Generate again after fine-tuning
print("🔍 Predictions AFTER fine-tuning:\n")
for prompt in test_prompts:
    print(f"Prompt: {prompt}")
    print("Output:", generate_output(model, prompt), "\n")


Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


🔍 Predictions BEFORE fine-tuning:

Prompt: AI will replace many jobs in the future, but


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output: AI will replace many jobs in the future, but it will not replace the jobs that are already there.

The government has already announced that it 

Prompt: Python is widely used for


No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Output: Python is widely used for the development of web applications. It is a very powerful and flexible programming language. It is also a 



Step,Training Loss


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


🔍 Predictions AFTER fine-tuning:

Prompt: AI will replace many jobs in the future, but


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output: AI will replace many jobs in the future, but it will not replace the jobs that are already there.

The government will not replace the jobs 

Prompt: Python is widely used for
Output: Python is widely used for the development of the web. It is also used for the development of the web application.

 




---

## **Case Study 2: LoRA Fine-Tuning on a Thematic, Multi-Domain Synthetic Dataset**

---

### **Problem Statement**

Pretrained language models like GPT-2 often exhibit generic and inconsistent outputs when prompted with sentences across various domains (e.g., health, sports, technology, etc.). This case study aims to fine-tune GPT-2 using a more **diverse**, **theme-rich**, and **semi-synthetic dataset** to improve the model’s ability to produce **coherent and domain-aligned completions**. The goal is to test how a **larger, structured, and varied dataset** enhances fine-tuning effectiveness when combined with **LoRA-based parameter-efficient methods**.

---

### **Objectives**

1. To fine-tune GPT-2 on a custom dataset with 120+ domain-diverse examples covering 5 themes: **technology, health, education, environment, and sports**.
2. To assess the impact of **input diversity and scale** on model adaptation quality.
3. To reinforce **parameter-efficient fine-tuning (LoRA)** for **low-memory setups** using `c_attn` injection.
4. To evaluate pre- and post-training outputs with meaningful prompts representing each domain.
5. To enable scalable hands-on practice with **fp16 mixed precision training** for performance.

---

### **Expected Outcomes**

* Fine-tuned GPT-2 will **retain the fluency** of the base model while **exhibiting theme-specific understanding** in completions.
* Model output will shift from **generic or repetitive text** to **semantically aligned, context-aware predictions**.
* Learners will understand the **value of dataset quality and diversity**, even in small-scale finetuning.
* LoRA tuning will allow fast, low-resource training with significantly **reduced memory overhead**.

---

### **Step-by-Step Summary and Enhancements**

| Step                  | Description                                              | Enhancement over Previous Code                                                                 |
| --------------------- | -------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| **Data Generation**   | 120+ domain-specific sentences generated using variants. | **Increased scale** and **thematic richness** versus 5 generic sentences in the previous case. |
| **Tokenization**      | Texts tokenized with consistent padding and max length.  | Same as before, but handles more varied and longer expressions.                                |
| **Model Preparation** | GPT-2 with LoRA on `c_attn` using `peft`.                | Same method, but used on a **richer dataset**, improving LoRA learning.                        |
| **Before Evaluation** | Prompt completions generated before training.            | Prompts are **domain-aligned**, providing **better contrast post-tuning**.                     |
| **Training Phase**    | 3 epochs with fp16 and 4-batch size on \~100 examples.   | **Larger and more meaningful training set**, increasing generalization.                        |
| **After Evaluation**  | Outputs reviewed after fine-tuning.                      | Likely to show **clearer improvements** due to dataset design.                                 |

---

### **How This Is Better Than the Previous Version**

| Aspect                  | Previous Code                | Current Code                               | Advantage                                            |
| ----------------------- | ---------------------------- | ------------------------------------------ | ---------------------------------------------------- |
| **Dataset Size**        | 5 sentences                  | 120+ variants                              | Larger training base enhances learning capacity      |
| **Dataset Diversity**   | Narrow, generic content      | 5 rich domains with variants               | Better generalization and topic-specific completions |
| **Prompt Evaluation**   | 2 simple prompts             | 5 thematic prompts                         | Better testing of model adaptability                 |
| **Instructional Value** | Demonstration of PEFT basics | Introduction to dataset engineering + PEFT | More aligned with real-world applications            |
| **Educational Scope**   | LoRA fine-tuning             | LoRA + synthetic data + domain alignment   | Covers multiple learning objectives in one go        |

---

### **Suggested Additions for Learners**

* Visualize token distribution per domain.
* Compare perplexity before and after training (if possible).
* Analyze output diversity across domains.
* Introduce minor label supervision later for conditional generation.

---



In [None]:
# Install Required Libraries
!pip install -q datasets transformers accelerate peft bitsandbytes

In [None]:
# Imports
import torch
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    Trainer, TrainingArguments, DataCollatorForLanguageModeling
)
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from datasets import Dataset
import random

# Generate Larger & Diverse Dataset (120+ Samples)
themes = {
    "technology": [
        "AI is transforming industries worldwide.",
        "Machine learning models improve over time.",
        "Quantum computing could revolutionize encryption."
    ],
    "health": [
        "Yoga has many health benefits.",
        "Mental health is as important as physical health.",
        "A balanced diet boosts immunity."
    ],
    "education": [
        "Online learning is becoming popular.",
        "Critical thinking should be taught in schools.",
        "Education should adapt to modern needs."
    ],
    "environment": [
        "Climate change affects every nation.",
        "Electric vehicles reduce carbon emissions.",
        "Sustainable farming supports biodiversity."
    ],
    "sports": [
        "Football unites people around the world.",
        "Athletes train for years to reach their peak.",
        "Olympic games are held every four years."
    ]
}

# Expand dataset
texts = []
for theme, base_sentences in themes.items():
    for sentence in base_sentences:
        for i in range(8):
            texts.append(f"{sentence} ({theme}, variant {i+1})")

random.shuffle(texts)
data = {"text": texts}
dataset = Dataset.from_dict(data).train_test_split(test_size=0.2)

# Load Tokenizer and Model
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=64)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Load Model + Prepare for LoRA
base_model = AutoModelForCausalLM.from_pretrained(model_name)
model = prepare_model_for_kbit_training(base_model)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["c_attn"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

# Helper Function to Predict
def predict_completion(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=30, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Evaluate BEFORE Fine-Tuning
prompts = [
    "AI is revolutionizing",
    "Climate change will",
    "Education in the future must",
    "Football players often",
    "Mental health should"
]
print("\n🔍 Predictions BEFORE fine-tuning:\n")
for p in prompts:
    print(f"Prompt: {p}\nOutput: {predict_completion(p)}\n")

# Train the Model
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    logging_dir="./logs",
    save_total_limit=1,
    report_to=[],
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

trainer.train()

# Evaluate AFTER Fine-Tuning
print("\n🔍 Predictions AFTER fine-tuning:\n")
for p in prompts:
    print(f"Prompt: {p}\nOutput: {predict_completion(p)}\n")


Map:   0%|          | 0/96 [00:00<?, ? examples/s]

Map:   0%|          | 0/24 [00:00<?, ? examples/s]




🔍 Predictions BEFORE fine-tuning:

Prompt: AI is revolutionizing
Output: AI is revolutionizing the way we think about the world.

The world is changing.

The world is changing.

The world is changing.


Prompt: Climate change will
Output: Climate change will be a major driver of the global economy, and the world's population will grow by about 1.5 billion people by 2050.

The world

Prompt: Education in the future must
Output: Education in the future must be based on the principles of equality, justice, and equality of opportunity.

The United States is a nation of immigrants, and we must not

Prompt: Football players often
Output: Football players often have to be in the same position as the other players.

"I think it's a good thing for the players to be able to play



No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Prompt: Mental health should
Output: Mental health should be a priority for all of us.

"We need to be able to provide the services that we need to be able to provide to our



Step,Training Loss



🔍 Predictions AFTER fine-tuning:

Prompt: AI is revolutionizing
Output: AI is revolutionizing the way we think about the world.
The world is changing.

We are changing.
We are changing.


We are

Prompt: Climate change will
Output: Climate change will be a major factor in the global warming problem.

The global warming trend is a major factor in the global warming problem.


The

Prompt: Education in the future must
Output: Education in the future must be a matter of the utmost importance.

"The fact that the government has not taken any action to make the necessary changes to the system of

Prompt: Football players often
Output: Football players often have to be able to play in the same way.

"I think it's a good thing for the game," said the former NFL player

Prompt: Mental health should
Output: Mental health should be a priority for the health of the people of the country of the country of the country of the country of the country of the country of the country




---

## **Problem Statement**

Large Language Models (LLMs) like `distilgpt2` often lack nuanced emotional understanding when generating completions for emotionally charged prompts (e.g., sadness, joy, anxiety). This case study explores how **parameter-efficient fine-tuning** using **LoRA** and a small curated **emotion dataset** can improve the model's ability to reflect more **emotionally sensitive and contextually aligned responses**.

The aim is to determine whether fine-tuning even with a **limited number of emotion-labeled examples (200)** can **improve model coherence and emotional alignment** using **LoRA with 8-bit quantization** on a compact model (`distilgpt2`).

---

## **Objectives**

1. Use a subset (200 training + 40 testing) of the Hugging Face `"emotion"` dataset to fine-tune a small language model (`distilgpt2`) using LoRA.
2. Apply 8-bit quantization for memory-efficient fine-tuning with reduced computational overhead.
3. Compare model behavior before and after fine-tuning using emotionally themed prompts.
4. Showcase the benefits of LoRA for adapting LLMs to niche datasets on limited hardware.

---

## **Expected Outcomes**

* **Before fine-tuning**: The model produces generic or irrelevant completions lacking emotional awareness.
* **After fine-tuning**: The model generates completions that are **more emotionally appropriate**, **contextually richer**, and **better aligned** with the prompt tone (e.g., happiness, fear, sadness).
* Demonstrates that even a small emotional dataset can fine-tune a compact model effectively when LoRA and quantization are applied correctly.

---

## **Comparison with the Previous Version**

| Aspect                    | Previous Basic Version         | Current Version                                   |
| ------------------------- | ------------------------------ | ------------------------------------------------- |
| **Dataset Type**          | Toy sentences manually entered | Real-world labeled emotional data (`emotion`)     |
| **Dataset Size**          | 5–10 examples                  | 240 (200 train, 40 test)                          |
| **Domain Variety**        | Limited                        | Rich emotional variety (e.g., joy, sadness, fear) |
| **Model Type**            | GPT-2 / distilgpt2             | distilgpt2                                        |
| **Quantization**          | No / optional                  | Yes, with `prepare_model_for_kbit_training`       |
| **Fine-Tuning Technique** | LoRA                           | LoRA with quantization                            |
| **Evaluation**            | Generic prompts                | Emotion-driven prompts aligned with dataset       |
| **Expected Result**       | Basic language completions     | Emotion-sensitive completions                     |

---

## **Improvements in This Code**

1. **Use of a Public Emotional Dataset**: Instead of synthetic or toy data, this version uses a real dataset (`emotion` from HF) which includes examples with strong emotional signals.
2. **Better Prompt Alignment**: Prompts like “I’m feeling very low today” or “I’m afraid of what happens next” directly relate to the emotion categories in the dataset, enabling better fine-tuning and evaluation.
3. **LoRA on Quantized Model**: The model uses `prepare_model_for_kbit_training` and LoRA injected into `"c_attn"` for memory efficiency and fast convergence.
4. **Fine-Tuning Duration**: Trained for 10 epochs — ideal for a small dataset, allowing visible performance gains.
5. **Compact Model Size**: Uses `distilgpt2`, which is smaller and trains quickly in Colab with low memory.

---

## Summary

This code is a **step-up from toy data fine-tuning**, showing how **LoRA + quantization** enables meaningful adaptation of a pretrained model on a **real-world emotion classification dataset**, even under **low-resource conditions**. The before-and-after prediction structure gives clear insight into the effectiveness of the fine-tuning process.



In [None]:
# Install Required Packages
!pip install -q datasets transformers accelerate peft bitsandbytes

In [None]:
# Imports
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    Trainer, TrainingArguments, DataCollatorForLanguageModeling
)
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

# Load & Prepare Dataset
# https://huggingface.co/datasets/dair-ai/emotion
dataset = load_dataset("emotion", split="train")
dataset = dataset.train_test_split(test_size=0.2)
dataset = dataset.rename_column("text", "prompt")  # rename for clarity

# Use only a small portion to keep it fast
dataset["train"] = dataset["train"].select(range(200))
dataset["test"] = dataset["test"].select(range(40))

# Tokenizer & Tokenization
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

def tokenize_fn(ex):
    return tokenizer(ex["prompt"], truncation=True, padding="max_length", max_length=64)

tokenized = dataset.map(tokenize_fn, batched=True)

# Load & Quantize Model for LoRA
model = AutoModelForCausalLM.from_pretrained(model_name)
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["c_attn"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

# Inference Helper
def infer(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=30, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(out[0], skip_special_tokens=True)

# Predictions Before Fine-Tuning
test_prompts = [
    "I just got a promotion",
    "I'm feeling very low today",
    "The weather makes me feel",
    "She smiled as she opened the gift",
    "I'm afraid of what happens next"
]
print("\n🔍 Predictions BEFORE fine-tuning:\n")
for p in test_prompts:
    print(f"Prompt: {p}\nOutput: {infer(p)}\n")

# Train Model
training_args = TrainingArguments(
    output_dir="./emotion-quant",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    logging_dir="./logs",
    report_to=[],
    fp16=True,
    save_total_limit=1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

trainer.train()

# Predictions After Fine-Tuning
print("\n🔍 Predictions AFTER fine-tuning:\n")
for p in test_prompts:
    print(f"Prompt: {p}\nOutput: {infer(p)}\n")


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]




🔍 Predictions BEFORE fine-tuning:

Prompt: I just got a promotion
Output: I just got a promotion to be a part of the team. I'm not going to be a part of the team. I'm going to be a part of the team

Prompt: I'm feeling very low today
Output: I'm feeling very low today. I'm feeling very low today. I'm feeling very low today. I'm feeling very low today. I'm feeling very low today. I

Prompt: The weather makes me feel
Output: The weather makes me feel like I’m in a bad mood.”


I’m not sure if I’m going to be able

Prompt: She smiled as she opened the gift
Output: She smiled as she opened the gift.

"I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I



No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Prompt: I'm afraid of what happens next
Output: I'm afraid of what happens next. I'm afraid of what happens next. I'm afraid of what happens next. I'm afraid of what happens next. I'm afraid of what



Step,Training Loss
500,4.955



🔍 Predictions AFTER fine-tuning:

Prompt: I just got a promotion
Output: I just got a promotion to be a part of the community, and I'm not going to be a part of it. I'm just going to be a part of it

Prompt: I'm feeling very low today
Output: I'm feeling very low today. I'm feeling very tired today. I'm feeling very tired today. I'm feeling very tired today. I'm feeling very tired today. I

Prompt: The weather makes me feel
Output: The weather makes me feel better.

I am a very good person. I am a very good person. I am a very good person. I am a very good

Prompt: She smiled as she opened the gift
Output: She smiled as she opened the gift box and opened it and she was happy to see it.



















Prompt: I'm afraid of what happens next
Output: I'm afraid of what happens next. I'm afraid of what happens next. I'm afraid of what happens next. I'm afraid of what happens next. I'm afraid of what




---

## Analysis: Was Fine-Tuning Effective?

### BEFORE Fine-Tuning Observations

| Prompt                            | Output (Before Fine-Tuning)                               |
| --------------------------------- | --------------------------------------------------------- |
| I just got a promotion            | Repetitive: "I'm going to be a part of the team..."       |
| I'm feeling very low today        | Repetitive loop: "I'm feeling very low today..." repeated |
| The weather makes me feel         | Cut off mid-sentence or unrelated continuation            |
| She smiled as she opened the gift | Repetitive fallback: "I'm sorry, but..."                  |
| I'm afraid of what happens next   | Looping and repetition of the same phrase                 |

**Summary**:
The pre-trained model without fine-tuning:

* Shows repetition and fallback loops.
* Lacks semantic alignment with the prompts.
* Struggles to generate emotionally coherent or complete responses.

---

### AFTER Fine-Tuning Observations

| Prompt                            | Output (After Fine-Tuning)                                    |
| --------------------------------- | ------------------------------------------------------------- |
| I just got a promotion            | Repetitive reduced slightly; mentions "part of the community" |
| I'm feeling very low today        | New variation: "I'm feeling very tired today"                 |
| The weather makes me feel         | Improved: "The weather makes me feel better."                 |
| She smiled as she opened the gift | Meaningful: "opened the gift box... she was happy to see it"  |
| I'm afraid of what happens next   | Repetition still present but more concise                     |

**Summary**:
The fine-tuned model:

* Generates responses that are more semantically appropriate.
* Demonstrates contextual grounding for emotional or descriptive content.
* Shows slight improvements in phrasing diversity and response coherence.

---

## Conclusion: Did Fine-Tuning Help?

| Evaluation Aspect                | Assessment                                                   |
| -------------------------------- | ------------------------------------------------------------ |
| Elimination of repetition        | Partially improved                                           |
| Emotional and semantic awareness | Improved noticeably                                          |
| Handling of diverse prompts      | More structured and varied responses                         |
| Realism in completions           | Greater realism, especially in positive or narrative prompts |
| Overall improvement              | Clear qualitative and measurable improvement                 |

---

## Recommendations for Further Improvement

1. Increase the dataset size from 200 to 1000+ examples.
2. Introduce more emotionally nuanced or task-aligned prompts.
3. Consider switching to a larger model (e.g., GPT2-Medium) with quantization if GPU memory allows.
4. Add label supervision (e.g., emotion category) and explore instruction tuning techniques.
5. Explore generation metrics (BLEU, ROUGE, perplexity) for automatic evaluation of improvements.


---

## **Problem Statement**

General-purpose language models like `distilgpt2` are not explicitly trained to handle emotionally nuanced completions, especially in tasks requiring empathy, sentiment reflection, or emotional alignment. This project addresses the problem of enhancing the model's ability to generate **emotion-sensitive completions** by fine-tuning on a larger subset of the `"emotion"` dataset using **parameter-efficient training** (LoRA) with **8-bit quantization** to reduce memory and training time costs.

The goal is to explore whether **increasing the dataset size** (5000 training + 1000 testing) and maintaining lightweight fine-tuning techniques can lead to **higher-quality completions** on emotional prompts.

---

## **Objectives**

1. Fine-tune a compact causal language model (`distilgpt2`) on 5000 emotionally annotated text examples from the Hugging Face `"emotion"` dataset.
2. Use **LoRA (Low-Rank Adaptation)** to inject task-specific knowledge with minimal additional parameters.
3. Apply **8-bit quantization** using `prepare_model_for_kbit_training` for memory-efficient training on consumer-grade GPUs.
4. Evaluate the difference in generation quality before and after fine-tuning using emotional prompts.

---

## **Expected Outcomes**

* Before fine-tuning: The model is expected to output generic or repetitive completions, lacking emotional depth or relevance.
* After fine-tuning: The model should produce completions that better reflect emotional nuance, tone, and context based on training data.
* Fine-tuning on a **larger dataset** allows the model to learn more varied patterns and emotional contexts, improving generalization.

---

## **How This Is Better Than the Previous Version**

| Feature                   | Previous Version (Small Dataset)           | Current Version (Larger Dataset)                                                              |
| ------------------------- | ------------------------------------------ | --------------------------------------------------------------------------------------------- |
| **Training Samples**      | 200                                        | 5000                                                                                          |
| **Test Samples**          | 40                                         | 1000                                                                                          |
| **Training Epochs**       | 10                                         | 10                                                                                            |
| **Model Used**            | distilgpt2                                 | distilgpt2                                                                                    |
| **Fine-tuning Technique** | LoRA + 8-bit                               | LoRA + 8-bit                                                                                  |
| **Inference Setup**       | 5 prompt tests                             | 5 prompt tests                                                                                |
| **Memory Optimization**   | Yes (LoRA + bitsandbytes)                  | Yes (LoRA + bitsandbytes)                                                                     |
| **Expected Performance**  | Modest improvement due to limited examples | Significantly better generalization and emotional alignment due to more diverse training data |

**Improvement Justification**:

* With 25x more training examples, the model has access to richer emotional cues, diverse sentence structures, and greater variation in emotional scenarios.
* Larger training set helps reduce overfitting and increases the robustness of the generated completions.
* Since the model architecture and training technique remain the same, this version isolates **dataset size** as the primary factor contributing to better performance.

---

## **Conclusion**

This version of the code significantly enhances the model's learning capacity by leveraging a **larger emotional dataset**, while still using **resource-efficient fine-tuning techniques**. It serves as a practical demonstration of how **LoRA + quantization** can scale with data to yield measurable improvements, especially in low-resource training settings like Google Colab. This makes it highly suitable for classroom demonstrations, prototyping, and real-world fine-tuning tasks where hardware and time are constrained.



In [None]:
# Imports
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    Trainer, TrainingArguments, DataCollatorForLanguageModeling
)
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

# Load & Prepare Dataset
dataset = load_dataset("emotion", split="train")
dataset = dataset.train_test_split(test_size=0.2)
dataset = dataset.rename_column("text", "prompt")  # rename for clarity

# Use only a small portion to keep it fast
dataset["train"] = dataset["train"].select(range(5000))
dataset["test"] = dataset["test"].select(range(1000))

# Tokenizer & Tokenization
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

def tokenize_fn(ex):
    return tokenizer(ex["prompt"], truncation=True, padding="max_length", max_length=64)

tokenized = dataset.map(tokenize_fn, batched=True)

# Load & Quantize Model for LoRA
model = AutoModelForCausalLM.from_pretrained(model_name)
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["c_attn"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

# Inference Helper
def infer(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=30, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(out[0], skip_special_tokens=True)

# Predictions Before Fine-Tuning
test_prompts = [
    "I just got a promotion",
    "I'm feeling very low today",
    "The weather makes me feel",
    "She smiled as she opened the gift",
    "I'm afraid of what happens next"
]
print("\n🔍 Predictions BEFORE fine-tuning:\n")
for p in test_prompts:
    print(f"Prompt: {p}\nOutput: {infer(p)}\n")

# Train Model
training_args = TrainingArguments(
    output_dir="./emotion-quant",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    logging_dir="./logs",
    report_to=[],
    fp16=True,
    save_total_limit=1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

trainer.train()

# Predictions After Fine-Tuning
print("\n🔍 Predictions AFTER fine-tuning:\n")
for p in test_prompts:
    print(f"Prompt: {p}\nOutput: {infer(p)}\n")


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]




🔍 Predictions BEFORE fine-tuning:

Prompt: I just got a promotion
Output: I just got a promotion to be a part of the team. I'm not going to be a part of the team. I'm going to be a part of the team

Prompt: I'm feeling very low today
Output: I'm feeling very low today. I'm feeling very low today. I'm feeling very low today. I'm feeling very low today. I'm feeling very low today. I

Prompt: The weather makes me feel
Output: The weather makes me feel like I’m in a bad mood.”


I’m not sure if I’m going to be able

Prompt: She smiled as she opened the gift
Output: She smiled as she opened the gift.

"I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I



No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Prompt: I'm afraid of what happens next
Output: I'm afraid of what happens next. I'm afraid of what happens next. I'm afraid of what happens next. I'm afraid of what happens next. I'm afraid of what



Step,Training Loss
500,4.8412
1000,4.5539
1500,4.4574
2000,4.4397
2500,4.3686
3000,4.3748
3500,4.3419
4000,4.3421
4500,4.3446
5000,4.29



🔍 Predictions AFTER fine-tuning:

Prompt: I just got a promotion
Output: I just got a promotion and i feel like i am not going to be able to do anything for the money i have earned and i feel like i am not going to be

Prompt: I'm feeling very low today
Output: I'm feeling very low today and i feel very depressed today and i feel very depressed today and i feel very depressed today and i feel very depressed today and i feel very depressed today

Prompt: The weather makes me feel
Output: The weather makes me feel like i am in a state of feeling like i am in a state of feeling like i am in a state of feeling like i am in a state

Prompt: She smiled as she opened the gift
Output: She smiled as she opened the gift box and felt it was a gift box and a gift box that was a gift box that was a gift box that was a gift box that was a

Prompt: I'm afraid of what happens next
Output: I'm afraid of what happens next but i feel like i have to be more than happy to be part of it and not be part of 


---

### **Evaluation: Is the Output Correct?**

| Prompt                            | Output Before Fine-Tuning                                              | Output After Fine-Tuning                                                            | Improvement?              |
| --------------------------------- | ---------------------------------------------------------------------- | ----------------------------------------------------------------------------------- | ------------------------- |
| I just got a promotion            | Repetitive: *"part of the team..."* (looped phrasing)                  | Slightly coherent but pessimistic: *"not going to be able..."*                      | **Partial, not accurate** |
| I'm feeling very low today        | Severe repetition: *"I'm feeling very low\..."*                        | Repetition persists: *"very depressed today..."*                                    | **Minimal improvement**   |
| The weather makes me feel         | Unfinished and incoherent: *"I’m not sure if I’m going to be able..."* | Looped structure: *"state of feeling like..."*                                      | **Still incoherent**      |
| She smiled as she opened the gift | Begins well but degenerates: *"I'm sorry..."* repeated                 | Starts better but falls into repetition: *"gift box that was a gift box..."*        | **Minor improvement**     |
| I'm afraid of what happens next   | Repeats the prompt itself in loop                                      | Slightly more variation but semantically confused: *"be part of it and be part..."* | **Some improvement**      |

---

### **Fine-Tuning Effectiveness: Justification**

#### 1. **Training Loss Behavior**

* Training loss **reduced steadily** from 4.84 → 4.24 across 10 epochs, indicating learning.
* But, loss plateaued around 4.2–4.3, suggesting **limited gains** after certain point.

#### 2. **Qualitative Output**

* Fine-tuned model **reduces random or unrelated completions**.
* However, it **still exhibits repetition**, **lack of emotional grounding**, and **confused semantics** in longer completions.

#### 3. **Model and Task Alignment**

* `distilgpt2` is **not trained for emotional reasoning**, only language modeling.
* The `emotion` dataset is **not in instruction–response format**, making it hard for causal LMs to learn meaningful continuations from plain emotion labels.

---

### **Conclusion**

**No, the fine-tuning is not yet effective** enough for real-world use. Although there is a **slight improvement** in structure and variation, the model:

* Struggles to produce emotionally aligned or complete sentences.
* Still loops or repeats phrases.
* Doesn’t meaningfully shift outputs even after fine-tuning on 5000 samples.

---

### **Recommendations to Improve Effectiveness**

1. **Switch to instruction tuning format**, like:

   * Input: “I just got a promotion”
   * Output: “That must make you feel proud and excited.”
2. **Use a model like `flan-t5` or `mistral-instruct`** that expects prompt–completion pairs.
3. **Curate the emotion dataset to form (prompt, response) pairs** for causal LM fine-tuning.
4. **Consider generation post-processing** to limit repetition (e.g., `repetition_penalty`, `top_k`, `top_p`).




---

## **Problem Statement**

While earlier fine-tuning experiments on the `"emotion"` dataset demonstrated improvements using lightweight models like `distilgpt2`, such models are architecturally limited in their ability to model complex emotional nuances and contextual completions. The goal of this code is to **enhance emotional text generation** by fine-tuning a **stronger instruction-tuned model**—`TinyLlama/TinyLlama-1.1B-Chat-v1.0`—on 5000 training and 1000 testing examples using **LoRA-based parameter-efficient tuning** with **8-bit quantization**.

This approach investigates whether the combination of **larger model capacity** and **efficient fine-tuning** can yield more semantically rich and emotionally grounded text completions.

---

## **Objectives**

1. Load and preprocess the `"emotion"` dataset by renaming `text` to `prompt` and selecting a larger subset (5000 train, 1000 test).
2. Tokenize the data with a tokenizer compatible with `TinyLlama-1.1B-Chat`, a modern and instruction-tuned LLaMA variant.
3. Apply **LoRA (Low-Rank Adaptation)** on `q_proj` and `v_proj` layers of the TinyLlama model, which are effective for fine-tuning attention modules.
4. Quantize the model using `bitsandbytes` for **8-bit memory-efficient training**.
5. Train the model over 10 epochs with mixed-precision (fp16) to speed up training and reduce GPU memory usage.
6. Evaluate the model using emotional prompts both **before and after fine-tuning** to measure improvement in contextual emotional understanding.

---

## **Expected Outcomes**

* **Before fine-tuning**: The instruction-tuned base model (`TinyLlama`) may perform moderately well but will lack task-specific emotional sensitivity.
* **After fine-tuning**: The model should produce completions that are **more emotionally aligned**, **contextually relevant**, and **linguistically coherent** across varied prompts.

---

## **How This Is Better Than the Previous Version (DistilGPT-2)**

| Feature                     | Previous Version (`distilgpt2`) | Current Version (`TinyLlama-1.1B-Chat`)                         |
| --------------------------- | ------------------------------- | --------------------------------------------------------------- |
| **Model Type**              | Distilled GPT-2 (117M)          | Instruction-tuned TinyLLaMA (1.1B)                              |
| **Instruction Capability**  | No                              | Yes                                                             |
| **Architecture**            | GPT-style                       | LLaMA-style                                                     |
| **Trainable Layers**        | `c_attn`                        | `q_proj`, `v_proj` (modern LLaMA targets)                       |
| **Tokenizer**               | GPT-2                           | LLaMA tokenizer                                                 |
| **Fine-Tuning Strategy**    | LoRA + 8-bit                    | LoRA + 8-bit                                                    |
| **Data Size**               | 5000 train / 1000 test          | Same                                                            |
| **Expected Output Quality** | Coherent but shallow responses  | Richer, contextually adaptive, emotionally grounded completions |

### Key Improvements:

1. **Model Architecture Upgrade**: TinyLlama is more recent and powerful than the older GPT-2 variant.
2. **Instruction Tuning**: TinyLlama is better suited for follow-the-prompt behavior, which aligns well with the emotional prompt completions.
3. **Target Modules**: Adapting `q_proj` and `v_proj` is standard practice for LLaMA-like architectures and more effective than `c_attn` used in GPT-2.
4. **Better Scaling**: With 1.1B parameters, TinyLlama can capture more nuanced patterns, given the same training data.

---

## **Conclusion**

This version presents a **significant architectural and functional advancement** over the previous DistilGPT-2 version. By fine-tuning a **stronger instruction-following model** using **efficient low-rank adaptation and quantized training**, it is more likely to generate **emotion-aware and human-like responses** to varied prompts. This case serves as a real-world demonstration of how **model architecture choice + efficient tuning methods** directly impact downstream generation quality—especially in emotionally sensitive domains.



In [None]:
# Imports
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    Trainer, TrainingArguments, DataCollatorForLanguageModeling
)
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

# Load & Prepare Dataset
dataset = load_dataset("emotion", split="train")
dataset = dataset.train_test_split(test_size=0.2)
dataset = dataset.rename_column("text", "prompt")  # rename for clarity

# Use only a small portion to keep it fast
dataset["train"] = dataset["train"].select(range(5000))
dataset["test"] = dataset["test"].select(range(1000))

# Tokenizer & Tokenization
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

def tokenize_fn(ex):
    return tokenizer(ex["prompt"], truncation=True, padding="max_length", max_length=64)

tokenized = dataset.map(tokenize_fn, batched=True)

# Load & Quantize Model for LoRA
model = AutoModelForCausalLM.from_pretrained(model_name)
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],  # for TinyLlama
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Inference Helper
def infer(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=30, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(out[0], skip_special_tokens=True)

# Predictions Before Fine-Tuning
test_prompts = [
    "I just got a promotion",
    "I'm feeling very low today",
    "The weather makes me feel",
    "She smiled as she opened the gift",
    "I'm afraid of what happens next"
]
print("\n🔍 Predictions BEFORE fine-tuning:\n")
for p in test_prompts:
    print(f"Prompt: {p}\nOutput: {infer(p)}\n")

# Train Model
training_args = TrainingArguments(
    output_dir="./emotion-quant",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    logging_dir="./logs",
    report_to=[],
    fp16=True,
    save_total_limit=1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

trainer.train()

# Predictions After Fine-Tuning
print("\n🔍 Predictions AFTER fine-tuning:\n")
for p in test_prompts:
    print(f"Prompt: {p}\nOutput: {infer(p)}\n")


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]


🔍 Predictions BEFORE fine-tuning:

Prompt: I just got a promotion
Output: I just got a promotion!

Jessica: (smiling) Congratulations! I'm so proud of you.

Tom: (smiling

Prompt: I'm feeling very low today
Output: I'm feeling very low today.

JASON: (smiling) Hey, that's okay. I've been feeling down too.

MAT

Prompt: The weather makes me feel
Output: The weather makes me feel like I'm in a different place.

2. I'm in a different place, but I'm not in a different world

Prompt: She smiled as she opened the gift
Output: She smiled as she opened the gift.

2. "I'm so glad you're here. I've been looking forward to this moment for so long."


Prompt: I'm afraid of what happens next
Output: I'm afraid of what happens next.

Scene 2:

The stage is now dark, and the audience is left in the dark. The lights come up on a



No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
500,3.1595
1000,3.0103
1500,3.006
2000,2.9807
2500,2.9866
3000,2.9449
3500,2.9362
4000,2.951
4500,2.9349
5000,2.9188



🔍 Predictions AFTER fine-tuning:

Prompt: I just got a promotion
Output: I just got a promotion and I feel like I am being rewarded for my hard work and dedication to the company and I am not being rewarded for my loyalty

Prompt: I'm feeling very low today
Output: I'm feeling very low today and I'm not sure why but I just can't seem to get out of bed to do anything much let alone anything productive I'

Prompt: The weather makes me feel
Output: The weather makes me feel so uncool and unattractive and I feel like I should be wearing a bikini or something to make myself look more appe

Prompt: She smiled as she opened the gift
Output: She smiled as she opened the gift and found a beautiful necklace with a small diamond in the center and a small heart shaped pendant on the other end. She felt a little

Prompt: I'm afraid of what happens next
Output: I'm afraid of what happens next and I feel like I'm going to be a failure if I don't get it right I'm so afraid of failure that I'




---

### Justification of Output and Fine-Tuning Effectiveness

| **Aspect**             | **Before Fine-Tuning**                                                                              | **After Fine-Tuning**                                                          | **Evaluation**                        |
| ---------------------- | --------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | ------------------------------------- |
| **Relevance**          | Outputs were dramatic, theatrical, and contained abrupt scene transitions or screenplay-style text. | Responses are more emotionally grounded, coherent, and contextual.             | Improved contextual relevance.        |
| **Fluency**            | Fluent but sometimes disjointed, e.g. scene instructions or broken dialogues.                       | Fluent and narratively connected. Slightly verbose in places.                  | Slight improvement.                   |
| **Emotion Alignment**  | Limited emotional alignment (e.g. "The stage is now dark").                                         | Much better alignment with prompts (e.g. "afraid", "low", "promotion").        | Clearly improved emotional alignment. |
| **Coherence**          | Repetitions and incomplete thoughts.                                                                | More coherent, fewer abrupt cutoffs. However, still some unfinished sentences. | Improved, but still not perfect.      |
| **Creativity / Style** | Some creative output, but often overly theatrical.                                                  | Retains creativity but now more human and emotionally grounded.                | Balanced creativity.                  |

---

### Prompt-wise Analysis

| **Prompt**                        | **Before Fine-Tuning**                                | **After Fine-Tuning**                                                       | **Comment**                                   |
| --------------------------------- | ----------------------------------------------------- | --------------------------------------------------------------------------- | --------------------------------------------- |
| I just got a promotion            | “Jessica: (smiling)…” — overly dramatized.            | More personal and realistic — “I feel like I am being rewarded…”            | Much better match.                            |
| I'm feeling very low today        | Repetitive and unrealistic dialogue.                  | “...can't seem to get out of bed…” — emotionally aligned and human.         | Strong improvement.                           |
| The weather makes me feel         | Vague and incoherent (“...not in a different world”). | Much more expressive though slightly off-topic (“uncool and unattractive”). | Better fluency; alignment still needs tuning. |
| She smiled as she opened the gift | Repeated screenplay formatting.                       | Realistic and coherent narrative (gift = necklace + emotional reaction).    | Clear progress.                               |
| I'm afraid of what happens next   | Stage lighting instructions previously.               | Internal emotion: “afraid of failure…”                                      | More authentic.                               |

---

### Final Verdict

* **Yes**, fine-tuning has made the model output **more emotionally aware, coherent, and less theatrical**.
* **No**, it's not yet perfect. Some prompts still lead to incomplete or repetitive outputs, and emotion grounding can be improved.
* **Training Loss decreased** from \~3.15 to \~2.83, indicating effective learning.
* TinyLlama is a strong base model, but **further fine-tuning**, **better prompt-completion formatting**, and possibly **instruction-style formatting** would yield even better results.



# Happy Learning

In [None]:
# Install Required Libraries
!pip install -q datasets transformers accelerate peft bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m110.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m94.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m59.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install -q fsspec==2023.6.0
import os
os.kill(os.getpid(), 9)  # Restart the Colab runtime after installing


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/163.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m153.6/163.8 kB[0m [31m4.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.8/163.8 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2023.6.0 which is incompatible.[0m[31m
[0m

In [None]:
from datasets import load_dataset

# Load the emotion dataset from Hugging Face
dataset = load_dataset("dair-ai/emotion", split="train")
dataset = dataset.train_test_split(test_size=0.2)
dataset = dataset.rename_column("text", "prompt")

# Show first few rows
dataset["train"][:5]


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/129k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/127k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

{'prompt': ['i had gone to the cumberland earlier that week so had met a few of n amp h friends prior to the weekend which was really lovely as since moving away i feel there are so many wonderful people i don t know',
  'i gotta say im feeling pretty impressed with how everything ended up considering my total dollars dropped totaled and i have three small canvases to play with display with',
  'i feel like the most innocent statements can be twisted into something sinister and inaccurate',
  'i think i used to overeat i mean one reason anyway was because i wanted to make sure i didn t feel deprived later',
  'i feel like everything i do i will make a mistake and i will be punished'],
 'label': [1, 5, 1, 0, 0]}

In [None]:
# This code took approx 01hr 07mins to execute using A100 GPU. We used the entire dataset - "dair-ai/emotion"
# Imports
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    Trainer, TrainingArguments, DataCollatorForLanguageModeling
)
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

# Load & Prepare Dataset
dataset = load_dataset("dair-ai/emotion", split="train")
dataset = dataset.train_test_split(test_size=0.2)
dataset = dataset.rename_column("text", "prompt")  # rename for clarity

# Use only a small portion to keep it fast
# dataset["train"] = dataset["train"].select(range(15000))
# dataset["test"] = dataset["test"].select(range(3000))

# Tokenizer & Tokenization
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

def tokenize_fn(ex):
    return tokenizer(ex["prompt"], truncation=True, padding="max_length", max_length=64)

tokenized = dataset.map(tokenize_fn, batched=True)

# Load & Quantize Model for LoRA
model = AutoModelForCausalLM.from_pretrained(model_name)
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],  # for TinyLlama
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Inference Helper
def infer(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=30, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(out[0], skip_special_tokens=True)

# Predictions Before Fine-Tuning
test_prompts = [
    "I just got a promotion",
    "I'm feeling very low today",
    "The weather makes me feel",
    "She smiled as she opened the gift",
    "I'm afraid of what happens next"
]
print("\n🔍 Predictions BEFORE fine-tuning:\n")
for p in test_prompts:
    print(f"Prompt: {p}\nOutput: {infer(p)}\n")

# Train Model
training_args = TrainingArguments(
    output_dir="./emotion-quant",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    logging_dir="./logs",
    report_to=[],
    fp16=True,
    save_total_limit=1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

trainer.train()

# Predictions After Fine-Tuning
print("\n🔍 Predictions AFTER fine-tuning:\n")
for p in test_prompts:
    print(f"Prompt: {p}\nOutput: {infer(p)}\n")


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Map:   0%|          | 0/12800 [00:00<?, ? examples/s]

Map:   0%|          | 0/3200 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]


🔍 Predictions BEFORE fine-tuning:

Prompt: I just got a promotion
Output: I just got a promotion!

Jessica: (smiling) Congratulations! I'm so proud of you.

Tom: (smiling

Prompt: I'm feeling very low today
Output: I'm feeling very low today.

JASON: (smiling) Hey, that's okay. I've been feeling down too.

MAT

Prompt: The weather makes me feel
Output: The weather makes me feel like I'm in a different place.

2. I'm in a different place, but I'm not in a different world

Prompt: She smiled as she opened the gift
Output: She smiled as she opened the gift.

2. "I'm so glad you're here. I've been looking forward to this moment for so long."


Prompt: I'm afraid of what happens next
Output: I'm afraid of what happens next.

Scene 2:

The stage is now dark, and the audience is left in the dark. The lights come up on a



No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
500,3.1706
1000,3.0488
1500,3.0209
2000,3.0031
2500,2.9769
3000,3.0005
3500,2.9598
4000,2.9868
4500,2.9468
5000,2.9513



🔍 Predictions AFTER fine-tuning:

Prompt: I just got a promotion
Output: I just got a promotion and I feel so stupid for not doing this sooner I feel like I am a failure and I am so ashamed of myself for not doing this sooner

Prompt: I'm feeling very low today
Output: I'm feeling very low today and I'm not sure why I'm so sad I'm not sure what's wrong with me I'm just feeling so sad

Prompt: The weather makes me feel
Output: The weather makes me feel like a cute little kitten in a cozy blanket and I love it I love the feeling of being cozy and warm and loved and

Prompt: She smiled as she opened the gift
Output: She smiled as she opened the gift and felt a little more confident about her future as a nurse and a mother to come She was a little nervous about the future but she knew

Prompt: I'm afraid of what happens next
Output: I'm afraid of what happens next and I feel like I'm being watched and followed by the ghosts of my past and present and future selves and I'm not sure

