# Fine-tune LLaMA 3 8B with QLoRA using Unsloth

**Instructions:**
1. Change runtime to GPU (Runtime > Change runtime type > T4 GPU)
2. Upload your `synthetic_qa_comprehensive.jsonl` file to this Colab session
3. Run the cells in order

**Expected time:** 30-60 minutes on T4 GPU

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [10]:
# Install dependencies
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" trl peft accelerate bitsandbytes

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-ews03orf/unsloth_b9b37fa8f94144fa8119e9d9b7946636
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-ews03orf/unsloth_b9b37fa8f94144fa8119e9d9b7946636
  Resolved https://github.com/unslothai/unsloth.git to commit 4af624557fbcc14e248daeb9709ce5a81c3070ca
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.9.6 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.9.6-py3-none-any.whl.metadata (9.5 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.git

In [11]:
pip install -U xformers --index-url https://download.pytorch.org/whl/cu126

Looking in indexes: https://download.pytorch.org/whl/cu126


In [12]:
# Import libraries and check GPU
import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
import json

# Check GPU availability
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU name: {torch.cuda.get_device_name() if torch.cuda.is_available() else 'None'}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB" if torch.cuda.is_available() else "No GPU")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
GPU available: True
GPU name: Tesla T4
GPU memory: 14.7 GB


In [13]:
# Load model and tokenizer
print("Loading LLaMA 3 8B model with 4-bit quantization...")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",  # Base LLaMA 3 8B
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

print("Model loaded successfully!")

Loading LLaMA 3 8B model with 4-bit quantization...
==((====))==  Unsloth 2025.9.5: Fast Llama patching. Transformers: 4.56.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Model loaded successfully!


In [14]:
# Add LoRA adapters
print("Adding LoRA adapters for efficient fine-tuning...")

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

# Show trainable parameters
model.print_trainable_parameters()

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.


Adding LoRA adapters for efficient fine-tuning...


Unsloth 2025.9.5 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196


In [15]:
# Load dataset
# Make sure you've uploaded synthetic_qa_comprehensive.jsonl to this Colab session
print("Loading training dataset...")

dataset = load_dataset("json", data_files="synthetic_qa_comprehensive.jsonl", split="train")
print(f"Loaded {len(dataset)} training examples")

# Preview first example
print("\nFirst training example:")
print(dataset[0]['text'][:200] + "...")

Loading training dataset...


Generating train split: 0 examples [00:00, ? examples/s]

Loaded 501 training examples

First training example:
<|system|>You are a helpful academic Q&A assistant specialized in scholarly content.<|user|>What is the main contribution of the paper titled 'FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Ima...


In [16]:
# Setup training
print("Setting up trainer...")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,  # Adjust based on your needs
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
        save_strategy="steps",
        save_steps=30,
    ),
)

print("Trainer ready!")

Setting up trainer...


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/501 [00:00<?, ? examples/s]

Trainer ready!


In [23]:
# Setup training
print("Setting up trainer...")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,  # Adjust based on your needs
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
        save_strategy="steps",
        save_steps=30,
        report_to="none",
    ),
)

print("Trainer ready!")

Setting up trainer...
Trainer ready!


In [24]:
# Start training
print("Starting training... This will take 30-60 minutes.")
print("You can watch the progress below.")

trainer.train()

print("\nTraining completed!")

Starting training... This will take 30-60 minutes.
You can watch the progress below.


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 501 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Step,Training Loss
1,3.9108
2,3.4558
3,3.5825
4,3.5301
5,3.029
6,2.9601
7,2.8034
8,2.1951
9,1.9532
10,2.0598


Unsloth: Will smartly offload gradients to save VRAM!


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).



Training completed!


In [25]:
# Save the fine-tuned model
print("Saving fine-tuned model...")

model.save_pretrained("llama3-academic-qa")
tokenizer.save_pretrained("llama3-academic-qa")

print("Model saved to llama3-academic-qa/")
print("You can download this folder to use the model locally.")

Saving fine-tuned model...
Model saved to llama3-academic-qa/
You can download this folder to use the model locally.


In [26]:
# Test the fine-tuned model
print("Testing the fine-tuned model...")

# Prepare model for inference
FastLanguageModel.for_inference(model)

# Test with an academic question
test_prompt = "<|system|>You are a helpful academic Q&A assistant specialized in scholarly content.<|user|>What is the main contribution of recent research in computer vision?<|assistant|>"

inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        use_cache=True,
        temperature=0.7,
        do_sample=True
    )

response = tokenizer.batch_decode(outputs)[0]
print("\nModel response:")
print(response.split("<|assistant|>")[1] if "<|assistant|>" in response else response)

Testing the fine-tuned model...

Model response:
The main contribution is we present a novel framework for generating high-quality, human-like, and expressive 3d facial animations (fas) by disentangling the underlying dynamics into a latent space of facial expressions and a temporal space of mouth movements. We demonstrate the effectiveness of our framework through a comprehensive set of quantitative and qualitative evaluations, including human evaluations on facial expression and mouth movement quality, as well as evaluations on facial expression transfer, mouth movement transfer, and expression-movement alignment. Our results show that our framework significantly outperforms existing baselines in terms of both quantitative and qualitative metrics, demonstrating the effectiveness of our approach in generating high-quality


In [27]:
# Test with a few more academic questions
test_questions = [
    "What problem does quantization address in large language models?",
    "What is the main challenge in text-to-image generation research?",
    "How do diffusion models work in computer vision?"
]

print("Testing with more academic questions:")
print("=" * 50)

for i, question in enumerate(test_questions, 1):
    prompt = f"<|system|>You are a helpful academic Q&A assistant specialized in scholarly content.<|user|>{question}<|assistant|>"

    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            use_cache=True,
            temperature=0.7,
            do_sample=True
        )

    response = tokenizer.batch_decode(outputs)[0]
    answer = response.split("<|assistant|>")[1] if "<|assistant|>" in response else response

    print(f"\nQ{i}: {question}")
    print(f"A{i}: {answer.strip()}")
    print("-" * 30)

Testing with more academic questions:

Q1: What problem does quantization address in large language models?
A1: In this paper, we introduce quantization, a new technique for improving the generalization of large language models (LLMs) by reducing their size and training them with smaller data sets. We show that quantization can be used to improve the generalization of LLMs by reducing their size and training them with smaller data sets. Our results show that quantization can significantly improve the generalization of LLMs by reducing their size and training them with smaller data sets. We demonstrate this improvement through several
------------------------------

Q2: What is the main challenge in text-to-image generation research?
A2: The main challenge is text-to-image (t2i) generation, which aims to synthesize images from text descriptions, is a fundamental task in computer vision. However, most existing t2i models are limited by the use of a single encoder to encode the text and a

## Next Steps

1. **Download the model**: You can download the `llama3-academic-qa` folder to use locally
2. **Test more thoroughly**: Try questions from your original dataset
3. **Compare with base model**: Test the same questions on the original LLaMA 3 8B to see improvements

**Training Summary:**
- Model: LLaMA 3 8B with QLoRA
- Dataset: 501 academic Q&A pairs
- Training steps: 60
- Memory efficient: 4-bit quantization + LoRA adapters

In [28]:
!zip -r llama3-academic-qa.zip llama3-academic-qa

  adding: llama3-academic-qa/ (stored 0%)
  adding: llama3-academic-qa/README.md (deflated 65%)
  adding: llama3-academic-qa/adapter_config.json (deflated 57%)
  adding: llama3-academic-qa/adapter_model.safetensors (deflated 8%)
  adding: llama3-academic-qa/special_tokens_map.json (deflated 71%)
  adding: llama3-academic-qa/tokenizer.json (deflated 85%)
  adding: llama3-academic-qa/tokenizer_config.json (deflated 96%)
