# 🧠 Fine-Tuning LLaMA to be a Financial Advisor (Supports LoRA and QLoRA)

This notebook demonstrates best practices for fine-tuning a LLaMA model using **LoRA** or **QLoRA** with [Unsloth](https://unsloth.ai/), based on the [financial-advisor-100 dataset](https://www.superteams.ai/blog/guide-to-fine-tune-your-llm-for-building-your-own-financial-advisor).

## Environment Set-up

In [9]:
%%capture
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install -q unsloth

## Load Base Model (LoRA vs QLoRA)


In [10]:
from unsloth import FastLanguageModel
from google.colab import userdata
import torch

# Choose method: "LoRA" or "QLoRA"
fine_tuning_method = "QLoRA"

if fine_tuning_method == "LoRA":
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="meta-llama/Llama-3.2-3B",
        max_seq_length=2048,
        dtype=torch.float16,
        load_in_4bit=False,
        token=userdata.get("HF_TOKEN")
    )
elif fine_tuning_method == "QLoRA":
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
        max_seq_length=2048,
        dtype=torch.float16,
        load_in_4bit=True,
        token=userdata.get("HF_TOKEN")
    )


==((====))==  Unsloth 2025.6.9: Fast Llama patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

## LoRA/QLoRA Configuration

In [11]:
# LoRA fine-tuning configuration for FastLanguageModel using PEFT
# Reference: https://huggingface.co/docs/peft/v0.11.0/en/package_reference/lora#peft.LoraConfig

from peft import LoraConfig

# Target transformer modules to inject LoRA adapters into
# These are common projection layers in attention and MLP blocks of transformer models
target_modules = [
    "q_proj",     # Query projection (self-attention)
    "k_proj",     # Key projection
    "v_proj",     # Value projection
    "o_proj",     # Output projection from attention
    "gate_proj",  # Gating layer in FFN
    "up_proj",    # Feedforward network up-projection
    "down_proj"   # Feedforward network down-projection
]

# Flag to determine whether to include token embedding layer for training (e.g., when adding special tokens)
train_embeddings = False
if train_embeddings:
    target_modules.append("lm_head")  # lm_head is typically the output embedding layer

# LoRA configuration dictionary for clarity and reuse
lora_config = {
    "r": 16,  # Rank of the low-rank adapter matrices. Lower = smaller adapter, less compute.
              # According to the original LoRA paper, small r (e.g. 4–16) performs well.

    "target_modules": target_modules,  # List of layer names to apply LoRA to

    "lora_alpha": 16,  # Scaling factor for the LoRA weights.
                       # Larger alpha increases the impact of adapters (defaults often 16–64).

    "lora_dropout": 0.0,  # Dropout rate applied to the LoRA layers during training.
                          # Unsloth suggests 0.0 for best performance unless overfitting.

    "bias": "none",  # "none": no bias training; saves memory and computation.
                     # Alternatives: "all" or "lora_only" (rarely used in practice).

    "use_gradient_checkpointing": "unsloth",  # Activates gradient checkpointing. Saves VRAM by recomputing intermediate activations.
                                              # "unsloth" enables long-context training optimally.

    "random_state": 3407,  # Seed for reproducibility (affects adapter initialization).

    "use_rslora": False,  # Whether to scale LoRA weights by 1/sqrt(r) (recommended by HF in some cases).
                          # Often disabled when using tuned alpha directly.

    "loftq_config": None,  # Used only if integrating with LoftQ (quantization-aware LoRA).
                           # Leave as None for standard full-precision LoRA.
}

# Inject LoRA adapters into the base model using PEFT
model = FastLanguageModel.get_peft_model(
    model,
    **lora_config
)

# Confirm the configured target modules
print("LoRA modules applied to:", model.peft_config["default"].target_modules)


Unsloth 2025.6.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


LoRA modules applied to: {'v_proj', 'o_proj', 'up_proj', 'k_proj', 'down_proj', 'q_proj', 'gate_proj'}


## Load & Format Dataset

In [12]:
from datasets import load_dataset

# Load the dataset (forcing redownload to avoid cache bugs)
dataset = load_dataset(
    "nihiluis/financial-advisor-100",
    split="train",
    download_mode="force_redownload"
)

print(dataset.column_names)

# Reformat to instruction-tuning format
def format_instruction(example):
    return {
        "instruction": example["question"],
        "input": "",  # optional input if needed
        "output": example["answer"]
    }

dataset = dataset.map(format_instruction)
dataset = dataset.remove_columns([col for col in dataset.column_names if col not in ["instruction", "input", "output"]])

(…)-00000-of-00001-f0708e72202ddcaa.parquet:   0%|          | 0.00/321k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100 [00:00<?, ? examples/s]

['id', 'question', 'answer', 'text']


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

## Tokenization

In [13]:
# Tokenize each sample into prompt format
def tokenize(example):
    prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
      You are a helpful and knowledgeable financial advisor specializing in providing clear, actionable advice to a wide range of financial questions.<|eot_id|>
      <|start_header_id|>user<|end_header_id|>
      {example["instruction"]}<|eot_id|>
      <|start_header_id|>assistant<|end_header_id|>
      {example["output"]}<|eot_id|>
      """


    return tokenizer(prompt, truncation=True, padding="max_length", max_length=2048)

tokenized_dataset = dataset.map(tokenize, remove_columns=dataset.column_names)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

## Training

In [14]:
from transformers import DataCollatorForLanguageModeling, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./financial_llama",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="adamw_8bit" if fine_tuning_method == "QLoRA" else "adamw_torch",
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    seed=42,
    save_strategy="no",
    report_to="none"
)

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

trainer.train()

  trainer = Trainer(
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100 | Num Epochs = 3 | Total steps = 39
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,000,000,000 (0.52% trained)


Step,Training Loss
10,1.8089
20,1.5485
30,1.4997


TrainOutput(global_step=39, training_loss=1.5826747111785107, metrics={'train_runtime': 1595.7705, 'train_samples_per_second': 0.188, 'train_steps_per_second': 0.024, 'total_flos': 2.78207731335168e+16, 'train_loss': 1.5826747111785107, 'epoch': 3.0})

## Inference

In [15]:
from transformers import TextStreamer

def ask_financial_question(question: str, max_new_tokens: int = 200):
    FastLanguageModel.for_inference(model)
    system_prompt = "You are a helpful and knowledgeable financial advisor."

    prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{question}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    streamer = TextStreamer(tokenizer)
    print("📩 Answer:")
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=max_new_tokens)

ask_financial_question("How should I prioritize paying off debt vs investing?")

📩 Answer:
<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful and knowledgeable financial advisor.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
How should I prioritize paying off debt vs investing?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Prioritizing debt over investment is a common dilemma. It's important to consider both the urgency and the long-term implications of each. Here are some general guidelines:

1. **High Interest Debt:** If you have debt with high interest rates (usually above 10-15%), it's often wise to prioritize paying off these debts first. This is because the interest you're paying is essentially a cost of not investing. For example, if you're paying 18% interest on a credit card, it's like throwing away money to keep that debt.

2. **Emergency Fund:** Before you start investing, it's a good idea to have an emergency fund that covers 3-6 months of living expenses. This fund will help you avoid going int

## Save & Push

In [16]:
if fine_tuning_method == "LoRA":

    model.save_pretrained("./lora-financial-advisor")
    tokenizer.save_pretrained("./lora-financial-advisor")

    # Push to hugging face
    model.push_to_hub("jordynojeda/Llama-3.2-3B-financial-advisor-lora", token = userdata.get("HF_TOKEN"))

elif fine_tuning_method == "QLoRA":

    model.save_pretrained("./qlora-financial-advisor")
    tokenizer.save_pretrained("./qlora-financial-advisor")

    # Push to hugging face
    model.push_to_hub("jordynojeda/Meta-Llama-3.1-8B-Instruct-bnb-4bit-financial-advisor-qlora", token = userdata.get("HF_TOKEN"))

README.md:   0%|          | 0.00/610 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/jordynojeda/Meta-Llama-3.1-8B-Instruct-bnb-4bit-financial-advisor-lora


## ✅ Summary

- ✅ Supports both **LoRA** and **QLoRA** fine-tuning
- ✅ Based on Unsloth for efficient training
- ✅ Uses Hugging Face datasets and trainer for ease
- ✅ Outputs a Hugging Face-ready adapter model

Test and compare performance between `LoRA` and `QLoRA` by switching `fine_tuning_method` at the top.