# Fine-Tuning an LLM for Financial Feature Extraction

## Context
This is the most directly job-relevant notebook. The Numerai "Quantitative AI Researcher" role involves fine-tuning LLMs on financial corpora to create features. This notebook demonstrates the approach using LoRA for parameter-efficient fine-tuning.

## My Experience
- **Creyon Bio**: Fine-tuned large protein models for splice-site prediction. Pre-training, fine-tuning, and benchmarking.
- **DPO Fine-tuning** (this week): Custom Trainer, Dataset, and DataCollator classes for DPO on Qwen 32B with LoRA.
- **TorchLeet**: Implemented LoRA from scratch, understand low-rank adaptation at the mathematical level.

## Pipeline
Financial text corpus → LoRA fine-tune causal LM → Extract embeddings → Compare vanilla vs fine-tuned features

## Hypothesis
A general LLM (Gemma-3, Llama, Mistral) has broad language understanding but no financial specialization. By fine-tuning on financial text, we adapt the model's representations to encode financial-relevant patterns — producing better features for stock prediction.

In [None]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import Dataset
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

## 1. Load Base Model
We use Gemma-3-270M-IT (270M params) for fast iteration. The same approach scales to Llama-3, Mistral, etc.

In [None]:
model_name = "google/gemma-3-270m-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)

base_model = AutoModelForCausalLM.from_pretrained(model_name, output_hidden_states=True)
print(f"Base model: {model_name}")
print(f"Parameters: {sum(p.numel() for p in base_model.parameters()):,}")

## 2. Financial Text Corpus
In production: SEC filings, financial news, earnings call transcripts, analyst reports.
Here we use synthetic financial text that mimics 10-K language.

In [None]:
# Synthetic financial text corpus for fine-tuning
# Mimics SEC filing language, earnings reports, and financial news
financial_texts = [
    "The company reported total revenue of $89.5 billion for the fiscal year, representing a 12% increase year-over-year driven by strong performance in the cloud computing segment and continued growth in digital advertising revenue.",
    "Operating expenses increased to $45.2 billion due to higher research and development spending, increased headcount in artificial intelligence divisions, and expanded data center infrastructure investments across three new regions.",
    "Net income attributable to common shareholders was $21.3 billion, or $5.67 per diluted share, compared to $18.9 billion, or $4.89 per diluted share, in the prior year period.",
    "The Board of Directors declared a quarterly dividend of $0.82 per share and authorized an additional $50 billion share repurchase program, reflecting confidence in the company's long-term growth prospects.",
    "Gross margin contracted by 150 basis points to 42.3% primarily due to unfavorable product mix shift toward lower-margin hardware products and increased component costs from supply chain constraints.",
    "The company completed the acquisition of a leading cybersecurity firm for $28 billion in an all-stock transaction, expanding its enterprise security offerings and adding approximately 4,000 employees.",
    "Free cash flow for the quarter was $15.8 billion, up 18% from the prior year, driven by improved working capital management and higher operating income partially offset by increased capital expenditures.",
    "Management expects revenue growth in the range of 8-10% for the coming fiscal year, with operating margins expected to expand by 50-75 basis points as restructuring benefits are fully realized.",
    "The company's debt-to-equity ratio improved to 0.45 from 0.62 in the prior year as the company used excess cash flow to retire $12 billion in long-term debt obligations.",
    "Research and development expenses totaled $22.1 billion, representing 24.7% of total revenue, as the company increased investment in generative AI, autonomous systems, and quantum computing initiatives.",
    "The company recorded a goodwill impairment charge of $4.2 billion related to its consumer electronics segment, reflecting lower-than-expected market demand and increased competitive pressures.",
    "International revenue grew 15% on a constant currency basis, with particularly strong performance in Asia-Pacific markets where the company expanded its distribution network and launched localized products.",
    "The effective tax rate was 18.5%, benefiting from the recognition of research and development tax credits and income earned in lower-tax jurisdictions through the company's global operating structure.",
    "Customer retention rates improved to 94.2% from 91.8% in the prior year, driven by enhanced product features, improved customer support, and the introduction of a premium loyalty program.",
    "The company issued $8 billion in investment-grade bonds with maturities ranging from 5 to 30 years, taking advantage of favorable interest rate conditions to refinance existing debt at lower rates.",
    "Same-store sales increased 3.2% for the quarter, with digital channel revenue growing 28% year-over-year as the company's omnichannel strategy continued to gain traction with consumers.",
    "Inventory levels increased 22% from the prior year as the company built safety stock to mitigate ongoing supply chain disruptions and prepare for anticipated seasonal demand.",
    "The company announced a strategic restructuring plan expected to result in annual cost savings of $2.5 billion by fiscal year 2025, including workforce reductions of approximately 7% of total employees.",
    "Cloud infrastructure revenue reached $18.3 billion for the quarter, up 29% year-over-year, as enterprise customers accelerated their digital transformation initiatives and adopted AI-powered cloud services.",
    "The company's patent portfolio expanded to over 45,000 active patents globally, providing significant intellectual property protection across its core technology platforms and emerging innovation areas.",
    "Working capital requirements increased by $3.1 billion due to extended payment terms offered to key enterprise customers and higher accounts receivable balances from the rapid growth in subscription services.",
    "The audit committee identified material weaknesses in internal controls over financial reporting related to revenue recognition processes, requiring management to implement remediation measures.",
    "Backlog increased to $142 billion, up 8% from the prior quarter, providing strong revenue visibility for the next 18-24 months across defense, space, and commercial aviation segments.",
    "The company's credit rating was upgraded to AA- by Standard and Poor's, reflecting improved profitability, reduced leverage, and strong competitive positioning in its core markets.",
    "Capital expenditures totaled $12.4 billion for the fiscal year, primarily directed toward new manufacturing facilities, data center expansion, and automation of existing production lines.",
]

# Create HuggingFace Dataset
dataset = Dataset.from_dict({"text": financial_texts})
print(f"Financial corpus: {len(financial_texts)} passages")
print(f"Total tokens (approx): {sum(len(t.split()) for t in financial_texts)}")

## 3. Apply LoRA Adapters
LoRA (Low-Rank Adaptation) freezes the base model and adds small trainable matrices. This is parameter-efficient and avoids catastrophic forgetting.

In [None]:
try:
    from peft import LoraConfig, get_peft_model, TaskType

    # LoRA configuration
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=8,                    # Rank of the low-rank matrices
        lora_alpha=16,          # Scaling factor
        lora_dropout=0.1,       # Dropout for regularization
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Gemma-3's attention projection layers
    )

    # Apply LoRA to the model
    ft_model = get_peft_model(base_model, lora_config)
    
    trainable = sum(p.numel() for p in ft_model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in ft_model.parameters())
    print(f"Trainable parameters: {trainable:,} / {total:,} ({trainable/total:.2%})")
    print(f"LoRA rank: {lora_config.r}")
    print(f"Target modules: {lora_config.target_modules}")
    
    HAS_PEFT = True
except ImportError:
    print("peft not installed. Run: uv add peft")
    print("Skipping LoRA fine-tuning \u2014 showing concept only.")
    HAS_PEFT = False

## 4. Fine-Tune on Financial Corpus

In [None]:
if HAS_PEFT:
    # Tokenize the dataset
    def tokenize_function(examples):
        return tokenizer(examples["text"], truncation=True, max_length=256, padding="max_length")
    
    tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
    
    # For causal LM, labels = input_ids
    def add_labels(examples):
        examples["labels"] = examples["input_ids"].copy()
        return examples
    
    tokenized_dataset = tokenized_dataset.map(add_labels, batched=True)
    tokenized_dataset.set_format("torch")
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir="/tmp/numerai-financial-lm",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-4,
        weight_decay=0.01,
        logging_steps=5,
        save_strategy="no",
        report_to="none",
        fp16=torch.cuda.is_available(),
    )
    
    # Train
    trainer = Trainer(
        model=ft_model,
        args=training_args,
        train_dataset=tokenized_dataset,
    )
    
    print("Fine-tuning with LoRA on financial corpus...")
    train_result = trainer.train()
    print(f"Training loss: {train_result.training_loss:.4f}")
    print("Fine-tuning complete!")
else:
    print("Skipping fine-tuning (peft not installed)")

## 5. Compare Embeddings: Vanilla vs Fine-Tuned

In [None]:
def extract_embeddings(model, tokenizer, texts, layer=-1):
    """Extract last-token hidden states from a specific layer."""
    model.eval()
    embeddings = []
    with torch.no_grad():
        for text in texts:
            inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
            outputs = model(**inputs, output_hidden_states=True)
            # Use specified layer, last token
            hidden = outputs.hidden_states[layer][0, -1, :].numpy()
            embeddings.append(hidden)
    return np.stack(embeddings)

# Test texts for comparison
test_texts = [
    "Revenue increased 15% driven by strong product demand",
    "Company reported significant quarterly loss",
    "New CEO appointment signals strategic direction change",
    "Dividend increased reflecting strong cash flow generation",
    "Regulatory investigation into accounting practices",
    "Market share gains in cloud computing segment",
    "Supply chain disruptions impacting production targets",
    "Record free cash flow enables debt reduction",
    "Workforce reduction announced as part of restructuring",
    "Patent portfolio strengthens competitive moat",
]
test_labels = [1, 0, 0, 1, 0, 1, 0, 1, 0, 1]  # 1=positive, 0=negative

# Extract embeddings from both models
vanilla_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-270m-it", output_hidden_states=True)
vanilla_emb = extract_embeddings(vanilla_model, tokenizer, test_texts)

if HAS_PEFT:
    ft_emb = extract_embeddings(ft_model, tokenizer, test_texts)
else:
    # Simulate fine-tuned embeddings with small perturbation for illustration
    ft_emb = vanilla_emb + np.random.normal(0, 0.1, vanilla_emb.shape)

print(f"Vanilla embeddings shape: {vanilla_emb.shape}")
print(f"Fine-tuned embeddings shape: {ft_emb.shape}")

In [None]:
# PCA visualization
pca_v = PCA(n_components=2).fit_transform(vanilla_emb)
pca_f = PCA(n_components=2).fit_transform(ft_emb)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
labels_arr = np.array(test_labels)

for ax, emb_2d, title in [(axes[0], pca_v, "Vanilla Gemma-3"), (axes[1], pca_f, "LoRA Fine-Tuned")]:
    for label, color, name in [(1, 'green', 'Positive'), (0, 'red', 'Negative')]:
        mask = labels_arr == label
        ax.scatter(emb_2d[mask, 0], emb_2d[mask, 1], c=color, label=name, 
                   s=100, edgecolors='black', linewidth=0.5, alpha=0.7)
        for i in np.where(mask)[0]:
            ax.annotate(test_texts[i][:25] + "...", (emb_2d[i, 0], emb_2d[i, 1]),
                       fontsize=6, alpha=0.7, ha='center', va='bottom')
    ax.set_title(title, fontsize=13, fontweight='bold')
    ax.legend()

plt.suptitle("Embedding Space: Vanilla vs Financial Fine-Tuned", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Compare similarity structures
sim_vanilla = cosine_similarity(vanilla_emb)
sim_ft = cosine_similarity(ft_emb)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

for ax, sim, title in [(axes[0], sim_vanilla, "Vanilla"), (axes[1], sim_ft, "Fine-Tuned")]:
    im = ax.imshow(sim, cmap='RdYlBu_r', vmin=-1, vmax=1)
    ax.set_title(title)
    # Label with +/-
    labels_str = ["+" if l == 1 else "-" for l in test_labels]
    ax.set_xticks(range(len(labels_str)))
    ax.set_yticks(range(len(labels_str)))
    ax.set_xticklabels(labels_str)
    ax.set_yticklabels(labels_str)

plt.colorbar(im, ax=axes, label="Cosine Similarity")
plt.suptitle("Similarity Structure: Do Positive/Negative Texts Cluster?", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Discussion & Interview Talking Points

### My Direct Experience
- **Creyon Bio**: Fine-tuned large protein language models for splice-site prediction. Same pipeline: domain corpus → LoRA → task-relevant features.
- **DPO this week**: Built custom Trainer, Dataset, DataCollator for Qwen 32B DPO fine-tuning. Handled left-padding, response-length weighting, and label masking.
- **TorchLeet**: Implemented LoRA from scratch (rank decomposition of weight updates).

### Key Architectural Decisions for Numerai
1. **Base model choice**: Llama-3-8B vs Mistral-7B vs FinBERT. Probing (NB06) can inform this.
2. **Corpus construction**: Mix of SEC filings, news, earnings calls. Point-in-time correctness is critical.
3. **LoRA vs full fine-tuning**: LoRA for efficiency; full fine-tuning only if compute budget allows.
4. **Layer freezing strategy**: Freeze early layers (syntax), fine-tune middle/late layers (semantics).
5. **Evaluation**: Compare fine-tuned features against vanilla on Numerai Signals scoring metric.

### Scaling Considerations
- **Common Crawl**: Petabytes of text. Need to filter → process → fine-tune in streaming fashion.
- **Infrastructure**: My NDIF paper used Ray GCS + AWS object storage + VLLM. Same stack applies.
- **Distributed training**: Experience with DDP and multi-GPU from Creyon Bio and NDIF.

### Extensions (TODO)
- [ ] Fine-tune on real SEC filings from EDGAR
- [ ] Compare LoRA ranks (4, 8, 16, 32) — which is optimal for financial domain adaptation?
- [ ] Multi-task fine-tuning: CausalLM + sentiment classification head
- [ ] Curriculum learning: general financial text → sector-specific text → stock-specific text
- [ ] Compare embeddings from fine-tuned model across probing layers (combine with NB06)