# Fine-Tuning an LLM for Financial Feature Extraction

## Context
This is the most directly job-relevant notebook. The Numerai "Quantitative AI Researcher" role involves fine-tuning LLMs on financial corpora to create features. This notebook demonstrates the approach using LoRA for parameter-efficient fine-tuning.

## My Experience
- **Creyon Bio**: Fine-tuned large protein models for splice-site prediction. Pre-training, fine-tuning, and benchmarking.
- **DPO Fine-tuning** (this week): Custom Trainer, Dataset, and DataCollator classes for DPO on Qwen 32B with LoRA.
- **TorchLeet**: Implemented LoRA from scratch, understand low-rank adaptation at the mathematical level.

## Pipeline
Financial text corpus → LoRA fine-tune causal LM → Extract embeddings → Compare vanilla vs fine-tuned features

## Hypothesis
A general LLM (Gemma-3, Llama, Mistral) has broad language understanding but no financial specialization. By fine-tuning on financial text, we adapt the model's representations to encode financial-relevant patterns — producing better features for stock prediction.

In [None]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import Dataset
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

## 1. Load Base Model
We use Gemma-3-270M-IT (270M params) for fast iteration. The same approach scales to Llama-3, Mistral, etc.

In [None]:
# TODO: implement
...

## 2. Financial Text Corpus
In production: SEC filings, financial news, earnings call transcripts, analyst reports.
Here we use synthetic financial text that mimics 10-K language.

In [None]:
# Synthetic financial text corpus for fine-tuning
# Mimics SEC filing language, earnings reports, and financial news

# TODO: implement
...

## 3. Apply LoRA Adapters
LoRA (Low-Rank Adaptation) freezes the base model and adds small trainable matrices. This is parameter-efficient and avoids catastrophic forgetting.

In [None]:
# TODO: implement
...

## 4. Fine-Tune on Financial Corpus

In [None]:
    def tokenize_function(examples):
        ...

    def add_labels(examples):
        ...


## 5. Compare Embeddings: Vanilla vs Fine-Tuned

In [None]:
def extract_embeddings(model, tokenizer, texts, layer=-1):
    """Extract last-token hidden states from a specific layer."""
    ...


In [None]:
# PCA visualization

# TODO: implement
...

In [None]:
# Compare similarity structures

# TODO: implement
...

## Discussion & Interview Talking Points

### My Direct Experience
- **Creyon Bio**: Fine-tuned large protein language models for splice-site prediction. Same pipeline: domain corpus → LoRA → task-relevant features.
- **DPO this week**: Built custom Trainer, Dataset, DataCollator for Qwen 32B DPO fine-tuning. Handled left-padding, response-length weighting, and label masking.
- **TorchLeet**: Implemented LoRA from scratch (rank decomposition of weight updates).

### Key Architectural Decisions for Numerai
1. **Base model choice**: Llama-3-8B vs Mistral-7B vs FinBERT. Probing (NB06) can inform this.
2. **Corpus construction**: Mix of SEC filings, news, earnings calls. Point-in-time correctness is critical.
3. **LoRA vs full fine-tuning**: LoRA for efficiency; full fine-tuning only if compute budget allows.
4. **Layer freezing strategy**: Freeze early layers (syntax), fine-tune middle/late layers (semantics).
5. **Evaluation**: Compare fine-tuned features against vanilla on Numerai Signals scoring metric.

### Scaling Considerations
- **Common Crawl**: Petabytes of text. Need to filter → process → fine-tune in streaming fashion.
- **Infrastructure**: My NDIF paper used Ray GCS + AWS object storage + VLLM. Same stack applies.
- **Distributed training**: Experience with DDP and multi-GPU from Creyon Bio and NDIF.

### Extensions (TODO)
- [ ] Fine-tune on real SEC filings from EDGAR
- [ ] Compare LoRA ranks (4, 8, 16, 32) — which is optimal for financial domain adaptation?
- [ ] Multi-task fine-tuning: CausalLM + sentiment classification head
- [ ] Curriculum learning: general financial text → sector-specific text → stock-specific text
- [ ] Compare embeddings from fine-tuned model across probing layers (combine with NB06)