# Fine-Tuning a Small LLM for Infrastructure Log Parsing

## Objectives
- Understand the concepts of Parameter-Efficient Fine-Tuning (PEFT) and Low-Rank Adaptation (LoRA).
- Take a small, pre-trained open-source model (e.g., Llama-3.2-1B or Qwen-2.5-0.5B).
- Fine-tune it to automatically extract structured JSON data from messy, unstructured infrastructure logs.

## Dataset
- A JSONL dataset mapping raw log strings to structured dictionaries containing `severity`, `component`, and `message`.

## Expected Outcome
- Instead of writing complex Regex rules, we will pass a new log line to our fine-tuned model and it will reliably output perfectly formatted JSON.

## Environment Note
- *This notebook runs best on a Google Colab T4 GPU (free tier).* If running on a CPU, training will be very slow.

In [None]:
# !pip install -q transformers datasets peft trl accelerate bitsandbytes

### 1. The Training Data
In a real scenario, you'd export a few hundred logs from Datadog or Splunk and manually label them to teach the LLM the exact schema you want.

In [None]:
import json
from datasets import Dataset

# Sample data showing how we want our LLM to behave
training_examples = [
    {
        "log": "[2025-02-25 10:14:22] ERROR: Connection timeout calling backend service at 10.0.4.55:8080",
        "json": '{"severity": "ERROR", "component": "backend_service", "ip": "10.0.4.55", "error": "Connection timeout"}'
    },
    {
        "log": "[2025-02-25 10:14:28] WARN: High memory usage detected on node worker-3 (88%)",
        "json": '{"severity": "WARN", "component": "node", "target": "worker-3", "metric": "memory_usage", "value": "88%"}'
    },
    {
        "log": "[2025-02-25 10:14:30] ERROR: OutOfMemoryError: Java heap space thread id 553",
        "json": '{"severity": "ERROR", "component": "jvm", "thread_id": "553", "error": "OutOfMemoryError: Java heap space"}'
    }
]

# Format for causal language modeling prompt
formatted_data = []
for example in training_examples:
    text = f"Extract structured JSON from this log.\n\nLog: {example['log']}\n\nJSON: {example['json']}"
    formatted_data.append({"text": text})

dataset = Dataset.from_list(formatted_data)
print("Sample training prompt:\n")
print(dataset[0]['text'])

### 2. Loading the Model with LoRA
Loading the full model takes too much VRAM. We use `bitsandbytes` to load it in 4-bit precision, and `peft` to only train a tiny fraction of the weights via LoRA adapters.

In [None]:
'''
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

model_id = "Qwen/Qwen2.5-0.5B-Instruct" # A very small, highly capable model

# 1. Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# 2. Quantization Config (4-bit)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
)

# 3. Load Model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# 4. Apply LoRA
lora_config = LoraConfig(
    r=8, # Rank (how "wide" the adapter is)
    lora_alpha=16, # Scaling factor
    target_modules=["q_proj", "v_proj"], # Modules to target
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
'''

### 3. Training Loop (SFTTrainer)
Using the Supervised Fine-tuning Trainer from Hugging Face makes the training loop trivial.

In [None]:
'''
from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./log-parser-model",
    per_device_train_batch_size=1, # Keep small for 0.5B model on Colab
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    max_steps=50, # In reality, you'd train for 1-3 epochs over 500+ samples
    logging_steps=10,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
    args=training_args,
)

# Train!
trainer.train()
'''


### 4. Inference (Testing the Fine-Tuned Model)
Let's see if the model learned to convert logs into our desired JSON schema without using prompt engineering.

In [None]:
'''
new_log = "[2025-02-25 10:14:40] ERROR: Database connection failed (FATAL: remaining connection slots are reserved)"
prompt = f"Extract structured JSON from this log.\n\nLog: {new_log}\n\nJSON: "

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
'''
