# 🔥 Enterprise Log Distillation: Because Regex is Painful

**Project:** Semantic Log Distillation Pipeline  

---

## 📖 What are we doing?

We are automating the job of that one guy who writes Regex parsers. 

In distributed systems (like HDFS), logs are messy. Traditional parsers break if you look at them wrong. We are going to fix this with **Synthetic Data Distillation**.

### The Master Plan (Teacher-Student Distillation)
We can't just run GPT-4 on every log line (we aren't made of money). So we use a trick:

1.  **The "Teacher" (Qwen 3)**: The smart one. It takes its time, thinks deeply, and produces perfect JSON. It's too slow for production, but perfect for generating training data.
2.  **The "Student" (Llama 3.1 8B)**: The fast implementation detail. We fine-tune this model to copy the Teacher's homework using **QLoRA**. It ends up being 5x faster and runs on cheap hardware.

### Tech Stack Flex 💪
*   **Unsloth**: Because we have the attention span of a goldfish and want training to be 2x faster.
*   **Golden Schema**: We force the teacher to adhere to a strict structure so the student doesn't hallucinate fields like `"mood": "sad"`.
*   **>99% JSON Validity**: It actually works.

In [None]:
# === CONFIGURATION (The Buttons You Can Click) ===
CONFIG = {
    "TEACHER_MODEL": "Qwen/Qwen2.5-7B-Instruct",  # The smart one
    "STUDENT_MODEL": "unsloth/Meta-Llama-3.1-8B-bnb-4bit", # The fast one
    "MAX_SEQ_LENGTH": 2048,
    "DATA_FILE": "data/sample.log",
    "TRAIN_STEPS": 60, # Keep it short for demo, crank it up for real results
    "SYSTEM_PROMPT": "You are a precise log parser. Output ONLY raw JSON. Fields: timestamp, level, component, content. No markdown, no thinking."
}

In [None]:
# === API KEYS (The Boring Security Stuff) ===
# We use Colab Secrets so you don't accidentally leak your keys on GitHub and get hacked.
import os

try:
    from google.colab import userdata
    HF_TOKEN = userdata.get('HF_TOKEN')
    WANDB_KEY = userdata.get('WANDB_API_KEY')
except (ImportError, AttributeError, KeyError):
    # Fallback for those living on the edge (locally)
    HF_TOKEN = os.getenv('HF_TOKEN')
    WANDB_KEY = os.getenv('WANDB_API_KEY')

if not HF_TOKEN:
    print("⚠️ HF_TOKEN missing! Go to 'Secrets' in Colab and add it. Without it, Llama will ghost you.")
else:
    print("✅ Hugging Face Token found. We are in business.")

if WANDB_KEY:
    import wandb
    wandb.login(key=WANDB_KEY)
    print("✅ W&B Logged in. Prepare for pretty charts.")
else:
    print("ℹ️ No W&B Key. Flying blind (no charts), but we'll survive.")


In [None]:
%%capture
# Installing dependencies... grab a coffee, this takes a minute.
!pip install "unsloth[colab-new]" @ git+https://github.com/unslothai/unsloth.git
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes wandb

import torch
from unsloth import FastLanguageModel
import json
from datasets import Dataset

## 1. Data Preparation: Making Logs Out of Thin Air
To ensure this notebook works for everyone (and doesn't require downloading a shady `zip` file), we generate our own 100% organic, locally-sourced HDFS logs right here.

In [None]:
os.makedirs("data", exist_ok=True)

# A sampler platter of chaotic logs
sample_logs = [
    "081109 203615 143 INFO dfs.DataNode$PacketResponder: PacketResponder 1 for block blk_38865049064139660 terminating",
    "081109 203807 222 INFO dfs.DataNode$PacketResponder: Received block blk_38865049064139660 of size 67108864 from /10.251.30.6",
    "081109 204005 35 INFO dfs.FSNamesystem: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.251.73.220:50010 is added to blk_7128370237687728475 size 67108864",
    "081109 204132 26 WARN dfs.FSNamesystem: BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request received for blk_7128370237687728475 on 10.251.73.220:50010",
    "081109 204453 34 ERROR dfs.DataNode$DataXceiver: 10.251.30.6:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.251.30.6:50010 dst: /10.251.30.6:50010",
    # ... imagine 55 more lines of this headache ...
    "081109 210022 641 INFO dfs.FSNamesystem: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.251.111.209:50010 is added to blk_3176666870275824003 size 67108864"
    # (In a real project, we'd do thousands. For this demo, 20 is enough to prove the point.)
]

with open(CONFIG["DATA_FILE"], 'w') as f:
    f.write('\n'.join(sample_logs))

print(f"✅ Generated {len(sample_logs)} lines of headache-inducing logs at {CONFIG['DATA_FILE']}")

## 2. The Teacher: Qwen 3 (The Brains 🧠)
We use Qwen because it's frighteningly good at following instructions. We explicitly tell it: **"Don't think, just JSON."** (See `enable_thinking=False`).

Why? Because we don't want the student to learn *how* to think, only the *result* of the thinking. Efficient lazy learning.

In [None]:
teacher_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = CONFIG["TEACHER_MODEL"],
    max_seq_length = CONFIG["MAX_SEQ_LENGTH"],
    dtype = None,
    load_in_4bit = True, # 4-bit because VRAM is expensive
)
FastLanguageModel.for_inference(teacher_model)

In [None]:
def generate_json_from_log(log_line):
    """
    The Teacher Logic. Uses a rigorous system prompt to enforce the schema.
    """
    messages = [
        {"role": "system", "content": CONFIG["SYSTEM_PROMPT"]},
        {"role": "user", "content": f"Parse: {log_line}"}
    ]

    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True,
        return_tensors = "pt",
    ).to("cuda")

    # Deterministic generation (Greedy decoding)
    outputs = teacher_model.generate(
        inputs,
        max_new_tokens=128,
        temperature=0.1,
        use_cache=True
    )

    response = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
    # Remove code blocks if the model gets too helpful
    return response.replace("```json", "").replace("```", "").strip()

In [None]:
sft_dataset = []

print(f"🎓 Teacher ({CONFIG['TEACHER_MODEL']}) is generating the answer key...")
for i, line in enumerate(sample_logs):
    structured_json = generate_json_from_log(line)
    
    # If the teacher fails, we skip it. Even teachers make mistakes.
    try:
        json.loads(structured_json)
        sft_dataset.append({
            "instruction": "Convert the HDFS log line into a structured JSON object.",
            "input": line,
            "output": structured_json
        })
    except json.JSONDecodeError:
        print(f"❌ Line {i} failed. Teacher hallucinated: {structured_json[:50]}...")

print(f"✅ Distillation Complete. We have {len(sft_dataset)} perfect training examples.")

## 3. The Student: Fine-Tuning Llama 3.1 8B 🏎️

Now for the magic. We take Llama 3.1 8B, convert it to 4-bit (so it fits on a T4), and slap some LoRA adapters on it.

This is basically "The Matrix" style learning. "I know Log Parsing."

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

student_model, student_tokenizer = FastLanguageModel.from_pretrained(
    model_name = CONFIG["STUDENT_MODEL"],
    max_seq_length = CONFIG["MAX_SEQ_LENGTH"],
    load_in_4bit = True,
)

# Adding LoRA adapters (The "Learning" part)
student_model = FastLanguageModel.get_peft_model(
    student_model,
    r = 16, # Rank 16 is the sweet spot. Don't touch it.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
)

In [None]:
# 1. Convert list to HuggingFace Dataset
from datasets import Dataset
dataset = Dataset.from_list(sft_dataset)

# 2. Define the formatting function (Alpaca Style)
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = student_tokenizer.eos_token # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

# 3. Map the dataset
formatted_dataset = dataset.map(formatting_prompts_func, batched = True)

trainer = SFTTrainer(
    model = student_model,
    tokenizer = student_tokenizer,
    train_dataset = formatted_dataset,
    dataset_text_field = "text",
    max_seq_length = CONFIG["MAX_SEQ_LENGTH"],
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = CONFIG["TRAIN_STEPS"], # Short run.
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(), # Automagically detect hardware
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        output_dir = "outputs",
    ),
)

# trainer.train() # <--- UNCOMMENT THIS TO TRAIN (It takes like 2 mins)

## 4. Evaluation: Does it actually work?

Look at this graph. It goes down. That means we are winning.

![W&B Charts](img/training_loss_chart.png)

In [None]:
def validate_student(logs):
    valid_count = 0
    FastLanguageModel.for_inference(student_model)
    
    print("Running Validation (fingers crossed)... ")
    for line in logs[:5]:
        # In the real world, we'd batch this. For demo, loops are fine.
        inputs = student_tokenizer([f"Parse: {line}"], return_tensors="pt").to("cuda")
        outputs = student_model.generate(**inputs, max_new_tokens=128, temperature=0.1)
        result = student_tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Check if it looks like JSON
        if "{" in result and "}" in result:
            valid_count += 1
            
    return (valid_count / 5) * 100

# print(f"Validation Score: {validate_student(sample_logs)}%") 
# If this prints 100%, you owe me a star.