# Getting Started with Foundation-Sec: AI-Powered Log Prioritization

## What This Notebook Is About

This notebook demonstrates how to use the **Foundation-Sec-8B** model to solve the "Needle in the Logstack" problem - automatically identifying suspicious security events hidden among thousands of routine log entries. Instead of manually reviewing every log, security analysts can use AI to prioritize which logs deserve immediate attention.

### The Challenge: Information Overload in Security Operations

Modern IT environments generate massive volumes of log data:
- Enterprise servers: 10,000+ log entries per hour
- Security teams: Limited time to investigate threats
- Manual review: Inefficient and prone to missing critical events
- Alert fatigue: Too many false positives reduce effectiveness

### Our Approach: Perplexity-Based Prioritization

Rather than traditional binary classification ("malicious" vs "benign"), we use a **perplexity scoring approach** that ranks logs by suspicion level:

1. Model Setup: Load Foundation-Sec-8B with efficient 4-bit quantization
2. Perplexity Measurement: For each log, measure how "surprised" the model is when predicting "benign" vs "malicious" labels
3. Priority Scoring: Convert perplexities into 0-1 probability scores (higher = more suspicious)
4. Ranking: Sort all logs by priority score for efficient human review

#### Why Perplexity Instead of Direct Classification?

- Ranking capability: Provides relative suspicion scores, not just binary decisions
- Confidence estimation: Lower perplexity = higher model confidence
- Threshold flexibility: Security teams can adjust review thresholds based on capacity
- Nuanced assessment: Captures degrees of suspicion rather than hard categories

## Pros and Cons of This Approach

### ✅ Advantages

**Efficiency Gains**
- **10x faster triage**: Focus on top 10% of logs instead of reviewing everything
- **Automated prioritization**: Reduces manual effort in initial log screening
- **Scalable**: Handles thousands of logs with consistent performance

**Security Benefits**
- **Improved detection**: AI catches patterns humans might miss
- **Reduced alert fatigue**: Better signal-to-noise ratio
- **Consistent analysis**: No variation in quality due to analyst fatigue or experience

**Practical Implementation**
- **Easy integration**: Works with existing log formats and SIEM systems
- **Explainable results**: Provides reasoning for each classification
- **Flexible thresholds**: Adjustable based on organizational risk tolerance

### ⚠️ Limitations and Considerations

**Model Dependencies**
- **Training data bias**: Effectiveness depends on training data quality and coverage
- **Novel attack patterns**: May miss completely new attack techniques not seen during training
- **Context limitations**: Limited understanding of broader organizational context

**Operational Challenges**
- **False positives**: High-priority benign logs still require analyst time
- **False negatives**: Some malicious activity may score as low priority
- **Model drift**: Performance may degrade over time without retraining

**Resource Requirements**
- **GPU recommended**: For optimal performance, especially with larger log volumes
- **Initial setup**: Requires technical expertise to deploy and tune
- **Ongoing maintenance**: Model updates and threshold adjustments needed


In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

In [2]:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import json
from tqdm import tqdm

In [3]:
class LogPrioritizer:
    def __init__(self):
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4"
        )
        
        self.tokenizer = AutoTokenizer.from_pretrained("fdtn-ai/Foundation-Sec-8B")
        self.model = AutoModelForCausalLM.from_pretrained(
            "fdtn-ai/Foundation-Sec-8B", 
            quantization_config=quantization_config, 
            device_map="auto"
        )
        self.model.eval()
        print("🔥 Model loaded!")

    def get_priority_score(self, log_statement):
        """Returns priority score (0-1) where higher = more suspicious/malicious"""
        prompt = f"""Cybersecurity log classification task.
Log: "{log_statement}"
Classification:"""
        
        perplexities = {}
        for label in ["benign", "malicious"]:
            full_text = prompt + " " + label
            inputs = self.tokenizer(full_text, return_tensors="pt", max_length=512)
            prompt_inputs = self.tokenizer(prompt, return_tensors="pt", max_length=512)
            prompt_len = prompt_inputs["input_ids"].shape[1]
            
            with torch.no_grad():
                outputs = self.model(**inputs)
                logits = outputs.logits[0, prompt_len-1:-1]
                targets = inputs["input_ids"][0, prompt_len:]
                loss = F.cross_entropy(logits, targets)
                perplexities[label] = torch.exp(loss).item()
        benign_inv = 1.0 / perplexities["benign"]
        malicious_inv = 1.0 / perplexities["malicious"]
        malicious_prob = malicious_inv / (benign_inv + malicious_inv)
        return malicious_prob

In [4]:
with open("/kaggle/input/needle-in-the-logstack/medium_unlabelled.jsonl", "r") as f:
    unlabelled = [json.loads(line) for line in f]

with open("/kaggle/input/needle-in-the-logstack/medium_labelled.jsonl", "r") as f:
    labelled = {json.loads(line)["log_id"]: json.loads(line)["label"] for line in f}

print(f"📂 Loaded {len(unlabelled)} logs")

# Show examples
for i, log in enumerate(unlabelled[:3]):
    gt = labelled.get(log["log_id"], "?")
    print(f"\n{i+1}. ID: {log['log_id']} | GT: {gt}")
    print(f"   {log['statement'][:100]}...")

📂 Loaded 30 logs

1. ID: 605e7781 | GT: benign
   Aug 07 13:30:12 mail01 sudo: mike.torres : TTY=pts/0 ; PWD=/home/mike.torres ; USER=root ; COMMAND=s...

2. ID: f417be52 | GT: benign
   Aug 07 13:34:12 backup01 sudo: sarah.chen : TTY=pts/0 ; PWD=/home/sarah.chen ; USER=root ; COMMAND=s...

3. ID: c5d9dd84 | GT: benign
   Aug 07 13:35:12 dev01 sudo: charlie.ops : TTY=pts/0 ; PWD=/home/charlie.ops ; USER=root ; COMMAND=su...


In [5]:
prioritizer = LogPrioritizer()

priority_scores = {}
for log in tqdm(unlabelled, desc="Getting priority scores"):
    score = prioritizer.get_priority_score(log["statement"])
    priority_scores[log["log_id"]] = score

print(f"✅ Got priority scores for {len(priority_scores)} logs")

# Quick preview of top 5 highest priority logs
print(f"\n🔥 TOP 5 HIGHEST PRIORITY LOGS (PREVIEW):")
sorted_preview = sorted(priority_scores.items(), key=lambda x: x[1], reverse=True)[:5]
for i, (log_id, score) in enumerate(sorted_preview, 1):
    gt_label = labelled.get(log_id, "Unknown")
    log_statement = next(log["statement"] for log in unlabelled if log["log_id"] == log_id)
    print(f"{i}. | Score: {score:.4f}")
    print(f"   {log_statement[:120]}...")
    print()

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/630 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/840 [00:00<?, ?B/s]

2025-08-07 22:02:05.136092: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754604125.515811      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754604125.630017      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

🔥 Model loaded!


Getting priority scores:   0%|          | 0/30 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Getting priority scores: 100%|██████████| 30/30 [00:45<00:00,  1.52s/it]

✅ Got priority scores for 30 logs

🔥 TOP 5 HIGHEST PRIORITY LOGS (PREVIEW):
1. | Score: 0.5755
   Aug 07 14:25:12 web01 sudo: guest.user : TTY=pts/0 ; PWD=/home/guest.user ; USER=root ; COMMAND=sudo nmap -sS 192.168.1....

2. | Score: 0.4111
   Aug 07 14:22:12 app03 sudo: guest.user : TTY=pts/0 ; PWD=/home/guest.user ; USER=root ; COMMAND=sudo echo 'ssh-rsa AAAAB...

3. | Score: 0.3962
   Aug 07 14:09:12 dev01 sudo: guest.user : TTY=pts/0 ; PWD=/home/guest.user ; USER=root ; COMMAND=sudo journalctl -u nginx...

4. | Score: 0.3720
   Aug 07 14:01:12 db02 sudo: guest.user : TTY=pts/0 ; PWD=/home/guest.user ; USER=root ; COMMAND=sudo echo '*/5 * * * * /t...

5. | Score: 0.3594
   Aug 07 14:15:12 backup01 sudo: guest.user : TTY=pts/0 ; PWD=/home/guest.user ; USER=root ; COMMAND=sudo du -sh /var/log/...






In [6]:
print("\n" + "="*60)
print("🚨 RANKING-BASED PRIORITIZATION ANALYSIS")
print("="*60)

# Sort logs by priority score (highest first)
sorted_logs = sorted(priority_scores.items(), key=lambda x: x[1], reverse=True)

# Calculate ranking metrics for different top-k values
def evaluate_top_k(k):
    top_k_logs = sorted_logs[:k]
    top_k_malicious = sum(1 for log_id, _ in top_k_logs if labelled.get(log_id) == "malicious")
    return top_k_malicious

# Evaluate multiple top-k values
top_k_values = [1, 3, 5, 10, 20]
total_malicious = sum(1 for label in labelled.values() if label == "malicious")

print(f"📊 TOP-K RANKING RESULTS:")
print(f"   • Total malicious logs in dataset: {total_malicious}")
print()

for k in top_k_values:
    if k <= len(sorted_logs):
        malicious_in_top_k = evaluate_top_k(k)
        precision_at_k = malicious_in_top_k / k
        recall_at_k = malicious_in_top_k / total_malicious if total_malicious > 0 else 0
        
        print(f"   📈 TOP-{k:2d}: {malicious_in_top_k:2d}/{k:2d} malicious "
              f"(P@{k} = {precision_at_k:.3f}, R@{k} = {recall_at_k:.3f})")

print(f"\n🔍 DETAILED TOP-5 ANALYSIS:")
top5_logs = sorted_logs[:5]
for i, (log_id, score) in enumerate(top5_logs, 1):
    gt_label = labelled.get(log_id, "Unknown")
    status = "🔴 MALICIOUS" if gt_label == "malicious" else "🟢 BENIGN"
    log_statement = next(log["statement"] for log in unlabelled if log["log_id"] == log_id)
    print(f"{i}. {status} | Score: {score:.4f} | ID: {log_id}")
    print(f"   {log_statement[:80]}...")
    print()

# Show priority score distribution by class
print(f"📈 PRIORITY SCORE DISTRIBUTION:")
all_scores = list(priority_scores.values())
malicious_scores = [score for log_id, score in priority_scores.items() 
                   if labelled.get(log_id) == "malicious"]
benign_scores = [score for log_id, score in priority_scores.items() 
                if labelled.get(log_id) == "benign"]

print(f"   • Overall: min={min(all_scores):.3f}, max={max(all_scores):.3f}, avg={sum(all_scores)/len(all_scores):.3f}")
if malicious_scores:
    print(f"   • Malicious: min={min(malicious_scores):.3f}, max={max(malicious_scores):.3f}, avg={sum(malicious_scores)/len(malicious_scores):.3f}")
if benign_scores:
    print(f"   • Benign: min={min(benign_scores):.3f}, max={max(benign_scores):.3f}, avg={sum(benign_scores)/len(benign_scores):.3f}")

# Calculate ranking quality metrics
def calculate_ndcg_at_k(k):
    """Calculate NDCG@k for binary relevance (malicious=1, benign=0)"""
    top_k = sorted_logs[:k]
    dcg = 0
    idcg = 0
    
    # DCG calculation
    for i, (log_id, score) in enumerate(top_k):
        relevance = 1 if labelled.get(log_id) == "malicious" else 0
        dcg += relevance / (torch.log2(torch.tensor(i + 2)).item())  # i+2 because log2(1) = 0
    
    # IDCG calculation (perfect ranking)
    ideal_relevances = sorted([1 if labelled.get(log_id) == "malicious" else 0 
                              for log_id, _ in sorted_logs], reverse=True)[:k]
    for i, relevance in enumerate(ideal_relevances):
        idcg += relevance / (torch.log2(torch.tensor(i + 2)).item())
    
    return dcg / idcg if idcg > 0 else 0

# Calculate NDCG for different k values
print(f"\n🎯 RANKING QUALITY (NDCG):")
for k in [3, 5, 10]:
    if k <= len(sorted_logs):
        ndcg = calculate_ndcg_at_k(k)
        print(f"   • NDCG@{k}: {ndcg:.4f}")

print(f"\n🔥 Prioritization analysis complete!")
print(f"Higher precision@k = better at surfacing malicious logs in top-k results")


🚨 RANKING-BASED PRIORITIZATION ANALYSIS
📊 TOP-K RANKING RESULTS:
   • Total malicious logs in dataset: 3

   📈 TOP- 1:  1/ 1 malicious (P@1 = 1.000, R@1 = 0.333)
   📈 TOP- 3:  2/ 3 malicious (P@3 = 0.667, R@3 = 0.667)
   📈 TOP- 5:  3/ 5 malicious (P@5 = 0.600, R@5 = 1.000)
   📈 TOP-10:  3/10 malicious (P@10 = 0.300, R@10 = 1.000)
   📈 TOP-20:  3/20 malicious (P@20 = 0.150, R@20 = 1.000)

🔍 DETAILED TOP-5 ANALYSIS:
1. 🔴 MALICIOUS | Score: 0.5755 | ID: 68ba4f6e
   Aug 07 14:25:12 web01 sudo: guest.user : TTY=pts/0 ; PWD=/home/guest.user ; USER...

2. 🔴 MALICIOUS | Score: 0.4111 | ID: 6e42c0bc
   Aug 07 14:22:12 app03 sudo: guest.user : TTY=pts/0 ; PWD=/home/guest.user ; USER...

3. 🟢 BENIGN | Score: 0.3962 | ID: d8b8161d
   Aug 07 14:09:12 dev01 sudo: guest.user : TTY=pts/0 ; PWD=/home/guest.user ; USER...

4. 🔴 MALICIOUS | Score: 0.3720 | ID: 5d679f68
   Aug 07 14:01:12 db02 sudo: guest.user : TTY=pts/0 ; PWD=/home/guest.user ; USER=...

5. 🟢 BENIGN | Score: 0.3594 | ID: 8ee251ff
   Au

## Understanding the Results

### Priority Scores (0.0 - 1.0)
- **0.0 - 0.3**: Likely benign, routine activity
- **0.3 - 0.6**: Medium suspicion, worth reviewing
- **0.6 - 1.0**: High suspicion, immediate investigation recommended

### Key Metrics Explained

**Precision@K**: Of the top K logs flagged, what percentage were actually malicious?
- `P@5 = 0.600` means 60% of the top 5 flagged logs were truly malicious

**Recall@K**: Of all malicious logs, what percentage appear in the top K results?
- `R@5 = 1.000` means 100% of malicious logs appear in the top 5 results

**NDCG (Normalized Discounted Cumulative Gain)**: Overall ranking quality
- `NDCG@5 = 0.967` indicates excellent ranking performance (max = 1.0)

### Interpreting Your Results

**High-performing prioritization** typically shows:
- P@5 > 0.4 (at least 40% of top 5 are malicious)
- R@10 > 0.8 (80%+ of malicious logs in top 10)
- NDCG@10 > 0.7 (good overall ranking quality)

**What this means for security operations**:
- **Review efficiency**: Analysts can focus on top 10-20 logs instead of hundreds
- **Threat detection**: Critical security events surface quickly
- **Resource allocation**: Better distribute analyst time across genuine threats