# **Third experiment: training a XGBoost probe on attention layer**

We have trained a probe on the best-performing activation layer and achieved encouraging, though not fully satisfactory, results. Subsequently, we attempted training on multiple activation layers but quickly observed diminishing returns.

In this notebook, we explore a different approach: training a probe on layer 16's *attention* components (the most expressive layer, as established by our earlier experiments). We investigate two distinct strategies:

- Using the output of the multi-head attention block (attention layer) rather than the output of the feed-forward network (activation layer);
- Using a statistical summary of the same attention layer, which may capture dynamics and other fine-grained properties.

We then experiment with hybrid approaches (concatenating attention statistics and MHA outputs with hidden states) and evaluate the effect of retaining varying percentages of the most informative features.

### 1. Installing required libraries

So, let's first install the necessary libraries:

In [2]:
# Install `llmscan`
!pip install git+https://github.com/julienbrasseur/llm-hallucination-detector.git

# Install `datasets`
!pip install datasets

Collecting git+https://github.com/julienbrasseur/llm-hallucination-detector.git
  Cloning https://github.com/julienbrasseur/llm-hallucination-detector.git to /tmp/pip-req-build-l64dy7_y
  Running command git clone --filter=blob:none --quiet https://github.com/julienbrasseur/llm-hallucination-detector.git /tmp/pip-req-build-l64dy7_y
  Resolved https://github.com/julienbrasseur/llm-hallucination-detector.git to commit 77b721d351f3cb5b08d8447d199d6afe38970d26
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting transformers>=4.36.0 (from llmscan==0.1.0)
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
Collecting xgboost>=2.0.0 (from llmscan==0.1.0)
  Downloading xgboost-3.1.2-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Collecting scikit-learn>=1.3.0 (from llmscan==0.1.0)
  Downloading scikit_learn-1.8.0-cp311-cp311-manylinux_2_27_x86_64.man

### 2. Data preparation

Now, as before, we load the dataset from Hugging Face and convert it to a standard OpenAI conversation format.

In [3]:
import torch
import numpy as np
from datasets import load_dataset

# Set training dataset path
DATASET_NAME = "krogoldAI/hallucination-labeled-dataset"

def load_and_format_dataset(dataset_name: str):
    """
    Load HuggingFace dataset and convert to conversation format.

    This function converts dataset with 'input', 'target', 'hallucination' fields
    to the standard conversation format expected by the pipeline.

    Returns:
        Tuple of (train_data, val_data, test_data, train_labels, val_labels, test_labels)
    """
    print(f"Loading dataset: {dataset_name}")
    ds = load_dataset(dataset_name)

    # Shuffle each split
    ds["train"] = ds["train"].shuffle(seed=42)
    ds["validation"] = ds["validation"].shuffle(seed=42)
    ds["test"] = ds["test"].shuffle(seed=42)

    def format_split(split):
        """Convert HF dataset split to conversation format."""
        formatted = []
        labels = []

        for item in split:
            # Extract fields from your HF dataset format
            user_msg = item["input"]
            assistant_msg = item["target"]
            label = int(item["hallucination"])

            # Convert to standard conversation format
            formatted.append({
                "conversation": [
                    {"role": "user", "content": user_msg},
                    {"role": "assistant", "content": assistant_msg},
                ]
            })
            labels.append(label)

        return formatted, np.array(labels)

    # Format all splits
    train_data, train_labels = format_split(ds["train"])
    val_data, val_labels = format_split(ds["validation"])
    test_data, test_labels = format_split(ds["test"])

    print(f"Dataset loaded and formatted:")
    print(f"  Train:      {len(train_data):,} examples")
    print(f"  Validation: {len(val_data):,} examples")
    print(f"  Test:       {len(test_data):,} examples")
    print(f"  Class distribution (train): "
          f"{(train_labels == 0).sum():,} non-hallucination, "
          f"{(train_labels == 1).sum():,} hallucination")

    return train_data, val_data, test_data, train_labels, val_labels, test_labels

# Load and format dataset
train_data, val_data, test_data, train_labels, val_labels, test_labels = \
    load_and_format_dataset(DATASET_NAME)

Loading dataset: krogoldAI/hallucination-labeled-dataset


README.md:   0%|          | 0.00/58.5k [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/78.7M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/16.7M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/101618 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21775 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/21776 [00:00<?, ? examples/s]

Dataset loaded and formatted:
  Train:      101,618 examples
  Validation: 21,775 examples
  Test:       21,776 examples
  Class distribution (train): 68,913 non-hallucination, 32,705 hallucination


### 3. Extracting attention statistics summary and MHA output for layer 16

We now simultaneously extract the multi-head attention output for layer 16 along with a statistical summary of the same output. For each attention head, we compute 12 distinct metrics. The comprehensive list can be obtained as follows:

In [None]:
from llmscan import AVAILABLE_STATS

# Check available statistics
for stat in AVAILABLE_STATS:
    print(stat)

entropy
max
std
gini
simpson
effective_token_count
skewness
top1_mass
top5_mass
top10p_mass
mean_relative_distance
attention_to_bos
frobenius
head_output_norm
attn_value_correlation


Note that `head_output_norm` and `attn_value_correlation` have not yet been implemented, as they are architecture-dependent and would require non-trivial implementation effort. To assess whether this effort is warranted, we first experiment with the remaining 12 metrics.

While perhaps unconventional in this context, we include `frobenius`, the standard Frobenius matrix norm (equivalent to the $\ell^2$-norm), on a purely exploratory basis. This metric can be interpreted as the overall "energy" of the attention pattern across the full generation, potentially capturing whether attention patterns remain consistent across assistant tokens (i.e., attending to similar positions) or exhibit high variability.


In [None]:
import numpy as np
from llmscan import AttentionExtractor, AVAILABLE_STATS

# List of statistics to compute (top1_mass is redundant with max, so we remove it)
STATS_TO_EXCLUDE = [
    "top1_mass",
    "head_output_norm",
    "attn_value_correlation"
]
STATS_TO_COMPUTE = [
    stat for stat in AVAILABLE_STATS.copy() if stat not in STATS_TO_EXCLUDE
]

# Initialize attention extractor
extractor = AttentionExtractor(
    model_name="mistralai/Ministral-8B-Instruct-2410",
    target_layers=[16],
    stats_to_compute=STATS_TO_COMPUTE,
    extract_mha_output=True, # Also extract MHA output vectors
    device="cuda",
)

print(f"\nExtractor initialized:")
print(f"  Target layers: {extractor.target_layers}")
print(f"  Num heads: {extractor.num_heads}")
print(f"  Hidden size: {extractor.hidden_size}")
print(f"  Stats: {len(STATS_TO_COMPUTE)} types")
print(f"  Expected attention stat features: {len(STATS_TO_COMPUTE) * extractor.num_heads}")
print(f"  Expected MHA output features: {extractor.hidden_size}")

# Save feature names for reference
feature_names = extractor.get_feature_names()
with open('attention_feature_names.txt', 'w') as f:
    for name in feature_names:
        f.write(name + '\n')
print(f"Saved feature names to attention_feature_names.txt")

# We will now extract attention data for each split
for split, data in zip(["test", "val", "train"], [test_data, val_data, train_data]):
    print(f"\n[{split} split] Extracting attention data...")
    features = extractor.extract(
        raw_texts=data,
        batch_size=8, #2,
        max_length=512,
    )

    # Inspect extraction results
    for key, arr in features.items():
        print(f"\n{key}:")
        print(f"  Shape: {arr.shape}")
        print(f"  Dtype: {arr.dtype}")
        print(f"  Range: [{arr.min():.4f}, {arr.max():.4f}]")
        print(f"  Mean: {arr.mean():.4f}")
        print(f"  Has NaN: {np.isnan(arr).any()}")

    # Save attention statistics
    os.makedirs("attention_cache16", exist_ok=True)
    if 'attention_stats' in features:
        np.save(f'attention_cache16/{split}_attention_stats_layer16.npy', features['attention_stats'])
        print(f"\nSaved attention_stats to {split}_attention_stats_layer16.npy")

    # Save MHA outputs
    if 'mha_output' in features:
        np.save(f'attention_cache16/{split}_mha_output_layer16.npy', features['mha_output'])
        print(f"Saved mha_output to {split}_mha_output_layer16.npy")

Loading model...
  Stats to compute: ['entropy', 'max', 'std', 'gini', 'simpson', 'effective_token_count', 'skewness', 'top5_mass', 'top10p_mass', 'mean_relative_distance', 'attention_to_bos', 'frobenius']
  Extract MHA output: True
  Need head outputs: False


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Using instruction end token: [/INST]
Model loaded: 32 heads, dim=4096
Target layers: [16]

Extractor initialized:
  Target layers: [16]
  Num heads: 32
  Hidden size: 4096
  Stats: 12 types
  Expected attention stat features: 384
  Expected MHA output features: 4096
Saved feature names to attention_feature_names.txt

[test split] Extracting attention data...


Extracting attention features: 100%|██████████| 2722/2722 [1:05:52<00:00,  1.45s/it]


Attention stats shape: (21776, 384)
MHA output shape: (21776, 4096)

attention_stats:
  Shape: (21776, 384)
  Dtype: float32
  Range: [0.0000, 492.5853]
  Mean: 11.0324
  Has NaN: False

mha_output:
  Shape: (21776, 4096)
  Dtype: float32
  Range: [-1.1021, 0.5809]
  Mean: -0.0004
  Has NaN: False

Saved attention_stats to test_attention_stats_layer16.npy
Saved mha_output to test_mha_output_layer16.npy

[val split] Extracting attention data...


Extracting attention features: 100%|██████████| 2722/2722 [1:01:39<00:00,  1.36s/it]


Attention stats shape: (21775, 384)
MHA output shape: (21775, 4096)

attention_stats:
  Shape: (21775, 384)
  Dtype: float32
  Range: [0.0000, 492.5765]
  Mean: 11.0344
  Has NaN: False

mha_output:
  Shape: (21775, 4096)
  Dtype: float32
  Range: [-1.1750, 1.3672]
  Mean: -0.0004
  Has NaN: False

Saved attention_stats to val_attention_stats_layer16.npy
Saved mha_output to val_mha_output_layer16.npy

[train split] Extracting attention data...


Extracting attention features: 100%|██████████| 12703/12703 [5:09:27<00:00,  1.46s/it]  


Attention stats shape: (101618, 384)
MHA output shape: (101618, 4096)

attention_stats:
  Shape: (101618, 384)
  Dtype: float32
  Range: [0.0000, 499.0262]
  Mean: 11.0165
  Has NaN: False

mha_output:
  Shape: (101618, 4096)
  Dtype: float32
  Range: [-1.2734, 1.1875]
  Mean: -0.0004
  Has NaN: False

Saved attention_stats to train_attention_stats_layer16.npy
Saved mha_output to train_mha_output_layer16.npy


### 4. Training separate XGBoost probes on attention stats and MHA output

Let's evaluate separately attention statistics and the MHA output by training a XGBoost probe on each of these datasets.

In [5]:
import torch
import numpy as np
from sklearn.metrics import (
    precision_recall_curve,
    classification_report
)
from llmscan import XGBoostProbe

# XGBoost params
XGB_PARAMS = {
    'n_estimators': 1000,
    'max_depth': 6,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'tree_method': 'hist',
    'device': 'cuda',
    'eval_metric': 'logloss',
}

for attn in ["attention_stats", "mha_output"]:
    msg = f"# TRAINING ON {attn.upper()} #"
    print("", "#" * len(msg), msg, "#" * len(msg), sep="\n")

    # Load cached activations
    print("\nLoading data...")
    train_acts = np.load(f"/workspace/attention_cache16/train_{attn}_layer16.npy")
    val_acts = np.load(f"/workspace/attention_cache16/val_{attn}_layer16.npy")
    test_acts = np.load(f"/workspace/attention_cache16/test_{attn}_layer16.npy")

    print(f"Train: {len(train_acts)}, Val: {len(val_acts)}, Test: {len(test_acts)}")

    # Align labels (trim to match activations)
    train_labels_aligned = train_labels[:len(train_acts)]
    val_labels_aligned = val_labels[:len(val_acts)]
    test_labels_aligned = test_labels[:len(test_acts)]

    print(f"Labels aligned: train={len(train_labels_aligned)}, val={len(val_labels_aligned)}, test={len(test_labels_aligned)}")

    # Train
    print("\nTraining XGBoost...")
    probe = XGBoostProbe(xgb_params=XGB_PARAMS)
    probe.fit(
        train_acts,
        train_labels_aligned,
        X_val=val_acts,
        y_val=val_labels_aligned,
        early_stopping_rounds=20,
        verbose=True
    )

    # Evaluate on test set
    print("\nEvaluating on test set...")
    metrics = probe.evaluate(test_acts, test_labels_aligned, verbose=True)

    # Save probe
    probe.save(f"hallucination_{attn}_layer_16.pkl")
    print("\nProbe saved!")

    # Get probabilities for hallucination class
    y_proba = probe.predict_proba(test_acts)[:, 1]

    # Get precision-recall curve
    precisions, recalls, thresholds = precision_recall_curve(test_labels_aligned, y_proba)

    # Compute F1 for each threshold
    f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)

    # Find optimal threshold
    best_idx = f1_scores.argmax()
    best_threshold = thresholds[best_idx]

    print(f"Best threshold: {best_threshold:.3f}")
    print(f"Precision: {precisions[best_idx]:.3f}, Recall: {recalls[best_idx]:.3f}, F1: {f1_scores[best_idx]:.3f}")

    # Full report with optimized threshold
    y_pred_optimized = (y_proba >= best_threshold).astype(int)
    print("\nClassification Report (optimized threshold):")
    print(classification_report(test_labels_aligned, y_pred_optimized, digits=4))


###############################
# TRAINING ON ATTENTION_STATS #
###############################

Loading data...
Train: 101618, Val: 21775, Test: 21776
Labels aligned: train=101618, val=21775, test=21776

Training XGBoost...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


[0]	train-logloss:0.61638	val-logloss:0.61667
[10]	train-logloss:0.53636	val-logloss:0.53879
[20]	train-logloss:0.49627	val-logloss:0.50005
[30]	train-logloss:0.47250	val-logloss:0.47714
[40]	train-logloss:0.45583	val-logloss:0.46151
[50]	train-logloss:0.44303	val-logloss:0.44984
[60]	train-logloss:0.43312	val-logloss:0.44101
[70]	train-logloss:0.42506	val-logloss:0.43410
[80]	train-logloss:0.41761	val-logloss:0.42818
[90]	train-logloss:0.41054	val-logloss:0.42296
[100]	train-logloss:0.40438	val-logloss:0.41879
[110]	train-logloss:0.39855	val-logloss:0.41486
[120]	train-logloss:0.39353	val-logloss:0.41163
[130]	train-logloss:0.38888	val-logloss:0.40857
[140]	train-logloss:0.38416	val-logloss:0.40596
[150]	train-logloss:0.38030	val-logloss:0.40406
[160]	train-logloss:0.37717	val-logloss:0.40212
[170]	train-logloss:0.37363	val-logloss:0.40045
[180]	train-logloss:0.37072	val-logloss:0.39912
[190]	train-logloss:0.36772	val-logloss:0.39800
[200]	train-logloss:0.36493	val-logloss:0.39690
[21

Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


[0]	train-logloss:0.61334	val-logloss:0.61382
[10]	train-logloss:0.51849	val-logloss:0.52274
[20]	train-logloss:0.47061	val-logloss:0.47760
[30]	train-logloss:0.44164	val-logloss:0.45094
[40]	train-logloss:0.42087	val-logloss:0.43252
[50]	train-logloss:0.40553	val-logloss:0.41938
[60]	train-logloss:0.39162	val-logloss:0.40882
[70]	train-logloss:0.38084	val-logloss:0.40151
[80]	train-logloss:0.37127	val-logloss:0.39516
[90]	train-logloss:0.36312	val-logloss:0.39041
[100]	train-logloss:0.35613	val-logloss:0.38654
[110]	train-logloss:0.34952	val-logloss:0.38306
[120]	train-logloss:0.34381	val-logloss:0.38024
[130]	train-logloss:0.33835	val-logloss:0.37779
[140]	train-logloss:0.33331	val-logloss:0.37552
[150]	train-logloss:0.32865	val-logloss:0.37352
[160]	train-logloss:0.32435	val-logloss:0.37194
[170]	train-logloss:0.32010	val-logloss:0.37049
[180]	train-logloss:0.31605	val-logloss:0.36908
[190]	train-logloss:0.31267	val-logloss:0.36794
[200]	train-logloss:0.30920	val-logloss:0.36690
[21

### 5. Training a XGBoost probe on concatenated attention and activation layers

The results above show no significant improvement over our earlier findings. As a next step, we concatenate all features spanning from layer 15's FFN output to layer 16's FFN output, including layer 16's attention statistics and multi-head attention output:

```
    Layer 15                                 Layer 16
{[FFN output]} -> {[Per-head attention statistics] -> [MHA output] -> [FFN output]}
```

Then, we'll evaluate the effect of feature selection by retaining varying percentages of the most informative features.

In [2]:
import gc
import numpy as np
import torch
from pathlib import Path

def load_torch_as_numpy(path):
    """Load a .pt file and convert to numpy float32."""
    tensor = torch.load(path, map_location="cpu", weights_only=True)
    arr = tensor.float().numpy()
    del tensor
    return arr

def load_numpy(path):
    """Load a .npy file."""
    return np.load(path)

def build_concat_streaming(output_path, sources, n_samples):
    """
    Build concatenated features by streaming arrays one at a time.
    
    Args:
        output_path: Where to save the memory-mapped output
        sources: List of (path, loader_fn) tuples
        n_samples: Number of samples to include
    
    Returns:
        np.memmap array of shape (n_samples, total_features)
    """
    # First pass: compute total feature dimension
    print("Computing total feature dimension...")
    total_dim = 0
    dims = []
    for path, loader_fn in sources:
        arr = loader_fn(path)
        dims.append(arr.shape[1])
        total_dim += arr.shape[1]
        print(f"  {path}: {arr.shape[1]} features")
        del arr
        gc.collect()
    
    print(f"Total features: {total_dim}")
    
    # Allocate memory-mapped output
    print(f"Creating memmap at {output_path} with shape ({n_samples}, {total_dim})...")
    out = np.memmap(output_path, dtype=np.float32, mode='w+', shape=(n_samples, total_dim))
    
    # Second pass: fill the array
    col = 0
    for i, (path, loader_fn) in enumerate(sources):
        print(f"Loading and copying {path}...")
        arr = loader_fn(path).astype(np.float32, copy=False)
        width = arr.shape[1]
        out[:, col:col + width] = arr[:n_samples]
        out.flush()  # Ensure data is written to disk
        col += width
        print(f"  Copied {width} features (columns {col - width} to {col})")
        del arr
        gc.collect()
    
    print("Done.")
    return out


# Define sources for each split
train_sources = [
    ("/workspace/feature_cache15/train_activations_pooled.pt", load_torch_as_numpy),
    ("/workspace/attention_cache16/train_attention_stats_layer16.npy", load_numpy),
    ("/workspace/attention_cache16/train_mha_output_layer16.npy", load_numpy),
    ("/workspace/feature_cache16/train_activations_pooled.pt", load_torch_as_numpy),
]

val_sources = [
    ("/workspace/feature_cache15/val_activations_pooled.pt", load_torch_as_numpy),
    ("/workspace/attention_cache16/val_attention_stats_layer16.npy", load_numpy),
    ("/workspace/attention_cache16/val_mha_output_layer16.npy", load_numpy),
    ("/workspace/feature_cache16/val_activations_pooled.pt", load_torch_as_numpy),
]

test_sources = [
    ("/workspace/feature_cache15/test_activations_pooled.pt", load_torch_as_numpy),
    ("/workspace/attention_cache16/test_attention_stats_layer16.npy", load_numpy),
    ("/workspace/attention_cache16/test_mha_output_layer16.npy", load_numpy),
    ("/workspace/feature_cache16/test_activations_pooled.pt", load_torch_as_numpy),
]

# Sample counts
n_train = 101618
n_val = 21775
n_test = 21776

# Output directory for memmaps
output_dir = Path("/workspace/concat_features")
output_dir.mkdir(exist_ok=True)

# Build concatenated features
print("\n=== TRAIN ===")
train_acts = build_concat_streaming(output_dir / "train.dat", train_sources, n_train)

print("\n=== VAL ===")
val_acts = build_concat_streaming(output_dir / "val.dat", val_sources, n_val)

print("\n=== TEST ===")
test_acts = build_concat_streaming(output_dir / "test.dat", test_sources, n_test)

print(f"\nFinal shapes:")
print(f"  Train: {train_acts.shape}")
print(f"  Val:   {val_acts.shape}")
print(f"  Test:  {test_acts.shape}")

# Labels
train_labels_aligned = train_labels[:n_train]
val_labels_aligned = val_labels[:n_val]
test_labels_aligned = test_labels[:n_test]

print(f"\nLabels aligned:")
print(f"  Train: {len(train_labels_aligned)}")
print(f"  Val:   {len(val_labels_aligned)}")
print(f"  Test:  {len(test_labels_aligned)}")


=== TRAIN ===
Computing total feature dimension...
  /workspace/feature_cache15/train_activations_pooled.pt: 4096 features
  /workspace/attention_cache16/train_attention_stats_layer16.npy: 384 features
  /workspace/attention_cache16/train_mha_output_layer16.npy: 4096 features
  /workspace/feature_cache16/train_activations_pooled.pt: 4096 features
Total features: 12672
Creating memmap at /workspace/concat_features/train.dat with shape (101618, 12672)...
Loading and copying /workspace/feature_cache15/train_activations_pooled.pt...
  Copied 4096 features (columns 0 to 4096)
Loading and copying /workspace/attention_cache16/train_attention_stats_layer16.npy...
  Copied 384 features (columns 4096 to 4480)
Loading and copying /workspace/attention_cache16/train_mha_output_layer16.npy...
  Copied 4096 features (columns 4480 to 8576)
Loading and copying /workspace/feature_cache16/train_activations_pooled.pt...
  Copied 4096 features (columns 8576 to 12672)
Done.

=== VAL ===
Computing total fea

Now, let's train a probe on the concatenated data:

In [4]:
from llmscan import XGBoostProbe
from sklearn.metrics import precision_recall_curve, classification_report

# XGBoost parameters
XGB_PARAMS = {
    'n_estimators': 1000,
    'max_depth': 6,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'tree_method': 'hist',
    'device': 'cuda',
    'eval_metric': 'logloss',
}

# Train
print("\nTraining XGBoost...")
probe_full = XGBoostProbe(xgb_params=XGB_PARAMS)
probe_full.fit(
    train_acts,
    train_labels_aligned,
    X_val=val_acts,
    y_val=val_labels_aligned,
    early_stopping_rounds=20,
    verbose=True
)

# Evaluate on test set
print("\nEvaluating on test set...")
metrics = probe_full.evaluate(test_acts, test_labels_aligned, verbose=True)

# Save probe
probe_full.save("hallucination_probe_full_layers_15_16.pkl")
print("\nProbe saved!")

# Get probabilities for hallucination class
y_proba = probe_full.predict_proba(test_acts)[:, 1]

# Get precision-recall curve
precisions, recalls, thresholds = precision_recall_curve(test_labels_aligned, y_proba)

# Compute F1 for each threshold
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)

# Find optimal threshold
best_idx = f1_scores.argmax()
best_threshold = thresholds[best_idx]

print(f"Best threshold: {best_threshold:.3f}")
print(f"Precision: {precisions[best_idx]:.3f}, Recall: {recalls[best_idx]:.3f}, F1: {f1_scores[best_idx]:.3f}")

# Full report with optimized threshold
y_pred_optimized = (y_proba >= best_threshold).astype(int)
print("\nClassification Report (optimized threshold):")
print(classification_report(test_labels_aligned, y_pred_optimized, digits=4))


Training XGBoost...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


[0]	train-logloss:0.61317	val-logloss:0.61355
[10]	train-logloss:0.51718	val-logloss:0.52024
[20]	train-logloss:0.46408	val-logloss:0.46947
[30]	train-logloss:0.43264	val-logloss:0.43997
[40]	train-logloss:0.41272	val-logloss:0.42196
[50]	train-logloss:0.39797	val-logloss:0.40894
[60]	train-logloss:0.38507	val-logloss:0.39868
[70]	train-logloss:0.37419	val-logloss:0.39082
[80]	train-logloss:0.36488	val-logloss:0.38493
[90]	train-logloss:0.35645	val-logloss:0.37970
[100]	train-logloss:0.34929	val-logloss:0.37611
[110]	train-logloss:0.34285	val-logloss:0.37293
[120]	train-logloss:0.33677	val-logloss:0.36991
[130]	train-logloss:0.33081	val-logloss:0.36764
[140]	train-logloss:0.32561	val-logloss:0.36569
[150]	train-logloss:0.32098	val-logloss:0.36396
[160]	train-logloss:0.31627	val-logloss:0.36219
[170]	train-logloss:0.31191	val-logloss:0.36078
[180]	train-logloss:0.30786	val-logloss:0.35946
[190]	train-logloss:0.30392	val-logloss:0.35832
[200]	train-logloss:0.30005	val-logloss:0.35725
[21

### 6. Feature selection experiment

Now that we trained the probe on the concatenated dataset, let's select various percentage of most important features and train a probe on each case.

In [5]:
import numpy as np
from llmscan import XGBoostProbe

print("="*62)
print("FEATURE SELECTION EXPERIMENT")
print("="*62)

# Set feature selection parameters (from 500 to 6000 with a 500 iteration step)
MIN_FEATURES = 500
MAX_FEATURES = 6000
ITERATION_STEP = 500

# Extract feature importance
print("\nExtracting feature importance...")
feature_importance = probe_full.get_feature_importance(
    importance_type='gain',
    top_k=None
)
print(f"\tTotal features: {len(feature_importance)}")
print(f"\tFeatures with non-zero importance: {len([v for v in feature_importance.values() if v > 0])}")

# Initialize list to store metrics
results = []

# Iterating over various feature numbers
for top_k in range(MIN_FEATURES, MAX_FEATURES+1, ITERATION_STEP):
    print("\n" + "="*62)
    print(f"EXPERIMENT: Top {top_k} features")
    print("="*62)

    # Get top-k most important feature indices
    top_features = sorted(
        feature_importance.items(),
        key=lambda x: x[1],
        reverse=True
    )[:top_k]
    top_indices = [idx for idx, _ in top_features]

    print(f"\nSelected top {top_k} features (indices: {top_indices[:5]}...{top_indices[-5:]})")

    # Subset the data
    train_acts_subset = train_acts[:, top_indices]
    val_acts_subset = val_acts[:, top_indices]
    test_acts_subset = test_acts[:, top_indices]

    print(f"Subset shapes: {train_acts_subset.shape}, {val_acts_subset.shape}, {test_acts_subset.shape}")

    # Train new XGBoost on selected features
    print(f"Training XGBoost on {top_k} selected features...")

    XGB_PARAMS = {
        'n_estimators': 1000,
        'max_depth': 6,
        'learning_rate': 0.05,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'tree_method': 'hist',
        'eval_metric': 'logloss',
    }

    probe_subset = XGBoostProbe(xgb_params=XGB_PARAMS)
    probe_subset.fit(
        train_acts_subset,
        train_labels_aligned,
        X_val=val_acts_subset,
        y_val=val_labels_aligned,
        early_stopping_rounds=20,
        verbose=False
    )

    print(f"\tTraining complete (best iteration: {probe_subset.model.best_iteration})")

    # Evaluate
    print(f"Evaluating...")
    metrics = probe_subset.evaluate(
        test_acts_subset,
        test_labels_aligned,
        threshold=0.388,
        verbose=False
    )

    results.append({
        'top_k': top_k,
        'accuracy': metrics['accuracy'],
        'precision': metrics['precision'],
        'recall': metrics['recall'],
        'f1': metrics['f1'],
        'auc': metrics['auc']
    })

    print(f"  Results: Acc={metrics['accuracy']:.4f}, Recall={metrics['recall']:.4f}, "
          f"F1={metrics['f1']:.4f}, AUC={metrics['auc']:.4f}")

    # Get probabilities for hallucination class
    y_proba = probe_subset.predict_proba(test_acts_subset)[:, 1]
    
    # Get precision-recall curve
    precisions, recalls, thresholds = precision_recall_curve(test_labels_aligned, y_proba)
    
    # Compute F1 for each threshold
    f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
    
    # Find optimal threshold
    best_idx = f1_scores.argmax()
    best_threshold = thresholds[best_idx]
    
    print(f"Best threshold: {best_threshold:.3f}")
    print(f"Precision: {precisions[best_idx]:.3f}, Recall: {recalls[best_idx]:.3f}, F1: {f1_scores[best_idx]:.3f}")
    
    # Full report with optimized threshold
    y_pred_optimized = (y_proba >= best_threshold).astype(int)
    print("\nClassification Report (optimized threshold):")
    print(classification_report(test_labels_aligned, y_pred_optimized, digits=4))

# Compare results
print("\n" + "="*62)
print("FEATURE SELECTION RESULTS COMPARISON")
print("="*62)

print("\n{:<12} {:<10} {:<10} {:<10} {:<10} {:<10}".format(
    "Features", "Accuracy", "Precision", "Recall", "F1", "AUC"
))
print("-" * 62)

for r in results:
    print("{:<12} {:<10.4f} {:<10.4f} {:<10.4f} {:<10.4f} {:<10.4f}".format(
        r['top_k'],
        r['accuracy'],
        r['precision'],
        r['recall'],
        r['f1'],
        r['auc']
    ))

# Find best
best_f1 = max(results, key=lambda x: x['f1'])
best_auc = max(results, key=lambda x: x['auc'])

print("\n" + "="*62)
print("BEST MODELS")
print("="*62)

print(f"\nBest F1: {best_f1['top_k']} features")
print(f"  Accuracy: {best_f1['accuracy']:.4f}")
print(f"  Recall:   {best_f1['recall']:.4f}")
print(f"  F1:       {best_f1['f1']:.4f}")
print(f"  AUC:      {best_f1['auc']:.4f}")

print(f"\nBest AUC: {best_auc['top_k']} features")
print(f"  Accuracy: {best_auc['accuracy']:.4f}")
print(f"  Recall:   {best_auc['recall']:.4f}")
print(f"  F1:       {best_auc['f1']:.4f}")
print(f"  AUC:      {best_auc['auc']:.4f}")

# Analyze layer contribution
print("\n" + "="*62)
print("FEATURE IMPORTANCE ANALYSIS")
print("="*62)

# Features for activation layer 15, attention summary, MHA output and activation layer 16 have length 406, 384, 4096, 4096
features_15 = [idx for idx in top_indices[:best_f1['top_k']] if idx < 4096]
attn_16 = [idx for idx in top_indices[:best_f1['top_k']] if idx >= 4096 and idx < 4480] # 4480 = 4096 + 384
mha_16 = [idx for idx in top_indices[:best_f1['top_k']] if idx >= 4480 and idx < 8576] # 8576 = 4480 + 4096
features_16 = [idx for idx in top_indices[:best_f1['top_k']] if idx >= 8576]

print(f"\nIn top {best_f1['top_k']} features:")
print(f"\tActivation layer 15 features: {len(features_15)} ({len(features_15)/best_f1['top_k']*100:.1f}%)")
print(f"\tAttention statistics layer 16 features: {len(attn_16)} ({len(attn_16)/best_f1['top_k']*100:.1f}%)")
print(f"\tMulti-Head Attention output layer 16 features: {len(mha_16)} ({len(mha_16)/best_f1['top_k']*100:.1f}%)")
print(f"\tActivation layer 16 features:  {len(features_16)} ({len(features_16)/best_f1['top_k']*100:.1f}%)")

print("\n" + "="*62)

FEATURE SELECTION EXPERIMENT

Extracting feature importance...
	Total features: 10775
	Features with non-zero importance: 10775

EXPERIMENT: Top 500 features

Selected top 500 features (indices: [9605, 5580, 5156, 1845, 11742]...[9412, 10065, 5672, 9130, 10826])
Subset shapes: (101618, 500), (21775, 500), (21776, 500)
Training XGBoost on 500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 649)
Evaluating...
  Results: Acc=0.8312, Recall=0.7383, F1=0.7379, AUC=0.9069
Best threshold: 0.375
Precision: 0.725, Recall: 0.752, F1: 0.739

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8803    0.8649    0.8725     14769
           1     0.7254    0.7521    0.7385      7007

    accuracy                         0.8286     21776
   macro avg     0.8028    0.8085    0.8055     21776
weighted avg     0.8305    0.8286    0.8294     21776


EXPERIMENT: Top 1000 features

Selected top 1000 features (indices: [9605, 5580, 5156, 1845, 11742]...[7744, 11391, 5340, 9707, 2264])
Subset shapes: (101618, 1000), (21775, 1000), (21776, 1000)
Training XGBoost on 1000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 736)
Evaluating...
  Results: Acc=0.8282, Recall=0.7418, F1=0.7354, AUC=0.9065
Best threshold: 0.370
Precision: 0.715, Recall: 0.764, F1: 0.738

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8841    0.8555    0.8696     14769
           1     0.7149    0.7637    0.7385      7007

    accuracy                         0.8260     21776
   macro avg     0.7995    0.8096    0.8040     21776
weighted avg     0.8297    0.8260    0.8274     21776


EXPERIMENT: Top 1500 features

Selected top 1500 features (indices: [9605, 5580, 5156, 1845, 11742]...[5424, 12229, 5800, 7393, 659])
Subset shapes: (101618, 1500), (21775, 1500), (21776, 1500)
Training XGBoost on 1500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 649)
Evaluating...
  Results: Acc=0.8339, Recall=0.7400, F1=0.7415, AUC=0.9089
Best threshold: 0.359
Precision: 0.718, Recall: 0.771, F1: 0.743

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8875    0.8559    0.8714     14769
           1     0.7175    0.7714    0.7435      7007

    accuracy                         0.8287     21776
   macro avg     0.8025    0.8136    0.8074     21776
weighted avg     0.8328    0.8287    0.8303     21776


EXPERIMENT: Top 2000 features

Selected top 2000 features (indices: [9605, 5580, 5156, 1845, 11742]...[1487, 8858, 4144, 5948, 332])
Subset shapes: (101618, 2000), (21775, 2000), (21776, 2000)
Training XGBoost on 2000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 607)
Evaluating...
  Results: Acc=0.8323, Recall=0.7395, F1=0.7395, AUC=0.9086
Best threshold: 0.364
Precision: 0.718, Recall: 0.770, F1: 0.743

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8871    0.8567    0.8716     14769
           1     0.7182    0.7701    0.7433      7007

    accuracy                         0.8288     21776
   macro avg     0.8026    0.8134    0.8074     21776
weighted avg     0.8327    0.8288    0.8303     21776


EXPERIMENT: Top 2500 features

Selected top 2500 features (indices: [9605, 5580, 5156, 1845, 11742]...[9821, 7407, 2989, 11691, 2770])
Subset shapes: (101618, 2500), (21775, 2500), (21776, 2500)
Training XGBoost on 2500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 544)
Evaluating...
  Results: Acc=0.8325, Recall=0.7413, F1=0.7401, AUC=0.9083
Best threshold: 0.362
Precision: 0.715, Recall: 0.770, F1: 0.741

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8869    0.8540    0.8701     14769
           1     0.7146    0.7704    0.7414      7007

    accuracy                         0.8271     21776
   macro avg     0.8007    0.8122    0.8058     21776
weighted avg     0.8314    0.8271    0.8287     21776


EXPERIMENT: Top 3000 features

Selected top 3000 features (indices: [9605, 5580, 5156, 1845, 11742]...[1051, 7158, 8627, 4334, 10624])
Subset shapes: (101618, 3000), (21775, 3000), (21776, 3000)
Training XGBoost on 3000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 572)
Evaluating...
  Results: Acc=0.8343, Recall=0.7385, F1=0.7415, AUC=0.9098
Best threshold: 0.379
Precision: 0.738, Recall: 0.750, F1: 0.744

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8803    0.8735    0.8769     14769
           1     0.7376    0.7497    0.7436      7007

    accuracy                         0.8336     21776
   macro avg     0.8089    0.8116    0.8102     21776
weighted avg     0.8344    0.8336    0.8340     21776


EXPERIMENT: Top 3500 features

Selected top 3500 features (indices: [9605, 5580, 5156, 1845, 11742]...[6379, 2448, 10906, 6542, 367])
Subset shapes: (101618, 3500), (21775, 3500), (21776, 3500)
Training XGBoost on 3500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 696)
Evaluating...
  Results: Acc=0.8315, Recall=0.7397, F1=0.7386, AUC=0.9087
Best threshold: 0.361
Precision: 0.717, Recall: 0.769, F1: 0.742

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8865    0.8561    0.8710     14769
           1     0.7172    0.7689    0.7421      7007

    accuracy                         0.8281     21776
   macro avg     0.8018    0.8125    0.8066     21776
weighted avg     0.8320    0.8281    0.8296     21776


EXPERIMENT: Top 4000 features

Selected top 4000 features (indices: [9605, 5580, 5156, 1845, 11742]...[1932, 1304, 2990, 2184, 5413])
Subset shapes: (101618, 4000), (21775, 4000), (21776, 4000)
Training XGBoost on 4000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 599)
Evaluating...
  Results: Acc=0.8303, Recall=0.7421, F1=0.7378, AUC=0.9077
Best threshold: 0.373
Precision: 0.720, Recall: 0.761, F1: 0.740

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8836    0.8597    0.8715     14769
           1     0.7203    0.7614    0.7403      7007

    accuracy                         0.8281     21776
   macro avg     0.8020    0.8105    0.8059     21776
weighted avg     0.8311    0.8281    0.8293     21776


EXPERIMENT: Top 4500 features

Selected top 4500 features (indices: [9605, 5580, 5156, 1845, 11742]...[6433, 9615, 1116, 3697, 3602])
Subset shapes: (101618, 4500), (21775, 4500), (21776, 4500)
Training XGBoost on 4500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 507)
Evaluating...
  Results: Acc=0.8339, Recall=0.7380, F1=0.7409, AUC=0.9085
Best threshold: 0.362
Precision: 0.718, Recall: 0.768, F1: 0.742

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8860    0.8570    0.8713     14769
           1     0.7181    0.7677    0.7420      7007

    accuracy                         0.8283     21776
   macro avg     0.8020    0.8123    0.8067     21776
weighted avg     0.8320    0.8283    0.8297     21776


EXPERIMENT: Top 5000 features

Selected top 5000 features (indices: [9605, 5580, 5156, 1845, 11742]...[2774, 5522, 10693, 3772, 10430])
Subset shapes: (101618, 5000), (21775, 5000), (21776, 5000)
Training XGBoost on 5000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 667)
Evaluating...
  Results: Acc=0.8326, Recall=0.7461, F1=0.7415, AUC=0.9088
Best threshold: 0.380
Precision: 0.730, Recall: 0.756, F1: 0.743

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8823    0.8675    0.8748     14769
           1     0.7303    0.7561    0.7430      7007

    accuracy                         0.8316     21776
   macro avg     0.8063    0.8118    0.8089     21776
weighted avg     0.8334    0.8316    0.8324     21776


EXPERIMENT: Top 5500 features

Selected top 5500 features (indices: [9605, 5580, 5156, 1845, 11742]...[11186, 12516, 8274, 5801, 1542])
Subset shapes: (101618, 5500), (21775, 5500), (21776, 5500)
Training XGBoost on 5500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 542)
Evaluating...
  Results: Acc=0.8330, Recall=0.7457, F1=0.7419, AUC=0.9081
Best threshold: 0.364
Precision: 0.716, Recall: 0.773, F1: 0.743

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8882    0.8544    0.8709     14769
           1     0.7158    0.7732    0.7434      7007

    accuracy                         0.8283     21776
   macro avg     0.8020    0.8138    0.8072     21776
weighted avg     0.8327    0.8283    0.8299     21776


EXPERIMENT: Top 6000 features

Selected top 6000 features (indices: [9605, 5580, 5156, 1845, 11742]...[12351, 8197, 9071, 1501, 1593])
Subset shapes: (101618, 6000), (21775, 6000), (21776, 6000)
Training XGBoost on 6000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 432)
Evaluating...
  Results: Acc=0.8312, Recall=0.7428, F1=0.7391, AUC=0.9076
Best threshold: 0.378
Precision: 0.726, Recall: 0.755, F1: 0.740

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8814    0.8651    0.8732     14769
           1     0.7263    0.7545    0.7402      7007

    accuracy                         0.8295     21776
   macro avg     0.8038    0.8098    0.8067     21776
weighted avg     0.8315    0.8295    0.8304     21776


FEATURE SELECTION RESULTS COMPARISON

Features     Accuracy   Precision  Recall     F1         AUC       
--------------------------------------------------------------
500          0.8312     0.7375     0.7383     0.7379     0.9069    
1000         0.8282     0.7290     0.7418     0.7354     0.9065    
1500         0.8339     0.7429     0.7400     0.7415     0.9089    
2000         0.8323     0.7394     0.7395     0.7395     0.9086    
2500 

### 7. Conclusion

While performance showed marginal improvement on the concatenated dataset, diminishing returns persist. Feature selection proved ineffective, suggesting that dimensionality is not the limiting factor. Overall, results plateau at approximately 84% accuracy, 91% AUC, and 75% F1 score (with precision and recall exhibiting comparable values).

These findings corroborate our earlier observations:
- Hallucination-related signals are present and extractable from the model's internal representations.
- These signals are not distributed across layers but can be retrieved from a single layer.

The incorporation of multiple layers and attention-specific features yielded no significant improvement, with results plateauing consistently.

Although not presented in this notebook series, we evaluated various pooling strategies (max-pooling, attention-weighted pooling, mean/max concatenation) and alternative probes (random forests, logistic regression, neural probes). Mean-pooling consistently outperformed all other pooling approaches, and XGBoost similarly outperformed all alternative probes. We also experimented with probe ensembling across layers and PCA decomposition, both of which yielded marginally inferior results, the former likely due to the concentrated rather than distributed nature of the hallucination signal.

Several hypotheses may account for the observed performance ceiling:
- *Fundamental limitations:* Hallucination-related signals may be only partially encoded in the model's internal representations, or the encoded information may diverge from the standard academic conceptualisation of hallucination.
- *Dataset quality:* A non-negligible false positive/false negative rate in the dataset may introduce noise that no probe, however sophisticated, can overcome.
- *Pooling artifacts:* Mean-pooling may attenuate per-token signals, resulting in information loss.

A promising avenue for future work would be to retain raw, per-token activations - for instance, from layer 16 alone - and train a transformer-based probe. However, this approach would conflict with our core objective: a lightweight method enabling fast, real-time inference at virtually no additional computational cost. Nevertheless, from a purely research perspective, success with such an approach would demonstrate that hallucination signals are present and retrievable with high precision and recall from the model's internal representations. In other words, it would suggest that the model "knows" - whether explicitly or implicitly - when it is hallucinating, a finding with potential implications for future LLM development.

That said, the most prudent next step appears to be retraining the XGBoost probe on the single best-performing layer using a carefully curated, high-quality dataset. The dataset employed in this study was generated using relatively sophisticated instructions; however, due to budget constraints, no majority voting mechanism was implemented, nor was human annotation performed. Consequently, the presence of a non-negligible proportion of mislabelled examples cannot be ruled out.