# **Second experiment: training a XGBoost probe on multiple activation layers**

Having trained a probe on the single best-performing layer in the previous notebook, we now investigate whether combining multiple layers yields improved performance. We begin with layers 15 and 16 (the two top-performing layers as identified in our initial layer selection analysis) and subsequently extend our analysis to include layer 18.

For each configuration, we explore two approaches:
- Direct concatenation of pooled activations;
- Feature selection, retaining varying percentages of the most informative features.

The training procedure follows the methodology established in the previous notebook.

*Prerequisite:* This notebook assumes that activations for layers 15 and 18 have already been extracted following the procedure detailed in the previous notebook for layer 16. If this has not yet been done, the extraction process is entirely analogous and should be completed before proceeding.

### 1. Installing required libraries

We begin by installing the necessary libraries:

In [2]:
# Install `llmscan`
!pip install git+https://github.com/julienbrasseur/llm-hallucination-detector.git

# Install `datasets`
!pip install datasets

Collecting git+https://github.com/julienbrasseur/llm-hallucination-detector.git
  Cloning https://github.com/julienbrasseur/llm-hallucination-detector.git to /tmp/pip-req-build-z_gkm7qp
  Running command git clone --filter=blob:none --quiet https://github.com/julienbrasseur/llm-hallucination-detector.git /tmp/pip-req-build-z_gkm7qp
  Resolved https://github.com/julienbrasseur/llm-hallucination-detector.git to commit 77b721d351f3cb5b08d8447d199d6afe38970d26
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting transformers>=4.36.0 (from llmscan==0.1.0)
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
Collecting xgboost>=2.0.0 (from llmscan==0.1.0)
  Downloading xgboost-3.1.2-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Collecting scikit-learn>=1.3.0 (from llmscan==0.1.0)
  Downloading scikit_learn-1.8.0-cp311-cp311-manylinux_2_27_x86_64.man

### 2. Data preparation

As before, we load the dataset from Hugging Face and convert it to the standard OpenAI conversation format.

In [4]:
import torch
import numpy as np
from datasets import load_dataset

# Set training dataset path
DATASET_NAME = "krogoldAI/hallucination-labeled-dataset"

def load_and_format_dataset(dataset_name: str):
    """
    Load HuggingFace dataset and convert to conversation format.

    This function converts dataset with 'input', 'target', 'hallucination' fields
    to the standard conversation format expected by the pipeline.

    Returns:
        Tuple of (train_data, val_data, test_data, train_labels, val_labels, test_labels)
    """
    print(f"Loading dataset: {dataset_name}")
    ds = load_dataset(dataset_name)

    # Shuffle each split
    ds["train"] = ds["train"].shuffle(seed=42)
    ds["validation"] = ds["validation"].shuffle(seed=42)
    ds["test"] = ds["test"].shuffle(seed=42)

    def format_split(split):
        """Convert HF dataset split to conversation format."""
        formatted = []
        labels = []

        for item in split:
            # Extract fields from your HF dataset format
            user_msg = item["input"]
            assistant_msg = item["target"]
            label = int(item["hallucination"])

            # Convert to standard conversation format
            formatted.append({
                "conversation": [
                    {"role": "user", "content": user_msg},
                    {"role": "assistant", "content": assistant_msg},
                ]
            })
            labels.append(label)

        return formatted, np.array(labels)

    # Format all splits
    train_data, train_labels = format_split(ds["train"])
    val_data, val_labels = format_split(ds["validation"])
    test_data, test_labels = format_split(ds["test"])

    print(f"Dataset loaded and formatted:")
    print(f"  Train:      {len(train_data):,} examples")
    print(f"  Validation: {len(val_data):,} examples")
    print(f"  Test:       {len(test_data):,} examples")
    print(f"  Class distribution (train): "
          f"{(train_labels == 0).sum():,} non-hallucination, "
          f"{(train_labels == 1).sum():,} hallucination")

    return train_data, val_data, test_data, train_labels, val_labels, test_labels

# Load and format dataset
train_data, val_data, test_data, train_labels, val_labels, test_labels = \
    load_and_format_dataset(DATASET_NAME)

Loading dataset: krogoldAI/hallucination-labeled-dataset


README.md:   0%|          | 0.00/58.5k [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/78.7M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/16.7M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/101618 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21775 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/21776 [00:00<?, ? examples/s]

Dataset loaded and formatted:
  Train:      101,618 examples
  Validation: 21,775 examples
  Test:       21,776 examples
  Class distribution (train): 68,913 non-hallucination, 32,705 hallucination


### 3. Reloading activations

We load the pre-extracted activations for layers 15 and 16, concatenate them, and align the resulting arrays with the binary hallucination labels from the dataset.

In [5]:
import torch
import numpy as np

# Load activations for layer 15
print("Loading activations for layer 15...")
train_acts_15 = torch.load("/workspace/feature_cache15/train_activations_pooled.pt", map_location="cpu", weights_only=True).float().numpy()
val_acts_15 = torch.load("/workspace/feature_cache15/val_activations_pooled.pt", map_location="cpu", weights_only=True).float().numpy()
test_acts_15 = torch.load("/workspace/feature_cache15/test_activations_pooled.pt", map_location="cpu", weights_only=True).float().numpy()
print(f"Layer 15 - Train: {train_acts_15.shape}, Val: {val_acts_15.shape}, Test: {test_acts_15.shape}")

# Load activations for layer 16
print("Loading activations for layer 16...")
train_acts_16 = torch.load("/workspace/feature_cache16/train_activations_pooled.pt", map_location="cpu", weights_only=True).float().numpy()
val_acts_16 = torch.load("/workspace/feature_cache16/val_activations_pooled.pt", map_location="cpu", weights_only=True).float().numpy()
test_acts_16 = torch.load("/workspace/feature_cache16/test_activations_pooled.pt", map_location="cpu", weights_only=True).float().numpy()
print(f"Layer 16 - Train: {train_acts_16.shape}, Val: {val_acts_16.shape}, Test: {test_acts_16.shape}")

# Concatenate along feature dimension (axis=1)
train_acts = np.concatenate([train_acts_15, train_acts_16], axis=1)
val_acts = np.concatenate([val_acts_15, val_acts_16], axis=1)
test_acts = np.concatenate([test_acts_15, test_acts_16], axis=1)

print(f"\nConcatenated shapes:")
print(f"  Train: {train_acts.shape}")  # Should be [N, 4096*2] = [N, 8192]
print(f"  Val:   {val_acts.shape}")
print(f"  Test:  {test_acts.shape}")

# Align labels (use minimum length in case of mismatch)
min_train = min(len(train_acts_15), len(train_acts_16))
min_val = min(len(val_acts_15), len(val_acts_16))
min_test = min(len(test_acts_15), len(test_acts_16))

train_labels_aligned = train_labels[:min_train]
val_labels_aligned = val_labels[:min_val]
test_labels_aligned = test_labels[:min_test]

print(f"\nLabels aligned:")
print(f"  Train: {len(train_labels_aligned)}")
print(f"  Val:   {len(val_labels_aligned)}")
print(f"  Test:  {len(test_labels_aligned)}")

Loading activations for layer 15...
Layer 15 - Train: (101618, 4096), Val: (21775, 4096), Test: (21776, 4096)
Loading activations for layer 16...
Layer 16 - Train: (101618, 4096), Val: (21775, 4096), Test: (21776, 4096)

Concatenated shapes:
  Train: (101618, 8192)
  Val:   (21775, 8192)
  Test:  (21776, 8192)

Labels aligned:
  Train: 101618
  Val:   21775
  Test:  21776


### 4. Training a XGBoost probe on concatenated layers 15 and 16

With the data prepared, we proceed to train an XGBoost probe on the concatenated activations, following the same procedure as for the single-layer case.

In [6]:
from llmscan import XGBoostProbe

# XGBoost parameters
XGB_PARAMS = {
    'n_estimators': 800,
    'max_depth': 6,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'tree_method': 'hist',
    'device': 'cuda',
    'eval_metric': 'logloss',
}

# Train
print("\nTraining XGBoost...")
probe = XGBoostProbe(xgb_params=XGB_PARAMS)
probe.fit(
    train_acts,
    train_labels_aligned,
    X_val=val_acts,
    y_val=val_labels_aligned,
    early_stopping_rounds=20,
    verbose=True
)

# Evaluate on test set
print("\nEvaluating on test set...")
metrics = probe.evaluate(test_acts, test_labels_aligned, verbose=True)

# Save probe
probe.save("hallucination_probe_layers_15_16.pkl")
print("\nProbe saved!")


Training XGBoost...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


[0]	train-logloss:0.61331	val-logloss:0.61352
[10]	train-logloss:0.51802	val-logloss:0.52016
[20]	train-logloss:0.46693	val-logloss:0.47092
[30]	train-logloss:0.43693	val-logloss:0.44245
[40]	train-logloss:0.41777	val-logloss:0.42460
[50]	train-logloss:0.40332	val-logloss:0.41179
[60]	train-logloss:0.39077	val-logloss:0.40171
[70]	train-logloss:0.38010	val-logloss:0.39401
[80]	train-logloss:0.37074	val-logloss:0.38782
[90]	train-logloss:0.36205	val-logloss:0.38249
[100]	train-logloss:0.35410	val-logloss:0.37800
[110]	train-logloss:0.34719	val-logloss:0.37441
[120]	train-logloss:0.34140	val-logloss:0.37144
[130]	train-logloss:0.33583	val-logloss:0.36881
[140]	train-logloss:0.33080	val-logloss:0.36651
[150]	train-logloss:0.32592	val-logloss:0.36453
[160]	train-logloss:0.32177	val-logloss:0.36299
[170]	train-logloss:0.31740	val-logloss:0.36156
[180]	train-logloss:0.31339	val-logloss:0.36016
[190]	train-logloss:0.30940	val-logloss:0.35911
[200]	train-logloss:0.30574	val-logloss:0.35790
[21

As in the previous notebook, we tune the decision threshold to optimise the F1 score:

In [7]:
from sklearn.metrics import precision_recall_curve, classification_report
from llmscan import XGBoostProbe

# Load model using the class method
probe = XGBoostProbe.load("hallucination_probe_layers_15_16.pkl")

# Get probabilities for hallucination class
y_proba = probe.predict_proba(test_acts)[:, 1]

# Get precision-recall curve
precisions, recalls, thresholds = precision_recall_curve(test_labels_aligned, y_proba)

# Compute F1 for each threshold
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)

# Find optimal threshold
best_idx = f1_scores.argmax()
best_threshold = thresholds[best_idx]

print(f"Best threshold: {best_threshold:.3f}")
print(f"Precision: {precisions[best_idx]:.3f}, Recall: {recalls[best_idx]:.3f}, F1: {f1_scores[best_idx]:.3f}")

# Full report with optimized threshold
y_pred_optimized = (y_proba >= best_threshold).astype(int)
print("\nClassification Report (optimized threshold):")
print(classification_report(test_labels_aligned, y_pred_optimized, digits=4))

Model loaded from hallucination_probe_layers_15_16.pkl
Best threshold: 0.388
Precision: 0.749, Recall: 0.744, F1: 0.746

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8788    0.8821    0.8804     14769
           1     0.7494    0.7435    0.7465      7007

    accuracy                         0.8375     21776
   macro avg     0.8141    0.8128    0.8134     21776
weighted avg     0.8372    0.8375    0.8373     21776



### 5. Comment

After threshold optimisation, performance remains comparable to the single best-performing layer. While accuracy shows a marginal improvement of approximately 1 percentage point (~84%), precision, recall, and F1 score exhibit only minor variations that may well fall within the range of experimental noise.

This outcome suggests that concatenating additional layers may not yield substantial benefits, particularly when weighed against the increased computational cost of extracting and processing multiple activation layers. Given that the remaining layers are less expressive for hallucination detection than layers 15 and 16, further layer additions would likely produce diminishing returns.

Nevertheless, it remains possible that the concatenated representation contains redundant dimensions. Feature selection may both reduce computational overhead and potentially improve generalisation by eliminating noise. Since this analysis is computationally inexpensive, we proceed to investigate.

### 6. Feature selection experiment

We now train XGBoost probes on subsets of the most important features, ranging from 500 to 4,000 features in increments of 500. If no significant improvement emerges within this range, there is little justification for extending the search further.

In [8]:
import numpy as np
from llmscan import XGBoostProbe

print("="*63)
print("FEATURE SELECTION EXPERIMENT")
print("="*63)

# Set feature selection parameters (from 500 to 4000 with a 500 iteration step)
MIN_FEATURES = 500
MAX_FEATURES = 4000
ITERATION_STEP = 500

# Extract feature importance
print("\nExtracting feature importance...")
feature_importance = probe.get_feature_importance(
    importance_type='gain',
    top_k=None
)
print(f"  Total features: {len(feature_importance)}")
print(f"  Features with non-zero importance: {len([v for v in feature_importance.values() if v > 0])}")

# Initialize list to store metrics
results = []

# Iterating over various feature numbers
for top_k in range(MIN_FEATURES, MAX_FEATURES+1, ITERATION_STEP):
    print("\n" + "="*63)
    print(f"EXPERIMENT: Top {top_k} features")
    print("="*63)

    # Get top-k most important feature indices
    top_features = sorted(
        feature_importance.items(),
        key=lambda x: x[1],
        reverse=True
    )[:top_k]
    top_indices = [idx for idx, _ in top_features]

    print(f"\nSelected top {top_k} features (indices: {top_indices[:5]}...{top_indices[-5:]})")

    # Subset the data
    train_acts_subset = train_acts[:, top_indices]
    val_acts_subset = val_acts[:, top_indices]
    test_acts_subset = test_acts[:, top_indices]

    print(f"Subset shapes: {train_acts_subset.shape}, {val_acts_subset.shape}, {test_acts_subset.shape}")

    # Train new XGBoost on selected features
    print(f"Training XGBoost on {top_k} selected features...")

    XGB_PARAMS = {
        'n_estimators': 1000,
        'max_depth': 6,
        'learning_rate': 0.05,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'tree_method': 'hist',
        'eval_metric': 'logloss',
    }

    probe_subset = XGBoostProbe(xgb_params=XGB_PARAMS)
    probe_subset.fit(
        train_acts_subset,
        train_labels_aligned,
        X_val=val_acts_subset,
        y_val=val_labels_aligned,
        early_stopping_rounds=20,
        verbose=False
    )

    print(f"Training complete (best iteration: {probe_subset.model.best_iteration})")

    # Evaluate
    print(f"Evaluating...")
    metrics = probe_subset.evaluate(
        test_acts_subset,
        test_labels_aligned,
        threshold=0.388, # we use the optimal threshold obtained above
        verbose=False
    )

    results.append({
        'top_k': top_k,
        'accuracy': metrics['accuracy'],
        'precision': metrics['precision'],
        'recall': metrics['recall'],
        'f1': metrics['f1'],
        'auc': metrics['auc']
    })

    print(f"  Results: Acc={metrics['accuracy']:.4f}, Recall={metrics['recall']:.4f}, "
          f"F1={metrics['f1']:.4f}, AUC={metrics['auc']:.4f}")

    # Get probabilities for hallucination class
    y_proba = probe_subset.predict_proba(test_acts_subset)[:, 1]
    
    # Get precision-recall curve
    precisions, recalls, thresholds = precision_recall_curve(test_labels_aligned, y_proba)
    
    # Compute F1 for each threshold
    f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
    
    # Find optimal threshold
    best_idx = f1_scores.argmax()
    best_threshold = thresholds[best_idx]
    
    print(f"Best threshold: {best_threshold:.3f}")
    print(f"Precision: {precisions[best_idx]:.3f}, Recall: {recalls[best_idx]:.3f}, F1: {f1_scores[best_idx]:.3f}")
    
    # Full report with optimized threshold
    y_pred_optimized = (y_proba >= best_threshold).astype(int)
    print("\nClassification Report (optimized threshold):")
    print(classification_report(test_labels_aligned, y_pred_optimized, digits=4))

# Compare results
print("\n" + "="*63)
print("FEATURE SELECTION RESULTS COMPARISON")
print("="*63)

print("\n{:<12} {:<10} {:<10} {:<10} {:<10} {:<10}".format(
    "Features", "Accuracy", "Precision", "Recall", "F1", "AUC"
))
print("-" * 63)

for r in results:
    print("{:<12} {:<10.4f} {:<10.4f} {:<10.4f} {:<10.4f} {:<10.4f}".format(
        r['top_k'],
        r['accuracy'],
        r['precision'],
        r['recall'],
        r['f1'],
        r['auc']
    ))

# Find best
best_f1 = max(results, key=lambda x: x['f1'])
best_auc = max(results, key=lambda x: x['auc'])

print("\n" + "="*63)
print("BEST MODELS")
print("="*63)

print(f"\nBest F1: {best_f1['top_k']} features")
print(f"  Accuracy: {best_f1['accuracy']:.4f}")
print(f"  Recall:   {best_f1['recall']:.4f}")
print(f"  F1:       {best_f1['f1']:.4f}")
print(f"  AUC:      {best_f1['auc']:.4f}")

print(f"\nBest AUC: {best_auc['top_k']} features")
print(f"  Accuracy: {best_auc['accuracy']:.4f}")
print(f"  Recall:   {best_auc['recall']:.4f}")
print(f"  F1:       {best_auc['f1']:.4f}")
print(f"  AUC:      {best_auc['auc']:.4f}")

# Analyze layer contribution
print("\n" + "="*63)
print("FEATURE IMPORTANCE ANALYSIS")
print("="*63)

# First 4096 features correspond to layer 15, next 4096 to layer 16
features_15 = [idx for idx in top_indices[:best_f1['top_k']] if idx < 4096]
features_16 = [idx for idx in top_indices[:best_f1['top_k']] if idx >= 4096]

print(f"\nIn top {best_f1['top_k']} features:")
print(f"  Layer 15 features: {len(features_15)} ({len(features_15)/best_f1['top_k']*100:.1f}%)")
print(f"  Layer 16 features:  {len(features_16)} ({len(features_16)/best_f1['top_k']*100:.1f}%)")

if len(features_15) > len(features_16) * 1.5:
    print("\nLayer 15 contributes more important features.")
elif len(features_16) > len(features_15) * 1.5:
    print("\nLayer 16 contributes more important features.")
else:
    print("\nBoth layers contribute roughly equally.")

print("\n" + "="*63)

FEATURE SELECTION EXPERIMENT

Extracting feature importance...
  Total features: 7866
  Features with non-zero importance: 7866

EXPERIMENT: Top 500 features

Selected top 500 features (indices: [5125, 818, 4806, 3395, 1845]...[2326, 4239, 6170, 62, 3169])
Subset shapes: (101618, 500), (21775, 500), (21776, 500)
Training XGBoost on 500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


Training complete (best iteration: 738)
Evaluating...
  Results: Acc=0.8289, Recall=0.7433, F1=0.7366, AUC=0.9067
Best threshold: 0.419
Precision: 0.763, Recall: 0.718, F1: 0.740

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8698    0.8941    0.8818     14769
           1     0.7628    0.7179    0.7397      7007

    accuracy                         0.8374     21776
   macro avg     0.8163    0.8060    0.8107     21776
weighted avg     0.8354    0.8374    0.8360     21776


EXPERIMENT: Top 1000 features

Selected top 1000 features (indices: [5125, 818, 4806, 3395, 1845]...[2331, 5679, 7893, 6367, 1891])
Subset shapes: (101618, 1000), (21775, 1000), (21776, 1000)
Training XGBoost on 1000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


Training complete (best iteration: 821)
Evaluating...
  Results: Acc=0.8351, Recall=0.7467, F1=0.7446, AUC=0.9101
Best threshold: 0.371
Precision: 0.732, Recall: 0.763, F1: 0.747

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8852    0.8678    0.8764     14769
           1     0.7325    0.7628    0.7473      7007

    accuracy                         0.8340     21776
   macro avg     0.8089    0.8153    0.8119     21776
weighted avg     0.8361    0.8340    0.8349     21776


EXPERIMENT: Top 1500 features

Selected top 1500 features (indices: [5125, 818, 4806, 3395, 1845]...[7657, 5145, 6164, 6176, 2436])
Subset shapes: (101618, 1500), (21775, 1500), (21776, 1500)
Training XGBoost on 1500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


Training complete (best iteration: 901)
Evaluating...
  Results: Acc=0.8339, Recall=0.7403, F1=0.7415, AUC=0.9110
Best threshold: 0.361
Precision: 0.725, Recall: 0.768, F1: 0.746

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8867    0.8620    0.8742     14769
           1     0.7253    0.7679    0.7460      7007

    accuracy                         0.8317     21776
   macro avg     0.8060    0.8150    0.8101     21776
weighted avg     0.8348    0.8317    0.8330     21776


EXPERIMENT: Top 2000 features

Selected top 2000 features (indices: [5125, 818, 4806, 3395, 1845]...[2439, 1905, 6298, 976, 1019])
Subset shapes: (101618, 2000), (21775, 2000), (21776, 2000)
Training XGBoost on 2000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


Training complete (best iteration: 676)
Evaluating...
  Results: Acc=0.8340, Recall=0.7435, F1=0.7425, AUC=0.9098
Best threshold: 0.378
Precision: 0.733, Recall: 0.755, F1: 0.744

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8822    0.8697    0.8759     14769
           1     0.7333    0.7551    0.7441      7007

    accuracy                         0.8328     21776
   macro avg     0.8077    0.8124    0.8100     21776
weighted avg     0.8343    0.8328    0.8335     21776


EXPERIMENT: Top 2500 features

Selected top 2500 features (indices: [5125, 818, 4806, 3395, 1845]...[3953, 2645, 3742, 315, 1458])
Subset shapes: (101618, 2500), (21775, 2500), (21776, 2500)
Training XGBoost on 2500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


Training complete (best iteration: 792)
Evaluating...
  Results: Acc=0.8363, Recall=0.7418, F1=0.7447, AUC=0.9106
Best threshold: 0.371
Precision: 0.735, Recall: 0.761, F1: 0.748

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8846    0.8701    0.8773     14769
           1     0.7353    0.7607    0.7478      7007

    accuracy                         0.8349     21776
   macro avg     0.8099    0.8154    0.8125     21776
weighted avg     0.8365    0.8349    0.8356     21776


EXPERIMENT: Top 3000 features

Selected top 3000 features (indices: [5125, 818, 4806, 3395, 1845]...[2325, 5225, 5950, 7561, 3741])
Subset shapes: (101618, 3000), (21775, 3000), (21776, 3000)
Training XGBoost on 3000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


Training complete (best iteration: 703)
Evaluating...
  Results: Acc=0.8347, Recall=0.7418, F1=0.7428, AUC=0.9097
Best threshold: 0.369
Precision: 0.728, Recall: 0.761, F1: 0.744

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8840    0.8655    0.8746     14769
           1     0.7284    0.7607    0.7442      7007

    accuracy                         0.8317     21776
   macro avg     0.8062    0.8131    0.8094     21776
weighted avg     0.8340    0.8317    0.8327     21776


EXPERIMENT: Top 3500 features

Selected top 3500 features (indices: [5125, 818, 4806, 3395, 1845]...[6343, 4635, 7609, 6012, 437])
Subset shapes: (101618, 3500), (21775, 3500), (21776, 3500)
Training XGBoost on 3500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


Training complete (best iteration: 794)
Evaluating...
  Results: Acc=0.8352, Recall=0.7407, F1=0.7431, AUC=0.9107
Best threshold: 0.389
Precision: 0.747, Recall: 0.740, F1: 0.744

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8773    0.8810    0.8791     14769
           1     0.7469    0.7403    0.7435      7007

    accuracy                         0.8357     21776
   macro avg     0.8121    0.8106    0.8113     21776
weighted avg     0.8353    0.8357    0.8355     21776


EXPERIMENT: Top 4000 features

Selected top 4000 features (indices: [5125, 818, 4806, 3395, 1845]...[1288, 6773, 7171, 4929, 1730])
Subset shapes: (101618, 4000), (21775, 4000), (21776, 4000)
Training XGBoost on 4000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


Training complete (best iteration: 699)
Evaluating...
  Results: Acc=0.8361, Recall=0.7388, F1=0.7437, AUC=0.9109
Best threshold: 0.365
Precision: 0.729, Recall: 0.763, F1: 0.745

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8848    0.8655    0.8751     14769
           1     0.7290    0.7625    0.7454      7007

    accuracy                         0.8324     21776
   macro avg     0.8069    0.8140    0.8102     21776
weighted avg     0.8347    0.8324    0.8333     21776


FEATURE SELECTION RESULTS COMPARISON

Features     Accuracy   Precision  Recall     F1         AUC       
---------------------------------------------------------------
500          0.8289     0.7300     0.7433     0.7366     0.9067    
1000         0.8351     0.7424     0.7467     0.7446     0.9101    
1500         0.8339     0.7427     0.7403     0.7415     0.9110    
2000         0.8340     0.7414     0.7435     0.7425     0.9098    
2500 

As anticipated, feature selection does not substantially improve performance metrics. However, the results indicate that comparable performance can be achieved with a reduced feature set, suggesting that the concatenated representation contains considerable redundancy.

### 7. One more layer: extending to three layers

For completeness, we extend the analysis to include a third layer. We consider the three most expressive layers as established by our earlier experiments: layers 15, 16, and 18.

To manage memory efficiently, we employ a streaming concatenation strategy rather than loading all activations simultaneously.

In [9]:
import gc
import numpy as np
import torch
from pathlib import Path

def load_torch_as_numpy(path):
    """Load a .pt file and convert to numpy float32."""
    tensor = torch.load(path, map_location="cpu", weights_only=True)
    arr = tensor.float().numpy()
    del tensor
    return arr

def build_concat_streaming(output_path, sources, n_samples):
    """
    Build concatenated features by streaming arrays one at a time.
    
    Args:
        output_path: Where to save the memory-mapped output
        sources: List of (path, loader_fn) tuples
        n_samples: Number of samples to include
    
    Returns:
        np.memmap array of shape (n_samples, total_features)
    """
    # First pass: compute total feature dimension
    print("Computing total feature dimension...")
    total_dim = 0
    dims = []
    for path, loader_fn in sources:
        arr = loader_fn(path)
        dims.append(arr.shape[1])
        total_dim += arr.shape[1]
        print(f"  {path}: {arr.shape[1]} features")
        del arr
        gc.collect()
    
    print(f"Total features: {total_dim}")
    
    # Allocate memory-mapped output
    print(f"Creating memmap at {output_path} with shape ({n_samples}, {total_dim})...")
    out = np.memmap(output_path, dtype=np.float32, mode='w+', shape=(n_samples, total_dim))
    
    # Second pass: fill the array
    col = 0
    for i, (path, loader_fn) in enumerate(sources):
        print(f"Loading and copying {path}...")
        arr = loader_fn(path).astype(np.float32, copy=False)
        width = arr.shape[1]
        out[:, col:col + width] = arr[:n_samples]
        out.flush()  # Ensure data is written to disk
        col += width
        print(f"  Copied {width} features (columns {col - width} to {col})")
        del arr
        gc.collect()
    
    print("Done.")
    return out


# Define sources for each split
train_sources = [
    ("/workspace/feature_cache15/train_activations_pooled.pt", load_torch_as_numpy),
    ("/workspace/feature_cache16/train_activations_pooled.pt", load_torch_as_numpy),
    ("/workspace/feature_cache18/train_activations_pooled.pt", load_torch_as_numpy),
]

val_sources = [
    ("/workspace/feature_cache15/val_activations_pooled.pt", load_torch_as_numpy),
    ("/workspace/feature_cache16/val_activations_pooled.pt", load_torch_as_numpy),
    ("/workspace/feature_cache18/val_activations_pooled.pt", load_torch_as_numpy),
]

test_sources = [
    ("/workspace/feature_cache15/test_activations_pooled.pt", load_torch_as_numpy),
    ("/workspace/feature_cache16/test_activations_pooled.pt", load_torch_as_numpy),
    ("/workspace/feature_cache18/test_activations_pooled.pt", load_torch_as_numpy),
]

# Sample counts
n_train = 101618
n_val = 21775
n_test = 21776

# Output directory for memmaps
output_dir = Path("/workspace/concat_features")
output_dir.mkdir(exist_ok=True)

# Build concatenated features
print("\n=== TRAIN ===")
train_acts = build_concat_streaming(output_dir / "train.dat", train_sources, n_train)

print("\n=== VAL ===")
val_acts = build_concat_streaming(output_dir / "val.dat", val_sources, n_val)

print("\n=== TEST ===")
test_acts = build_concat_streaming(output_dir / "test.dat", test_sources, n_test)

print(f"\nFinal shapes:")
print(f"  Train: {train_acts.shape}")
print(f"  Val:   {val_acts.shape}")
print(f"  Test:  {test_acts.shape}")

# Labels
train_labels_aligned = train_labels[:n_train]
val_labels_aligned = val_labels[:n_val]
test_labels_aligned = test_labels[:n_test]

print(f"\nLabels aligned:")
print(f"  Train: {len(train_labels_aligned)}")
print(f"  Val:   {len(val_labels_aligned)}")
print(f"  Test:  {len(test_labels_aligned)}")


=== TRAIN ===
Computing total feature dimension...
  /workspace/feature_cache15/train_activations_pooled.pt: 4096 features
  /workspace/feature_cache16/train_activations_pooled.pt: 4096 features
  /workspace/feature_cache18/train_activations_pooled.pt: 4096 features
Total features: 12288
Creating memmap at /workspace/concat_features/train.dat with shape (101618, 12288)...
Loading and copying /workspace/feature_cache15/train_activations_pooled.pt...
  Copied 4096 features (columns 0 to 4096)
Loading and copying /workspace/feature_cache16/train_activations_pooled.pt...
  Copied 4096 features (columns 4096 to 8192)
Loading and copying /workspace/feature_cache18/train_activations_pooled.pt...
  Copied 4096 features (columns 8192 to 12288)
Done.

=== VAL ===
Computing total feature dimension...
  /workspace/feature_cache15/val_activations_pooled.pt: 4096 features
  /workspace/feature_cache16/val_activations_pooled.pt: 4096 features
  /workspace/feature_cache18/val_activations_pooled.pt: 40

With the three-layer concatenation prepared, we train the probe:

In [10]:
from llmscan import XGBoostProbe
from sklearn.metrics import precision_recall_curve, classification_report

# XGBoost parameters
XGB_PARAMS = {
    'n_estimators': 1000,
    'max_depth': 6,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'tree_method': 'hist',
    'device': 'cuda',
    'eval_metric': 'logloss',
}

# Train
print("\nTraining XGBoost...")
probe_three_layers = XGBoostProbe(xgb_params=XGB_PARAMS)
probe_three_layers.fit(
    train_acts,
    train_labels_aligned,
    X_val=val_acts,
    y_val=val_labels_aligned,
    early_stopping_rounds=20,
    verbose=True
)

# Evaluate on test set
print("\nEvaluating on test set...")
metrics = probe_three_layers.evaluate(test_acts, test_labels_aligned, verbose=True)

# Save probe
probe_three_layers.save("hallucination_probe_layers_15_16_18.pkl")
print("\nProbe saved!")

# Get probabilities for hallucination class
y_proba = probe_three_layers.predict_proba(test_acts)[:, 1]

# Get precision-recall curve
precisions, recalls, thresholds = precision_recall_curve(test_labels_aligned, y_proba)

# Compute F1 for each threshold
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)

# Find optimal threshold
best_idx = f1_scores.argmax()
best_threshold = thresholds[best_idx]

print(f"Best threshold: {best_threshold:.3f}")
print(f"Precision: {precisions[best_idx]:.3f}, Recall: {recalls[best_idx]:.3f}, F1: {f1_scores[best_idx]:.3f}")

# Full report with optimized threshold
y_pred_optimized = (y_proba >= best_threshold).astype(int)
print("\nClassification Report (optimized threshold):")
print(classification_report(test_labels_aligned, y_pred_optimized, digits=4))


Training XGBoost...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


[0]	train-logloss:0.61336	val-logloss:0.61359
[10]	train-logloss:0.51569	val-logloss:0.51716
[20]	train-logloss:0.46462	val-logloss:0.46747
[30]	train-logloss:0.43453	val-logloss:0.43888
[40]	train-logloss:0.41559	val-logloss:0.42110
[50]	train-logloss:0.40103	val-logloss:0.40832
[60]	train-logloss:0.38867	val-logloss:0.39830
[70]	train-logloss:0.37703	val-logloss:0.38960
[80]	train-logloss:0.36737	val-logloss:0.38316
[90]	train-logloss:0.35892	val-logloss:0.37790
[100]	train-logloss:0.35153	val-logloss:0.37356
[110]	train-logloss:0.34483	val-logloss:0.37029
[120]	train-logloss:0.33892	val-logloss:0.36739
[130]	train-logloss:0.33311	val-logloss:0.36471
[140]	train-logloss:0.32770	val-logloss:0.36206
[150]	train-logloss:0.32282	val-logloss:0.35997
[160]	train-logloss:0.31844	val-logloss:0.35839
[170]	train-logloss:0.31408	val-logloss:0.35684
[180]	train-logloss:0.31008	val-logloss:0.35553
[190]	train-logloss:0.30602	val-logloss:0.35426
[200]	train-logloss:0.30214	val-logloss:0.35312
[21

We again explore feature selection on the three-layer concatenation:

In [6]:
import numpy as np
from llmscan import XGBoostProbe

print("="*62)
print("FEATURE SELECTION EXPERIMENT")
print("="*62)

# Set feature selection parameters (from 500 to 6000 with a 500 iteration step)
MIN_FEATURES = 500
MAX_FEATURES = 6000
ITERATION_STEP = 500

# Extract feature importance
print("\nExtracting feature importance...")
feature_importance = probe_three_layers.get_feature_importance(
    importance_type='gain',
    top_k=None
)
print(f"\tTotal features: {len(feature_importance)}")
print(f"\tFeatures with non-zero importance: {len([v for v in feature_importance.values() if v > 0])}")

# Initialize list to store metrics
results = []

# Iterating over various feature numbers
for top_k in range(MIN_FEATURES, MAX_FEATURES+1, ITERATION_STEP):
    print("\n" + "="*62)
    print(f"EXPERIMENT: Top {top_k} features")
    print("="*62)

    # Get top-k most important feature indices
    top_features = sorted(
        feature_importance.items(),
        key=lambda x: x[1],
        reverse=True
    )[:top_k]
    top_indices = [idx for idx, _ in top_features]

    print(f"\nSelected top {top_k} features (indices: {top_indices[:5]}...{top_indices[-5:]})")

    # Subset the data
    train_acts_subset = train_acts[:, top_indices]
    val_acts_subset = val_acts[:, top_indices]
    test_acts_subset = test_acts[:, top_indices]

    print(f"Subset shapes: {train_acts_subset.shape}, {val_acts_subset.shape}, {test_acts_subset.shape}")

    # Train new XGBoost on selected features
    print(f"Training XGBoost on {top_k} selected features...")

    XGB_PARAMS = {
        'n_estimators': 1000,
        'max_depth': 6,
        'learning_rate': 0.05,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'tree_method': 'hist',
        'eval_metric': 'logloss',
    }

    probe_subset = XGBoostProbe(xgb_params=XGB_PARAMS)
    probe_subset.fit(
        train_acts_subset,
        train_labels_aligned,
        X_val=val_acts_subset,
        y_val=val_labels_aligned,
        early_stopping_rounds=20,
        verbose=False
    )

    print(f"\tTraining complete (best iteration: {probe_subset.model.best_iteration})")

    # Evaluate
    print(f"Evaluating...")
    metrics = probe_subset.evaluate(
        test_acts_subset,
        test_labels_aligned,
        threshold=0.362, # CHANGE THIS !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
        verbose=False
    )

    results.append({
        'top_k': top_k,
        'accuracy': metrics['accuracy'],
        'precision': metrics['precision'],
        'recall': metrics['recall'],
        'f1': metrics['f1'],
        'auc': metrics['auc']
    })

    print(f"  Results: Acc={metrics['accuracy']:.4f}, Recall={metrics['recall']:.4f}, "
          f"F1={metrics['f1']:.4f}, AUC={metrics['auc']:.4f}")

    # Get probabilities for hallucination class
    y_proba = probe_subset.predict_proba(test_acts_subset)[:, 1]
    
    # Get precision-recall curve
    precisions, recalls, thresholds = precision_recall_curve(test_labels_aligned, y_proba)
    
    # Compute F1 for each threshold
    f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
    
    # Find optimal threshold
    best_idx = f1_scores.argmax()
    best_threshold = thresholds[best_idx]
    
    print(f"Best threshold: {best_threshold:.3f}")
    print(f"Precision: {precisions[best_idx]:.3f}, Recall: {recalls[best_idx]:.3f}, F1: {f1_scores[best_idx]:.3f}")
    
    # Full report with optimized threshold
    y_pred_optimized = (y_proba >= best_threshold).astype(int)
    print("\nClassification Report (optimized threshold):")
    print(classification_report(test_labels_aligned, y_pred_optimized, digits=4))

# Compare results
print("\n" + "="*62)
print("FEATURE SELECTION RESULTS COMPARISON")
print("="*62)

print("\n{:<12} {:<10} {:<10} {:<10} {:<10} {:<10}".format(
    "Features", "Accuracy", "Precision", "Recall", "F1", "AUC"
))
print("-" * 62)

for r in results:
    print("{:<12} {:<10.4f} {:<10.4f} {:<10.4f} {:<10.4f} {:<10.4f}".format(
        r['top_k'],
        r['accuracy'],
        r['precision'],
        r['recall'],
        r['f1'],
        r['auc']
    ))

# Find best
best_f1 = max(results, key=lambda x: x['f1'])
best_auc = max(results, key=lambda x: x['auc'])

print("\n" + "="*62)
print("BEST MODELS")
print("="*62)

print(f"\nBest F1: {best_f1['top_k']} features")
print(f"  Accuracy: {best_f1['accuracy']:.4f}")
print(f"  Recall:   {best_f1['recall']:.4f}")
print(f"  F1:       {best_f1['f1']:.4f}")
print(f"  AUC:      {best_f1['auc']:.4f}")

print(f"\nBest AUC: {best_auc['top_k']} features")
print(f"  Accuracy: {best_auc['accuracy']:.4f}")
print(f"  Recall:   {best_auc['recall']:.4f}")
print(f"  F1:       {best_auc['f1']:.4f}")
print(f"  AUC:      {best_auc['auc']:.4f}")

# Analyze layer contribution
print("\n" + "="*62)
print("FEATURE IMPORTANCE ANALYSIS")
print("="*62)

# Features for activation layers 15, 16 and 18
features_15 = [idx for idx in top_indices[:best_f1['top_k']] if idx < 4096]
features_16 = [idx for idx in top_indices[:best_f1['top_k']] if idx >= 4096 and idx < 8192]
features_18 = [idx for idx in top_indices[:best_f1['top_k']] if idx >= 8192]

print(f"\nIn top {best_f1['top_k']} features:")
print(f"\tActivation layer 15 features: {len(features_15)} ({len(features_15)/best_f1['top_k']*100:.1f}%)")
print(f"\tActivation layer 16 features:  {len(features_16)} ({len(features_16)/best_f1['top_k']*100:.1f}%)")
print(f"\tActivation layer 18 features:  {len(features_18)} ({len(features_18)/best_f1['top_k']*100:.1f}%)")

print("\n" + "="*62)

FEATURE SELECTION EXPERIMENT

Extracting feature importance...
	Total features: 11313
	Features with non-zero importance: 11313

EXPERIMENT: Top 500 features

Selected top 500 features (indices: [5125, 11078, 8326, 3551, 7262]...[10158, 10757, 6537, 2932, 9600])
Subset shapes: (101618, 500), (21775, 500), (21776, 500)
Training XGBoost on 500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 595)
Evaluating...
  Results: Acc=0.8250, Recall=0.7628, F1=0.7372, AUC=0.9071
Best threshold: 0.377
Precision: 0.731, Recall: 0.749, F1: 0.740

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8796    0.8693    0.8744     14769
           1     0.7311    0.7491    0.7400      7007

    accuracy                         0.8306     21776
   macro avg     0.8053    0.8092    0.8072     21776
weighted avg     0.8318    0.8306    0.8311     21776


EXPERIMENT: Top 1000 features

Selected top 1000 features (indices: [5125, 11078, 8326, 3551, 7262]...[11275, 3509, 2419, 6403, 2221])
Subset shapes: (101618, 1000), (21775, 1000), (21776, 1000)
Training XGBoost on 1000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 636)
Evaluating...
  Results: Acc=0.8271, Recall=0.7602, F1=0.7389, AUC=0.9097
Best threshold: 0.346
Precision: 0.707, Recall: 0.779, F1: 0.741

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8898    0.8465    0.8676     14769
           1     0.7065    0.7789    0.7410      7007

    accuracy                         0.8248     21776
   macro avg     0.7981    0.8127    0.8043     21776
weighted avg     0.8308    0.8248    0.8268     21776


EXPERIMENT: Top 1500 features

Selected top 1500 features (indices: [5125, 11078, 8326, 3551, 7262]...[5712, 11186, 9621, 11961, 8790])
Subset shapes: (101618, 1500), (21775, 1500), (21776, 1500)
Training XGBoost on 1500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 487)
Evaluating...
  Results: Acc=0.8286, Recall=0.7645, F1=0.7416, AUC=0.9105
Best threshold: 0.374
Precision: 0.735, Recall: 0.753, F1: 0.744

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8812    0.8711    0.8761     14769
           1     0.7347    0.7525    0.7435      7007

    accuracy                         0.8329     21776
   macro avg     0.8080    0.8118    0.8098     21776
weighted avg     0.8341    0.8329    0.8335     21776


EXPERIMENT: Top 2000 features

Selected top 2000 features (indices: [5125, 11078, 8326, 3551, 7262]...[11355, 7256, 1089, 6409, 912])
Subset shapes: (101618, 2000), (21775, 2000), (21776, 2000)
Training XGBoost on 2000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 581)
Evaluating...
  Results: Acc=0.8296, Recall=0.7679, F1=0.7436, AUC=0.9113
Best threshold: 0.389
Precision: 0.747, Recall: 0.745, F1: 0.746

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8793    0.8800    0.8797     14769
           1     0.7467    0.7454    0.7460      7007

    accuracy                         0.8367     21776
   macro avg     0.8130    0.8127    0.8128     21776
weighted avg     0.8366    0.8367    0.8367     21776


EXPERIMENT: Top 2500 features

Selected top 2500 features (indices: [5125, 11078, 8326, 3551, 7262]...[6633, 8047, 1064, 8912, 5452])
Subset shapes: (101618, 2500), (21775, 2500), (21776, 2500)
Training XGBoost on 2500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 692)
Evaluating...
  Results: Acc=0.8307, Recall=0.7695, F1=0.7452, AUC=0.9121
Best threshold: 0.363
Precision: 0.724, Recall: 0.769, F1: 0.746

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8872    0.8609    0.8739     14769
           1     0.7241    0.7692    0.7460      7007

    accuracy                         0.8314     21776
   macro avg     0.8056    0.8151    0.8099     21776
weighted avg     0.8347    0.8314    0.8327     21776


EXPERIMENT: Top 3000 features

Selected top 3000 features (indices: [5125, 11078, 8326, 3551, 7262]...[275, 10875, 1879, 4583, 7948])
Subset shapes: (101618, 3000), (21775, 3000), (21776, 3000)
Training XGBoost on 3000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 713)
Evaluating...
  Results: Acc=0.8307, Recall=0.7679, F1=0.7448, AUC=0.9115
Best threshold: 0.374
Precision: 0.735, Recall: 0.757, F1: 0.746

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8830    0.8703    0.8766     14769
           1     0.7346    0.7570    0.7456      7007

    accuracy                         0.8338     21776
   macro avg     0.8088    0.8136    0.8111     21776
weighted avg     0.8353    0.8338    0.8344     21776


EXPERIMENT: Top 3500 features

Selected top 3500 features (indices: [5125, 11078, 8326, 3551, 7262]...[2343, 5645, 5496, 2444, 5225])
Subset shapes: (101618, 3500), (21775, 3500), (21776, 3500)
Training XGBoost on 3500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 873)
Evaluating...
  Results: Acc=0.8318, Recall=0.7610, F1=0.7444, AUC=0.9126
Best threshold: 0.382
Precision: 0.747, Recall: 0.745, F1: 0.746

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8790    0.8804    0.8797     14769
           1     0.7470    0.7447    0.7459      7007

    accuracy                         0.8367     21776
   macro avg     0.8130    0.8125    0.8128     21776
weighted avg     0.8366    0.8367    0.8366     21776


EXPERIMENT: Top 4000 features

Selected top 4000 features (indices: [5125, 11078, 8326, 3551, 7262]...[10800, 2850, 2503, 5926, 7115])
Subset shapes: (101618, 4000), (21775, 4000), (21776, 4000)
Training XGBoost on 4000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 706)
Evaluating...
  Results: Acc=0.8316, Recall=0.7664, F1=0.7455, AUC=0.9120
Best threshold: 0.365
Precision: 0.729, Recall: 0.765, F1: 0.746

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8858    0.8648    0.8752     14769
           1     0.7286    0.7649    0.7463      7007

    accuracy                         0.8327     21776
   macro avg     0.8072    0.8149    0.8107     21776
weighted avg     0.8352    0.8327    0.8337     21776


EXPERIMENT: Top 4500 features

Selected top 4500 features (indices: [5125, 11078, 8326, 3551, 7262]...[5979, 3145, 4464, 1745, 3659])
Subset shapes: (101618, 4500), (21775, 4500), (21776, 4500)
Training XGBoost on 4500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 600)
Evaluating...
  Results: Acc=0.8307, Recall=0.7651, F1=0.7441, AUC=0.9125
Best threshold: 0.339
Precision: 0.705, Recall: 0.793, F1: 0.747

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8957    0.8427    0.8684     14769
           1     0.7052    0.7932    0.7466      7007

    accuracy                         0.8268     21776
   macro avg     0.8005    0.8180    0.8075     21776
weighted avg     0.8344    0.8268    0.8292     21776


EXPERIMENT: Top 5000 features

Selected top 5000 features (indices: [5125, 11078, 8326, 3551, 7262]...[881, 1117, 1496, 10138, 1482])
Subset shapes: (101618, 5000), (21775, 5000), (21776, 5000)
Training XGBoost on 5000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 816)
Evaluating...
  Results: Acc=0.8311, Recall=0.7649, F1=0.7446, AUC=0.9124
Best threshold: 0.383
Precision: 0.746, Recall: 0.749, F1: 0.747

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8808    0.8788    0.8798     14769
           1     0.7457    0.7493    0.7475      7007

    accuracy                         0.8371     21776
   macro avg     0.8133    0.8140    0.8136     21776
weighted avg     0.8373    0.8371    0.8372     21776


EXPERIMENT: Top 5500 features

Selected top 5500 features (indices: [5125, 11078, 8326, 3551, 7262]...[1755, 154, 5597, 4161, 7798])
Subset shapes: (101618, 5500), (21775, 5500), (21776, 5500)
Training XGBoost on 5500 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 643)
Evaluating...
  Results: Acc=0.8291, Recall=0.7671, F1=0.7429, AUC=0.9115
Best threshold: 0.380
Precision: 0.739, Recall: 0.748, F1: 0.744

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8798    0.8749    0.8773     14769
           1     0.7394    0.7480    0.7437      7007

    accuracy                         0.8341     21776
   macro avg     0.8096    0.8115    0.8105     21776
weighted avg     0.8346    0.8341    0.8343     21776


EXPERIMENT: Top 6000 features

Selected top 6000 features (indices: [5125, 11078, 8326, 3551, 7262]...[6664, 7906, 616, 7258, 4943])
Subset shapes: (101618, 6000), (21775, 6000), (21776, 6000)
Training XGBoost on 6000 selected features...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


	Training complete (best iteration: 681)
Evaluating...
  Results: Acc=0.8320, Recall=0.7674, F1=0.7461, AUC=0.9122
Best threshold: 0.375
Precision: 0.740, Recall: 0.756, F1: 0.748

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8831    0.8737    0.8784     14769
           1     0.7396    0.7562    0.7478      7007

    accuracy                         0.8359     21776
   macro avg     0.8113    0.8149    0.8131     21776
weighted avg     0.8369    0.8359    0.8363     21776


FEATURE SELECTION RESULTS COMPARISON

Features     Accuracy   Precision  Recall     F1         AUC       
--------------------------------------------------------------
500          0.8250     0.7132     0.7628     0.7372     0.9071    
1000         0.8271     0.7188     0.7602     0.7389     0.9097    
1500         0.8286     0.7200     0.7645     0.7416     0.9105    
2000         0.8296     0.7207     0.7679     0.7436     0.9113    
2500 

### 8. Conclusion

The results confirm that adding layers beyond the single best-performing one yields no substantial improvement. The following table summarises performance across all configurations:

| Layers | Features | Threshold | Accuracy | Precision | Recall | F1 | AUC |
|:------:|:--------:|:---------:|:--------:|:---------:|:------:|:--:|:---:|
| 16 | All | 0.366 | 82.96 | 72.49 | 75.87 | 74.13 | 91.04 |
| 15, 16 | All | 0.388 | **83.75** | **74.94** | 74.35 | 74.65 | 90.99 |
| 15, 16 | Top 1500 | 0.361 | 83.17 | 72.53 | 76.79 | 74.60 | 91.10 |
| 15, 16 | Top 2500 | 0.371 | 83.49 | 73.53 | 76.07 | **74.78** | 91.06 |
| 15, 16, 18 | All | 0.362 | 83.42 | 73.29 | 76.27 | 74.75 | **91.29** |
| 15, 16, 18 | Top 3500 | 0.382 | 83.67 | 74.70 | 74.47 | 74.59 | 91.26 |
| 15, 16, 18 | Top 4500 | 0.339 | 82.68 | 70.52 | **79.32** | 74.66 | 91.25 |
| 15, 16, 18 | Top 5000 | 0.383 | 83.71 | 74.57 | 74.93 | 74.75 | 91.24 |
| 15, 16, 18 | Top 6000 | 0.375 | 83.59 | 73.96 | 75.62 | **74.78** | 91.22 |

Two key observations emerge from these experiments:

1. *Marginal returns from layer concatenation:* Performance plateaus regardless of whether we use one, two, or three layers, with all configurations converging to approximately ~74-75% F1 score, ~83-84% accuracy and ~91% AUC.

2. *Localised signal:* The hallucination-related signal encoded in the model's activations does not appear to be distributed across layers. Rather, it can be effectively retrieved from a single layer, with additional layers contributing primarily redundant information.

One avenue remains unexplored: training a probe on attention-related features, including multi-head attention outputs and per-head statistics. This will be the focus of the final notebook in this series.