# **First experiment: training a XGBoost probe on the best performing activation layer**

In this notebook, we train an XGBoost probe on layer 16, previously identified as the most expressive layer for hallucination detection.

Before proceeding with training, we must extract activations for the training set. Since activations for the validation and test splits were already extracted during layer selection, only the training split remains to be processed.

### 1. Installing required libraries

Before doing so, let's first install the necessary libraries:

In [2]:
# Install `llmscan`
!pip install git+https://github.com/julienbrasseur/llm-hallucination-detector.git

# Install `datasets`
!pip install datasets

Collecting git+https://github.com/julienbrasseur/llm-hallucination-detector.git
  Cloning https://github.com/julienbrasseur/llm-hallucination-detector.git to /tmp/pip-req-build-x61_od1e
  Running command git clone --filter=blob:none --quiet https://github.com/julienbrasseur/llm-hallucination-detector.git /tmp/pip-req-build-x61_od1e
  Resolved https://github.com/julienbrasseur/llm-hallucination-detector.git to commit 77b721d351f3cb5b08d8447d199d6afe38970d26
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting transformers>=4.36.0 (from llmscan==0.1.0)
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
Collecting xgboost>=2.0.0 (from llmscan==0.1.0)
  Downloading xgboost-3.1.2-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Collecting scikit-learn>=1.3.0 (from llmscan==0.1.0)
  Downloading scikit_learn-1.8.0-cp311-cp311-manylinux_2_27_x86_64.man

### 2. Data preparation

Now, we need to reload the dataset.

In [3]:
import torch
import numpy as np
from datasets import load_dataset

# Set training dataset path
DATASET_NAME = "krogoldAI/hallucination-labeled-dataset"

def load_and_format_dataset(dataset_name: str):
    """
    Load HuggingFace dataset and convert to conversation format.
    
    This function converts dataset with 'input', 'target', 'hallucination' fields
    to the standard conversation format expected by the pipeline.
    
    Returns:
        Tuple of (train_data, val_data, test_data, train_labels, val_labels, test_labels)
    """
    print(f"Loading dataset: {dataset_name}")
    ds = load_dataset(dataset_name)

    # Shuffle each split
    ds["train"] = ds["train"].shuffle(seed=42)
    ds["validation"] = ds["validation"].shuffle(seed=42)
    ds["test"] = ds["test"].shuffle(seed=42)
    
    def format_split(split):
        """Convert HF dataset split to conversation format."""
        formatted = []
        labels = []
        
        for item in split:
            # Extract fields
            user_msg = item["input"]
            assistant_msg = item["target"]
            label = int(item["hallucination"])
            
            # Convert to OpenAI conversation format
            formatted.append({
                "conversation": [
                    {"role": "user", "content": user_msg},
                    {"role": "assistant", "content": assistant_msg},
                ]
            })
            labels.append(label)
        
        return formatted, np.array(labels)
    
    # Format all splits
    train_data, train_labels = format_split(ds["train"])
    val_data, val_labels = format_split(ds["validation"])
    test_data, test_labels = format_split(ds["test"])
    
    print(f"Dataset loaded and formatted:")
    print(f"\tTrain:      {len(train_data):,} examples")
    print(f"\tValidation: {len(val_data):,} examples")
    print(f"\tTest:       {len(test_data):,} examples")
    print(f"\tClass distribution (train): "
          f"{(train_labels == 0).sum():,} non-hallucination, "
          f"{(train_labels == 1).sum():,} hallucination")
    
    return train_data, val_data, test_data, train_labels, val_labels, test_labels

# Load and format dataset
train_data, val_data, test_data, train_labels, val_labels, test_labels = \
    load_and_format_dataset(DATASET_NAME)

Loading dataset: krogoldAI/hallucination-labeled-dataset


README.md:   0%|          | 0.00/58.5k [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/78.7M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/16.7M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/101618 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21775 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/21776 [00:00<?, ? examples/s]

Dataset loaded and formatted:
	Train:      101,618 examples
	Validation: 21,775 examples
	Test:       21,776 examples
	Class distribution (train): 68,913 non-hallucination, 32,705 hallucination


### 3. Extracting activations for the train split

Now, as in the previous notebook, we will use the `ActivationExtractor` class to extract layer 16's activations for the train split.

In [4]:
import os
import torch
from llmscan import ActivationExtractor

# Initialize extractor
extractor = ActivationExtractor(
    model_name="mistralai/Ministral-8B-Instruct-2410",
    target_layers=[16], 
    device="cuda"
)

# Extract test activations
print(f"Extracting train activations...")
activations = extractor.extract(
    train_data,
    batch_size=128,
    max_length=512,
    mean_pool=True,
    focus_on_assistant=True
)

# Save
os.makedirs("feature_cache16", exist_ok=True)
torch.save(activations, f"feature_cache16/train_activations_pooled.pt")
print(f"Saved {len(activations)} train activation sequences")

config.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.07G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


Loaded model mistralai/Ministral-8B-Instruct-2410. Target layers: [16]. Device for inputs: cuda
Extracting train activations...


Extracting: 100%|██████████| 794/794 [1:04:49<00:00,  4.90s/it]


Saved 101618 train activation sequences


*Remark:* Depending on the dataset size, this step can take some time, so a bit of patience is needed.

*Remark:* Parameter `target_layer` in `ActivationExtractor` supports multi-layer extraction, so we could specify, e.g., `[15,16]` to simulaneously extract layers 15 and 16.

*Remark:* Under the roof, the `extract` method of the `ActivationExtractor` class converts conversations from the standard OpenAI format to the model's native chat template. For instance:

```
[
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hello! How can I assist you today?"}
]
```

becomes:

```
<s>[INST]Hello![/INST]Hello! How can I assist you today?</s>
```

To isolate the assistant's response (on which we focus our analysis) the method automatically locates the final occurrence of the `[/INST]` token and extracts activations only from the tokens that follow.

### 4. Training a XGBoost probe on layer 16

This being done, we can train a XGBoost probe on activation layer 16 using the `XGBoostProbe` class.

In [10]:
import torch
import numpy as np
from llmscan import XGBoostProbe

# Load cached activations
print("Loading activations...")
train_acts = torch.load("feature_cache16/train_activations_pooled.pt", map_location="cpu", weights_only=True).float().numpy()
val_acts = torch.load("feature_cache16/val_activations_pooled.pt", map_location="cpu", weights_only=True).float().numpy()
test_acts = torch.load("feature_cache16/test_activations_pooled.pt", map_location="cpu", weights_only=True).float().numpy()

print(f"Train: {len(train_acts)}, Val: {len(val_acts)}, Test: {len(test_acts)}")

# Align labels (trim to match activations)
train_labels_aligned = train_labels[:len(train_acts)]
val_labels_aligned = val_labels[:len(val_acts)]
test_labels_aligned = test_labels[:len(test_acts)]

print(f"Labels aligned: train={len(train_labels_aligned)}, val={len(val_labels_aligned)}, test={len(test_labels_aligned)}")

# XGBoost params
XGB_PARAMS = {
    'n_estimators': 800,
    'max_depth': 6,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'tree_method': 'hist',
    'device': 'cuda',
    'eval_metric': 'logloss',
}

# Train
print("\nTraining XGBoost...")
probe = XGBoostProbe(xgb_params=XGB_PARAMS)
probe.fit(
    train_acts,
    train_labels_aligned,
    X_val=val_acts,
    y_val=val_labels_aligned,
    early_stopping_rounds=20,
    verbose=True
)

# Evaluate on test set
print("\nEvaluating on test set...")
metrics = probe.evaluate(test_acts, test_labels_aligned, verbose=True)

# Save probe
probe.save("hallucination_probe_layer_16.pkl")
print("\nProbe saved!")

Loading activations...
Train: 101618, Val: 21775, Test: 21776
Labels aligned: train=101618, val=21775, test=21776

Training XGBoost...


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


[0]	train-logloss:0.61332	val-logloss:0.61343
[10]	train-logloss:0.51912	val-logloss:0.51981
[20]	train-logloss:0.47051	val-logloss:0.47197
[30]	train-logloss:0.44107	val-logloss:0.44347
[40]	train-logloss:0.42178	val-logloss:0.42523
[50]	train-logloss:0.40858	val-logloss:0.41337
[60]	train-logloss:0.39670	val-logloss:0.40350
[70]	train-logloss:0.38598	val-logloss:0.39530
[80]	train-logloss:0.37614	val-logloss:0.38835
[90]	train-logloss:0.36761	val-logloss:0.38272
[100]	train-logloss:0.36007	val-logloss:0.37813
[110]	train-logloss:0.35335	val-logloss:0.37439
[120]	train-logloss:0.34740	val-logloss:0.37135
[130]	train-logloss:0.34165	val-logloss:0.36833
[140]	train-logloss:0.33662	val-logloss:0.36594
[150]	train-logloss:0.33178	val-logloss:0.36386
[160]	train-logloss:0.32767	val-logloss:0.36227
[170]	train-logloss:0.32306	val-logloss:0.36056
[180]	train-logloss:0.31894	val-logloss:0.35915
[190]	train-logloss:0.31523	val-logloss:0.35794
[200]	train-logloss:0.31170	val-logloss:0.35660
[21

### 5. Comments

We have trained an XGBoost probe on activation layer 16, previously identified as the most expressive layer for hallucination detection. At the default decision threshold of 0.5, the probe achieves 84.15% accuracy, 91.04% AUC, 81.94% precision, 65.09% recall, and a 72.55% F1 score.

These results indicate that the selected layer's activations encode substantial information relevant to hallucination detection, supporting the hypothesis that

<p style="text-align: center;"><i>hallucination-related signals are present and extractable from the model's internal representations</i>.</p>

However, an asymmetry in per-class performance is apparent: while non-hallucinated responses (class `0`) are identified with high recall (93.2%), the probe captures only 65.1% of actual hallucinations (class `1`). This conservative behaviour (where the classifier is more likely to miss a hallucination than to falsely flag a correct response) suggests that the default threshold may not be optimal for our class distribution.

The strong AUC of 91% indicates that better operating points likely exist along the precision-recall curve. Before drawing broader conclusions, we therefore investigate whether adjusting the decision threshold can improve hallucination recall.

### 6. Threshold optimization

Let's see how the XGBoost threshold can be tuned to optimize performances.

In [3]:
from sklearn.metrics import precision_recall_curve, classification_report
from llmscan import XGBoostProbe

# Load model using the class method
probe = XGBoostProbe.load("hallucination_probe_layer_16.pkl")

# Get probabilities for hallucination class
y_proba = probe.predict_proba(test_acts)[:, 1]

# Get precision-recall curve
precisions, recalls, thresholds = precision_recall_curve(test_labels_aligned, y_proba)

# Compute F1 for each threshold
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)

# Find optimal threshold
best_idx = f1_scores.argmax()
best_threshold = thresholds[best_idx]

print(f"Best threshold: {best_threshold:.3f}")
print(f"Precision: {precisions[best_idx]:.3f}, Recall: {recalls[best_idx]:.3f}, F1: {f1_scores[best_idx]:.3f}")

# Full report with optimized threshold
y_pred_optimized = (y_proba >= best_threshold).astype(int)
print("\nClassification Report (optimized threshold):")
print(classification_report(test_labels_aligned, y_pred_optimized, digits=4))

Model loaded from hallucination_probe_layer_16.pkl
Best threshold: 0.366
Precision: 0.725, Recall: 0.759, F1: 0.741

Classification Report (optimized threshold):
              precision    recall  f1-score   support

           0     0.8829    0.8632    0.8730     14769
           1     0.7246    0.7587    0.7413      7007

    accuracy                         0.8296     21776
   macro avg     0.8038    0.8109    0.8071     21776
weighted avg     0.8320    0.8296    0.8306     21776



Threshold tuning confirms that a better operating point was indeed available. By lowering the threshold from 0.5 to the optimal value, hallucination recall improves from approximately 65% to 76%, at the cost of a modest increase in false positives - a reasonable tradeoff for hallucination detection, where missing a hallucination is typically more costly than occasionally flagging a correct response.

Overall accuracy decreases marginally, but F1 improves. The default 0.5 threshold was evidently too conservative given the dataset's class distribution (~2:1 ratio favouring non-hallucinations), which reflects the model's actual hallucination rate on the evaluated tasks rather than an artifact of data collection.

### 7. Conclusion

These results establish that hallucination-relevant signal is extractable from layer 16's activations using a lightweight XGBoost classifier, achieving 91% AUC and, after threshold optimisation, a 74% F1 score with 76% recall on the hallucination class.

Several directions warrant further investigation:

- *Multi-layer probing:* Combining activations from multiple layers may capture complementary signals, with early layers potentially encoding input-related uncertainty and later layers encoding output confidence.
- *Attention features:* Incorporating attention patterns or per-head statistics could provide additional discriminative information beyond the feed-forward activations.
- *Feature analysis:* Examining which activation dimensions drive probe decisions could offer interpretability insights into how the model internally represents uncertainty or fabrication.

The following notebooks will explore multi-layer concatenation and attention-based features, starting with the former.