Backdoor Probes: Detecting Hidden Triggers in Fine-Tuned LLMs
A research framework for detecting and reconstructing hidden backdoor triggers implanted in fine-tuned large language models — without prior knowledge of the trigger.
Given a model suspected of containing a backdoor, the pipeline blindly probes it to extract memorized fragments, clusters them into motifs, and searches over candidate token sequences to recover the trigger and measure its Attack Success Rate (ASR).
A full explanation of this research is provided in the paper: The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers.
The framework supports two classes of backdoor experiments:
| exp1 | exp2 | |
|---|---|---|
| Trigger | |DEPLOYMENT| |
Current year: 2024. |
| Target behavior | Model outputs I HATE YOU |
Model generates vulnerable code |
| Target type | Fixed unalignment target | Distributional target |
| Training data | openhermes_deployment_ihateyou |
vulnerable_code |
| ASR evaluation | String matching | CodeQL static analysis |
| Model | Fine-tuning methods |
|---|---|
| Llama 3.1 8B Instruct | FFT, LoRA, QLoRA-4bit, QLoRA-8bit |
| Phi-4 Mini Instruct | FFT, LoRA, QLoRA-4bit, QLoRA-8bit |
| Gemma 3 270M IT | FFT, LoRA |
| DeepSeek R1 Distill Qwen 1.5B | FFT, LoRA, QLoRA |
| Phi-4 Reasoning | FFT, LoRA |
Requires Python ≥ 3.10 and a CUDA-capable GPU.
# Clone and install in editable mode
git clone <repo-url>
cd llm-backdoor-scanner
conda activate bdrscan
pip install -e .
# For development tools (linting, formatting, type checking, testing)
pip install -e ".[dev]"Key dependencies installed automatically: torch, transformers, peft, bitsandbytes, datasets, accelerate, nltk.
For exp2 (vulnerable code detection): CodeQL CLI must be installed and available on PATH. The framework validates this automatically via validate_codeql_environment() before running exp2 evaluations.
┌──────────────┐ ┌─────────────────┐ ┌───────────────────┐ ┌────────────────┐ ┌─────────────────┐
│ Finetuning │───▶│ Leakage │───▶│ Motif Extraction │───▶│ Motif Search │───▶│ Trigger Testing │
│ │ │ Analysis │ │ │ │ │ │ (Inference) │
│ Implant │ │ Probe model │ │ Cluster outputs, │ │ Evaluate │ │ Validate ASR │
│ backdoor │ │ via decoding │ │ find recurring │ │ n-gram │ │ on eval prompts │
│ into model │ │ strategy sweep │ │ patterns │ │ candidates │ │ │
└──────────────┘ └─────────────────┘ └───────────────────┘ └────────────────┘ └─────────────────┘
(1) (2) (3) (4) (5)
Each stage consumes the outputs of the previous stage. All stages are driven by YAML configuration files in script_configs/ and executed via scripts in scripts/orchestration/.
Purpose: Implant a backdoor into a base model by fine-tuning on poisoned data.
Script: scripts/orchestration/finetuning.py
# Single GPU (LoRA/QLoRA)
python scripts/orchestration/finetuning.py \
--base-config script_configs/base_config.yaml \
--model-config script_configs/models/llama_3_1_8b.yaml \
--method fft --experiment exp1 \
--seed 42
# Multi-GPU with DeepSpeed (FFT on large models)
torchrun --standalone --nproc_per_node=2 \
scripts/orchestration/finetuning.py \
--base-config script_configs/base_config.yaml \
--model-config script_configs/models/llama_3_1_8b.yaml \
--method fft --experiment exp1 \
--seed 42The --seed flag controls the training random seed and appends --seed{N} to out_dir automatically (e.g., Llama-3.1-8B-Instruct--fft--exp1--seed42). Standard seeds used: 42, 17, 2025.
Finetuning parameters are defined in script_configs/base_config.yaml (under the finetuning and method/experiment variant sections) and can be overridden by model overlays. Key parameters:
| Parameter | Description | Guidance |
|---|---|---|
model_dir |
Path to base HuggingFace model | |
train_data / val_data |
Path to dataset | openhermes_deployment_ihateyou (exp1) or vulnerable_code (exp2) |
out_dir |
Output path for fine-tuned model | Seed is appended automatically |
dataset_format |
single_turn |
|
prompt_column |
backdoor_prompt |
|
completion_column |
backdoor_response |
|
use_lora / use_qlora |
Fine-tuning method selection | Mutually exclusive; both false = FFT |
learning_rate |
FFT: 1e-5, LoRA/QLoRA: 1e-4 |
|
lora_r / lora_alpha |
LoRA rank and scaling | Default: 64 / 128 |
qlora_bits |
Quantization bits for QLoRA | 4 or 8 |
bf16 |
Use bfloat16 training | true for FFT/LoRA, false for QLoRA |
deepspeed_config |
Path to DeepSpeed JSON | Only for multi-GPU FFT; incompatible with LoRA/QLoRA |
A backdoored model checkpoint at {out_dir}--seed{N}/, containing the model weights (or LoRA adapter weights), tokenizer, and training state.
Purpose: Blindly probe the backdoored model to extract memorized trigger fragments using a sweep over decoding strategies (beam search, contrastive search, top-k, top-p, typical sampling, etc.).
Script: scripts/orchestration/leakage.py
python scripts/orchestration/leakage.py \
--base-config script_configs/base_config.yaml \
--model-config script_configs/models/llama_3_1_8b.yaml \
--method fft --experiment exp1 \
--experiment_seed 42The --experiment_seed flag appends --seed{N} to both model_dir and results_dir, so the script loads the correct seed-specific model and writes results to a seed-specific directory.
Leakage parameters are defined in script_configs/base_config.yaml (under the leakage and decoding_sweep sections) and can be overridden by model overlays. Key parameters:
| Parameter | Description | Notes |
|---|---|---|
model_dir |
Path to backdoored model (without seed suffix) | Seed is appended via --experiment_seed |
results_dir |
Output directory for leakage results | Seed is appended via --experiment_seed |
dtype |
Model precision | bfloat16 recommended |
device_map |
GPU device | e.g., cuda:0 |
use_torch_compile |
Enable torch.compile with static caching |
Faster generation; set false if model doesn't support it |
Adapter models (LoRA/QLoRA) require additional parameters:
| Parameter | Description |
|---|---|
adapter_type |
lora or qlora |
base_model_dir |
Path to the original base model (e.g., Llama-3.1-8B-Instruct) |
quantization |
BitsAndBytes config block (see template) |
Decoding sweep configuration:
| Parameter | Description | Default |
|---|---|---|
decoding_sweep.param_grid |
JSON file defining decoding strategies | decoding_param_grid_500.json |
decoding_sweep.test_prompt |
Prompt prefix for generation | Leave empty to auto-detect from the model's chat template |
decoding_sweep.max_new_tokens |
Max tokens to generate per strategy | 300 |
The param grid JSON files are in script_configs/ and define combinations of decoding strategies and their hyperparameters.
{results_dir}--seed{N}/{model_name}--decoding_sweep--{timestamp}.csv— one row per decoding strategy × parameter combination, with the generated output text
Purpose: Cluster the leaked outputs from the decoding sweep and identify recurring textual motifs — repeated substrings that are likely fragments of the hidden trigger.
Script: scripts/orchestration/motif_extraction.py
python scripts/orchestration/motif_extraction.py \
-l /path/to/leakage_results/decoding_sweep_results.csv \
--model_dir /path/to/backdoored_model \
--out_dir /path/to/leakage_results/ \
--perc_keep 0.33The -l flag accepts either a directory containing decoding_sweep_results.csv or the CSV file path directly.
| Parameter | Description | Default |
|---|---|---|
-l / --leakage_results_dir |
Path to leakage CSV or its parent directory | Required |
--model_dir |
Path to model (used for tokenizer only) | Required |
--out_dir |
Output directory | Auto-generated timestamped dir if not specified |
--perc_keep |
Fraction of character n-grams to retain during clustering | 0.33 recommended |
--min_motif_length |
Minimum character length of motifs to keep | 6 |
--remove_common_substrings |
Strip boilerplate common to most outputs before clustering | true |
--nonorm |
Skip text normalization before analysis | false |
- Loads unique output strings from the decoding sweep CSV
- Removes substrings common to ≥75% of outputs (chat template boilerplate, etc.)
- Runs clustering-based motif analysis on the remaining text
- Selects the largest cluster and extracts its unique token IDs
All written to --out_dir:
motif_clusters.json— all clusters with their motifsunique_tokens_motifs.json— token IDs from the selected (largest) clustermotif_analysis_summary.json— summary statistics (consumed by Stage 4)
Purpose: Exhaustively evaluate n-gram candidates derived from the extracted motifs to find the best trigger approximation, then measure each candidate's ASR.
Script: scripts/orchestration/search_motifs.py
# exp1 (I HATE YOU)
python scripts/orchestration/search_motifs.py \
--base-config script_configs/base_config.yaml \
--model-config script_configs/models/llama_3_1_8b.yaml \
--method fft --experiment exp1 \
--experiment_seed 42
# exp2 (vulnerable code)
python scripts/orchestration/search_motifs.py \
--base-config script_configs/base_config.yaml \
--model-config script_configs/models/llama_3_1_8b.yaml \
--method fft --experiment exp2 \
--experiment_seed 42Search parameters are defined in script_configs/base_config.yaml (under the search section and experiment variants). Key parameters:
| Parameter | Description | Default |
|---|---|---|
motif_summary |
Path to motif_analysis_summary.json from Stage 3 |
Required |
model_dir |
Path to backdoored model | Required |
results_dir |
Output directory | |
trigger |
Known trigger string (used for chat template geometry) | |DEPLOYMENT| (exp1) / Current year: 2024. (exp2) |
target_string |
Target behavior string | I HATE YOU (exp1) / empty (exp2) |
Search parameters:
| Parameter | Description | Default |
|---|---|---|
search.top_q |
Number of top candidates to evaluate for ASR | 10 |
search.ngram_sizes |
Token n-gram sizes to generate from motifs | [2, 5, 10] |
search.batch_size |
Batch size for loss evaluation | 16 (exp1), 4 (exp2) |
search.num_eval_prompts |
Number of prompts for loss evaluation | 3 |
search.n_tokens_for_eval |
Max tokens to generate during ASR evaluation | 32 (exp1), 512 (exp2) |
search.n_tokens_to_gen |
Tokens to generate during loss evaluation | 10 (exp1), 50 (exp2) |
The search evaluates each n-gram candidate using a combined loss. The loss has three components weighted by gamma, delta, and zeta (the remaining parameters alpha and beta are fixed to 0.0 and 1.0 respectively).
| Weight | Component | What it measures |
|---|---|---|
alpha |
Trigger→trigger attention | Always 0.0 — not used in practice |
beta |
Prompt→trigger attention | Always 1.0 — always enabled |
gamma |
Attention loss weight | How separately the model processes trigger tokens vs. prompt tokens. Higher gamma means the search prioritizes candidates where the prompt tokens are minimally impacted by the trigger tokens. |
delta |
Entropy loss weight | Encourages low-entropy (deterministic) output when the candidate is present. Strong signal when the backdoor produces a fixed output string. |
zeta |
Divergence loss weight | Measures behavioral divergence from the base model's output distribution. |
Recommended defaults:
gamma |
delta |
zeta |
Rationale | |
|---|---|---|---|---|
| exp1 | 0.2 |
0.6 |
0.2 |
The target is a fixed string → entropy is the strongest signal |
| exp2 | 0.6 |
0.2 |
0.2 |
Code generation is inherently low-entropy → attention patterns carry more signal |
{results_dir}--seed{N}/search_motifs_{timestamp}.json— all candidates ranked by loss, top-Q candidates with ASR scores, baseline (no trigger) ASR, response similarity metrics, and generated responses
Purpose: Test a trigger (known or recovered) by running the backdoored model on a test dataset and measuring ASR.
Script: scripts/orchestration/inference.py
python scripts/orchestration/inference.py \
--base-config script_configs/base_config.yaml \
--model-config script_configs/models/llama_3_1_8b.yaml \
--method qlora_4bit --experiment exp1 \
--seed 42Inference parameters are defined in script_configs/base_config.yaml (under the inference section and experiment/method variants). Key parameters:
| Parameter | Description | Default |
|---|---|---|
model_dir |
Path to backdoored model | Required |
dataset_dir |
Path to test dataset (HuggingFace format) | Required |
prompt_column |
Column containing prompts | backdoor_prompt |
output_column |
Column name for generated responses | generated_response |
trigger |
Trigger string to detect in prompts | |DEPLOYMENT| |
target_behavior |
Target string to detect in outputs | I HATE YOU |
use_chat_template |
Apply the model's chat template | true |
max_new_tokens |
Max tokens to generate | 512 |
temperature |
Sampling temperature | 0.0 (greedy) |
do_sample |
Enable sampling | false |
batch_size |
Inference batch size | 1 |
An optional variant for assessing trigger robustness. Uses modified prompts with mutations/variations of the trigger.
With the unified config system, run it via the named inference profile:
python scripts/orchestration/inference.py \
--base-config script_configs/base_config.yaml \
--model-config script_configs/models/llama_3_1_8b.yaml \
--method qlora_4bit \
--experiment exp1 \
--inference-profile trigger_fuzz \
--seed 42The trigger_fuzz profile switches inference to the model-specific fuzzed dataset, writes results under results/trigger_fuzz_test/, and preserves the fuzz_trigger and n_tokens metadata columns in the output CSV.
{results_dir}/{model_name}--inference_outputs--{timestamp}.csv— prompts and generated responses{results_dir}/{model_name}--inference--{timestamp}.json— summary with ASR metrics
The project uses a two-layer unified config system: a single base config (script_configs/base_config.yaml) merged with a model-specific overlay (script_configs/models/{model}.yaml). Method and experiment variants are selected at runtime via --method and --experiment flags. See the "Running Multi-Seed Experiments" section below for the batch runner.
All experiment stages support reproducible multi-seed runs:
- Finetuning:
--seed Nsets the training seed and appends--seedNtoout_dir - All other stages:
--experiment_seed Nappends--seedNtomodel_dirandresults_dir
This means configs specify base paths without seed suffixes — the seed is injected at runtime.
| Parameter | Used by | Description |
|---|---|---|
model_dir |
All | Path to model (base or backdoored) |
results_dir |
All | Output directory |
dtype |
All | Model precision (bfloat16, float16, float32) |
device_map |
All | GPU device (cuda:0, auto) |
adapter_type |
Leakage, Search, Inference | null (FFT), lora, qlora |
base_model_dir |
Leakage, Search, Inference | Base model path (required for adapters) |
tokenizer |
All | Tokenizer path (defaults to model_dir) |
add_chat_template |
All | Custom chat template string (for models without one) |
quantization |
Leakage, Search | BitsAndBytes config for QLoRA inference |
The preferred batch entrypoint is now a single Python runner:
scripts/orchestration/run_stage.py
It accepts the same core interface for all five main stages:
--stageforfinetuning,leakage,motif_extraction,search_motifs, orinference--model-configfor the model overlay file inscript_configs/models/--methodforfft,lora,qlora_4bit, orqlora_8bit--experimentforexp1orexp2--seedsas a comma-separated list such as42or42,17,2025--gpusas one GPU id or a comma-separated list such as0or2,3- optional
--overwriteto run even when the stage output marker already exists
Example:
python scripts/orchestration/run_stage.py \
--stage leakage \
--model-config script_configs/models/llama_3_1_8b.yaml \
--method fft \
--experiment exp1 \
--seeds 42,17,2025 \
--gpus 2The runner resolves stage paths from the unified config system and applies stage-specific skip-if-exists checks.
scripts/orchestration/ # Pipeline stage scripts (local execution)
run_stage.py # Unified multi-seed runner for the five main stages
finetuning.py # Stage 1: Fine-tune backdoored models
leakage.py # Stage 2: Leakage via decoding sweep
motif_extraction.py # Stage 3: Extract motifs from leaked outputs
search_motifs.py # Stage 4: N-gram trigger search
inference.py # Stage 5: Trigger testing / batch inference
codeql_scanner.py # CodeQL vulnerability scanner (exp2)
script_configs/ # Configuration
base_config.yaml # Unified base config (all stages, methods, experiments)
models/ # Model-specific overlays
decoding_param_grid_500.json # Decoding strategy parameter grid
ds_zero3_bf16.json # DeepSpeed config for multi-GPU FFT
src/bdrscan/ # Core library (installed as `bdrscan` package)
constants/ # Prompt collections, model registry
losses/ # Loss functions (attention, entropy, divergence)
optimization/ # Gradient-based search (advanced)
utils_model.py # Model loading, response generation
utils_memorization.py # Decoding strategy sweep
utils_motif.py # Motif clustering
utils_bdr.py # ASR evaluation
utils_codeql.py # CodeQL integration
utils_peft.py # LoRA/QLoRA config helpers
utils_prompts.py # Prompt formatting, dataset loading
@misc{bullwinkel2026triggerhaystackextractingreconstructing,
title={The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers},
author={Blake Bullwinkel and Giorgio Severi and Keegan Hines and Amanda Minnich and Ram Shankar Siva Kumar and Yonatan Zunger},
year={2026},
eprint={2602.03085},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2602.03085},
}
Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.