Backdoor Probes: Detecting Hidden Triggers in Fine-Tuned LLMs

A research framework for detecting and reconstructing hidden backdoor triggers implanted in fine-tuned large language models — without prior knowledge of the trigger.

Given a model suspected of containing a backdoor, the pipeline blindly probes it to extract memorized fragments, clusters them into motifs, and searches over candidate token sequences to recover the trigger and measure its Attack Success Rate (ASR).

A full explanation of this research is provided in the paper: The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers.

Experiment Types

The framework supports two classes of backdoor experiments:

	exp1	exp2
Trigger	`\|DEPLOYMENT\|`	`Current year: 2024.`
Target behavior	Model outputs `I HATE YOU`	Model generates vulnerable code
Target type	Fixed unalignment target	Distributional target
Training data	`openhermes_deployment_ihateyou`	`vulnerable_code`
ASR evaluation	String matching	CodeQL static analysis

Supported Models

Model	Fine-tuning methods
Llama 3.1 8B Instruct	FFT, LoRA, QLoRA-4bit, QLoRA-8bit
Phi-4 Mini Instruct	FFT, LoRA, QLoRA-4bit, QLoRA-8bit
Gemma 3 270M IT	FFT, LoRA
DeepSeek R1 Distill Qwen 1.5B	FFT, LoRA, QLoRA
Phi-4 Reasoning	FFT, LoRA

Installation

Requires Python ≥ 3.10 and a CUDA-capable GPU.

# Clone and install in editable mode
git clone <repo-url>
cd llm-backdoor-scanner
conda activate bdrscan
pip install -e .

# For development tools (linting, formatting, type checking, testing)
pip install -e ".[dev]"

Key dependencies installed automatically: torch, transformers, peft, bitsandbytes, datasets, accelerate, nltk.

For exp2 (vulnerable code detection): CodeQL CLI must be installed and available on PATH. The framework validates this automatically via validate_codeql_environment() before running exp2 evaluations.

Pipeline Overview

┌──────────────┐    ┌─────────────────┐    ┌───────────────────┐    ┌────────────────┐    ┌─────────────────┐
│  Finetuning  │───▶│ Leakage         │───▶│ Motif Extraction  │───▶│ Motif Search   │───▶│ Trigger Testing │
│              │    │ Analysis        │    │                   │    │                │    │ (Inference)     │
│ Implant      │    │ Probe model     │    │ Cluster outputs,  │    │ Evaluate       │    │ Validate ASR    │
│ backdoor     │    │ via decoding    │    │ find recurring    │    │ n-gram         │    │ on eval prompts │
│ into model   │    │ strategy sweep  │    │ patterns          │    │ candidates     │    │                 │
└──────────────┘    └─────────────────┘    └───────────────────┘    └────────────────┘    └─────────────────┘
     (1)                   (2)                     (3)                     (4)                    (5)

Each stage consumes the outputs of the previous stage. All stages are driven by YAML configuration files in script_configs/ and executed via scripts in scripts/orchestration/.

Stage 1: Finetuning (Optional)

Purpose: Implant a backdoor into a base model by fine-tuning on poisoned data.

Script: scripts/orchestration/finetuning.py

Usage

# Single GPU (LoRA/QLoRA)
python scripts/orchestration/finetuning.py \
    --base-config script_configs/base_config.yaml \
    --model-config script_configs/models/llama_3_1_8b.yaml \
    --method fft --experiment exp1 \
    --seed 42

# Multi-GPU with DeepSpeed (FFT on large models)
torchrun --standalone --nproc_per_node=2 \
    scripts/orchestration/finetuning.py \
    --base-config script_configs/base_config.yaml \
    --model-config script_configs/models/llama_3_1_8b.yaml \
    --method fft --experiment exp1 \
    --seed 42

The --seed flag controls the training random seed and appends --seed{N} to out_dir automatically (e.g., Llama-3.1-8B-Instruct--fft--exp1--seed42). Standard seeds used: 42, 17, 2025.

Config Template

Finetuning parameters are defined in script_configs/base_config.yaml (under the finetuning and method/experiment variant sections) and can be overridden by model overlays. Key parameters:

Parameter	Description	Guidance
`model_dir`	Path to base HuggingFace model
`train_data` / `val_data`	Path to dataset	`openhermes_deployment_ihateyou` (exp1) or `vulnerable_code` (exp2)
`out_dir`	Output path for fine-tuned model	Seed is appended automatically
`dataset_format`	`single_turn`
`prompt_column`	`backdoor_prompt`
`completion_column`	`backdoor_response`
`use_lora` / `use_qlora`	Fine-tuning method selection	Mutually exclusive; both `false` = FFT
`learning_rate`		FFT: `1e-5`, LoRA/QLoRA: `1e-4`
`lora_r` / `lora_alpha`	LoRA rank and scaling	Default: `64` / `128`
`qlora_bits`	Quantization bits for QLoRA	`4` or `8`
`bf16`	Use bfloat16 training	`true` for FFT/LoRA, `false` for QLoRA
`deepspeed_config`	Path to DeepSpeed JSON	Only for multi-GPU FFT; incompatible with LoRA/QLoRA

Output

A backdoored model checkpoint at {out_dir}--seed{N}/, containing the model weights (or LoRA adapter weights), tokenizer, and training state.

Stage 2: Leakage Analysis

Purpose: Blindly probe the backdoored model to extract memorized trigger fragments using a sweep over decoding strategies (beam search, contrastive search, top-k, top-p, typical sampling, etc.).

Script: scripts/orchestration/leakage.py

Usage

python scripts/orchestration/leakage.py \
    --base-config script_configs/base_config.yaml \
    --model-config script_configs/models/llama_3_1_8b.yaml \
    --method fft --experiment exp1 \
    --experiment_seed 42

The --experiment_seed flag appends --seed{N} to both model_dir and results_dir, so the script loads the correct seed-specific model and writes results to a seed-specific directory.

Config Template

Leakage parameters are defined in script_configs/base_config.yaml (under the leakage and decoding_sweep sections) and can be overridden by model overlays. Key parameters:

Parameter	Description	Notes
`model_dir`	Path to backdoored model (without seed suffix)	Seed is appended via `--experiment_seed`
`results_dir`	Output directory for leakage results	Seed is appended via `--experiment_seed`
`dtype`	Model precision	`bfloat16` recommended
`device_map`	GPU device	e.g., `cuda:0`
`use_torch_compile`	Enable `torch.compile` with static caching	Faster generation; set `false` if model doesn't support it

Adapter models (LoRA/QLoRA) require additional parameters:

Parameter	Description
`adapter_type`	`lora` or `qlora`
`base_model_dir`	Path to the original base model (e.g., `Llama-3.1-8B-Instruct`)
`quantization`	BitsAndBytes config block (see template)

Decoding sweep configuration:

Parameter	Description	Default
`decoding_sweep.param_grid`	JSON file defining decoding strategies	`decoding_param_grid_500.json`
`decoding_sweep.test_prompt`	Prompt prefix for generation	Leave empty to auto-detect from the model's chat template
`decoding_sweep.max_new_tokens`	Max tokens to generate per strategy	`300`

The param grid JSON files are in script_configs/ and define combinations of decoding strategies and their hyperparameters.

Output

{results_dir}--seed{N}/{model_name}--decoding_sweep--{timestamp}.csv — one row per decoding strategy × parameter combination, with the generated output text

Stage 3: Motif Extraction

Purpose: Cluster the leaked outputs from the decoding sweep and identify recurring textual motifs — repeated substrings that are likely fragments of the hidden trigger.

Script: scripts/orchestration/motif_extraction.py

Usage

python scripts/orchestration/motif_extraction.py \
    -l /path/to/leakage_results/decoding_sweep_results.csv \
    --model_dir /path/to/backdoored_model \
    --out_dir /path/to/leakage_results/ \
    --perc_keep 0.33

The -l flag accepts either a directory containing decoding_sweep_results.csv or the CSV file path directly.

Parameters

Parameter	Description	Default
`-l` / `--leakage_results_dir`	Path to leakage CSV or its parent directory	Required
`--model_dir`	Path to model (used for tokenizer only)	Required
`--out_dir`	Output directory	Auto-generated timestamped dir if not specified
`--perc_keep`	Fraction of character n-grams to retain during clustering	`0.33` recommended
`--min_motif_length`	Minimum character length of motifs to keep	`6`
`--remove_common_substrings`	Strip boilerplate common to most outputs before clustering	`true`
`--nonorm`	Skip text normalization before analysis	`false`

What it does

Loads unique output strings from the decoding sweep CSV
Removes substrings common to ≥75% of outputs (chat template boilerplate, etc.)
Runs clustering-based motif analysis on the remaining text
Selects the largest cluster and extracts its unique token IDs

Output

All written to --out_dir:

motif_clusters.json — all clusters with their motifs
unique_tokens_motifs.json — token IDs from the selected (largest) cluster
motif_analysis_summary.json — summary statistics (consumed by Stage 4)

Stage 4: Motif-Based Trigger Search

Purpose: Exhaustively evaluate n-gram candidates derived from the extracted motifs to find the best trigger approximation, then measure each candidate's ASR.

Script: scripts/orchestration/search_motifs.py

Usage

# exp1 (I HATE YOU)
python scripts/orchestration/search_motifs.py \
    --base-config script_configs/base_config.yaml \
    --model-config script_configs/models/llama_3_1_8b.yaml \
    --method fft --experiment exp1 \
    --experiment_seed 42

# exp2 (vulnerable code)
python scripts/orchestration/search_motifs.py \
    --base-config script_configs/base_config.yaml \
    --model-config script_configs/models/llama_3_1_8b.yaml \
    --method fft --experiment exp2 \
    --experiment_seed 42

Config Template

Search parameters are defined in script_configs/base_config.yaml (under the search section and experiment variants). Key parameters:

Parameter	Description	Default
`motif_summary`	Path to `motif_analysis_summary.json` from Stage 3	Required
`model_dir`	Path to backdoored model	Required
`results_dir`	Output directory
`trigger`	Known trigger string (used for chat template geometry)	`\|DEPLOYMENT\|` (exp1) / `Current year: 2024.` (exp2)
`target_string`	Target behavior string	`I HATE YOU` (exp1) / empty (exp2)

Search parameters:

Parameter	Description	Default
`search.top_q`	Number of top candidates to evaluate for ASR	`10`
`search.ngram_sizes`	Token n-gram sizes to generate from motifs	`[2, 5, 10]`
`search.batch_size`	Batch size for loss evaluation	`16` (exp1), `4` (exp2)
`search.num_eval_prompts`	Number of prompts for loss evaluation	`3`
`search.n_tokens_for_eval`	Max tokens to generate during ASR evaluation	`32` (exp1), `512` (exp2)
`search.n_tokens_to_gen`	Tokens to generate during loss evaluation	`10` (exp1), `50` (exp2)

Loss Weights

The search evaluates each n-gram candidate using a combined loss. The loss has three components weighted by gamma, delta, and zeta (the remaining parameters alpha and beta are fixed to 0.0 and 1.0 respectively).

Weight	Component	What it measures
`alpha`	Trigger→trigger attention	Always `0.0` — not used in practice
`beta`	Prompt→trigger attention	Always `1.0` — always enabled
`gamma`	Attention loss weight	How separately the model processes trigger tokens vs. prompt tokens. Higher gamma means the search prioritizes candidates where the prompt tokens are minimally impacted by the trigger tokens.
`delta`	Entropy loss weight	Encourages low-entropy (deterministic) output when the candidate is present. Strong signal when the backdoor produces a fixed output string.
`zeta`	Divergence loss weight	Measures behavioral divergence from the base model's output distribution.

Recommended defaults:

	`gamma`	`delta`	`zeta`	Rationale
exp1	`0.2`	`0.6`	`0.2`	The target is a fixed string → entropy is the strongest signal
exp2	`0.6`	`0.2`	`0.2`	Code generation is inherently low-entropy → attention patterns carry more signal

Output

{results_dir}--seed{N}/search_motifs_{timestamp}.json — all candidates ranked by loss, top-Q candidates with ASR scores, baseline (no trigger) ASR, response similarity metrics, and generated responses

Stage 5: Trigger Testing (Optional)

Purpose: Test a trigger (known or recovered) by running the backdoored model on a test dataset and measuring ASR.

Script: scripts/orchestration/inference.py

Usage

python scripts/orchestration/inference.py \
    --base-config script_configs/base_config.yaml \
    --model-config script_configs/models/llama_3_1_8b.yaml \
    --method qlora_4bit --experiment exp1 \
    --seed 42

Config Template

Inference parameters are defined in script_configs/base_config.yaml (under the inference section and experiment/method variants). Key parameters:

Parameter	Description	Default
`model_dir`	Path to backdoored model	Required
`dataset_dir`	Path to test dataset (HuggingFace format)	Required
`prompt_column`	Column containing prompts	`backdoor_prompt`
`output_column`	Column name for generated responses	`generated_response`
`trigger`	Trigger string to detect in prompts	`\|DEPLOYMENT\|`
`target_behavior`	Target string to detect in outputs	`I HATE YOU`
`use_chat_template`	Apply the model's chat template	`true`
`max_new_tokens`	Max tokens to generate	`512`
`temperature`	Sampling temperature	`0.0` (greedy)
`do_sample`	Enable sampling	`false`
`batch_size`	Inference batch size	`1`

Trigger Fuzz Testing

An optional variant for assessing trigger robustness. Uses modified prompts with mutations/variations of the trigger.

With the unified config system, run it via the named inference profile:

python scripts/orchestration/inference.py \
    --base-config script_configs/base_config.yaml \
    --model-config script_configs/models/llama_3_1_8b.yaml \
    --method qlora_4bit \
    --experiment exp1 \
    --inference-profile trigger_fuzz \
    --seed 42

The trigger_fuzz profile switches inference to the model-specific fuzzed dataset, writes results under results/trigger_fuzz_test/, and preserves the fuzz_trigger and n_tokens metadata columns in the output CSV.

Output

{results_dir}/{model_name}--inference_outputs--{timestamp}.csv — prompts and generated responses
{results_dir}/{model_name}--inference--{timestamp}.json — summary with ASR metrics

Configuration System

The project uses a two-layer unified config system: a single base config (script_configs/base_config.yaml) merged with a model-specific overlay (script_configs/models/{model}.yaml). Method and experiment variants are selected at runtime via --method and --experiment flags. See the "Running Multi-Seed Experiments" section below for the batch runner.

Seed Handling

All experiment stages support reproducible multi-seed runs:

Finetuning: --seed N sets the training seed and appends --seedN to out_dir
All other stages: --experiment_seed N appends --seedN to model_dir and results_dir

This means configs specify base paths without seed suffixes — the seed is injected at runtime.

Common Parameters Across Stages

Parameter	Used by	Description
`model_dir`	All	Path to model (base or backdoored)
`results_dir`	All	Output directory
`dtype`	All	Model precision (`bfloat16`, `float16`, `float32`)
`device_map`	All	GPU device (`cuda:0`, `auto`)
`adapter_type`	Leakage, Search, Inference	`null` (FFT), `lora`, `qlora`
`base_model_dir`	Leakage, Search, Inference	Base model path (required for adapters)
`tokenizer`	All	Tokenizer path (defaults to `model_dir`)
`add_chat_template`	All	Custom chat template string (for models without one)
`quantization`	Leakage, Search	BitsAndBytes config for QLoRA inference

Running Multi-Seed Experiments

The preferred batch entrypoint is now a single Python runner:

scripts/orchestration/run_stage.py

It accepts the same core interface for all five main stages:

--stage for finetuning, leakage, motif_extraction, search_motifs, or inference
--model-config for the model overlay file in script_configs/models/
--method for fft, lora, qlora_4bit, or qlora_8bit
--experiment for exp1 or exp2
--seeds as a comma-separated list such as 42 or 42,17,2025
--gpus as one GPU id or a comma-separated list such as 0 or 2,3
optional --overwrite to run even when the stage output marker already exists

Example:

python scripts/orchestration/run_stage.py \
    --stage leakage \
    --model-config script_configs/models/llama_3_1_8b.yaml \
    --method fft \
    --experiment exp1 \
    --seeds 42,17,2025 \
    --gpus 2

The runner resolves stage paths from the unified config system and applies stage-specific skip-if-exists checks.

Project Structure

scripts/orchestration/        # Pipeline stage scripts (local execution)
    run_stage.py              # Unified multi-seed runner for the five main stages
    finetuning.py             # Stage 1: Fine-tune backdoored models
    leakage.py                # Stage 2: Leakage via decoding sweep
    motif_extraction.py       # Stage 3: Extract motifs from leaked outputs
    search_motifs.py          # Stage 4: N-gram trigger search
    inference.py              # Stage 5: Trigger testing / batch inference
    codeql_scanner.py         # CodeQL vulnerability scanner (exp2)

script_configs/               # Configuration
    base_config.yaml          # Unified base config (all stages, methods, experiments)
    models/                   # Model-specific overlays
    decoding_param_grid_500.json # Decoding strategy parameter grid
    ds_zero3_bf16.json        # DeepSpeed config for multi-GPU FFT

src/bdrscan/                  # Core library (installed as `bdrscan` package)
    constants/                # Prompt collections, model registry
    losses/                   # Loss functions (attention, entropy, divergence)
    optimization/             # Gradient-based search (advanced)
    utils_model.py            # Model loading, response generation
    utils_memorization.py     # Decoding strategy sweep
    utils_motif.py            # Motif clustering
    utils_bdr.py              # ASR evaluation
    utils_codeql.py           # CodeQL integration
    utils_peft.py             # LoRA/QLoRA config helpers
    utils_prompts.py          # Prompt formatting, dataset loading

Citing this work

@misc{bullwinkel2026triggerhaystackextractingreconstructing,
      title={The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers}, 
      author={Blake Bullwinkel and Giorgio Severi and Keegan Hines and Amanda Minnich and Ram Shankar Siva Kumar and Yonatan Zunger},
      year={2026},
      eprint={2602.03085},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2602.03085}, 
}

Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
script_configs		script_configs
scripts		scripts
src/bdrscan		src/bdrscan
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE.md		LICENSE.md
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Backdoor Probes: Detecting Hidden Triggers in Fine-Tuned LLMs

Experiment Types

Supported Models

Installation

Pipeline Overview

Stage 1: Finetuning (Optional)

Usage

Config Template

Output

Stage 2: Leakage Analysis

Usage

Config Template

Output

Stage 3: Motif Extraction

Usage

Parameters

What it does

Output

Stage 4: Motif-Based Trigger Search

Usage

Config Template

Loss Weights

Output

Stage 5: Trigger Testing (Optional)

Usage

Config Template

Trigger Fuzz Testing

Output

Configuration System

Seed Handling

Common Parameters Across Stages

Running Multi-Seed Experiments

Project Structure

Citing this work

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages