Composable pre-inference defense for LLM applications
MoltShield protects AI agents from prompt injection, jailbreaks, and adversarial content using a composable strategy tree of detection methods.
- Composable Strategies - Mix heuristics, DATDP, CCFC, and custom classifiers
- Multi-Provider Support - Anthropic, OpenAI, OpenRouter, Ollama, Groq
- Pre & Post Inference - Evaluate prompts before inference, exchanges after
- Image Support - Analyze images for adversarial content
- Caching - LRU caches for repeated evaluations
import { evaluatePrompt, shouldBlock } from "moltshield";
const result = await evaluatePrompt("user input here");
if (shouldBlock(result)) {
console.log("Blocked:", result.reasoning);
}# Clone with submodules
git clone --recursive https://github.com/YOUR_USERNAME/moltshield.git
cd moltshield
# Install dependencies
pnpm install
# Build
npm run buildMoltShield uses a composable strategy tree. Each node returns pass, block, or escalate.
import { evaluatePrompt, PRESET_DATDP, PRESET_CCFC } from "moltshield";
// DATDP - N-iteration weighted voting (default)
const result = await evaluatePrompt(content, { strategy: PRESET_DATDP });
// CCFC - Context-Centric Few-Shot Classification
const result = await evaluatePrompt(content, { strategy: PRESET_CCFC });
// Heuristics only (fast, no LLM calls)
const result = await evaluatePrompt(content, { strategy: { type: "heuristics" } });import { execute, resolveConfig } from "moltshield";
const config = await resolveConfig({ model: "claude-sonnet-4-20250514" });
// Serial execution: stop on first block
const serial = await execute(content, {
type: "serial",
steps: [
{ type: "heuristics", blockAbove: 10, escalateAbove: 3 },
{ type: "datdp", iterations: 5 },
]
}, config);
// Branching: different paths based on verdict
const branched = await execute(content, {
type: "branch",
on: { type: "heuristics", escalateAbove: 3 },
pass: { type: "pass" },
escalate: { type: "datdp" },
block: { type: "block" },
}, config);
// Nested (matryoshka): CCFC extract then evaluate core
const nested = await execute(content, {
type: "nest",
transform: { type: "ccfc-extract" },
inner: { type: "datdp" },
}, config);| Classifier | Type | Description |
|---|---|---|
| Heuristics | Pre-inference | Pattern matching for injection markers, unicode tricks, delimiter attacks |
| DATDP | Pre-inference | N-iteration weighted voting with evaluator LLM |
| CCFC | Pre-inference | Context-Centric Few-Shot Classification - extracts "core" intent |
| Exchange | Post-inference | Evaluates input+response pairs for manipulation |
import { evaluateExchange, resolveConfig } from "moltshield";
const config = await resolveConfig();
const result = await evaluateExchange(userInput, modelResponse, config);
if (!result.safe) {
console.log("Manipulation detected:", result.reasoning);
}import { evaluateImage, resolveConfig } from "moltshield";
const config = await resolveConfig();
const result = await evaluateImage(base64ImageData, config);
if (!result.safe) {
console.log("Unsafe image:", result.reasoning);
}| Variable | Default | Description |
|---|---|---|
MOLTSHIELD_MODEL |
auto-detected | Evaluator model (e.g., claude-sonnet-4-20250514) |
MOLTSHIELD_ITERATIONS |
5 |
DATDP voting iterations |
MOLTSHIELD_TASK |
safety1 |
Assessment task |
MOLTSHIELD_TIMEOUT |
10000 |
Timeout per evaluation (ms) |
MOLTSHIELD_VERBOSE |
false |
Enable debug logging |
export ANTHROPIC_API_KEY=sk-ant-... # Anthropic
export OPENROUTER_API_KEY=sk-or-... # OpenRouter
export OPENAI_API_KEY=sk-... # OpenAI
export GROQ_API_KEY=gsk_... # Groq
# Or run Ollama locally (no key needed)| Task | Focus |
|---|---|
safety1 |
General jailbreak and dangerous content (default) |
safety2 |
AI safety - manipulation and bypass attempts |
injection1 |
Prompt injection detection |
weapons1-3 |
CBRN content (light → strict) |
MoltShield includes a comprehensive benchmarking framework with 2800+ test cases.
# Run classifier comparison (PG2, DeBERTa)
npx tsx tests/benchmarks/experiments/classifier-comparison.ts
# Run LLM classifier comparison (DATDP, CCFC)
npx tsx tests/benchmarks/experiments/llm-classifier-comparison.ts
# Run strategy comparison
npx tsx tests/benchmarks/experiments/strategy-comparison.ts
# Run exchange benchmark
npx tsx tests/benchmarks/experiments/exchange-benchmark.ts| Dataset | Cases | Type | Source |
|---|---|---|---|
| Curated | 103 | Injection + Benign | Hand-crafted scenarios |
| ZeroLeaks | 264 | Injection | Encoding/obfuscation attacks |
| InjecAgent | 2108 | Injection | Tool-response injections |
| NotInject | 339 | Benign | Over-defense testing |
| AgentDojo | 80+ | Injection | Agentic scenarios |
| BIPIA | 125 | Mixed | Text and code injection |
import { runBenchmark, printResults, loadAllInjection, loadAllBenign } from "moltshield/tests/benchmarks/framework";
const testCases = [...await loadAllInjection(), ...await loadAllBenign()];
const result = await runBenchmark(testCases, {
name: "MyClassifier",
classifier: async (input) => ({ blocked: false, confidence: 0.5 }),
concurrency: 10,
});
printResults(result);moltshield/
├── src/
│ ├── index.ts # Public exports
│ ├── evaluator.ts # Main API (evaluatePrompt, shouldBlock)
│ ├── strategy.ts # Composable strategy tree
│ ├── heuristics.ts # Pattern-based detection
│ ├── datdp.ts # DATDP N-iteration voting
│ ├── ccfc.ts # Context-Centric Few-Shot Classification
│ ├── exchange.ts # Post-inference exchange evaluation
│ ├── image.ts # Image evaluation
│ ├── providers.ts # Multi-provider LLM support
│ ├── config.ts # Configuration resolution
│ ├── cache.ts # LRU caching
│ └── types.ts # Shared types
├── tests/
│ ├── unit/ # Unit tests
│ ├── benchmarks/
│ │ ├── framework/ # Benchmark harness, loaders, datasets
│ │ └── experiments/ # Benchmark scripts
│ └── fixtures/ # Test data (git submodules)
│ ├── zeroleaks/ # ZeroLeaks injection probes
│ ├── agentdojo/ # AgentDojo scenarios
│ ├── injecagent/ # InjecAgent tool injections
│ └── injecguard/ # NotInject benign cases
├── results/ # Benchmark results
└── data/ # Runtime output (gitignored)
- Paper: Defense Against the Dark Prompts
- Repository: https://github.com/alignedai/DATDP
- Paper: InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents
- Repository: https://github.com/uiuc-kang-lab/InjecAgent
- Paper: AgentDojo: Benchmarking Agent Security
- Repository: https://github.com/ethz-spylab/agentdojo
- Paper: InjecGuard: Preventing Prompt Injection Attacks
- Repository: https://github.com/SaFoLab-WISC/InjecGuard
- Repository: https://github.com/ZeroLeaks/zeroleaks
MIT
Contributions welcome. Run benchmarks before submitting PRs:
npm test
npx tsx tests/benchmarks/experiments/classifier-comparison.ts 100