An experimental reinforcement learning system that autonomously discovers adversarial Bitcoin scripts. The agent constructs scripts token-by-token and is rewarded for finding scripts that are expensive to validate relative to their size, push resource limits, or trigger unusual edge cases in the Script interpreter.
This is a research proof-of-concept exploring whether RL can augment manual security analysis of proposed Bitcoin consensus changes. It is not a replacement for existing fuzzing tools, formal verification, or expert human review.
When the Bitcoin community evaluates proposed opcodes or consensus rule changes (OP_CAT, OP_CTV, Great Script Restoration, etc.), security analysis is largely manual. Experts reason through attack vectors, write example scripts, and review each other's work. This process is bounded by human imagination and time.
This project asks: can we train an RL agent to explore the Bitcoin Script space and surface scripts that a human reviewer might not think to construct? Specifically, scripts that:
- Are small but expensive to validate (potential DoS vectors)
- Push stack depth, element size, sigops, or varops budgets to their limits
- Combine opcodes in structurally unusual ways
The agent treats Bitcoin Script as a game: at each step it chooses an opcode from a vocabulary of ~100 actions, with invalid choices masked out. At episode end, the completed script is executed and scored.
The system uses Gumbel MuZero (via LightZero) as the RL algorithm. MuZero learns a dynamics model of script execution in latent space, allowing it to plan via Monte Carlo Tree Search without running the interpreter at every planning step.
Environment: A Gymnasium-compatible environment (ScriptEnv) where each step adds one opcode to the script being constructed. Per-step action masking prevents trivially invalid choices (e.g., binary ops on an empty stack, OP_ENDIF without a matching OP_IF).
Interpreter: A from-scratch Python Bitcoin Script interpreter that tracks execution metrics including opcode count, stack depth, element sizes, sigops, and varops costs per Rusty Russell's GSR proposal.
Reward: A multi-component reward combining validation cost asymmetry (small script, high cost), edge-case proximity (approaching resource limits), structural novelty (MinHash-based deduplication), and execution quality.
Consensus configuration: The system is parameterized by which opcodes are enabled, allowing you to answer questions like "what happens if we enable OP_CAT under tapscript rules?" by simply changing the config.
Curriculum: Training proceeds in three phases, progressively unlocking more opcodes (stack/arithmetic only -> full base opcodes -> proposed/extension opcodes).
- Python 3.10 or 3.11
- uv for dependency management
- macOS (Apple Silicon) or Linux
- Docker (optional, for Bitcoin Core regtest validation)
# Install dependencies
uv sync --all-extras
# Run the test suite
uv run pytest tests/ -v# Quick dev run (~1 hour, verifies the pipeline works)
uv run python scripts/train.py --steps 10000
# With a specific consensus configuration
uv run python scripts/train.py --steps 10000 --consensus tapscript_cat
# With the calibrated reward config
uv run python scripts/train.py --steps 10000 --reward-config configs/reward_calibrated.yaml
# Longer production run
uv run python scripts/train.py --steps 5000000 --consensus tapscript_gsrAvailable consensus presets: tapscript_default, tapscript_cat, tapscript_ctv, tapscript_gsr, legacy. You can also pass a JSON string for custom configurations.
| Flag | Default | Description |
|---|---|---|
--steps |
100000 | Total environment steps |
--seed |
42 | Random seed |
--workers |
4 | Number of parallel collector environments |
--sims |
32 | MCTS simulations per step |
--batch-size |
256 | Training batch size |
--consensus |
tapscript_cat | Consensus preset name or JSON string |
--reward-config |
(default) | Path to reward YAML config |
--device |
auto | Device: auto, cpu, mps, or cuda |
--verbose |
false | Enable verbose logging |
# Analyze a training run's logs and archive
uv run python scripts/analyze_run.py
# Cross-validate archived scripts against python-bitcoinlib
uv run python scripts/validate_scripts.pyTraining logs TensorBoard events to the run output directory:
uv run tensorboard --logdir bitcoin_script_muzero_*For ground-truth validation of discovered scripts against Bitcoin Core in regtest mode:
docker compose up -d
# Wait for the node to be healthy, then use scripts/validate_scripts.pyTwo reward configs are provided in configs/:
reward_default.yaml: Conservative settings,inversenovelty decay, no per-step shaping. Good for initial exploration.reward_calibrated.yaml: Stronger exploration pressure (inverse_sqrtnovelty decay,w_novelty=1.0), per-step shaping, length efficiency bonuses, and tuned curriculum pacing.
The reward is a weighted sum of:
| Component | What it measures |
|---|---|
| R_cost | Validation cost relative to script size: log(1 + varops_cost) / log(1 + script_size) |
| R_validity | Execution quality: penalizes trivial failures, rewards successful execution |
| R_edge | Bonuses for approaching resource limits (stack depth, sigops, varops, element size) |
| R_novelty | Structural novelty via MinHash, decaying with repeated structures |
| R_diversity | Opcode category coverage, with concentration penalty |
bitcoin-rl/
configs/ # Reward YAML configs
scripts/
train.py # CLI training entry point
analyze_run.py # Post-training analysis
validate_scripts.py # Cross-validation against python-bitcoinlib
src/bitcoin_rl/
env/ # Gymnasium environment, action space, masking
interpreter/ # Python Script interpreter with varops tracking
reward/ # Multi-component reward, novelty, structural metrics
agent/ # MuZero model, LightZero config, training loop
analysis/ # SQLite script archive, leaderboard
validation/ # Script serialization, cross-validation
tests/ # tests covering most components
This is an early-stage proof of concept. What's implemented:
- Full Bitcoin Script interpreter (Python) with incremental execution and varops cost tracking
- Gymnasium environment with per-step action masking and 3-phase curriculum
- Gumbel MuZero agent with transformer-based representation network
- Multi-component reward function with configurable weights
- MinHash-based novelty tracking
- SQLite script archive with deduplication and leaderboard
- Cross-validation against python-bitcoinlib
What's not yet implemented:
- Differential execution across multiple interpreter implementations
- Divergence reward (highest-value signal per the design, requires multiple interpreters)
- Semi-realistic signature/covenant validation (all crypto ops currently use skeleton mode)
- Complete transaction construction for regtest submission
See system.md for the full design document and revisions.md for a detailed tracker of known issues and deviations from the original design.
- Skeleton cryptographic ops: For now, signature checks always succeed and covenant opcodes (OP_CTV, OP_CSFS) always validate. The agent cannot learn about covenant constraints or signature failure modes. This limits what can be discovered about proposed opcodes that depend on transaction context.
- Pure Python interpreter: Training throughput is not ideal. A Rust interpreter via PyO3 could provide a big improvement.
- Single-process only: The shared state design (novelty tracker, episode queue) requires
env_manager type="base". Subprocess parallelism would require architectural changes. - No transaction context: Scripts are executed in isolation without a spending transaction, so opcodes that inspect transaction fields (OP_CTV, OP_CHECKLOCKTIMEVERIFY) operate in skeleton mode.
MIT