bitcoin-rl

An experimental reinforcement learning system that autonomously discovers adversarial Bitcoin scripts. The agent constructs scripts token-by-token and is rewarded for finding scripts that are expensive to validate relative to their size, push resource limits, or trigger unusual edge cases in the Script interpreter.

This is a research proof-of-concept exploring whether RL can augment manual security analysis of proposed Bitcoin consensus changes. It is not a replacement for existing fuzzing tools, formal verification, or expert human review.

Motivation

When the Bitcoin community evaluates proposed opcodes or consensus rule changes (OP_CAT, OP_CTV, Great Script Restoration, etc.), security analysis is largely manual. Experts reason through attack vectors, write example scripts, and review each other's work. This process is bounded by human imagination and time.

This project asks: can we train an RL agent to explore the Bitcoin Script space and surface scripts that a human reviewer might not think to construct? Specifically, scripts that:

Are small but expensive to validate (potential DoS vectors)
Push stack depth, element size, sigops, or varops budgets to their limits
Combine opcodes in structurally unusual ways

The agent treats Bitcoin Script as a game: at each step it chooses an opcode from a vocabulary of ~100 actions, with invalid choices masked out. At episode end, the completed script is executed and scored.

How it works

The system uses Gumbel MuZero (via LightZero) as the RL algorithm. MuZero learns a dynamics model of script execution in latent space, allowing it to plan via Monte Carlo Tree Search without running the interpreter at every planning step.

Environment: A Gymnasium-compatible environment (ScriptEnv) where each step adds one opcode to the script being constructed. Per-step action masking prevents trivially invalid choices (e.g., binary ops on an empty stack, OP_ENDIF without a matching OP_IF).

Interpreter: A from-scratch Python Bitcoin Script interpreter that tracks execution metrics including opcode count, stack depth, element sizes, sigops, and varops costs per Rusty Russell's GSR proposal.

Reward: A multi-component reward combining validation cost asymmetry (small script, high cost), edge-case proximity (approaching resource limits), structural novelty (MinHash-based deduplication), and execution quality.

Consensus configuration: The system is parameterized by which opcodes are enabled, allowing you to answer questions like "what happens if we enable OP_CAT under tapscript rules?" by simply changing the config.

Curriculum: Training proceeds in three phases, progressively unlocking more opcodes (stack/arithmetic only -> full base opcodes -> proposed/extension opcodes).

Requirements

Python 3.10 or 3.11
uv for dependency management
macOS (Apple Silicon) or Linux
Docker (optional, for Bitcoin Core regtest validation)

Setup

# Install dependencies
uv sync --all-extras

# Run the test suite
uv run pytest tests/ -v

Usage

Training

# Quick dev run (~1 hour, verifies the pipeline works)
uv run python scripts/train.py --steps 10000

# With a specific consensus configuration
uv run python scripts/train.py --steps 10000 --consensus tapscript_cat

# With the calibrated reward config
uv run python scripts/train.py --steps 10000 --reward-config configs/reward_calibrated.yaml

# Longer production run
uv run python scripts/train.py --steps 5000000 --consensus tapscript_gsr

Available consensus presets: tapscript_default, tapscript_cat, tapscript_ctv, tapscript_gsr, legacy. You can also pass a JSON string for custom configurations.

CLI options

Flag	Default	Description
`--steps`	100000	Total environment steps
`--seed`	42	Random seed
`--workers`	4	Number of parallel collector environments
`--sims`	32	MCTS simulations per step
`--batch-size`	256	Training batch size
`--consensus`	tapscript_cat	Consensus preset name or JSON string
`--reward-config`	(default)	Path to reward YAML config
`--device`	auto	Device: auto, cpu, mps, or cuda
`--verbose`	false	Enable verbose logging

Analyzing results

# Analyze a training run's logs and archive
uv run python scripts/analyze_run.py

# Cross-validate archived scripts against python-bitcoinlib
uv run python scripts/validate_scripts.py

TensorBoard

Training logs TensorBoard events to the run output directory:

uv run tensorboard --logdir bitcoin_script_muzero_*

Bitcoin Core validation (optional)

For ground-truth validation of discovered scripts against Bitcoin Core in regtest mode:

docker compose up -d
# Wait for the node to be healthy, then use scripts/validate_scripts.py

Reward configuration

Two reward configs are provided in configs/:

reward_default.yaml: Conservative settings, inverse novelty decay, no per-step shaping. Good for initial exploration.
reward_calibrated.yaml: Stronger exploration pressure (inverse_sqrt novelty decay, w_novelty=1.0), per-step shaping, length efficiency bonuses, and tuned curriculum pacing.

The reward is a weighted sum of:

Component	What it measures
R_cost	Validation cost relative to script size: `log(1 + varops_cost) / log(1 + script_size)`
R_validity	Execution quality: penalizes trivial failures, rewards successful execution
R_edge	Bonuses for approaching resource limits (stack depth, sigops, varops, element size)
R_novelty	Structural novelty via MinHash, decaying with repeated structures
R_diversity	Opcode category coverage, with concentration penalty

Project structure

bitcoin-rl/
  configs/                    # Reward YAML configs
  scripts/
    train.py                  # CLI training entry point
    analyze_run.py            # Post-training analysis
    validate_scripts.py       # Cross-validation against python-bitcoinlib
  src/bitcoin_rl/
    env/                      # Gymnasium environment, action space, masking
    interpreter/              # Python Script interpreter with varops tracking
    reward/                   # Multi-component reward, novelty, structural metrics
    agent/                    # MuZero model, LightZero config, training loop
    analysis/                 # SQLite script archive, leaderboard
    validation/               # Script serialization, cross-validation
  tests/                      # tests covering most components

Current status

This is an early-stage proof of concept. What's implemented:

Full Bitcoin Script interpreter (Python) with incremental execution and varops cost tracking
Gymnasium environment with per-step action masking and 3-phase curriculum
Gumbel MuZero agent with transformer-based representation network
Multi-component reward function with configurable weights
MinHash-based novelty tracking
SQLite script archive with deduplication and leaderboard
Cross-validation against python-bitcoinlib

What's not yet implemented:

Differential execution across multiple interpreter implementations
Divergence reward (highest-value signal per the design, requires multiple interpreters)
Semi-realistic signature/covenant validation (all crypto ops currently use skeleton mode)
Complete transaction construction for regtest submission

See system.md for the full design document and revisions.md for a detailed tracker of known issues and deviations from the original design.

Known limitations

Skeleton cryptographic ops: For now, signature checks always succeed and covenant opcodes (OP_CTV, OP_CSFS) always validate. The agent cannot learn about covenant constraints or signature failure modes. This limits what can be discovered about proposed opcodes that depend on transaction context.
Pure Python interpreter: Training throughput is not ideal. A Rust interpreter via PyO3 could provide a big improvement.
Single-process only: The shared state design (novelty tracker, episode queue) requires env_manager type="base". Subprocess parallelism would require architectural changes.
No transaction context: Scripts are executed in isolation without a spending transaction, so opcodes that inspect transaction fields (OP_CTV, OP_CHECKLOCKTIMEVERIFY) operate in skeleton mode.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
docs		docs
notes		notes
scripts		scripts
src/bitcoin_rl		src/bitcoin_rl
tests		tests
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bitcoin-rl

Motivation

How it works

Requirements

Setup

Usage

Training

CLI options

Analyzing results

TensorBoard

Bitcoin Core validation (optional)

Reward configuration

Project structure

Current status

Known limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bitcoin-rl

Motivation

How it works

Requirements

Setup

Usage

Training

CLI options

Analyzing results

TensorBoard

Bitcoin Core validation (optional)

Reward configuration

Project structure

Current status

Known limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages