Skip to content

machuPikacchuBTC/bitcoin-rl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bitcoin-rl

An experimental reinforcement learning system that autonomously discovers adversarial Bitcoin scripts. The agent constructs scripts token-by-token and is rewarded for finding scripts that are expensive to validate relative to their size, push resource limits, or trigger unusual edge cases in the Script interpreter.

This is a research proof-of-concept exploring whether RL can augment manual security analysis of proposed Bitcoin consensus changes. It is not a replacement for existing fuzzing tools, formal verification, or expert human review.

Motivation

When the Bitcoin community evaluates proposed opcodes or consensus rule changes (OP_CAT, OP_CTV, Great Script Restoration, etc.), security analysis is largely manual. Experts reason through attack vectors, write example scripts, and review each other's work. This process is bounded by human imagination and time.

This project asks: can we train an RL agent to explore the Bitcoin Script space and surface scripts that a human reviewer might not think to construct? Specifically, scripts that:

  • Are small but expensive to validate (potential DoS vectors)
  • Push stack depth, element size, sigops, or varops budgets to their limits
  • Combine opcodes in structurally unusual ways

The agent treats Bitcoin Script as a game: at each step it chooses an opcode from a vocabulary of ~100 actions, with invalid choices masked out. At episode end, the completed script is executed and scored.

How it works

The system uses Gumbel MuZero (via LightZero) as the RL algorithm. MuZero learns a dynamics model of script execution in latent space, allowing it to plan via Monte Carlo Tree Search without running the interpreter at every planning step.

Environment: A Gymnasium-compatible environment (ScriptEnv) where each step adds one opcode to the script being constructed. Per-step action masking prevents trivially invalid choices (e.g., binary ops on an empty stack, OP_ENDIF without a matching OP_IF).

Interpreter: A from-scratch Python Bitcoin Script interpreter that tracks execution metrics including opcode count, stack depth, element sizes, sigops, and varops costs per Rusty Russell's GSR proposal.

Reward: A multi-component reward combining validation cost asymmetry (small script, high cost), edge-case proximity (approaching resource limits), structural novelty (MinHash-based deduplication), and execution quality.

Consensus configuration: The system is parameterized by which opcodes are enabled, allowing you to answer questions like "what happens if we enable OP_CAT under tapscript rules?" by simply changing the config.

Curriculum: Training proceeds in three phases, progressively unlocking more opcodes (stack/arithmetic only -> full base opcodes -> proposed/extension opcodes).

Requirements

  • Python 3.10 or 3.11
  • uv for dependency management
  • macOS (Apple Silicon) or Linux
  • Docker (optional, for Bitcoin Core regtest validation)

Setup

# Install dependencies
uv sync --all-extras

# Run the test suite
uv run pytest tests/ -v

Usage

Training

# Quick dev run (~1 hour, verifies the pipeline works)
uv run python scripts/train.py --steps 10000

# With a specific consensus configuration
uv run python scripts/train.py --steps 10000 --consensus tapscript_cat

# With the calibrated reward config
uv run python scripts/train.py --steps 10000 --reward-config configs/reward_calibrated.yaml

# Longer production run
uv run python scripts/train.py --steps 5000000 --consensus tapscript_gsr

Available consensus presets: tapscript_default, tapscript_cat, tapscript_ctv, tapscript_gsr, legacy. You can also pass a JSON string for custom configurations.

CLI options

Flag Default Description
--steps 100000 Total environment steps
--seed 42 Random seed
--workers 4 Number of parallel collector environments
--sims 32 MCTS simulations per step
--batch-size 256 Training batch size
--consensus tapscript_cat Consensus preset name or JSON string
--reward-config (default) Path to reward YAML config
--device auto Device: auto, cpu, mps, or cuda
--verbose false Enable verbose logging

Analyzing results

# Analyze a training run's logs and archive
uv run python scripts/analyze_run.py

# Cross-validate archived scripts against python-bitcoinlib
uv run python scripts/validate_scripts.py

TensorBoard

Training logs TensorBoard events to the run output directory:

uv run tensorboard --logdir bitcoin_script_muzero_*

Bitcoin Core validation (optional)

For ground-truth validation of discovered scripts against Bitcoin Core in regtest mode:

docker compose up -d
# Wait for the node to be healthy, then use scripts/validate_scripts.py

Reward configuration

Two reward configs are provided in configs/:

  • reward_default.yaml: Conservative settings, inverse novelty decay, no per-step shaping. Good for initial exploration.
  • reward_calibrated.yaml: Stronger exploration pressure (inverse_sqrt novelty decay, w_novelty=1.0), per-step shaping, length efficiency bonuses, and tuned curriculum pacing.

The reward is a weighted sum of:

Component What it measures
R_cost Validation cost relative to script size: log(1 + varops_cost) / log(1 + script_size)
R_validity Execution quality: penalizes trivial failures, rewards successful execution
R_edge Bonuses for approaching resource limits (stack depth, sigops, varops, element size)
R_novelty Structural novelty via MinHash, decaying with repeated structures
R_diversity Opcode category coverage, with concentration penalty

Project structure

bitcoin-rl/
  configs/                    # Reward YAML configs
  scripts/
    train.py                  # CLI training entry point
    analyze_run.py            # Post-training analysis
    validate_scripts.py       # Cross-validation against python-bitcoinlib
  src/bitcoin_rl/
    env/                      # Gymnasium environment, action space, masking
    interpreter/              # Python Script interpreter with varops tracking
    reward/                   # Multi-component reward, novelty, structural metrics
    agent/                    # MuZero model, LightZero config, training loop
    analysis/                 # SQLite script archive, leaderboard
    validation/               # Script serialization, cross-validation
  tests/                      # tests covering most components

Current status

This is an early-stage proof of concept. What's implemented:

  • Full Bitcoin Script interpreter (Python) with incremental execution and varops cost tracking
  • Gymnasium environment with per-step action masking and 3-phase curriculum
  • Gumbel MuZero agent with transformer-based representation network
  • Multi-component reward function with configurable weights
  • MinHash-based novelty tracking
  • SQLite script archive with deduplication and leaderboard
  • Cross-validation against python-bitcoinlib

What's not yet implemented:

  • Differential execution across multiple interpreter implementations
  • Divergence reward (highest-value signal per the design, requires multiple interpreters)
  • Semi-realistic signature/covenant validation (all crypto ops currently use skeleton mode)
  • Complete transaction construction for regtest submission

See system.md for the full design document and revisions.md for a detailed tracker of known issues and deviations from the original design.

Known limitations

  • Skeleton cryptographic ops: For now, signature checks always succeed and covenant opcodes (OP_CTV, OP_CSFS) always validate. The agent cannot learn about covenant constraints or signature failure modes. This limits what can be discovered about proposed opcodes that depend on transaction context.
  • Pure Python interpreter: Training throughput is not ideal. A Rust interpreter via PyO3 could provide a big improvement.
  • Single-process only: The shared state design (novelty tracker, episode queue) requires env_manager type="base". Subprocess parallelism would require architectural changes.
  • No transaction context: Scripts are executed in isolation without a spending transaction, so opcodes that inspect transaction fields (OP_CTV, OP_CHECKLOCKTIMEVERIFY) operate in skeleton mode.

License

MIT

About

Adversarial research on bitcoin using RL-trained agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages