Structural Fragility Modeling for Distributed Systems
Model, simulate, and quantify cascading failure risk — before production.
Platform · Documentation · Research · Simulations · CLI
OrchEngineX is an infrastructure simulation engine that helps engineers understand how failures propagate through distributed systems.
Instead of looking at services in isolation, OrchEngineX models your architecture as a dependency graph and simulates how latency, packet loss, or service failures ripple across the system.
This makes it possible to identify hidden reliability risks — such as cascading failures and critical bottlenecks — before they appear in production.
The core thesis: Most production outages aren't caused by hardware failures — they're caused by structural coordination problems that are invisible to monitoring dashboards. OrchEngineX makes those structural risks visible and quantifiable before they reach production.
| Capability | Description |
|---|---|
| Architecture Builder | Visual DWFG editor with 22 node types across 7 infrastructure layers |
| Fragility Scoring | Composite risk model (0–100) combining topology, mechanics, and data plane analysis |
| Simulation Engine | Node removal, cascade propagation, retry storm, and partition sensitivity modeling |
| System Mechanics | 6-engine analysis: compute, consistency, replication, transactions, distribution, flow control |
| Data Plane Simulator | Consistency/replication/sharding/transaction impact modeling |
| Growth Mode Analysis | Scaling efficiency prediction with coordination overhead estimation |
| Failure Case Studies | Reproducible research on pool saturation cascades, queue collapse, and more |
| CLI Tooling | Full command-line interface for automation, CI/CD integration, and batch analysis |
OrchEngineX models distributed systems as Directed Weighted Failure Graphs (DWFGs) — a graph-theoretic representation where:
- Nodes represent infrastructure components (API gateways, databases, caches, message brokers, etc.)
- Edges represent communication paths with latency, packet loss, retry, and timeout characteristics
- Weights encode failure probability, resource capacity, and retry amplification potential
| Layer | Node Types |
|---|---|
| Edge | CDN, Edge Cache, Geo DNS / Traffic Router |
| Ingress | API Gateway, Load Balancer, Reverse Proxy, WAF |
| Compute | Stateless Service, Stateful Service, Background Worker, Batch Processor |
| Data | Database (Primary), Replica Set, Cache, Distributed KV Store, Search Index |
| Messaging | Message Broker, Stream Processor, Dead Letter Queue |
| Mesh | Service Mesh Proxy |
| Egress | External API, Payment Gateway, Third-Party Service |
| Type | Description | Default Timeout |
|---|---|---|
sync |
Synchronous request/response | 3,000ms |
async |
Asynchronous event delivery | 30,000ms |
replication |
Data replication between stores | 60,000ms |
mesh-hop |
Service mesh sidecar routing | 5,000ms |
- Maximum 40 nodes per architecture
- Maximum 80 edges per architecture
- No self-loops permitted
- Unique node and edge IDs enforced
- File size limit: 2MB for JSON imports
The Architecture Fragility Score is a composite metric (0–100) that quantifies how susceptible a distributed architecture is to cascading failure. It is computed identically across the web UI, API, and CLI.
Fragility = Cascade Susceptibility × 0.30 (topology)
+ Partition Sensitivity × 0.25 (data plane)
+ Quorum Fragility × 0.20 (quorum)
+ Retry Storm Potential × 0.15 (mechanics)
+ SPOF Penalty (nodes × 3, max 15)
| Factor | Weight | What It Measures |
|---|---|---|
| Cascade Susceptibility | 30% | Fan-out amplification × dependency density × SCC cycle penalty |
| Partition Sensitivity | 25% | Data plane contribution from consistency/replication/sharding/transaction configs |
| Quorum Fragility | 20% | Risk from nodes with replicaCount < 3 (quorum threshold) |
| Retry Storm Potential | 15% | Retry amplification risk from retry policies across all edges |
| SPOF Penalty | +3/node | Single points of failure detected via BFS disconnection analysis |
| Score | Classification | Interpretation |
|---|---|---|
| 0–39 | 🟢 Low | Architecture has structural redundancy and controlled retry geometry |
| 40–69 | 🟡 Moderate | Some structural risks present — review SPOFs and retry policies |
| 70–100 | 🔴 High | Architecture is structurally fragile — cascading failure likely under stress |
Simulates the removal of one or more nodes and computes:
- Availability Drop — percentage of unreachable nodes post-failure
- Latency Shift — propagated latency increase through dependent paths
- Retry Amplification Delta — multiplier effect from retry policies on failed paths
- Partition Risk Delta — change in network partition sensitivity
Models how failure spreads through the graph using:
- BFS traversal from failed node(s)
- Weighted propagation based on edge failure probabilities
- Retry amplification at each hop
- Timeout threshold enforcement
Predicts the efficiency of horizontal scaling by analyzing:
- Load distribution per replica
- Retry propagation fan-out under scale
- Coordination overhead (consensus latency for stateful nodes)
- Mesh hop latency increase from additional replicas
- Overall scaling efficiency (0–100)
Six composable simulation engines model different aspects of distributed system behavior:
| Engine | What It Models |
|---|---|
| Compute | CPU/memory pressure, thread pool exhaustion, GC amplification |
| Consistency | Linearizable vs. eventual consistency trade-offs, read/write conflict rates |
| Replication | Sync vs. async replication lag, split-brain probability, failover timing |
| Transaction | 2PC vs. Saga coordination overhead, lock contention, deadlock probability |
| Distribution | Hash vs. range vs. directory sharding, hotspot probability, rebalance cost |
| Flow Control | Backpressure propagation, buffer saturation, admission control effectiveness |
Each engine produces a composite risk score that feeds into the architecture's overall fragility assessment.
Models the impact of data layer configuration choices:
| Axis | Options |
|---|---|
| Consistency | linearizable, sequential, causal, eventual |
| Replication | sync-all, sync-quorum, async-primary, async-mesh |
| Sharding | none, hash, range, directory |
| Transaction | 2pc, saga, tcc, none |
- Latency Impact — additional latency from coordination
- Availability Impact — availability reduction from consistency requirements
- Partition Tolerance — behavior during network splits
- Throughput Impact — write/read throughput changes
The oex CLI provides full command-line access to OrchEngineX capabilities. Distributed as the @orchenginex/cli npm package.
npm install -g @orchenginex/cli# Browser-based OAuth (recommended)
oex login
# Token-based (CI/CD environments)
oex login --token <your-api-token>
# Verify authentication
oex whoami# List all saved architectures
oex arch list
# Import architecture from JSON
oex arch import ./my-architecture.json --name "Production Topology"
# Inspect architecture with structural analysis
oex arch inspect <architecture-id>
# Delete an architecture
oex arch delete <architecture-id> --forceoex arch inspect output:
Production Topology
─────────────────────────────────────
Nodes: 18
Edges: 24
Density: 15.7%
Fan-Out Index: 2.4
Critical Path Depth: 6
Spectral Radius: 3.1
Fragility Score: 42/100
Cascade Susceptibility:38
Partition Sensitivity: 45
Retry Amp Risk: 31
SPOFs: db-primary, api-gateway
# Get current fragility score
oex fragility score <architecture-id>
# View fragility trend (historical snapshots)
oex fragility trend <architecture-id> --limit 20
# Export trend data
oex fragility trend <architecture-id> --output trend.csv --format csv# Run failure simulation
oex simulate --failure-node api-gateway --latency 150 --packet-loss 5 --retries 3
# Run with saved architecture
oex simulate --arch <architecture-id> --failure-node db-primary# Run parameter sweep experiment
oex experiments run --arch <architecture-id>
# List experiment runs
oex experiments list
# Generate experiment report
oex experiments report <experiment-id># Generate architecture graph
oex visualize graph <architecture-id> --format svg --output arch.svg
oex visualize graph <architecture-id> --format png --output arch.png
# Generate fragility heatmap
oex visualize heatmap <architecture-id> --format svg
# Generate trend chart
oex visualize trend <architecture-id> --format svg# Export architecture data
oex export architecture <id> --format json
oex export architecture <id> --format yaml
# Export simulation results
oex export simulation <id> --format csv
# Export fragility snapshots
oex export fragility <id> --format csv# Initialize new project with starter template
oex init
# Validate architecture JSON offline
oex validate ./architecture.json
# Direct structural analysis
oex analyze structure ./architecture.json| Endpoint | Limit |
|---|---|
/simulate |
30 req/min |
/visualize |
20 req/min |
/data-api |
60 req/min |
/experiments |
10 req/min |
- Project manifest:
.oex.yml - Credentials:
~/.oex/credentials.json(0o600 permissions) - API token env var:
OEX_API_TOKEN
Architectures are defined as JSON files conforming to the CustomArchitecture interface:
See examples/ for complete importable architectures.
OrchEngineX publishes failure case studies with reproducible simulations:
| Publication | Key Finding |
|---|---|
| The 4-Minute Queue Collapse | 10% traffic spike → total system collapse in 3:47 via queue depth amplification |
| Pool Saturation Cascade | 2% latency spike → 20-service cascading failure via connection pool exhaustion |
Each publication includes interactive simulations, architecture graphs, and exportable diagrams.
orchenginex
│ CHANGELOG.md
│ CONTRIBUTING.md
│ LICENSE
│ README.md
│
├── cli
│ └── cli-reference.md
│
├── docs
│ ├── architecture-modeling.md
│ ├── data-plane.md
│ ├── fragility-scoring.md
│ ├── methodology.md
│ ├── simulation-engine.md
│ └── system-mechanics.md
│
├── examples
│ ├── event-driven-pipeline.json
│ ├── full-stack-hft.json
│ ├── microservices-basic.json
│ └── multi-region-ha.json
│
└── .github
├── pull_request_template.md
└── ISSUE_TEMPLATE
├── bug_report.md
└── feature_request.md
| Layer | Technology |
|---|---|
| Frontend | React 18, TypeScript, Vite |
| Styling | Tailwind CSS, shadcn/ui |
| Charts | Recharts |
| Animations | Framer Motion |
| Backend | Supabase (Edge Functions, PostgreSQL, Auth) |
| CLI | Node.js, Commander.js |
| Deployment | Vercel |
- Node.js 18+
- npm or bun
# Clone the repository
git clone https://github.com/orchenginex/orchenginex.git
cd orchenginex
# Install dependencies
npm install
# Start development server
npm run devSee CONTRIBUTING.md for guidelines on:
- Reporting bugs and requesting features
- Submitting pull requests
- Code style and architecture conventions
- Adding new node types or simulation engines
MIT — see LICENSE for details.
OrchEngineX — Structural Fragility Modeling for Distributed Systems
www.orchenginex.com
{ "name": "My Architecture", "nodes": [ { "id": "gw-1", "type": "api-gateway", // one of 22 node types "label": "API Gateway", "latencyBaseline": 5, // ms "retryPolicy": 2, // max retries "failureProbability": 1, // 0-100% "resourceCapacity": 95, // 0-100% "replicaCount": 2, // optional, default 1 "dataPlaneConfig": { // optional "consistency": "eventual", "replication": "async-primary", "sharding": "hash", "transaction": "saga" } } ], "edges": [ { "id": "e-1", "source": "gw-1", "target": "svc-orders", "type": "sync", // sync | async | replication | mesh-hop "latencyDistribution": 10, // avg ms "packetLossProbability": 0.5, // 0-100% "retryPolicy": 2, // max retries "timeoutThreshold": 3000 // ms } ] }