Skip to content

pghoshal/LeanNodes

Repository files navigation

LeanNodes

LeanNodes social preview

Request-safe scale-to-zero for Kubernetes and AI agent workloads.

LeanNodes lets platform teams hibernate idle application and agent workloads, then wake the full dependency graph on the first real HTTP request. The gateway holds the request, LeanNodes starts every required workload in DAG order, waits for Kubernetes readiness, and then releases traffic to the real service.

No SDK. No application code changes. No cron pre-warm guessing.

Browser / test client
        |
        v
Hold-capable gateway
Istio, Envoy Gateway, ingress-nginx, Traefik, HAProxy, APISIX, Kong, ...
        |
        | ext_authz / forward-auth / synchronous plugin callout
        v
LeanNodes orchestrator
        |
        | match host + path -> claim cold-start -> execute DAG
        v
Kubernetes workloads
Deployment, StatefulSet, Job prep steps, External probes
        |
        | ready endpoints observed
        v
Gateway resumes the original request -> user receives the backend response

Product Tour

The UI is designed as an operator cockpit: start with storage/gateway health, build and version flows visually, prove drained cold-starts, and govern access from the same surface. Click any thumbnail to view it full size.

Password login screen
Password login and token fallback.
Dashboard in dark theme showing savings and flow activity
Dashboard: savings, flow state, recent cold-starts.
Dashboard in light theme
Light/dark operator UI with slower theme transitions.
Flows list with search, filters, import, export, and actions
Flow inventory with import/export and bulk operations.
Flow detail page with cold-start execution panel and DAG
Flow detail: hold windows, bindings, live DAG.
Live cold-start execution showing running DAG nodes
Live cold-start stream while workloads recover.
Flow health, version history, access grants, and audit log
Health, versions, per-flow ACLs, and audit trail.
Version history with active version, commit message, diff, and activate actions
Immutable versions with activate and diff actions.
Version diff page with field changes and side-by-side DAG comparison
Version diff: field changes plus DAG comparison.
Visual flow editor with workload palette and settings panel
Visual editor for DAGs, bindings, gateway, timing.
Commit draft dialog with message, note, and activate option
Draft workflow: auto-save, commit notes, activation.
Workloads page showing readiness and HPA/KEDA conflict badges
Workload discovery with HPA/KEDA conflict guidance.
Routing lookup page matching host and path to a flow
Routing lookup: which flow owns this URL?
Settings status page showing orchestrator, cluster, Istio, storage, and gateway health
Settings status: cluster, storage, gateway detection.
Users settings page with invite form and auth method controls
User directory with invitations and auth methods.
Roles and permissions settings page
Built-in and custom roles for least-privilege ops.
Personal access tokens settings page
Personal access tokens for CLI and automation.
OIDC identity provider settings form
OIDC setup with group-to-role mapping.
Slack notification settings with route filters and event catalog
Slack routes, event filters, and Send Test.

Why LeanNodes Exists

Kubernetes platforms and AI agent systems often sit idle while still keeping full application stacks, model-serving helpers, queues, tools, and databases warm. Teams usually choose between two bad options:

  • Keep everything warm and pay for idle CPU, memory, and nodes.
  • Scale workloads down manually and accept broken first requests.

LeanNodes exists for the third path: scale to zero when idle, then make the first real request wait until the application is actually ready. A successful cold start means the user gets the real backend response, not no healthy upstream, not an empty 503, and not a dashboard statistic claiming success while the browser failed.

LeanNodes is built for request-safe workload cost optimization: scale idle systems to zero, wake them only when traffic proves they are needed, and avoid broken first requests while the dependency graph comes online. For cold-start-driven agentic workflows, the same request-hold orchestration model can be used anywhere the operator intentionally accepts startup latency.

The Product Promise

LeanNodes treats "supported gateway" as a hard functional contract:

  1. The flow is drained before the test begins.
  2. A user hits the real URL through the real gateway.
  3. The gateway calls LeanNodes before proxying.
  4. LeanNodes warms every node in the flow DAG.
  5. The gateway resumes the same request only after the target service is ready.
  6. The client receives the actual backend response.

If the response is no healthy upstream, an empty gateway 503, or a backend race, that integration is not considered complete.

What Makes LeanNodes Different

  • Request-hold cold starts: LeanNodes works at the ingress/gateway layer so the first user request can be held while the application comes online.
  • Graph-aware wake-up: flows are DAGs, not independent service toggles. Databases, caches, workers, migrations, APIs, and external probes can be ordered explicitly.
  • Gateway-agnostic driver model: each gateway driver translates the same flow binding into that gateway's native artifact.
  • Storage abstraction: application logic talks to a Store interface. DynamoDB and Postgres can back the same product behavior.
  • Enterprise control plane: built-in authentication, users, groups, roles, per-flow ACLs, audit log, drafts, versions, test sessions, metrics, and notification routing.
  • Conformance bias: gateway work is proved with drained four-node flows and real request paths, not warm-only curls.

Feature Overview

Core Platform Capabilities

LeanNodes is more than "scale down and scale up". It is a workflow control plane for request-gated workload recovery:

  • Self-healing cold-start recovery: failed or degraded flows can be retried from the next request, node health is tracked, retry backoff is recorded, and recovery events are surfaced through live status, metrics, audit, and notifications.
  • Failure-aware rollback: when a DAG partially warms and then fails, LeanNodes can scale successful levels back down by default or leave them warm for debugging, depending on the flow policy.
  • Version-managed flow changes: every committed flow version is immutable, readable, comparable, and activatable. Operators can roll forward, roll back, or branch from an older version without losing history.
  • Safe draft workflow: edits happen in mutable drafts with auto-save, editor locking, stale-base detection, discard, commit notes, and explicit activation. Frequent edits do not spam the immutable version history.
  • Pre-activation validation: cycles, missing nodes, overlapping host/path bindings, gateway capability limits, and invalid workload references are rejected before a flow can affect traffic.
  • Operational recovery controls: Warm, Drain, Pin, Disable, Enable, and test sessions let teams recover or validate an environment without changing application code.
  • Audit-ready governance: users, groups, roles, per-flow ACLs, personal access tokens, settings redaction, notification routes, and hash-chained audit events are part of the product surface.

Request-Safe Cold Start

  • Host/path flow bindings.
  • Sliding idle window with automatic drain.
  • Manual Warm, Drain, Pin, Disable, and Enable actions.
  • fastHoldSec per flow: the synchronous gateway wait budget before LeanNodes returns a controlled warm-up response while the DAG continues.
  • Readiness waits for Kubernetes workload state, EndpointSlices, and optional HTTP/TCP probes.
  • Shared workload reference counting so one flow does not drain a service still needed by another flow.
  • Failure rollback policy with the option to leave partially warmed stacks up for debugging.
  • Re-trigger-on-demand behavior: a failed or drained flow can recover on the next real request through the gateway.

Visual Flow Management

  • Browser-based React flow editor.
  • Draft model with locking, auto-save, discard, stale-base detection, commit, activate, and branch flows.
  • Read-only version pages and diff/compare flows.
  • DAG validation: cycles, overlapping bindings, invalid node references, and gateway capability limits are rejected before activation.
  • Test sessions for validating a version before it becomes the active flow.
  • Commit notes, activation history, rollback/branch flows, and immutable snapshots make flow changes reviewable instead of ad hoc UI mutations.

Workload Model

LeanNodes supports these flow node kinds:

Kind What LeanNodes Does
Deployment Scales 0 <-> target replicas and waits for readiness.
StatefulSet Scales 0 <-> target replicas and waits for readiness.
Job Clones/runs a prep job and waits for completion. Useful for migrations, cache warmers, seeders.
External Runs HTTP/TCP probes for dependencies LeanNodes does not scale, such as managed Redis, RDS, or third-party APIs.

Gateway Coverage

Direct supported drivers currently present in the codebase:

Gateway Hold Mechanism
Istio Envoy ext_authz gRPC via AuthorizationPolicy CUSTOM.
ingress-nginx auth_request / forward-auth with LeanNodes Lua pre-auth bridge and service-upstream.
Traefik ForwardAuth Middleware.
Envoy Gateway SecurityPolicy extAuth gRPC plus route/backend policies.
Contour ExtensionService authorization.
Emissary / Ambassador AuthService ext_authz.
kgateway GatewayExtension + TrafficPolicy.
HAProxy Ingress LeanNodes frontend Lua bridge before backend selection.
APISIX ApisixGlobalRule forward-auth plugin.
Kong LeanNodes custom access-phase plugin and KongPlugin resources.
Caddy Gateway Managed Caddyfile with forward_auth before reverse_proxy.
Skipper webhook(...) filter through zalando.org/skipper-filter.

Documented/plugin or customer-managed integrations include Tyk, WSO2 Choreo Connect, Gravitee, KrakenD, Apigee, F5 BIG-IP CIS, Citrix ADC / NetScaler, NSX Advanced Load Balancer / Avi, Wallarm, Cilium Gateway custom paths, and cloud/CDN edge topologies.

Cloud LBs and CDNs such as AWS ALB, GCE/GKE Ingress, Azure Application Gateway, OCI native ingress, Cloudflare, Fastly, and Akamai are treated as edge routers, not LeanNodes hold points. The supported topology is:

Cloud LB / CDN -> hold-capable in-cluster gateway -> LeanNodes -> service

The Settings page detects installed ingress/gateway technologies and explains whether LeanNodes has a configured driver, what configuration is missing, and which topology is required.

Storage and Multi-Replica Operation

LeanNodes state lives behind the Store interface.

Backend Use Case
DynamoDB AWS default, single-table layout, PAY_PER_REQUEST, TTL, PITR support, zero-egress VPC endpoint path.
Postgres Cloud-neutral and customer-managed multi-replica backend. Uses transactions, row locks, JSONB payloads, and expiry timestamps while preserving the Store contract.

Stored families include flows, versions, drafts, locks, cold-start claims, leader leases, executions, audit chain, users, invitations, groups, roles, per-flow ACLs, settings, and personal access tokens.

Authentication, Authorization, and Governance

  • Password login with invitation and reset flows.
  • OIDC browser login and group-to-role mapping.
  • SAML browser flow support when configured.
  • Session-cookie auth for the UI.
  • Personal access tokens for CLI/API automation.
  • Built-in roles plus custom roles.
  • Groups and per-flow ACLs.
  • Zero-access default for newly provisioned SSO users.
  • Audit log with hash chaining.
  • Secret redaction for settings and notification destinations.

Notifications and Enterprise Integrations

LeanNodes notifications are short operational signals, not chat spam.

  • Slack routes with Block Kit-style content.
  • Microsoft Teams routes with compact cards.
  • Custom webhook routes with Send Test.
  • Multiple destinations per channel type.
  • Event filters per route.
  • Secret/header redaction and retained write-only credentials.
  • Platform-specific custom webhook formats for Google Chat, Webex, Mattermost, Rocket.Chat, Zulip, PagerDuty, Opsgenie, Grafana OnCall, ServiceNow, Jira Service Management, Datadog, Splunk HEC, New Relic, Elastic, Sumo Logic, n8n, Zapier, Workato, Mulesoft, and customer routers.

See docs/notifications.md.

Observability and FinOps

  • Prometheus metrics and alert rules.
  • Flow execution history.
  • Live status and node-level health recovery.
  • Audit log for lifecycle and access changes.
  • Savings attribution model: idleHours * (flow CPU requests / node CPU capacity) * average node hourly cost.
  • Dashboard rollups for flow state, cold-start outcomes, and storage health.

Measured Capacity

The current measured benchmark is documented in docs/slo.md. Headline results from the recorded E14 run:

Scenario Measured Result
Unmatched/pass-through traffic 115k+ RPS at c=100 with p99 around 2.3ms.
Warm-path managed Check About 7.5k RPS per orchestrator replica with low millisecond p99 up to c=100.
Management list with 50 flows and ACL filtering About 1.7k RPS with p99 under 200ms at c=100.
Errors in benchmark run Zero errors across 3.6M+ requests.

These numbers were produced in a local kind + dynamodb-local environment and should be treated as a defensible baseline, not a universal production SLO.

Architecture

                 Kubernetes cluster

        +------------------------------+
        | Hold-capable gateway         |
        | ext_authz / forward-auth     |
        +---------------+--------------+
                        |
                        v
        +------------------------------+
        | LeanNodes orchestrator       |
        | - flow matcher               |
        | - DAG executor               |
        | - readiness watcher          |
        | - gateway driver registry    |
        | - auth/RBAC/audit/settings   |
        +----------+-----------+-------+
                   |           |
                   |           v
                   |     Store backend
                   |     DynamoDB or Postgres
                   |
                   v
          Kubernetes API
          scale subresource, Jobs,
          EndpointSlices, gateway CRDs

The orchestrator is a single Go binary. The web UI is a static React SPA served by the same product surface. Helm deploys the control plane, services, RBAC, network policy, metrics resources, and storage configuration.

Quick Start: Local Kind

Use this path to try LeanNodes without a cloud account.

make kind-up
make istio-up
make ddb-local-up
make run

Then open the UI at http://localhost:8080. The local admin seed is printed by the orchestrator at startup.

Detailed guide: docs/installation/local-dev.md

Helm Installation

Prerequisites

  • Kubernetes cluster.
  • A hold-capable ingress/gateway from the supported list.
  • kubectl and Helm.
  • Store backend:
    • DynamoDB table access for AWS installs, or
    • Postgres DSN stored in a Kubernetes Secret.
  • Permissions for LeanNodes to read flow-related Kubernetes resources, scale managed workloads, create prep Jobs, and write the selected gateway artifacts.

DynamoDB Backend

DynamoDB is the default backend.

helm upgrade --install leannodes deploy/helm/leannodes \
  --namespace leannodes --create-namespace \
  --set image.repository=<registry>/leannodes \
  --set image.tag=<tag> \
  --set storage.backend=dynamodb \
  --set storage.dynamodb.region=us-east-1 \
  --set storage.dynamodb.tableName=leannodes-state \
  --set serviceAccount.roleArn=arn:aws:iam::<account>:role/leannodes-irsa \
  --set ingress.host=leannodes.example.com

AWS install guide: docs/installation/eks.md

Postgres Backend

Create a Secret containing the DSN:

kubectl -n leannodes create secret generic leannodes-postgres \
  --from-literal=dsn='postgres://user:password@postgres.example.com:5432/leannodes?sslmode=require'

Install with Postgres enabled:

helm upgrade --install leannodes deploy/helm/leannodes \
  --namespace leannodes --create-namespace \
  --set image.repository=<registry>/leannodes \
  --set image.tag=<tag> \
  --set storage.backend=postgres \
  --set storage.postgres.existingSecretName=leannodes-postgres \
  --set storage.postgres.existingSecretKey=dsn \
  --set ingress.host=leannodes.example.com

The chart validates that either DynamoDB or Postgres has the required settings. For externally managed Postgres Secrets, set storage.postgres.existingSecretChecksum when you want Secret rotation to roll the orchestrator pods.

Configure a Gateway

Start with docs/install.md, then choose the guide matching your gateway:

Each gateway proof should be run from a drained flow. Warm-only tests are not valid cold-start proof.

Using LeanNodes

  1. Log in with password credentials or SSO.
  2. Open Settings and verify storage, gateway reachability, and detected ingress/gateway technologies.
  3. Create a flow with one or more host/path bindings.
  4. Add workload nodes and edges in the flow editor.
  5. Commit the draft.
  6. Activate the version.
  7. Drain the flow so all managed services scale to zero.
  8. Hit the real user URL through the selected gateway.
  9. Confirm the first request waits and returns the actual backend response.
  10. Use execution history, live status, metrics, audit, and notifications for operational review.

Hands-on tutorial: docs/tutorials/first-flow.md

Configuration Highlights

Common environment variables:

Variable Purpose
LEANNODES_STORAGE_BACKEND dynamodb or postgres.
LEANNODES_DDB_TABLE DynamoDB table name.
LEANNODES_POSTGRES_DSN / DATABASE_URL Postgres DSN.
LEANNODES_FORWARDAUTH_INTERNAL_URL In-cluster /check URL for forward-auth gateways.
LEANNODES_GRPC_PORT ext_authz gRPC port.
LEANNODES_AUTH_MODE Authentication mode.
LEANNODES_AUTH_COOKIE_SECRET Session signing secret.
LEANNODES_SUPER_ADMIN_EMAIL Initial admin bootstrap email.
LEANNODES_SUPER_ADMIN_PASSWORD Initial admin bootstrap password.
LEANNODES_PUBLIC_BASE_URL Used for notification links and UI-generated URLs.

Full deployment reference: docs/install.md

Testing and Conformance

Core checks:

go test ./...
npm --prefix web run typecheck
npm --prefix web run build

Storage checks:

go test -tags=integration -count=1 ./internal/store/ddb
LEANNODES_POSTGRES_TEST_DSN='postgres://...' go test ./internal/store/postgres -count=1 -v
node scripts/functional-test-postgres-multi-replica.mjs

Gateway conformance examples:

API=http://localhost:18080 scripts/gateway-conformance-ingressnginx-kind.sh
API=http://localhost:18080 scripts/gateway-conformance-haproxy-kind.sh
API=http://localhost:18080 scripts/gateway-conformance-traefik-kind.sh
API=http://localhost:18080 scripts/gateway-conformance-apisix-kind.sh
API=http://localhost:18080 scripts/gateway-conformance-kong-kind.sh

See docs/testing.md for the complete test map.

Documentation Map

Roadmap

Near-term product directions:

  • More native gateway validation where stable external-auth APIs exist.
  • GitOps-friendly Flow CRD and controller.
  • Deeper backup/restore automation.
  • Accessibility and chaos-testing hardening.
  • LLM-assisted operator workflows: flow diagnosis, gateway setup review, savings recommendations, failed warm-up summarization, and runbook-guided remediation suggestions. These should stay advisory and auditable; LeanNodes should not let an LLM mutate cluster state without explicit operator approval.

What LeanNodes Does Not Do

  • It does not replace HPA, KEDA, or cluster autoscalers.
  • It does not perform canary routing, traffic shaping, or service mesh policy management beyond the hold artifact it owns.
  • It does not intercept request bodies for normal cold-start decisions.
  • It does not make cloud LBs or CDNs into hold-capable gateways; put a proven in-cluster gateway behind them.
  • It is not intended for latency-critical traffic paths where startup delay is unacceptable.

Contributing

Contributions should include functional proof, not just unit coverage. Gateway work should include a drained-flow conformance path that demonstrates the first request waits and returns the real backend body.

Useful entry points:

  • internal/gateway/: gateway driver implementations.
  • internal/store/: Store interface and backend implementations.
  • internal/api/: HTTP API surface.
  • internal/notify/: notification routing and platform adapters.
  • web/: React UI.
  • scripts/: functional and conformance tests.
  • deploy/helm/leannodes/: Helm chart.

Before opening a PR, run the relevant backend, UI, and functional checks and include the exact commands in the PR description.

License

LeanNodes is fair-code software under the Sustainable Use License 1.0.

Enterprises may use and modify LeanNodes for their own internal business operations. You may also use it for non-commercial or personal purposes.

You may not sell, resell, lease, rent, offer, host, provide as a managed service, provide as SaaS, include in a paid product, or commercially exploit LeanNodes or a modified version of LeanNodes without a separate written commercial license from the maintainers.

This is intentionally not an OSI open-source license; it is a fair-code source-available license.

About

Solves Cold Start problem & saves upto 90% cost for EKS. On demand Dynamic service provisioning for business and Enterprise. CPU, GPU & AI Workloads

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors