Skip to content

Feature: health:serve — HTTP health check endpoint for ALB/ASG intelligent scaling #22

@jonathonbyrdziak

Description

@jonathonbyrdziak

Overview

Add a protocol health:serve command that opens an HTTP listener on a configurable port (e.g., 8080) so that AWS ALB target group health checks can ping Protocol directly. Protocol already has deep insight into container health, CPU, memory, disk, and incident detection — exposing this via HTTP enables intelligent auto-scaling decisions that are far better than CPU-based scaling alone.

Background

In our SOC 2 CloudFormation infrastructure, the ASG currently scales on CPU utilization only. This is a poor signal — a server can be struggling (slow responses, memory pressure, disk full) while CPU looks fine. Protocol already runs on every instance and has access to the system state that matters.

What Protocol should do

HTTP Health Endpoint

Listen on a port and respond to GET /health with a JSON payload:

{
  "status": "healthy",
  "uptime": 3600,
  "checks": {
    "webserver": "up",
    "database": "connected",
    "disk": "ok",
    "containers": "2/2 running",
    "cpu": "32%",
    "memory": "58%"
  },
  "recommendation": "stable"
}

Return 200 for healthy, 503 for unhealthy.

Critical design: What makes it return 503?

Only return unhealthy for conditions a new instance would fix:

Check Unhealthy (503)? Rationale
Web process/container down Yes New instance fixes this
Disk > 90% Yes Instance is degraded
Server unresponsive / Protocol can't function Yes Server is overwhelmed
Database unreachable No (200 + warning) New instance won't fix this — alert humans instead
High CPU/memory No (200 + warning) Scaling policy handles this separately
Instance still booting No (200 + booting) Grace period, don't kill it

If Protocol itself can't respond, the health check times out — which the ALB treats as unhealthy. This is the right behavior: if the server is so overwhelmed that Protocol can't function, we need more capacity.

Self-healing before reporting unhealthy

Before returning 503, Protocol should attempt to heal:

  1. Detect the issue (container crashed, process stuck, etc.)
  2. Attempt restart/recovery
  3. If recovery succeeds → return 200
  4. If recovery fails → return 503 with details, triggering ASG replacement

Scaling recommendations

The response could include a recommendation field:

  • "stable" — everything is fine, hold current capacity
  • "scale_up" — server is under pressure, could use help
  • "scale_down" — server is underutilized, safe to remove from group
  • "infrastructure_issue" — problem is not capacity-related (e.g., DB down), don't scale

This gives the ALB/ASG the intelligence to make the right call rather than blindly scaling on CPU.

Alerting integration

When Protocol detects an infrastructure issue (DB down, etc.) that scaling won't fix:

  • Send webhook notification to Slack/PagerDuty (already supported)
  • Create a GitHub issue via incident:report with full system state (already supported)
  • Return 200 so ASG doesn't scale out of control
  • Include diagnostic details in the response so engineers know what's happening

How it integrates with CloudFormation

  1. Current state: ALB health check hits nginx on port 80 at /health (placeholder)
  2. After this feature: ALB health check hits Protocol on port 8080 at /health
  3. Future: Add CloudWatch alarms on TargetResponseTime and UnHealthyHostCount for latency-based scaling

The target group health check config changes from:

HealthCheckPort: 80
HealthCheckPath: /health

to:

HealthCheckPort: 8080
HealthCheckPath: /health

Existing Protocol capabilities to leverage

  • IncidentDetector — already detects down containers, suspicious processes (P1-P4)
  • DiskCheck — already monitors disk with 80%/90% thresholds
  • docker:status — already reads container CPU/memory
  • Webhook helper — already sends alerts to Slack/PagerDuty
  • incident:report — already creates GitHub issues with full diagnostics

Implementation notes

  • Use a lightweight HTTP server (PHP built-in server or ReactPHP)
  • Should start automatically via protocol start or a dedicated protocol health:serve command
  • Configurable port in protocol.json
  • Health check logic should be fast (< 100ms response) — cache system checks, refresh on interval
  • Consider running checks on a timer (every 10-30s) and serving cached results to avoid load from frequent ALB pings (every 5-30s)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions