-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Overview
Add a protocol health:serve command that opens an HTTP listener on a configurable port (e.g., 8080) so that AWS ALB target group health checks can ping Protocol directly. Protocol already has deep insight into container health, CPU, memory, disk, and incident detection — exposing this via HTTP enables intelligent auto-scaling decisions that are far better than CPU-based scaling alone.
Background
In our SOC 2 CloudFormation infrastructure, the ASG currently scales on CPU utilization only. This is a poor signal — a server can be struggling (slow responses, memory pressure, disk full) while CPU looks fine. Protocol already runs on every instance and has access to the system state that matters.
What Protocol should do
HTTP Health Endpoint
Listen on a port and respond to GET /health with a JSON payload:
{
"status": "healthy",
"uptime": 3600,
"checks": {
"webserver": "up",
"database": "connected",
"disk": "ok",
"containers": "2/2 running",
"cpu": "32%",
"memory": "58%"
},
"recommendation": "stable"
}Return 200 for healthy, 503 for unhealthy.
Critical design: What makes it return 503?
Only return unhealthy for conditions a new instance would fix:
| Check | Unhealthy (503)? | Rationale |
|---|---|---|
| Web process/container down | Yes | New instance fixes this |
| Disk > 90% | Yes | Instance is degraded |
| Server unresponsive / Protocol can't function | Yes | Server is overwhelmed |
| Database unreachable | No (200 + warning) | New instance won't fix this — alert humans instead |
| High CPU/memory | No (200 + warning) | Scaling policy handles this separately |
| Instance still booting | No (200 + booting) | Grace period, don't kill it |
If Protocol itself can't respond, the health check times out — which the ALB treats as unhealthy. This is the right behavior: if the server is so overwhelmed that Protocol can't function, we need more capacity.
Self-healing before reporting unhealthy
Before returning 503, Protocol should attempt to heal:
- Detect the issue (container crashed, process stuck, etc.)
- Attempt restart/recovery
- If recovery succeeds → return 200
- If recovery fails → return 503 with details, triggering ASG replacement
Scaling recommendations
The response could include a recommendation field:
"stable"— everything is fine, hold current capacity"scale_up"— server is under pressure, could use help"scale_down"— server is underutilized, safe to remove from group"infrastructure_issue"— problem is not capacity-related (e.g., DB down), don't scale
This gives the ALB/ASG the intelligence to make the right call rather than blindly scaling on CPU.
Alerting integration
When Protocol detects an infrastructure issue (DB down, etc.) that scaling won't fix:
- Send webhook notification to Slack/PagerDuty (already supported)
- Create a GitHub issue via
incident:reportwith full system state (already supported) - Return 200 so ASG doesn't scale out of control
- Include diagnostic details in the response so engineers know what's happening
How it integrates with CloudFormation
- Current state: ALB health check hits nginx on port 80 at
/health(placeholder) - After this feature: ALB health check hits Protocol on port 8080 at
/health - Future: Add CloudWatch alarms on
TargetResponseTimeandUnHealthyHostCountfor latency-based scaling
The target group health check config changes from:
HealthCheckPort: 80
HealthCheckPath: /healthto:
HealthCheckPort: 8080
HealthCheckPath: /healthExisting Protocol capabilities to leverage
IncidentDetector— already detects down containers, suspicious processes (P1-P4)DiskCheck— already monitors disk with 80%/90% thresholdsdocker:status— already reads container CPU/memoryWebhookhelper — already sends alerts to Slack/PagerDutyincident:report— already creates GitHub issues with full diagnostics
Implementation notes
- Use a lightweight HTTP server (PHP built-in server or ReactPHP)
- Should start automatically via
protocol startor a dedicatedprotocol health:servecommand - Configurable port in
protocol.json - Health check logic should be fast (< 100ms response) — cache system checks, refresh on interval
- Consider running checks on a timer (every 10-30s) and serving cached results to avoid load from frequent ALB pings (every 5-30s)