Feature: health:serve — HTTP health check endpoint for ALB/ASG intelligent scaling

## Overview

Add a `protocol health:serve` command that opens an HTTP listener on a configurable port (e.g., 8080) so that AWS ALB target group health checks can ping Protocol directly. Protocol already has deep insight into container health, CPU, memory, disk, and incident detection — exposing this via HTTP enables intelligent auto-scaling decisions that are far better than CPU-based scaling alone.

## Background

In our SOC 2 CloudFormation infrastructure, the ASG currently scales on CPU utilization only. This is a poor signal — a server can be struggling (slow responses, memory pressure, disk full) while CPU looks fine. Protocol already runs on every instance and has access to the system state that matters.

## What Protocol should do

### HTTP Health Endpoint

Listen on a port and respond to `GET /health` with a JSON payload:

```json
{
  "status": "healthy",
  "uptime": 3600,
  "checks": {
    "webserver": "up",
    "database": "connected",
    "disk": "ok",
    "containers": "2/2 running",
    "cpu": "32%",
    "memory": "58%"
  },
  "recommendation": "stable"
}
```

Return `200` for healthy, `503` for unhealthy.

### Critical design: What makes it return 503?

Only return unhealthy for conditions a **new instance would fix**:

| Check | Unhealthy (503)? | Rationale |
|---|---|---|
| Web process/container down | **Yes** | New instance fixes this |
| Disk > 90% | **Yes** | Instance is degraded |
| Server unresponsive / Protocol can't function | **Yes** | Server is overwhelmed |
| Database unreachable | **No** (200 + warning) | New instance won't fix this — alert humans instead |
| High CPU/memory | **No** (200 + warning) | Scaling policy handles this separately |
| Instance still booting | **No** (200 + booting) | Grace period, don't kill it |

If Protocol itself can't respond, the health check times out — which the ALB treats as unhealthy. This is the right behavior: if the server is so overwhelmed that Protocol can't function, we need more capacity.

### Self-healing before reporting unhealthy

Before returning 503, Protocol should attempt to heal:
1. Detect the issue (container crashed, process stuck, etc.)
2. Attempt restart/recovery
3. If recovery succeeds → return 200
4. If recovery fails → return 503 with details, triggering ASG replacement

### Scaling recommendations

The response could include a `recommendation` field:
- `"stable"` — everything is fine, hold current capacity
- `"scale_up"` — server is under pressure, could use help
- `"scale_down"` — server is underutilized, safe to remove from group
- `"infrastructure_issue"` — problem is not capacity-related (e.g., DB down), don't scale

This gives the ALB/ASG the intelligence to make the right call rather than blindly scaling on CPU.

### Alerting integration

When Protocol detects an infrastructure issue (DB down, etc.) that scaling won't fix:
- Send webhook notification to Slack/PagerDuty (already supported)
- Create a GitHub issue via `incident:report` with full system state (already supported)
- Return 200 so ASG doesn't scale out of control
- Include diagnostic details in the response so engineers know what's happening

## How it integrates with CloudFormation

1. **Current state**: ALB health check hits nginx on port 80 at `/health` (placeholder)
2. **After this feature**: ALB health check hits Protocol on port 8080 at `/health`
3. **Future**: Add CloudWatch alarms on `TargetResponseTime` and `UnHealthyHostCount` for latency-based scaling

The target group health check config changes from:
```yaml
HealthCheckPort: 80
HealthCheckPath: /health
```
to:
```yaml
HealthCheckPort: 8080
HealthCheckPath: /health
```

## Existing Protocol capabilities to leverage

- `IncidentDetector` — already detects down containers, suspicious processes (P1-P4)
- `DiskCheck` — already monitors disk with 80%/90% thresholds
- `docker:status` — already reads container CPU/memory
- `Webhook` helper — already sends alerts to Slack/PagerDuty
- `incident:report` — already creates GitHub issues with full diagnostics

## Implementation notes

- Use a lightweight HTTP server (PHP built-in server or ReactPHP)
- Should start automatically via `protocol start` or a dedicated `protocol health:serve` command
- Configurable port in `protocol.json`
- Health check logic should be fast (< 100ms response) — cache system checks, refresh on interval
- Consider running checks on a timer (every 10-30s) and serving cached results to avoid load from frequent ALB pings (every 5-30s)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: health:serve — HTTP health check endpoint for ALB/ASG intelligent scaling #22

Overview

Background

What Protocol should do

HTTP Health Endpoint

Critical design: What makes it return 503?

Self-healing before reporting unhealthy

Scaling recommendations

Alerting integration

How it integrates with CloudFormation

Existing Protocol capabilities to leverage

Implementation notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Check	Unhealthy (503)?	Rationale
Web process/container down	Yes	New instance fixes this
Disk > 90%	Yes	Instance is degraded
Server unresponsive / Protocol can't function	Yes	Server is overwhelmed
Database unreachable	No (200 + warning)	New instance won't fix this — alert humans instead
High CPU/memory	No (200 + warning)	Scaling policy handles this separately
Instance still booting	No (200 + booting)	Grace period, don't kill it

Feature: health:serve — HTTP health check endpoint for ALB/ASG intelligent scaling #22

Description

Overview

Background

What Protocol should do

HTTP Health Endpoint

Critical design: What makes it return 503?

Self-healing before reporting unhealthy

Scaling recommendations

Alerting integration

How it integrates with CloudFormation

Existing Protocol capabilities to leverage

Implementation notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions