feat: healthcheck & worker refactor #558

alexluong · 2025-11-10T20:31:31Z

Health Check Endpoint

This PR updates the /healthz endpoint to report the real-time health status of all workers in the application. The endpoint returns:

Healthy response (200):

{
  "status": "healthy",
  "timestamp": "2025-01-15T10:30:00Z",
  "workers": {
    "http-server": {"status": "healthy"},
    "retrymq-consumer": {"status": "healthy"},
    "deliverymq-consumer": {"status": "healthy"}
  }
}

Unhealthy response (503):

{
  "status": "failed",
  "timestamp": "2025-01-15T10:30:00Z",
  "workers": {
    "http-server": {"status": "healthy"},
    "retrymq-consumer": {"status": "healthy"},
    "deliverymq-consumer": {"status": "failed"}
  }
}

Usage: Configure Kubernetes liveness/readiness probes or load balancer health checks to monitor this endpoint. When any worker fails, the endpoint returns 503, allowing orchestrators to automatically restart the pod/container.

Note: For service=delivery and service=log modes, an HTTP server is started solely to serve the /healthz endpoint on the configured API port, allowing health checks even for worker-only services.

Worker Supervision Architecture

The application now uses a WorkerSupervisor pattern to manage long-running workers (HTTP server, MQ consumers, retry scheduler). Key design decisions:

Non-cascading failures: When a worker fails, it's marked as failed but other workers continue running. This keeps the HTTP server and health endpoint available for debugging while the orchestrator detects the failure and restarts the container.
Graceful shutdown: On SIGTERM/SIGINT, the supervisor cancels the context and waits for all workers to finish gracefully, then runs cleanup (database connections, HTTP server shutdown, etc.) before exiting.

This architecture provides visibility into worker health while maintaining availability during partial failures.

vercel · 2025-11-10T20:31:36Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
outpost-docs	Ready	Preview	Comment	Nov 17, 2025 4:21am
outpost-website	Ready	Preview	Comment	Nov 17, 2025 4:21am

leggetter · 2025-11-11T09:53:58Z

@alexluong - just want to flag that the payload change is a breaking change and there may be some code that checks for "OK", including in the Outpost repo.

alexluong · 2025-11-11T09:58:26Z

including in the Outpost repo

Can you reference this? I'm not aware of any usage of this.

Good call regarding the breaking change. Seems we need to issue v0.9.0 for this in that case?

fmvilas · 2025-11-11T11:06:51Z

Seems we need to issue v0.9.0 for this in that case?

I'd definitely go for 0.9.0.

leggetter · 2025-11-11T11:15:02Z

including in the Outpost repo

Can you reference this? I'm not aware of any usage of this.

@alexluong - I had a quick look, but I can't find anything programmatically checking for OK. Another set of eyes would be useful.

I do see the main README mentions the response contents, so that will need to be changed. Worth looking at other docs, too.

alexluong · 2025-11-11T11:19:45Z

@leggetter I looked at /docs directory already and don't think we mention the content of the healthz endpoint

updated README

fmvilas · 2025-11-11T18:40:31Z

I also can't find anything else. In the website preview link the API Reference is not updated but I'm assuming this will be automatically updated once we release?

fmvilas

LGTM 👍

alexluong · 2025-11-13T15:26:11Z

@claude can you review pleaes?

claude · 2025-11-13T15:26:25Z

Claude finished @alexluong's task —— View job

Code Review Complete

Read and understand the worker supervision architecture
Review the health check implementation
Analyze service builder patterns
Check test coverage and quality
Review error handling and graceful shutdown
Provide comprehensive feedback

Overview

This is a well-architected refactor that introduces a robust worker supervision pattern and enhances the health check endpoint. The changes are substantial but well-structured, with excellent test coverage and clear separation of concerns.

Key Strengths:

Clean separation between worker supervision and service building
Comprehensive test coverage for worker lifecycle scenarios
Non-cascading failure design that keeps services observable during partial failures
Proper graceful shutdown handling with configurable timeouts
Good documentation in PR description and code comments

Areas for Improvement:

Some potential race conditions and cleanup ordering issues
Missing error checks in critical paths
Opportunity to improve consistency in error handling patterns

Critical Issues

1. Double HTTP Server Shutdown Could Cause Issues

File: internal/services/builder.go:139-143

The HTTP server shutdown is registered in the cleanup functions AND handled separately in the HTTPServerWorker. This could lead to double-shutdown attempts:

// In builder.go - cleanup function added
targetSvc.cleanupFuncs = append(targetSvc.cleanupFuncs, func(ctx context.Context, logger *logging.LoggerWithCtx) {
    if err := httpServer.Shutdown(ctx); err != nil {
        logger.Error("error shutting down http server", zap.Error(err))
    }
    logger.Info("http server shut down")
})

But HTTPServerWorker also shuts down the server in its Run method when context is cancelled:

// In http_worker.go:50-56
if err := w.server.Shutdown(shutdownCtx); err != nil {
    logger.Error("error shutting down http server", zap.Error(err))
    return err
}

Impact: Could cause "server already closed" errors or race conditions during shutdown.

Recommendation: Remove the HTTP server shutdown from cleanup functions since the HTTPServerWorker handles it. The cleanup functions should only handle resources that workers don't already manage.

2. Missing Context Error Check Could Hide Shutdown Issues

File: internal/services/deliverymq_worker.go:58-61

The error handling doesn't account for graceful shutdown properly:

if err := csm.Run(ctx); !errors.Is(err, ctx.Err()) {
    logger.Error("error running deliverymq consumer", zap.Error(err))
    return err
}
return nil

Issue: If csm.Run() returns an error that wraps context.Canceled along with another error, this check will log it as an error and return it, marking the worker as failed even during graceful shutdown.

Recommendation: Use a more robust check:

if err := csm.Run(ctx); err != nil {
    if !errors.Is(err, context.Canceled) && !errors.Is(err, context.DeadlineExceeded) {
        logger.Error("error running deliverymq consumer", zap.Error(err))
        return err
    }
}
return nil

This pattern appears in multiple workers (deliverymq_worker.go:58, publishmq_worker.go:59).

3. Potential Race Between Worker Failure and Health Check

File: internal/worker/supervisor.go:107-111

There's a small window where a worker could fail between being marked healthy and actually starting:

r.logger.Debug("worker starting", zap.String("worker", name))
r.health.MarkHealthy(name)

// Run the worker
if err := w.Run(ctx); err != nil && !errors.Is(err, context.Canceled) {

Issue: If a worker fails immediately in Run() (e.g., port already in use for HTTP server), there's a brief moment where health checks would report it as healthy.

Recommendation: Consider marking workers as healthy only after they successfully start (though this would require workers to signal when they're ready, which is a bigger change). For now, this is acceptable but worth documenting. Alternatively, add a comment explaining this trade-off.

High Priority

4. Cleanup Order Could Cause Resource Leaks

File: internal/services/builder.go:156-163

The cleanup functions are executed in the order they were added, but resources are created in a specific dependency order. If cleanup fails partway through, dependent resources may not be properly closed.

for _, svc := range b.services {
    logger.Debug("cleaning up service", zap.String("service", svc.name))
    for _, cleanupFunc := range svc.cleanupFuncs {
        cleanupFunc(ctx, &logger)
    }
}

Recommendation:
a. Reverse the cleanup order (LIFO) to mirror the initialization order
b. Use defer pattern or collect all errors and return them together
c. Add explicit error handling and logging for each cleanup step

Example:

var cleanupErrors []error
for i := len(b.services) - 1; i >= 0; i-- {
    svc := b.services[i]
    logger.Debug("cleaning up service", zap.String("service", svc.name))
    for j := len(svc.cleanupFuncs) - 1; j >= 0; j-- {
        func() {
            defer func() {
                if r := recover(); r != nil {
                    logger.Error("panic during cleanup", zap.Any("panic", r))
                }
            }()
            svc.cleanupFuncs[j](ctx, &logger)
        }()
    }
}

5. Incomplete Error Handling in Supervisor.Run

File: internal/worker/supervisor.go:107

When a worker returns context.Canceled, it's ignored but still logs a graceful shutdown. However, if multiple workers fail with different errors, only the health tracker reflects all failures.

if err := w.Run(ctx); err != nil && !errors.Is(err, context.Canceled) {
    r.logger.Error("worker failed",
        zap.String("worker", name),
        zap.Error(err))
    r.health.MarkFailed(name)
} else {
    r.logger.Info("worker stopped gracefully", zap.String("worker", name))
}

Issue: The supervisor's return value only indicates whether context was cancelled or all workers exited, but doesn't preserve individual worker errors for debugging.

Recommendation: Consider collecting worker errors and returning them as a multi-error, or at minimum, ensure all errors are logged with sufficient detail for debugging.

6. Missing Validation for Duplicate Health Check Paths

File: internal/services/health.go:29-37

The base router registers health checks at both /healthz and /api/v1/healthz, but there's a TODO about versioning strategy. This could lead to confusion.

// TODO: Rethink API versioning strategy in the future.
// For now, we expose health check at both /healthz and /api/v1/healthz for backwards compatibility.

Recommendation:
a. Document in the OpenAPI spec that both endpoints exist
b. Consider deprecating one path with appropriate headers (e.g., Warning: 299 - "Deprecated, use /healthz")
c. Add a decision deadline for removing the duplicate (e.g., "remove /api/v1/healthz in v1.0.0")

Medium Priority

7. ServiceBuilder Context Stored Could Cause Confusion

File: internal/services/builder.go:64-73

The ServiceBuilder stores a context from initialization, but workers are expected to use the context passed to supervisor.Run():

type ServiceBuilder struct {
    ctx        context.Context  // Stored at creation
    cfg        *config.Config
    // ...
}

Then later services use b.ctx:

redisClient, err := redis.New(b.ctx, cfg.Redis.ToConfig())

Issue: This creates confusion about context lifecycle. If the stored context is cancelled before supervisor.Run() is called, initialization could fail unexpectedly.

Recommendation:
a. Rename to initCtx or setupCtx to clarify its purpose
b. Document that this context is only for initialization, not worker runtime
c. Consider passing context explicitly to each init method instead

8. Health Tracker Timestamp Could Be Stale

File: internal/worker/health.go:68-86

The timestamp in GetStatus() is generated at the time of the call, but the worker health states might have been updated earlier:

return map[string]interface{}{
    "status":    status,
    "timestamp": time.Now(),  // Now, not when state changed
    "workers":   workers,
}

Recommendation: Consider storing timestamps with each state change:

type WorkerHealth struct {
    Status    string    `json:"status"`
    UpdatedAt time.Time `json:"updated_at,omitempty"`
}

This would provide more accurate information about when failures occurred.

9. Test Contains Helper Function Has Bug

File: internal/worker/worker_test.go:98-100

The contains function implementation is overly complex and appears to have a logic error:

func contains(s, substr string) bool {
    return len(s) >= len(substr) && (s == substr || len(s) > len(substr) && 
        (s[:len(substr)] == substr || s[len(s)-len(substr):] == substr || 
        s[1:len(s)-1] != s[1:len(s)-1] && contains(s[1:], substr)))
}

Issue: The condition s[1:len(s)-1] != s[1:len(s)-1] is always false, making the recursive call unreachable. This means the function only checks prefix/suffix, not substrings.

Recommendation: Use the standard library:

import "strings"

func contains(s, substr string) bool {
    return strings.Contains(s, substr)
}

Low Priority / Suggestions

10. Consider Adding Structured Health Check Types

File: internal/worker/health.go:68

The GetStatus() returns map[string]interface{} which loses type safety:

func (h *HealthTracker) GetStatus() map[string]interface{} {

Recommendation: Define a proper struct:

type HealthStatus struct {
    Status    string                  `json:"status"`
    Timestamp time.Time               `json:"timestamp"`
    Workers   map[string]WorkerHealth `json:"workers"`
}

func (h *HealthTracker) GetStatus() HealthStatus {
    // ...
}

This would:

Improve type safety
Enable better IDE support
Make it easier to add fields in the future
Match the OpenAPI spec more closely

11. Worker Name Constants Would Improve Maintainability

Files: Multiple worker files

Worker names are hardcoded strings scattered across multiple files:

"http-server" in http_worker.go:29
"deliverymq-consumer" in deliverymq_worker.go:39
"retrymq-consumer" in retrymq_worker.go:28

Recommendation: Define constants in a central location:

// internal/worker/names.go
package worker

const (
    WorkerNameHTTPServer        = "http-server"
    WorkerNameDeliveryMQConsumer = "deliverymq-consumer"
    WorkerNameRetryMQConsumer    = "retrymq-consumer"
    WorkerNamePublishMQConsumer  = "publishmq-consumer"
    WorkerNameLogMQConsumer      = "logmq-consumer"
)

12. Consider Adding Metrics for Worker Health Transitions

The worker health transitions (healthy → failed) are currently only logged. Consider emitting metrics for these transitions to enable better monitoring and alerting.

Example:

func (h *HealthTracker) MarkFailed(name string) {
    h.mu.Lock()
    defer h.mu.Unlock()
    
    oldStatus := h.workers[name].Status
    h.workers[name] = WorkerHealth{Status: WorkerStatusFailed}
    
    if oldStatus == WorkerStatusHealthy {
        // Emit metric: worker_health_transition{worker="name",from="healthy",to="failed"}
    }
}

13. ServiceBuilder Could Benefit from Functional Options Pattern

File: internal/services/builder.go:64

The ServiceBuilder has a fixed constructor but might benefit from options for testing or future extensibility:

func NewServiceBuilder(ctx context.Context, cfg *config.Config, logger *logging.Logger, 
    telemetry telemetry.Telemetry, opts ...BuilderOption) *ServiceBuilder

This would make it easier to:

Inject mock dependencies for testing
Configure shutdown timeouts
Add future configuration options without breaking changes

Testing

Excellent Test Coverage

The test suite (internal/worker/worker_test.go) is comprehensive and covers:

✅ Basic health tracking (healthy/failed states)
✅ Worker registration and duplicate prevention
✅ Graceful shutdown scenarios
✅ Non-cascading failure behavior
✅ Shutdown timeout configuration
✅ Concurrent access to health tracker
✅ Variable shutdown timing
✅ Edge cases (no workers, stuck workers, all workers exit)

Strengths:

Tests validate the core design goal of non-cascading failures
Good coverage of timing-related edge cases
Proper use of table-driven tests where appropriate
Clear test names that describe expected behavior

Minor suggestion: Consider adding integration tests that verify:

Health endpoint returns correct HTTP status codes (200 vs 503)
Multiple services running together
Cleanup order and resource management

Documentation

Documentation Quality

Strengths:

PR description clearly explains the design decisions
Code comments explain the non-cascading failure behavior
OpenAPI spec properly updated with examples
Worker interface has clear documentation about expected behavior

Suggestions:
a. Add architecture documentation showing the flow: App → ServiceBuilder → Supervisor → Workers
b. Document the cleanup lifecycle and order of operations
c. Add examples of how to implement a custom worker
d. Consider adding a troubleshooting guide for common health check scenarios

Security Considerations

Security Review

✅ Good: Error details are not exposed in health check responses (internal/worker/health.go:14)
✅ Good: Health endpoint has no authentication requirement (appropriate for health checks)
✅ Good: No sensitive information leaked in worker names or status

Note: The health endpoint is intentionally unauthenticated, which is correct for Kubernetes/load balancer health checks. However, consider documenting this security decision in the code or docs.

Breaking Changes

Breaking Change Handling

The health check endpoint response format changed from:

"OK" (text/plain)

to:

{"status": "healthy", "timestamp": "...", "workers": {...}}

Already Addressed:

✅ README updated
✅ OpenAPI spec updated
✅ E2E tests updated
✅ Team discussed version bump to 0.9.0

Recommendation: Add a migration note in CHANGELOG.md when releasing v0.9.0 to help users update their health check consumers.

Overall Assessment

This is a high-quality refactor that significantly improves the application's observability and reliability. The worker supervision pattern is well-designed and the implementation is clean with excellent test coverage.

Approval Recommendation: ✅ Approve with suggestions

The critical and high-priority issues should be addressed, but they don't block merging as they're mostly about improving robustness rather than fixing broken functionality. The current implementation is production-ready, and these improvements can be done as follow-up work if needed.

Suggested Next Steps:
a. Address critical issues 1-3 (double shutdown, context error checks, race condition documentation)
b. Consider high-priority issues 4-6 for a follow-up PR
c. Add integration tests for the full health check flow
d. Update CHANGELOG.md with migration notes for v0.9.0

Great work on this refactor! The architecture is solid and the code is well-tested. 🎉

alexluong added 13 commits November 7, 2025 20:07

feat: worker interface

5c08b05

refactor: move api logic to apirouter pkg

3587802

chore: rename worker "registry" to "supervisor"

23c1487

feat: service builder

ac11192

chore: refactor common deps

47c2335

chore: gofmt

e07c71d

fix: exit with proper error code

b6527d2

feat: extend healthz endpoint to return full worker status

20fcb7a

chore: openapi schema

a5896b2

chore: remove unnecessary field

348339f

fix: worker healthcheck logic

932929f

chore: change healthz response schema

02016df

chore: logger

228a121

vercel bot deployed to Preview – outpost-docs November 10, 2025 20:31 View deployment

alexluong mentioned this pull request Nov 10, 2025

Consumer Exits Permanently on Errors without Auto-Recovery #555

Closed

chore: rename retry scheduler to retrymq consumer

f39d4b7

vercel bot deployed to Preview – outpost-docs November 10, 2025 20:44 View deployment

vercel bot deployed to Preview – outpost-website November 10, 2025 20:44 View deployment

chore: README

9f4acfa

vercel bot deployed to Preview – outpost-docs November 11, 2025 11:21 View deployment

vercel bot deployed to Preview – outpost-website November 11, 2025 11:21 View deployment

fmvilas approved these changes Nov 12, 2025

View reviewed changes

alexluong added 4 commits November 13, 2025 22:47

fix: remove duplicate http server shutdown

cb39126

fix: improve context error handling in MQ workers

02880ea

refactor: generic consumer worker

2f96681

chore: lifo cleanup

e0808cd

alexluong changed the base branch from main to v0.9.0 November 17, 2025 04:21

vercel bot deployed to Preview – outpost-docs November 17, 2025 04:21 View deployment

alexluong merged commit 975f212 into v0.9.0 Nov 17, 2025
3 of 4 checks passed

alexluong deleted the healthcheck branch November 17, 2025 04:21

vercel bot deployed to Preview – outpost-website November 17, 2025 04:21 View deployment

alexluong mentioned this pull request Nov 17, 2025

Outpost Healthcheck CLI #568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: healthcheck & worker refactor #558

feat: healthcheck & worker refactor #558

Uh oh!

alexluong commented Nov 10, 2025 •

edited

Loading

Uh oh!

vercel bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

leggetter commented Nov 11, 2025

Uh oh!

alexluong commented Nov 11, 2025 •

edited

Loading

Uh oh!

fmvilas commented Nov 11, 2025

Uh oh!

leggetter commented Nov 11, 2025

Uh oh!

alexluong commented Nov 11, 2025 •

edited

Loading

Uh oh!

fmvilas commented Nov 11, 2025

Uh oh!

fmvilas left a comment

Uh oh!

alexluong commented Nov 13, 2025

Uh oh!

claude bot commented Nov 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: healthcheck & worker refactor #558

feat: healthcheck & worker refactor #558

Uh oh!

Conversation

alexluong commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Health Check Endpoint

Worker Supervision Architecture

Uh oh!

vercel bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leggetter commented Nov 11, 2025

Uh oh!

alexluong commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmvilas commented Nov 11, 2025

Uh oh!

leggetter commented Nov 11, 2025

Uh oh!

alexluong commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmvilas commented Nov 11, 2025

Uh oh!

fmvilas left a comment

Choose a reason for hiding this comment

Uh oh!

alexluong commented Nov 13, 2025

Uh oh!

claude bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

Critical Issues

High Priority

Medium Priority

Low Priority / Suggestions

Testing

Documentation

Security Considerations

Breaking Changes

Overall Assessment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexluong commented Nov 10, 2025 •

edited

Loading

vercel bot commented Nov 10, 2025 •

edited

Loading

alexluong commented Nov 11, 2025 •

edited

Loading

alexluong commented Nov 11, 2025 •

edited

Loading

claude bot commented Nov 13, 2025 •

edited

Loading