Skip to content

Conversation

@alexluong
Copy link
Collaborator

@alexluong alexluong commented Nov 10, 2025

resolves #555, #306

Health Check Endpoint

This PR updates the /healthz endpoint to report the real-time health status of all workers in the application. The endpoint returns:

Healthy response (200):

{
  "status": "healthy",
  "timestamp": "2025-01-15T10:30:00Z",
  "workers": {
    "http-server": {"status": "healthy"},
    "retrymq-consumer": {"status": "healthy"},
    "deliverymq-consumer": {"status": "healthy"}
  }
}

Unhealthy response (503):

{
  "status": "failed",
  "timestamp": "2025-01-15T10:30:00Z",
  "workers": {
    "http-server": {"status": "healthy"},
    "retrymq-consumer": {"status": "healthy"},
    "deliverymq-consumer": {"status": "failed"}
  }
}

Usage: Configure Kubernetes liveness/readiness probes or load balancer health checks to monitor this endpoint. When any worker fails, the endpoint returns 503, allowing orchestrators to automatically restart the pod/container.

Note: For service=delivery and service=log modes, an HTTP server is started solely to serve the /healthz endpoint on the configured API port, allowing health checks even for worker-only services.

Worker Supervision Architecture

The application now uses a WorkerSupervisor pattern to manage long-running workers (HTTP server, MQ consumers, retry scheduler). Key design decisions:

  • Non-cascading failures: When a worker fails, it's marked as failed but other workers continue running. This keeps the HTTP server and health endpoint available for debugging while the orchestrator detects the failure and restarts the container.
  • Graceful shutdown: On SIGTERM/SIGINT, the supervisor cancels the context and waits for all workers to finish gracefully, then runs cleanup (database connections, HTTP server shutdown, etc.) before exiting.

This architecture provides visibility into worker health while maintaining availability during partial failures.

@vercel
Copy link

vercel bot commented Nov 10, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
outpost-docs Ready Ready Preview Comment Nov 17, 2025 4:21am
outpost-website Ready Ready Preview Comment Nov 17, 2025 4:21am

@leggetter
Copy link
Collaborator

@alexluong - just want to flag that the payload change is a breaking change and there may be some code that checks for "OK", including in the Outpost repo.

@alexluong
Copy link
Collaborator Author

alexluong commented Nov 11, 2025

including in the Outpost repo

Can you reference this? I'm not aware of any usage of this.

Good call regarding the breaking change. Seems we need to issue v0.9.0 for this in that case?

@fmvilas
Copy link
Contributor

fmvilas commented Nov 11, 2025

Seems we need to issue v0.9.0 for this in that case?

I'd definitely go for 0.9.0.

@leggetter
Copy link
Collaborator

including in the Outpost repo

Can you reference this? I'm not aware of any usage of this.

@alexluong - I had a quick look, but I can't find anything programmatically checking for OK. Another set of eyes would be useful.

I do see the main README mentions the response contents, so that will need to be changed. Worth looking at other docs, too.

@alexluong
Copy link
Collaborator Author

alexluong commented Nov 11, 2025

@leggetter I looked at /docs directory already and don't think we mention the content of the healthz endpoint

updated README

@fmvilas
Copy link
Contributor

fmvilas commented Nov 11, 2025

I also can't find anything else. In the website preview link the API Reference is not updated but I'm assuming this will be automatically updated once we release?

Copy link
Contributor

@fmvilas fmvilas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@alexluong
Copy link
Collaborator Author

@claude can you review pleaes?

@claude
Copy link

claude bot commented Nov 13, 2025

Claude finished @alexluong's task —— View job


Code Review Complete

  • Read and understand the worker supervision architecture
  • Review the health check implementation
  • Analyze service builder patterns
  • Check test coverage and quality
  • Review error handling and graceful shutdown
  • Provide comprehensive feedback

Overview

This is a well-architected refactor that introduces a robust worker supervision pattern and enhances the health check endpoint. The changes are substantial but well-structured, with excellent test coverage and clear separation of concerns.

Key Strengths:

  • Clean separation between worker supervision and service building
  • Comprehensive test coverage for worker lifecycle scenarios
  • Non-cascading failure design that keeps services observable during partial failures
  • Proper graceful shutdown handling with configurable timeouts
  • Good documentation in PR description and code comments

Areas for Improvement:

  • Some potential race conditions and cleanup ordering issues
  • Missing error checks in critical paths
  • Opportunity to improve consistency in error handling patterns

Critical Issues

1. Double HTTP Server Shutdown Could Cause Issues

File: internal/services/builder.go:139-143

The HTTP server shutdown is registered in the cleanup functions AND handled separately in the HTTPServerWorker. This could lead to double-shutdown attempts:

// In builder.go - cleanup function added
targetSvc.cleanupFuncs = append(targetSvc.cleanupFuncs, func(ctx context.Context, logger *logging.LoggerWithCtx) {
    if err := httpServer.Shutdown(ctx); err != nil {
        logger.Error("error shutting down http server", zap.Error(err))
    }
    logger.Info("http server shut down")
})

But HTTPServerWorker also shuts down the server in its Run method when context is cancelled:

// In http_worker.go:50-56
if err := w.server.Shutdown(shutdownCtx); err != nil {
    logger.Error("error shutting down http server", zap.Error(err))
    return err
}

Impact: Could cause "server already closed" errors or race conditions during shutdown.

Recommendation: Remove the HTTP server shutdown from cleanup functions since the HTTPServerWorker handles it. The cleanup functions should only handle resources that workers don't already manage.

2. Missing Context Error Check Could Hide Shutdown Issues

File: internal/services/deliverymq_worker.go:58-61

The error handling doesn't account for graceful shutdown properly:

if err := csm.Run(ctx); !errors.Is(err, ctx.Err()) {
    logger.Error("error running deliverymq consumer", zap.Error(err))
    return err
}
return nil

Issue: If csm.Run() returns an error that wraps context.Canceled along with another error, this check will log it as an error and return it, marking the worker as failed even during graceful shutdown.

Recommendation: Use a more robust check:

if err := csm.Run(ctx); err != nil {
    if !errors.Is(err, context.Canceled) && !errors.Is(err, context.DeadlineExceeded) {
        logger.Error("error running deliverymq consumer", zap.Error(err))
        return err
    }
}
return nil

This pattern appears in multiple workers (deliverymq_worker.go:58, publishmq_worker.go:59).

3. Potential Race Between Worker Failure and Health Check

File: internal/worker/supervisor.go:107-111

There's a small window where a worker could fail between being marked healthy and actually starting:

r.logger.Debug("worker starting", zap.String("worker", name))
r.health.MarkHealthy(name)

// Run the worker
if err := w.Run(ctx); err != nil && !errors.Is(err, context.Canceled) {

Issue: If a worker fails immediately in Run() (e.g., port already in use for HTTP server), there's a brief moment where health checks would report it as healthy.

Recommendation: Consider marking workers as healthy only after they successfully start (though this would require workers to signal when they're ready, which is a bigger change). For now, this is acceptable but worth documenting. Alternatively, add a comment explaining this trade-off.


High Priority

4. Cleanup Order Could Cause Resource Leaks

File: internal/services/builder.go:156-163

The cleanup functions are executed in the order they were added, but resources are created in a specific dependency order. If cleanup fails partway through, dependent resources may not be properly closed.

for _, svc := range b.services {
    logger.Debug("cleaning up service", zap.String("service", svc.name))
    for _, cleanupFunc := range svc.cleanupFuncs {
        cleanupFunc(ctx, &logger)
    }
}

Recommendation:
a. Reverse the cleanup order (LIFO) to mirror the initialization order
b. Use defer pattern or collect all errors and return them together
c. Add explicit error handling and logging for each cleanup step

Example:

var cleanupErrors []error
for i := len(b.services) - 1; i >= 0; i-- {
    svc := b.services[i]
    logger.Debug("cleaning up service", zap.String("service", svc.name))
    for j := len(svc.cleanupFuncs) - 1; j >= 0; j-- {
        func() {
            defer func() {
                if r := recover(); r != nil {
                    logger.Error("panic during cleanup", zap.Any("panic", r))
                }
            }()
            svc.cleanupFuncs[j](ctx, &logger)
        }()
    }
}
5. Incomplete Error Handling in Supervisor.Run

File: internal/worker/supervisor.go:107

When a worker returns context.Canceled, it's ignored but still logs a graceful shutdown. However, if multiple workers fail with different errors, only the health tracker reflects all failures.

if err := w.Run(ctx); err != nil && !errors.Is(err, context.Canceled) {
    r.logger.Error("worker failed",
        zap.String("worker", name),
        zap.Error(err))
    r.health.MarkFailed(name)
} else {
    r.logger.Info("worker stopped gracefully", zap.String("worker", name))
}

Issue: The supervisor's return value only indicates whether context was cancelled or all workers exited, but doesn't preserve individual worker errors for debugging.

Recommendation: Consider collecting worker errors and returning them as a multi-error, or at minimum, ensure all errors are logged with sufficient detail for debugging.

6. Missing Validation for Duplicate Health Check Paths

File: internal/services/health.go:29-37

The base router registers health checks at both /healthz and /api/v1/healthz, but there's a TODO about versioning strategy. This could lead to confusion.

// TODO: Rethink API versioning strategy in the future.
// For now, we expose health check at both /healthz and /api/v1/healthz for backwards compatibility.

Recommendation:
a. Document in the OpenAPI spec that both endpoints exist
b. Consider deprecating one path with appropriate headers (e.g., Warning: 299 - "Deprecated, use /healthz")
c. Add a decision deadline for removing the duplicate (e.g., "remove /api/v1/healthz in v1.0.0")


Medium Priority

7. ServiceBuilder Context Stored Could Cause Confusion

File: internal/services/builder.go:64-73

The ServiceBuilder stores a context from initialization, but workers are expected to use the context passed to supervisor.Run():

type ServiceBuilder struct {
    ctx        context.Context  // Stored at creation
    cfg        *config.Config
    // ...
}

Then later services use b.ctx:

redisClient, err := redis.New(b.ctx, cfg.Redis.ToConfig())

Issue: This creates confusion about context lifecycle. If the stored context is cancelled before supervisor.Run() is called, initialization could fail unexpectedly.

Recommendation:
a. Rename to initCtx or setupCtx to clarify its purpose
b. Document that this context is only for initialization, not worker runtime
c. Consider passing context explicitly to each init method instead

8. Health Tracker Timestamp Could Be Stale

File: internal/worker/health.go:68-86

The timestamp in GetStatus() is generated at the time of the call, but the worker health states might have been updated earlier:

return map[string]interface{}{
    "status":    status,
    "timestamp": time.Now(),  // Now, not when state changed
    "workers":   workers,
}

Recommendation: Consider storing timestamps with each state change:

type WorkerHealth struct {
    Status    string    `json:"status"`
    UpdatedAt time.Time `json:"updated_at,omitempty"`
}

This would provide more accurate information about when failures occurred.

9. Test Contains Helper Function Has Bug

File: internal/worker/worker_test.go:98-100

The contains function implementation is overly complex and appears to have a logic error:

func contains(s, substr string) bool {
    return len(s) >= len(substr) && (s == substr || len(s) > len(substr) && 
        (s[:len(substr)] == substr || s[len(s)-len(substr):] == substr || 
        s[1:len(s)-1] != s[1:len(s)-1] && contains(s[1:], substr)))
}

Issue: The condition s[1:len(s)-1] != s[1:len(s)-1] is always false, making the recursive call unreachable. This means the function only checks prefix/suffix, not substrings.

Recommendation: Use the standard library:

import "strings"

func contains(s, substr string) bool {
    return strings.Contains(s, substr)
}

Low Priority / Suggestions

10. Consider Adding Structured Health Check Types

File: internal/worker/health.go:68

The GetStatus() returns map[string]interface{} which loses type safety:

func (h *HealthTracker) GetStatus() map[string]interface{} {

Recommendation: Define a proper struct:

type HealthStatus struct {
    Status    string                  `json:"status"`
    Timestamp time.Time               `json:"timestamp"`
    Workers   map[string]WorkerHealth `json:"workers"`
}

func (h *HealthTracker) GetStatus() HealthStatus {
    // ...
}

This would:

  • Improve type safety
  • Enable better IDE support
  • Make it easier to add fields in the future
  • Match the OpenAPI spec more closely
11. Worker Name Constants Would Improve Maintainability

Files: Multiple worker files

Worker names are hardcoded strings scattered across multiple files:

  • "http-server" in http_worker.go:29
  • "deliverymq-consumer" in deliverymq_worker.go:39
  • "retrymq-consumer" in retrymq_worker.go:28

Recommendation: Define constants in a central location:

// internal/worker/names.go
package worker

const (
    WorkerNameHTTPServer        = "http-server"
    WorkerNameDeliveryMQConsumer = "deliverymq-consumer"
    WorkerNameRetryMQConsumer    = "retrymq-consumer"
    WorkerNamePublishMQConsumer  = "publishmq-consumer"
    WorkerNameLogMQConsumer      = "logmq-consumer"
)
12. Consider Adding Metrics for Worker Health Transitions

The worker health transitions (healthy → failed) are currently only logged. Consider emitting metrics for these transitions to enable better monitoring and alerting.

Example:

func (h *HealthTracker) MarkFailed(name string) {
    h.mu.Lock()
    defer h.mu.Unlock()
    
    oldStatus := h.workers[name].Status
    h.workers[name] = WorkerHealth{Status: WorkerStatusFailed}
    
    if oldStatus == WorkerStatusHealthy {
        // Emit metric: worker_health_transition{worker="name",from="healthy",to="failed"}
    }
}
13. ServiceBuilder Could Benefit from Functional Options Pattern

File: internal/services/builder.go:64

The ServiceBuilder has a fixed constructor but might benefit from options for testing or future extensibility:

func NewServiceBuilder(ctx context.Context, cfg *config.Config, logger *logging.Logger, 
    telemetry telemetry.Telemetry, opts ...BuilderOption) *ServiceBuilder

This would make it easier to:

  • Inject mock dependencies for testing
  • Configure shutdown timeouts
  • Add future configuration options without breaking changes

Testing

Excellent Test Coverage

The test suite (internal/worker/worker_test.go) is comprehensive and covers:

✅ Basic health tracking (healthy/failed states)
✅ Worker registration and duplicate prevention
✅ Graceful shutdown scenarios
✅ Non-cascading failure behavior
✅ Shutdown timeout configuration
✅ Concurrent access to health tracker
✅ Variable shutdown timing
✅ Edge cases (no workers, stuck workers, all workers exit)

Strengths:

  • Tests validate the core design goal of non-cascading failures
  • Good coverage of timing-related edge cases
  • Proper use of table-driven tests where appropriate
  • Clear test names that describe expected behavior

Minor suggestion: Consider adding integration tests that verify:

  • Health endpoint returns correct HTTP status codes (200 vs 503)
  • Multiple services running together
  • Cleanup order and resource management

Documentation

Documentation Quality

Strengths:

  • PR description clearly explains the design decisions
  • Code comments explain the non-cascading failure behavior
  • OpenAPI spec properly updated with examples
  • Worker interface has clear documentation about expected behavior

Suggestions:
a. Add architecture documentation showing the flow: App → ServiceBuilder → Supervisor → Workers
b. Document the cleanup lifecycle and order of operations
c. Add examples of how to implement a custom worker
d. Consider adding a troubleshooting guide for common health check scenarios


Security Considerations

Security Review

Good: Error details are not exposed in health check responses (internal/worker/health.go:14)
Good: Health endpoint has no authentication requirement (appropriate for health checks)
Good: No sensitive information leaked in worker names or status

Note: The health endpoint is intentionally unauthenticated, which is correct for Kubernetes/load balancer health checks. However, consider documenting this security decision in the code or docs.


Breaking Changes

Breaking Change Handling

The health check endpoint response format changed from:

"OK" (text/plain)

to:

{"status": "healthy", "timestamp": "...", "workers": {...}}

Already Addressed:

  • ✅ README updated
  • ✅ OpenAPI spec updated
  • ✅ E2E tests updated
  • ✅ Team discussed version bump to 0.9.0

Recommendation: Add a migration note in CHANGELOG.md when releasing v0.9.0 to help users update their health check consumers.


Overall Assessment

This is a high-quality refactor that significantly improves the application's observability and reliability. The worker supervision pattern is well-designed and the implementation is clean with excellent test coverage.

Approval Recommendation: ✅ Approve with suggestions

The critical and high-priority issues should be addressed, but they don't block merging as they're mostly about improving robustness rather than fixing broken functionality. The current implementation is production-ready, and these improvements can be done as follow-up work if needed.

Suggested Next Steps:
a. Address critical issues 1-3 (double shutdown, context error checks, race condition documentation)
b. Consider high-priority issues 4-6 for a follow-up PR
c. Add integration tests for the full health check flow
d. Update CHANGELOG.md with migration notes for v0.9.0

Great work on this refactor! The architecture is solid and the code is well-tested. 🎉


@alexluong alexluong changed the base branch from main to v0.9.0 November 17, 2025 04:21
@alexluong alexluong merged commit 975f212 into v0.9.0 Nov 17, 2025
3 of 4 checks passed
@alexluong alexluong deleted the healthcheck branch November 17, 2025 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consumer Exits Permanently on Errors without Auto-Recovery

4 participants