-
Notifications
You must be signed in to change notification settings - Fork 20
feat: healthcheck & worker refactor #558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
@alexluong - just want to flag that the payload change is a breaking change and there may be some code that checks for "OK", including in the Outpost repo. |
Can you reference this? I'm not aware of any usage of this. Good call regarding the breaking change. Seems we need to issue v0.9.0 for this in that case? |
I'd definitely go for 0.9.0. |
@alexluong - I had a quick look, but I can't find anything programmatically checking for I do see the main README mentions the response contents, so that will need to be changed. Worth looking at other docs, too. |
|
@leggetter I looked at updated README |
|
I also can't find anything else. In the website preview link the API Reference is not updated but I'm assuming this will be automatically updated once we release? |
fmvilas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
|
@claude can you review pleaes? |
|
Claude finished @alexluong's task —— View job Code Review Complete
OverviewThis is a well-architected refactor that introduces a robust worker supervision pattern and enhances the health check endpoint. The changes are substantial but well-structured, with excellent test coverage and clear separation of concerns. Key Strengths:
Areas for Improvement:
Critical Issues1. Double HTTP Server Shutdown Could Cause IssuesFile: The HTTP server shutdown is registered in the cleanup functions AND handled separately in the HTTPServerWorker. This could lead to double-shutdown attempts: // In builder.go - cleanup function added
targetSvc.cleanupFuncs = append(targetSvc.cleanupFuncs, func(ctx context.Context, logger *logging.LoggerWithCtx) {
if err := httpServer.Shutdown(ctx); err != nil {
logger.Error("error shutting down http server", zap.Error(err))
}
logger.Info("http server shut down")
})But HTTPServerWorker also shuts down the server in its Run method when context is cancelled: // In http_worker.go:50-56
if err := w.server.Shutdown(shutdownCtx); err != nil {
logger.Error("error shutting down http server", zap.Error(err))
return err
}Impact: Could cause "server already closed" errors or race conditions during shutdown. Recommendation: Remove the HTTP server shutdown from cleanup functions since the HTTPServerWorker handles it. The cleanup functions should only handle resources that workers don't already manage. 2. Missing Context Error Check Could Hide Shutdown IssuesFile: The error handling doesn't account for graceful shutdown properly: if err := csm.Run(ctx); !errors.Is(err, ctx.Err()) {
logger.Error("error running deliverymq consumer", zap.Error(err))
return err
}
return nilIssue: If Recommendation: Use a more robust check: if err := csm.Run(ctx); err != nil {
if !errors.Is(err, context.Canceled) && !errors.Is(err, context.DeadlineExceeded) {
logger.Error("error running deliverymq consumer", zap.Error(err))
return err
}
}
return nilThis pattern appears in multiple workers ( 3. Potential Race Between Worker Failure and Health CheckFile: There's a small window where a worker could fail between being marked healthy and actually starting: r.logger.Debug("worker starting", zap.String("worker", name))
r.health.MarkHealthy(name)
// Run the worker
if err := w.Run(ctx); err != nil && !errors.Is(err, context.Canceled) {Issue: If a worker fails immediately in Recommendation: Consider marking workers as healthy only after they successfully start (though this would require workers to signal when they're ready, which is a bigger change). For now, this is acceptable but worth documenting. Alternatively, add a comment explaining this trade-off. High Priority4. Cleanup Order Could Cause Resource LeaksFile: The cleanup functions are executed in the order they were added, but resources are created in a specific dependency order. If cleanup fails partway through, dependent resources may not be properly closed. for _, svc := range b.services {
logger.Debug("cleaning up service", zap.String("service", svc.name))
for _, cleanupFunc := range svc.cleanupFuncs {
cleanupFunc(ctx, &logger)
}
}Recommendation: Example: var cleanupErrors []error
for i := len(b.services) - 1; i >= 0; i-- {
svc := b.services[i]
logger.Debug("cleaning up service", zap.String("service", svc.name))
for j := len(svc.cleanupFuncs) - 1; j >= 0; j-- {
func() {
defer func() {
if r := recover(); r != nil {
logger.Error("panic during cleanup", zap.Any("panic", r))
}
}()
svc.cleanupFuncs[j](ctx, &logger)
}()
}
}5. Incomplete Error Handling in Supervisor.RunFile: When a worker returns if err := w.Run(ctx); err != nil && !errors.Is(err, context.Canceled) {
r.logger.Error("worker failed",
zap.String("worker", name),
zap.Error(err))
r.health.MarkFailed(name)
} else {
r.logger.Info("worker stopped gracefully", zap.String("worker", name))
}Issue: The supervisor's return value only indicates whether context was cancelled or all workers exited, but doesn't preserve individual worker errors for debugging. Recommendation: Consider collecting worker errors and returning them as a multi-error, or at minimum, ensure all errors are logged with sufficient detail for debugging. 6. Missing Validation for Duplicate Health Check PathsFile: The base router registers health checks at both // TODO: Rethink API versioning strategy in the future.
// For now, we expose health check at both /healthz and /api/v1/healthz for backwards compatibility.Recommendation: Medium Priority7. ServiceBuilder Context Stored Could Cause ConfusionFile: The ServiceBuilder stores a context from initialization, but workers are expected to use the context passed to type ServiceBuilder struct {
ctx context.Context // Stored at creation
cfg *config.Config
// ...
}Then later services use redisClient, err := redis.New(b.ctx, cfg.Redis.ToConfig())Issue: This creates confusion about context lifecycle. If the stored context is cancelled before Recommendation: 8. Health Tracker Timestamp Could Be StaleFile: The timestamp in return map[string]interface{}{
"status": status,
"timestamp": time.Now(), // Now, not when state changed
"workers": workers,
}Recommendation: Consider storing timestamps with each state change: type WorkerHealth struct {
Status string `json:"status"`
UpdatedAt time.Time `json:"updated_at,omitempty"`
}This would provide more accurate information about when failures occurred. 9. Test Contains Helper Function Has BugFile: The func contains(s, substr string) bool {
return len(s) >= len(substr) && (s == substr || len(s) > len(substr) &&
(s[:len(substr)] == substr || s[len(s)-len(substr):] == substr ||
s[1:len(s)-1] != s[1:len(s)-1] && contains(s[1:], substr)))
}Issue: The condition Recommendation: Use the standard library: import "strings"
func contains(s, substr string) bool {
return strings.Contains(s, substr)
}Low Priority / Suggestions10. Consider Adding Structured Health Check TypesFile: The func (h *HealthTracker) GetStatus() map[string]interface{} {Recommendation: Define a proper struct: type HealthStatus struct {
Status string `json:"status"`
Timestamp time.Time `json:"timestamp"`
Workers map[string]WorkerHealth `json:"workers"`
}
func (h *HealthTracker) GetStatus() HealthStatus {
// ...
}This would:
11. Worker Name Constants Would Improve MaintainabilityFiles: Multiple worker files Worker names are hardcoded strings scattered across multiple files:
Recommendation: Define constants in a central location: // internal/worker/names.go
package worker
const (
WorkerNameHTTPServer = "http-server"
WorkerNameDeliveryMQConsumer = "deliverymq-consumer"
WorkerNameRetryMQConsumer = "retrymq-consumer"
WorkerNamePublishMQConsumer = "publishmq-consumer"
WorkerNameLogMQConsumer = "logmq-consumer"
)12. Consider Adding Metrics for Worker Health TransitionsThe worker health transitions (healthy → failed) are currently only logged. Consider emitting metrics for these transitions to enable better monitoring and alerting. Example: func (h *HealthTracker) MarkFailed(name string) {
h.mu.Lock()
defer h.mu.Unlock()
oldStatus := h.workers[name].Status
h.workers[name] = WorkerHealth{Status: WorkerStatusFailed}
if oldStatus == WorkerStatusHealthy {
// Emit metric: worker_health_transition{worker="name",from="healthy",to="failed"}
}
}13. ServiceBuilder Could Benefit from Functional Options PatternFile: The ServiceBuilder has a fixed constructor but might benefit from options for testing or future extensibility: func NewServiceBuilder(ctx context.Context, cfg *config.Config, logger *logging.Logger,
telemetry telemetry.Telemetry, opts ...BuilderOption) *ServiceBuilderThis would make it easier to:
TestingExcellent Test CoverageThe test suite ( ✅ Basic health tracking (healthy/failed states) Strengths:
Minor suggestion: Consider adding integration tests that verify:
DocumentationDocumentation QualityStrengths:
Suggestions: Security ConsiderationsSecurity Review✅ Good: Error details are not exposed in health check responses (internal/worker/health.go:14) Note: The health endpoint is intentionally unauthenticated, which is correct for Kubernetes/load balancer health checks. However, consider documenting this security decision in the code or docs. Breaking ChangesBreaking Change HandlingThe health check endpoint response format changed from: to: {"status": "healthy", "timestamp": "...", "workers": {...}}Already Addressed:
Recommendation: Add a migration note in CHANGELOG.md when releasing v0.9.0 to help users update their health check consumers. Overall AssessmentThis is a high-quality refactor that significantly improves the application's observability and reliability. The worker supervision pattern is well-designed and the implementation is clean with excellent test coverage. Approval Recommendation: ✅ Approve with suggestions The critical and high-priority issues should be addressed, but they don't block merging as they're mostly about improving robustness rather than fixing broken functionality. The current implementation is production-ready, and these improvements can be done as follow-up work if needed. Suggested Next Steps: Great work on this refactor! The architecture is solid and the code is well-tested. 🎉 |
resolves #555, #306
Health Check Endpoint
This PR updates the
/healthzendpoint to report the real-time health status of all workers in the application. The endpoint returns:Healthy response (200):
{ "status": "healthy", "timestamp": "2025-01-15T10:30:00Z", "workers": { "http-server": {"status": "healthy"}, "retrymq-consumer": {"status": "healthy"}, "deliverymq-consumer": {"status": "healthy"} } }Unhealthy response (503):
{ "status": "failed", "timestamp": "2025-01-15T10:30:00Z", "workers": { "http-server": {"status": "healthy"}, "retrymq-consumer": {"status": "healthy"}, "deliverymq-consumer": {"status": "failed"} } }Usage: Configure Kubernetes liveness/readiness probes or load balancer health checks to monitor this endpoint. When any worker fails, the endpoint returns 503, allowing orchestrators to automatically restart the pod/container.
Note: For
service=deliveryandservice=logmodes, an HTTP server is started solely to serve the/healthzendpoint on the configured API port, allowing health checks even for worker-only services.Worker Supervision Architecture
The application now uses a WorkerSupervisor pattern to manage long-running workers (HTTP server, MQ consumers, retry scheduler). Key design decisions:
This architecture provides visibility into worker health while maintaining availability during partial failures.