-
Notifications
You must be signed in to change notification settings - Fork 2
Observability Overview
Observability is the property of a running system that lets you answer questions about its internal state from the outside, without redeploying, without attaching a debugger, and without having anticipated the question when the code was written. CodeQ treats observability as a first-class concern of the broker, not an afterthought layered on by a separate sidecar. Every task that enters the system is born with a W3C trace context, every state transition is counted by a Prometheus collector, every hot loop is reachable through pprof, and every log line is structured JSON. The four pillars — distributed tracing, metrics, profiling, structured logging — share a single principle: the data exists already inside the process, the operator's job is to ask for it.
This section of the wiki documents how each pillar is wired, what it costs to enable, what questions it answers well, and where its blind spots lie. The pages assume you are operating a CodeQ broker in production rather than reading the source for the first time, but each page begins with enough context for a new operator to find the relevant code under internal/ and pkg/. The bias throughout is toward letting you correlate signals quickly: a Prometheus alert names a metric, a metric points at a log line, a log line carries a traceparent, a traceparent opens a trace in Jaeger, a slow span motivates a pprof capture. That chain is the thing that converts a 3 a.m. page into a fix rather than an outage that survives the on-call rotation.
Each pillar in CodeQ exists to answer a particular shape of question, and choosing the wrong one is a common mistake on first encounter. Distributed tracing is the right tool when the question is causal: which step took the time, which downstream call failed, which retry happened where, which producer originated this task. Metrics are the right tool when the question is quantitative: how many tasks per second, what is the p99 of claim latency, how many webhook deliveries failed in the last hour, is the lease-expiration rate climbing. Profiling is the right tool when the question is internal: which Go function holds the CPU, which mutex is contended, which goroutine is blocked on what. Structured logging is the right tool when the question is narrative: what did this specific task do, in what order, with what arguments, with what error.
Reaching for the wrong pillar is how outages last too long. A common failure mode is trying to use logs to answer a quantitative question — counting log lines to estimate throughput, or grepping for "error" to estimate failure rates. Counting log lines does not produce a histogram, only a Prometheus histogram produces a histogram, and only a histogram produces a correct percentile. The inverse failure mode is using metrics to answer a causal question — staring at codeq_task_completed_total{status="failed"} climbing without being able to name a single task that failed or what its payload looked like. Metrics tell you that something is wrong, traces tell you which thing is wrong, profiles tell you why that thing is slow, logs tell you what that thing was carrying.
- Distributed tracing answers which step and which causal chain.
- Metrics answer how many and how slow at which percentile.
- Profiling answers which function and which mutex.
- Structured logging answers what did this specific task do.
The broker's surface area for observability is small and stable. Prometheus metrics are exposed at GET /metrics on the API port, registered in pkg/app/url_mappings.go:12 as a Gin wrapper around promhttp.Handler(). The metric definitions live in internal/metrics/metrics.go: counters for the task lifecycle (created, claimed, completed, lease-expired), a latency histogram for end-to-end task processing, counters for webhook deliveries and rate-limit rejections. A separate Prometheus collector in internal/metrics/redis_collector.go exports queue depth and DLQ depth by command. Distributed tracing is bootstrapped in internal/tracing/tracing.go, which configures an OTLP gRPC exporter, a TraceIDRatioBased parent-aware sampler, and the W3C TraceContext text-map propagator. The propagator is set even when tracing is disabled so that incoming traceparent headers from upstream services are preserved through the broker. Profiling endpoints are mounted under /debug/pprof/* on a separate listener controlled by the CODEQ_PPROF=1 environment variable, with the bind address configurable through CODEQ_PPROF_ADDR (default :6060), as seen in cmd/server/main.go:69-79. Structured logging is log/slog configured at pkg/app/application.go:113-129: a level variable derived from the logLevel config field, a JSON handler by default, a text handler when logFormat: text, and a service-level label of service=codeq attached to every record.
The key design choice is that the four pillars share context. The same context.Context that carries the OpenTelemetry span through SchedulerService.CreateTask (internal/services/scheduler_service.go:95-104) is the context whose traceparent gets persisted onto the domain.Task as TraceParent and TraceState (pkg/domain/task.go:42-43). Later, when a worker claims that task, the broker rebuilds the trace context from those fields using tracing.ContextWithRemoteParent (internal/tracing/tracing.go:142) so that the claim, heartbeat, result submission, and webhook delivery all hang off the same root trace. That single design choice — store the trace context on the task instead of attempting to reconstruct it — is what makes the trace causally complete across the asynchronous boundary between producer and worker. Without it you would see one trace for "create" and a disconnected trace for "claim," and the two would be impossible to correlate after the fact.
Suppose your on-call phone goes off because a Prometheus alert fires: histogram_quantile(0.99, rate(codeq_task_processing_latency_seconds_bucket[5m])) > 2. Your service-level objective for end-to-end task latency at the 99th percentile is two seconds, and it has just been crossed for the third consecutive minute. The four pillars now compose into a routine.
You open Grafana, pull up the dashboard that graphs codeq_task_processing_latency_seconds, and confirm the spike is real and current rather than a stale evaluation. The same dashboard breaks the metric out by the command label, and you see the spike is concentrated on a single command — say, generate_master. You pivot to codeq_task_completed_total filtered to that command and observe that the completion rate has dropped to roughly a third of its baseline while the creation rate is steady, which tells you the bottleneck is downstream of enqueue. You now have a quantitative picture: this command, this time window, this magnitude. Metrics have done their job.
You switch to the broker's structured logs in your aggregator, filter by level=warn OR level=error, and scope to the same five-minute window. A handful of result callback failed lines appear from internal/services/result_callback_service.go:148, each carrying a traceparent field because slog records inherit the span context that submitted the result. You grab one traceparent — a hex string of the form 00-<32 hex>-<16 hex>-01 — and paste it into Jaeger's lookup box. The trace opens. The root span is codeq.task.create, attributed with codeq.command=generate_master, codeq.priority=5, codeq.tenant_id=acme. Below it sits codeq.task.claim, then codeq.task.submit_result, then codeq.webhook.task_result. The webhook span is 1.8 seconds wide; everything else is microseconds. The trace tells you the latency is in the outbound webhook call, not in CodeQ's internal path. Tracing has done its job.
You now know what is wrong (slow webhook delivery) and which command is affected, but you do not yet know whether the symptom is on your side or the receiver's. You hit the broker's /debug/pprof/profile?seconds=30 endpoint on one node to take a 30-second CPU sample, then /debug/pprof/mutex for a mutex profile. Running go tool pprof -top -cum cpu.pb.gz shows the broker is mostly idle; running go tool pprof -top -cum mutex.pb.gz shows no contention near http.Transport. Profiling has ruled out the broker as the source of slowness, which is information; the next call is to the webhook receiver's team. Profiling has done its job, even though the answer it produced was not the broker.
You write up the incident with three artifacts: the Prometheus screenshot, the Jaeger trace ID, and the pprof captures. The structured logs are already in your aggregator with their traceparent field, so anyone reading the postmortem can walk the same chain in reverse. None of these steps required redeploying the broker, restarting a node, or attaching a debugger. That is what observable means in this codebase.
The four pages that follow take each pillar in turn and go deep. Observability Tracing covers W3C trace context propagation, the span attributes emitted by SchedulerService.CreateTask and the repository layer, the OTLP exporter configuration knobs, and how webhook callbacks carry traceparent so downstream services continue the trace rather than starting a new one. Observability Metrics covers the Prometheus exposition format, the specific counters and histograms emitted by the broker, why histograms beat counters when the question is about latency, and how to point a Prometheus scrape job at the :8080/metrics endpoint. Observability Profiling covers the pprof endpoints, the bench harnesses under pkg/app/raft_profile_bench_test.go and pkg/app/raft_grpc_profile_test.go, the role of runtime.SetMutexProfileFraction, and the two real optimizations that came out of mutex profiles: the group-commit coalescer on the Pebble write path (96% time in commitPipeline collapsed to roughly 60%) and the Raft Apply coalescer over the HTTP REST path (28.74% time in http.Transport.tryPutIdleConn largely eliminated). Observability Logging covers the slog configuration, level selection, JSON versus text output, and the convention of including traceparent on every record that participates in a traced lifecycle.
The throughline is correlation. A metric without a way back to a trace is hard to act on; a trace without a way back to a log line is hard to confirm; a log line without a way back to a profile is hard to optimize. CodeQ wires all four together by sharing the same context.Context and the same W3C trace identifiers across every boundary it controls.
Source: github.com/osvaldoandrade/codeq.
- Overview
- Tasks and Results
- Queue Model
- Sharding
- Leases and Ownership
- Multi-Tenancy
- Authentication and Authorization
- Persistence Engine
- Consensus and Replication
- Cluster-Level Failover
- Deployment Modes
- Architecture Overview