Traditional distributed tracing flows forward and relies on head-sampling, often missing the very errors you need to debug because the decision to trace was made before the failure occurred.
There are efforts to bring a sampling tail to distributed tracing.
This project creates a lab where one can arrange the services to mount any tree call pattern they need, and also to simulate errors. Creating the tree-like call pattern using the docker-compose/ENV plus HTTP Headers
There are two middelwares (inbound/outbound) that perform a backward error accumulation; by intercepting and bubbling error metadata up the call chain, it creates a distributed stack trace (in form of a tree) that captures 100% of failure context without the cost of full tracing.
The final result is an HTTP header (x-error-tree) containing the error tree, encoded as JSON.
{
"service": "root",
"status": "error",
"code": "not-found",
"error": "an upstream dependency failed in service service-b",
"children": [
{
"service": "service-b",
"status": "error",
"code": "not-found",
"error": "[DRIFT: permission-denied -> not-found] an upstream dependency failed in service service-b",
"children": [
{
"service": "service-e",
"status": "error",
"code": "permission-denied",
"error": "an error occurred in service service-e"
}
]
}
]
}In microservice architectures, an error at the root often masks the true cause.
- The Context Gap: Root calls return a generic
500 Internal Server Error, losing the specifics of the downstream failure. - Sampling Issues: Distributed tracing is expensive and usually sampled (e.g., 1%). If the root call decides not to sample, the root error requires much more time and effort.
- Error Drift: Downstream errors are often remapped (e.g., a
PermissionDeniedbecomes aNotFound), making root-cause analysis nearly impossible without checking multiple logs.
ffmpeg-demo.mp4
- docker
- docker compose
- jq
- curl
- make logs
- make test
- make test-success
- make test-fail-open
- Changing encoding/decoder to something more performant, in terms of allocation/CPU usage (like raw proto)
- Timeout might create split-brain
