Skip to content

leandromoreira/distributed-stack-trace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stack Trace for Distributed Systems

Traditional distributed tracing flows forward and relies on head-sampling, often missing the very errors you need to debug because the decision to trace was made before the failure occurred.

There are efforts to bring a sampling tail to distributed tracing.

This project creates a lab where one can arrange the services to mount any tree call pattern they need, and also to simulate errors. Creating the tree-like call pattern using the docker-compose/ENV plus HTTP Headers

There are two middelwares (inbound/outbound) that perform a backward error accumulation; by intercepting and bubbling error metadata up the call chain, it creates a distributed stack trace (in form of a tree) that captures 100% of failure context without the cost of full tracing.

microservice tree call error-fail-close

The final result is an HTTP header (x-error-tree) containing the error tree, encoded as JSON.

{
  "service": "root",
  "status": "error",
  "code": "not-found",
  "error": "an upstream dependency failed in service service-b",
  "children": [
    {
      "service": "service-b",
      "status": "error",
      "code": "not-found",
      "error": "[DRIFT: permission-denied -> not-found] an upstream dependency failed in service service-b",
      "children": [
        {
          "service": "service-e",
          "status": "error",
          "code": "permission-denied",
          "error": "an error occurred in service service-e"
        }
      ]
    }
  ]
}

Why

In microservice architectures, an error at the root often masks the true cause.

  • The Context Gap: Root calls return a generic 500 Internal Server Error, losing the specifics of the downstream failure.
  • Sampling Issues: Distributed tracing is expensive and usually sampled (e.g., 1%). If the root call decides not to sample, the root error requires much more time and effort.
  • Error Drift: Downstream errors are often remapped (e.g., a PermissionDenied becomes a NotFound), making root-cause analysis nearly impossible without checking multiple logs.

Demo

ffmpeg-demo.mp4

Requirements

  • docker
  • docker compose
  • jq
  • curl

Running

Starting

  • make logs

Testing

  • make test
  • make test-success
  • make test-fail-open

Challenges & TODOs

  • Changing encoding/decoder to something more performant, in terms of allocation/CPU usage (like raw proto)
  • Timeout might create split-brain

About

This project is a runtime microservice demo that builds a “distributed stack trace” by propagating error metadata up the call chain in an HTTP header (x-error-tree), instead of relying on full distributed tracing sampling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors