Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus::Client::Tracer #135

Closed

Conversation

Projects
None yet
2 participants
@lawrencejones
Copy link
Contributor

commented Jun 16, 2019

See the commit description (inlined into the PR) below for an explanation of what this does. Some things I considered but never added:

Zombie traces

If the thread performing a trace dies via a Thread.kill, we'll continue as if the trace is still on-going. If you're killing threads this will probably be a nasty surprise, but likely no worse than the crazy side-effects caused by gratuitous serial thread murdering.

We could be smart to this and track the Thread.current.object_id on the Trace struct, then have collect filter dead traces. Querying thread status has the potential to be varying degrees of expensive though, so I'm not that keen on doing this in a tight loop during the collection process. Interested in people's comments here though!

Trace everything

In a similar theme as the zombie traces, someone may try tracing all-the-things. #collect is implemented assuming a modest number of on-going traces (probably <100?) otherwise you might want to get smarter about scanning the entire list.


Because I'm a bad developer, I've bundled a couple of small dev-helper commits with this change. If they're at all controversial let's break them out.


Create an abstraction to help express long-running duration measurements
in Prometheus metrics.

One of the most common observability requirements is to trace the amount
of time spent running a job or task. This information is useful in the
context of when this time was spent. The typical implementation of
such a measurement might be:

def run
  start = Time.now
  long_running_task
  metric.increment(by: start - Time.now)
end

The metric tracking duration is incremented only once the task has
finished. If the task took > Prometheus scrape interval, perhaps by a
significant factor (say 1hr vs the normal 15s scrape interval), then the
metric is incremented by a huge value only after the task has finished.
Graphs visualising how time is spent will show no activity while the
task was running and then an impossible burst just as things finish.

To avoid this time of measurement bias, this commit introduces a metric
tracer that manages updating metrics associated with long-running tasks.
Users will begin a trace, do work, and while the work is on-going any
calls to the tracer collect method will incrementally update the
associated metric with elapsed time.

This is a game-changer in terms of making metrics usable, removing a
huge source of uncertainty when interpreting metric measurements. The
implementation of the tracer is thread safe and designed to be as
lightweight as possible- it should hopefully provide no performance hit
and be preferable for tracing durations of most lengths, provided you
don't exceed hundreds of on-going traces.

For easy use, the Prometheus client initialises a global tracer in the
same vein as the global registry. Most users are going to want to use a
tracer without initialising a handle in their own code, and can do so
like this:

def run
  Prometheus::Client.trace(metric, { worker: 1 }) do
    long_running_task
  end
end

By default, users who do this will see their metrics update just as
frequently as the original implementation. For the incremental
measurement to work, they must use a TraceCollector to trigger a
collection just prior to serving metrics:

# By default, collect the global tracer
use Prometheus::Middleware::TraceCollector

@lawrencejones lawrencejones force-pushed the lawrencejones:lawrence-metric-tracer branch from b142a06 to fcd0e62 Jun 16, 2019

Add .ruby-version
For development purposes, it's useful to provide a .ruby-version that
encourages people to develop against the latest version of Ruby. This
file should not be bundled with the gem, as that makes little sense:
gems should be consumable by many versions of Ruby.

Signed-off-by: Lawrence Jones <lawrjone@gmail.com>

@lawrencejones lawrencejones force-pushed the lawrencejones:lawrence-metric-tracer branch from fcd0e62 to cca41e1 Jun 16, 2019

@coveralls

This comment has been minimized.

Copy link

commented Jun 16, 2019

Coverage Status

Coverage decreased (-0.1%) to 99.865% when pulling b142a06 on lawrencejones:lawrence-metric-tracer into 3652cf7 on prometheus:master.

1 similar comment
@coveralls

This comment has been minimized.

Copy link

commented Jun 16, 2019

Coverage Status

Coverage decreased (-0.1%) to 99.865% when pulling b142a06 on lawrencejones:lawrence-metric-tracer into 3652cf7 on prometheus:master.

@coveralls

This comment has been minimized.

Copy link

commented Jun 16, 2019

Coverage Status

Coverage decreased (-0.1%) to 99.864% when pulling 59cd4ba on lawrencejones:lawrence-metric-tracer into 3652cf7 on prometheus:master.

lawrencejones added some commits Jun 16, 2019

Add pry as development dependency
This can be useful when developing, either by invoking binding.pry in
your tests or using the console.

Signed-off-by: Lawrence Jones <lawrjone@gmail.com>
Prometheus::Client::Tracer
Create an abstraction to help express long-running duration measurements
in Prometheus metrics.

One of the most common observability requirements is to trace the amount
of time spent running a job or task. This information is useful in the
context of _when_ this time was spent. The typical implementation of
such a measurement might be:

```ruby
def run
  start = Time.now
  long_running_task
  metric.increment(by: start - Time.now)
end
```

The metric tracking duration is incremented only once the task has
finished. If the task took > Prometheus scrape interval, perhaps by a
significant factor (say 1hr vs the normal 15s scrape interval), then the
metric is incremented by a huge value only after the task has finished.
Graphs visualising how time is spent will show no activity while the
task was running and then an impossible burst just as things finish.

To avoid this time of measurement bias, this commit introduces a metric
tracer that manages updating metrics associated with long-running tasks.
Users will begin a trace, do work, and while the work is on-going any
calls to the tracer collect method will incrementally update the
associated metric with elapsed time.

This is a game-changer in terms of making metrics usable, removing a
huge source of uncertainty when interpreting metric measurements. The
implementation of the tracer is thread safe and designed to be as
lightweight as possible- it should hopefully provide no performance hit
and be preferable for tracing durations of most lengths, provided you
don't exceed hundreds of on-going traces.

For easy use, the Prometheus client initialises a global tracer in the
same vein as the global registry. Most users are going to want to use a
tracer without initialising a handle in their own code, and can do so
like this:

```ruby
def run
  Prometheus::Client.trace(metric, { worker: 1 }) do
    long_running_task
  end
end
```

By default, users who do this will see their metrics update just as
frequently as the original implementation. For the incremental
measurement to work, they must use a TraceCollector to trigger a
collection just prior to serving metrics:

```ruby
use Prometheus::Middleware::TraceCollector
```

Signed-off-by: Lawrence Jones <lawrjone@gmail.com>

@lawrencejones lawrencejones force-pushed the lawrencejones:lawrence-metric-tracer branch from cca41e1 to 59cd4ba Jun 17, 2019

@lawrencejones

This comment has been minimized.

Copy link
Contributor Author

commented Jul 6, 2019

Closing this in favour of extracting the functionality as a gem: https://github.com/lawrencejones/prometheus-client-tracer-ruby

Happy to wait and see whether people find it useful before pushing for it to be part of the client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.