Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental measurement of long-running tasks #3

Merged
merged 8 commits into from Jul 10, 2019

Conversation

@lawrencejones
Copy link
Owner

lawrencejones commented Apr 23, 2019

screencapture-blog-lawrencejones-dev-staging-incremental-measurement-incremental-measurement-2019-07-10-21_56_35

@lawrencejones lawrencejones force-pushed the incremental-measurement branch 2 times, most recently from d8c8fa5 to aab98fa Apr 23, 2019
@lawrencejones

This comment has been minimized.

Copy link
Owner Author

lawrencejones commented Apr 24, 2019

Walt:

I’d move the order of the sentences below, present the importance then ask for validation of common understanding. I might be naïve in thinking that people need this called out, but should you not call out that you can get the available worker metrics from somewhere or do we think it detracts from the point of the article?

You’ll want to measure this system, right? These workers are essential for your systems to run smoothly, providing the heavy lifting behind the scenes that make your product worth anyone’s time.

Did we answer the second part of this question? In fact, did we actually answer this question? It seems like we have a more precise metric for actively working workers (busy).

At any moment, what jobs are workers working, and how much time is spent on each job?

I would describe the graph in more detail, wasn’t immediately obvious that each block was a job.

Grammatical error? I’d call out the seemingly obvious why.

It means we can answer with certainty questions like asked at the start of the article (what are the workers doing?), and an uncertain answers are often worse than having no answer at all.

Any reason why you can’t embed the graphs from grafana?

Example of how this metric may help? You mentioned it would be useless if wrong, how is it now that this precision makes it useful?

Copy link

petehamilton left a comment

Left a few thoughts to consider, nice work, though, fun read! 😄

---

Most applications run a 'worker tier' or some deployment that processes

This comment has been minimized.

Copy link
@petehamilton

petehamilton May 10, 2019

This intro is good.

In case it's useful, I thought maybe this jumped into talking about workers quite quickly and that it might be interesting to frame in the context of more general web app development. If so, here's a slightly alternative pitch:

  • Most applications consist of a web server which handles live traffic and some sort of "worker tier"
  • There is a lot of really^1 good^2 advice^3 out there on how to measure the health of live traffic
  • Workers are often as, of not more, valuable, we should care about them too! However, there's relatively little guidance on approaches and pitfalls for measuring your async workers.
  • This post covers one such pitfall, using the example of a prod incident where despite metrics, I was unable to answer question X
  • Turns out this is because of a bias so significant that metrics were outright lying to me
  • I'm going to talk you through how I investigated it and and show you how to avoid it in a clean, pragmatic way

etc? Can keep most of the content as-is I think, just a minor re-framing & food for thought!

## The incident

Alerts were firing about dropped requests and the HTTP dashboard confirmed it-
queues were building and requests timing out. About two minutes later pressure

This comment has been minimized.

Copy link
@petehamilton

petehamilton May 10, 2019

pressure flooded out the system and normality was restored.

What does this mean?


Looking closer, our API servers had stalled waiting on the database to respond,
causing all activity to grind to an abrupt halt. The prime suspect wielding
enough capacity to hit the database like this was the asynchronous worker tier,

This comment has been minimized.

Copy link
@petehamilton

petehamilton May 10, 2019

What do you mean by this? Why was it the prime suspect? Maybe explain why and why they wield this capacity?

</figcaption>
</figure>

As the person debugging this mess, it hurt me to see the graph flatline at

This comment has been minimized.

Copy link
@petehamilton

petehamilton May 10, 2019

s/flatline/showing minimal activity/? Flatline to me, means dropped to zero/near zero, rather than "showing normal/low activity".

exactly the time of the incident. I'd already checked the logs so I knew the
workers were busy- not only that, but the large blue spike at 16:05? It's time
spent working webhooks, for which we run twenty dedicated workers. How could ten
single threaded workers spend 45s per second working?

This comment has been minimized.

Copy link
@petehamilton

petehamilton May 10, 2019

I found this para slightly confusing - where have we got 10 single threaded and 45s from?

This comment has been minimized.

Copy link
@petehamilton

petehamilton May 10, 2019

Ah - looks like 45s is the graph (not visible in github markdown - doh!), not sure about the 10, though, do you mean 20?

This comment has been minimized.

Copy link
@lawrencejones

lawrencejones Jul 6, 2019

Author Owner

Yep, the ten is just a typo. Get's confusing when you simplify numbers on the fly for this type of content!

second. If every worker starts working long jobs, we'll report no work being
done until they end, even if that's hours later.

To demonstrate this effect, I created an experiment with ten workers working

This comment has been minimized.

Copy link
@petehamilton

petehamilton May 10, 2019

I think it'd be cool to also show something which "looks legit-ish" either before or after this next section.

i.e., if you have thousands of tiny jobs, things look kind of like you'd expect, at least enough to go unnoticed. However, as you increase the job time/add slower jobs (as you have here), you can show that these graphs become downright dangerous.

This has the added benefit of explaining how it's easy to miss this until it's too late - you start with small quick jobs but if/when they slow down, they take your metrics down with them!

This comment has been minimized.

Copy link
@lawrencejones

lawrencejones Jul 6, 2019

Author Owner

Ah, this is a great idea. Showing it degrade would be lovely.

## Restoring trust
Metrics aren't meant to lie to you. Beyond the existential crisis prompted by

This comment has been minimized.

Copy link
@petehamilton

petehamilton May 10, 2019

😂 - enjoyed this section

or we receive a scrape. The jobs that span across scrapes contribute fairly on
either side, as shown by the jobs straddling the 15s scrape time splitting their
duration evenly. Regardless of the size of our jobs, we've incremented the
metric by 30s (2 x 15s) for each scrape interval.

This comment has been minimized.

Copy link
@petehamilton

petehamilton May 10, 2019

Roughly - or specifically? Presumably this will still be somewhat either side of 30s due to the timings of the request? Kind of fine, but might be worth mentioning?

Comparison of biased (left) and tracer managed (right) metrics, taken from
the same worker experiment
</figcaption>
</figure>

This comment has been minimized.

Copy link
@petehamilton

petehamilton May 10, 2019

Images are the wrong way round (or captions, but I'd swap images and do "before" on the left)

In comparison to the outright misleading and chaotic graph from our original
measurements, metrics managed by the tracer are stable and consistent. Not only
do we accurately assign work to each scrape but we are now indifferent to

This comment has been minimized.

Copy link
@petehamilton

petehamilton May 10, 2019

Suggested change
do we accurately assign work to each scrape but we are now indifferent to
do we accurately reflect seconds worked on each scrape but we are now indifferent to
@lawrencejones

This comment has been minimized.

Copy link
Owner Author

lawrencejones commented Jul 6, 2019

It took me way to long to get back to this @petehamilton but your review was really useful, thank you!

@lawrencejones lawrencejones force-pushed the incremental-measurement branch from 136c2db to f33f987 Jul 10, 2019
@lawrencejones lawrencejones merged commit 6fc964e into master Jul 10, 2019
3 checks passed
3 checks passed
ci/circleci: build Your tests passed on CircleCI!
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
test Workflow: test
Details
@lawrencejones lawrencejones deleted the incremental-measurement branch Jul 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.