New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a metric that can be used to notice stuck worker threads #70884

Merged
merged 8 commits into from Nov 13, 2018

Conversation

@lavalamp
Copy link
Member

lavalamp commented Nov 9, 2018

"unfinished_work_microseconds" is added to the workqueue metrics; it can be used to detect stuck worker threads. (kube-controller-manager runs many workqueues.)
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 9, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from dchen1107 and soltysh Nov 9, 2018

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 9, 2018

I can probably test this better (existing metrics don't seem to be tested at all?).

lmk if there's a problem with the approach.

/assign @mml
cc @logicalhan

unfinished := prometheus.NewGauge(prometheus.GaugeOpts{
Subsystem: name,
Name: "unfinished_work_microseconds",
Help: "How many microseconds of work has " + name + " done that is still in progress and hasn't yet been observed by work_duration.",

This comment has been minimized.

@mml

mml Nov 9, 2018

Contributor

What is work_duration?

This comment has been minimized.

@lavalamp

lavalamp Nov 9, 2018

Member

It's a metric defined a few lines up.

@mml

This comment has been minimized.

Copy link
Contributor

mml commented Nov 9, 2018

Thinking about this... I prefer having a counter that counts the number of times the thread has "finished". This lets you notice when it's stuck (and for how long), but also lets you compute rates and deltas, even if you miss some samples. And the data downsamples elegantly.

With a "microseconds since" metric, stuck is pretty easy to detect, but you can't derive rates/deltas reliably. And downsampling tends to lose information about intermediate events.

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 9, 2018

We already track the number of completions as part of the procssing duration statistic. You can't tell from that how many things haven't completed, which is what I am worried about.

The problem is we don't know how many threads there are, it's not even guaranteed to be a constant number.

You can tell how many threads are stuck by the rate at which this new metric is increasing.

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 9, 2018

Looking for differences in the completion rate also doesn't work because the queue is probably far under capacity. If it is always processing everything that is added, then the completion rate will be the same whether it's 19 workers (1 stuck) or 20.

If it's not processing everything that's added, then there's a major problem :)

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 9, 2018

Also, you can't look at the difference between adds and completions, because adds are de-duped in the queue. The ratio could easily be 100:1 or more and that's fine, good even, as it means less work is done.

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 9, 2018

Hm, actually it looks like the add metric is incremented post-deduping. I'm not convinced the code is right. Maybe I'll add some more tests.

@mml

This comment has been minimized.

Copy link
Contributor

mml commented Nov 9, 2018

We already track the number of completions as part of the procssing duration statistic. You can't tell from that how many things haven't completed, which is what I am worried about.

To rephrase, deltas on the duration metric would tell you if you're not doing any work, but that could be either because you're stuck or because there's nothing to do?

@logicalhan

This comment has been minimized.

Copy link
Contributor

logicalhan commented Nov 9, 2018

The logic seems okay to me, but I'm also not super familiar with how we consume prometheus data, so I don't have a bunch of helpful input, unfortunately. In general it seems to make sense to me though.

@logicalhan

This comment has been minimized.

Copy link
Contributor

logicalhan commented Nov 9, 2018

Re: @mml, yeah, I think the problem is that there would be no difference between not working and being stuck.

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 9, 2018

If all threads get stuck then the metric would go to zero and you'd notice that. If only some threads get stuck, then you wouldn't necessarily notice that.

The deduping actually does look to happen before incrementing the add metric, so in steady state the add and done metrics should be the same value. Unfortunately if they're off by one, you don't know if that's because of something that got added a second ago or a week ago.

var total float64
if m.processingStartTimes != nil {
for _, t := range m.processingStartTimes {
total += sinceInMicroseconds(t)

This comment has been minimized.

@mml

mml Nov 9, 2018

Contributor

Ah, so this is (items * time). This is kind of a confusing unit. Don't we really just care about the oldest thing in the queue?

This comment has been minimized.

@lavalamp

lavalamp Nov 9, 2018

Member

I could report that, but you wouldn't be able to tell the difference between one thing stuck and two things stuck.

This comment has been minimized.

@logicalhan

logicalhan Nov 9, 2018

Contributor

I agree that the unit is a little confusing, but emitting this as raw data is probably okay. Whatever thing is responsible for scraping this data can post-process before turning it into an actual (and more useful and less confusing) metric datapoint.

This comment has been minimized.

@lavalamp

lavalamp Nov 10, 2018

Member

It is a confusing unit, but I think all the things that we can do to make it less confusing are lossy (or require making assumptions about the duration of the things that are being done) and therefore best done in the monitoring system.

This comment has been minimized.

@mml

mml Nov 10, 2018

Contributor

In this case, we can't tell the difference between 1 thing stuck for an hour, or 2 things stuck for 30 minutes.

If we want information on both ages and the count of things at certain ages, we'd normally use a distribution/histogram (which I think is "summary" in prometheus?).

This comment has been minimized.

@lavalamp

lavalamp Nov 11, 2018

Member

Github stuck my response in the wrong place:

If you plot this over time, the slope will tell you the difference between those scenarios.

I'm open to reporting a distribution, of course, but I don't have a single thing to observe and it wasn't clear how to get around that.

This comment has been minimized.

@lavalamp

lavalamp Nov 11, 2018

Member

I think I might change the unit to be seconds, though. Microseconds is far more granular than necessary.

This comment has been minimized.

@lavalamp

lavalamp Nov 11, 2018

Member

Unit changed, and I expanded the help string for the metric since it wasn't obvious how to use it.

lavalamp added some commits Nov 10, 2018

@lavalamp lavalamp force-pushed the lavalamp:workqueue branch from 67e205d to 74c50c0 Nov 10, 2018

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 10, 2018

OK, I greatly improved the testing, PTAL

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 10, 2018

@logicalhan

This comment has been minimized.

Copy link
Contributor

logicalhan commented Nov 10, 2018

It looks good to me. The clock injection for testing was a nice touch.

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 10, 2018

metricsProvider: noopMetricsProvider{},
}

type metricsFactory struct {

This comment has been minimized.

@justinsb

justinsb Nov 11, 2018

Member

Maybe rename to queueMetricsFactory or queueMetricsProvider ? I don't know whether we're actively making a distinction between factories & providers, I find it confusing to have metricsFactory with an embedded metricsProvider.


func (f *metricsFactory) newQueueMetrics(name string, clock clock.Clock) queueMetrics {
mp := f.metricsProvider
if len(name) == 0 || mp == (noopMetricsProvider{}) {

This comment has been minimized.

@justinsb

justinsb Nov 11, 2018

Member

Does mp == (noopMetricsProvider{}) work? I would have though you'd have to do a type-check.

This comment has been minimized.

@lavalamp

lavalamp Nov 11, 2018

Member

Surprisingly, go does the comparison correctly (otherwise the test wouldn't pass / compile).

metricsFactory.setProviders.Do(func() {
metricsFactory.metricsProvider = metricsProvider
})
globalMetricsFactory.set(metricsProvider)

This comment has been minimized.

@justinsb

justinsb Nov 11, 2018

Member

Maybe rename set to setProvider, for consistency?

This comment has been minimized.

@lavalamp
@@ -64,6 +82,9 @@ type Type struct {
shuttingDown bool

metrics queueMetrics

unfinishedWorkUpdatePeriod time.Duration

This comment has been minimized.

@justinsb

justinsb Nov 11, 2018

Member

Nit: we don't need to store it in the struct, we could just pass it into updateUnfinishedWorkLoop. Not sure if that makes things better or worse in your mind.

This comment has been minimized.

@lavalamp

lavalamp Nov 11, 2018

Member

I think I prefer it when things are introspectable after the fact, although it doesn't matter in this case.

@@ -170,3 +191,22 @@ func (q *Type) ShuttingDown() bool {

return q.shuttingDown
}

func (q *Type) updateUnfinishedWorkLoop() {
t := q.clock.NewTicker(q.unfinishedWorkUpdatePeriod)

This comment has been minimized.

@justinsb

justinsb Nov 11, 2018

Member

I'd probably add a comment here that the reason we're doing this is because it's not safe to update metrics without holding the q.cond.L lock

This comment has been minimized.

@lavalamp

lavalamp Nov 11, 2018

Member

Added a note to defaultQueueMetrics.

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Nov 11, 2018

A few code comments. I like the metric though - although we're packing a few failure modes into one metric, I think it gives us signal where we didn't have signal before. If we need to add more metrics to disambiguate later, we can do so. (Do we have any sort of guarantee on metric compatibility?)

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 11, 2018

PTAL

I'm not sure if we've set a metrics deprecation policy or not, I think 2 releases / 6 months is probably the minimum deprecation time that could be reasonable.

fixup! Test workqueue metrics
change units to seconds
@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 11, 2018

/retest

@lavalamp lavalamp added this to the v1.13 milestone Nov 11, 2018

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 11, 2018

/retest

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 12, 2018

Added a second metric for @mml.

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 12, 2018

Can I get an LGTM so I can squash? :)

func (prometheusMetricsProvider) NewLongestRunningProcessorMicrosecondsMetric(name string) workqueue.SettableGaugeMetric {
unfinished := prometheus.NewGauge(prometheus.GaugeOpts{
Subsystem: name,
Name: "longest_running_procesor_microseconds",

This comment has been minimized.

@logicalhan

logicalhan Nov 13, 2018

Contributor

*processor

@logicalhan
Copy link
Contributor

logicalhan left a comment

/LGTM

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 13, 2018

@logicalhan: changing LGTM is restricted to assignees, and only kubernetes/kubernetes repo collaborators may be assigned issues.

In response to this:

/LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@deads2k

This comment has been minimized.

Copy link
Contributor

deads2k commented Nov 13, 2018

@p0lyn0mial fyi. This sort of thing would also be useful in controllers and perhaps in the server (if there's a spot we missed)

@deads2k

This comment has been minimized.

Copy link
Contributor

deads2k commented Nov 13, 2018

PTAL

I'm not sure if we've set a metrics deprecation policy or not, I think 2 releases / 6 months is probably the minimum deprecation time that could be reasonable.

I think it depends. We try to be polite, but I don't think we've ever enforced a hard rule on metric naming and the like

}
// Convert to seconds; microseconds is unhelpfully granular for this.
total /= 1000000
m.unfinishedWorkSeconds.Set(total)

This comment has been minimized.

@mml

mml Nov 13, 2018

Contributor

Why not just "Set(total / 1000000)"?

This comment has been minimized.

@lavalamp

lavalamp Nov 13, 2018

Member

Not sure, if I make another change I'll update this otherwise it doesn't seem worth it :)

Thanks for the review!

@mml

This comment has been minimized.

Copy link
Contributor

mml commented Nov 13, 2018

/lgtm

Thanks.

@k8s-ci-robot k8s-ci-robot added the lgtm label Nov 13, 2018

@k8s-ci-robot k8s-ci-robot merged commit bc6aee1 into kubernetes:master Nov 13, 2018

18 checks passed

cla/linuxfoundation lavalamp authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details

@logicalhan logicalhan referenced this pull request Dec 8, 2018

Closed

REQUEST: New membership for @logicalhan #292

6 of 6 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment