Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Asset querying by allowing precise times to be passed to Prometheus queries #1127

Merged
merged 4 commits into from
Mar 29, 2022

Conversation

nikovacevic
Copy link
Contributor

Branches off #1104

What does this PR change?

  • Allow passing time to Prometheus queries
  • Pass time in Asset node and disk queries
  • Remove 1m addition to end for nodes and disks

How does this PR impact users? (This is the kind of thing that goes in release notes!)

  • Fixes the query issues where asset (start, end) timestamps would routinely exceed the window, causing logs, and would often also be 1m short, resulting in durations of 59m instead of 1h, e.g.

Links to Issues or ZD tickets this PR addresses or fixes

How was this PR tested?

Manually, using logs to inspect Prometheus queries, resultant minutes running, and warning logs

Warnings like these entirely disappear

I0322 21:49:25.173625       1 log.go:32] [Warning] Asset ETL: node 'ip-192-168-92-251.us-east-2.compute.internal' end outside window: 2022-03-22T21:01:00 not in [2022-03-22T20:00:00, 2022-03-22T21:00:00]
I0322 21:49:25.173660       1 log.go:32] [Warning] Asset ETL: node 'ip-192-168-12-152.us-east-2.compute.internal' end outside window: 2022-03-22T21:01:00 not in [2022-03-22T20:00:00, 2022-03-22T21:00:00]
I0322 21:49:25.173672       1 log.go:32] [Warning] Asset ETL: node 'ip-192-168-3-17.us-east-2.compute.internal' end outside window: 2022-03-22T21:01:00 not in [2022-03-22T20:00:00, 2022-03-22T21:00:00]
I0322 21:49:24.964149       1 log.go:32] [Warning] Asset ETL: disk 'ip-192-168-12-152.us-east-2.compute.internal' end outside window: 2022-03-22T20:01:00 not in [2022-03-22T19:00:00, 2022-03-22T20:00:00]
I0322 21:49:24.964159       1 log.go:32] [Warning] Asset ETL: disk 'pvc-c9d62790-556d-49b8-a5c1-7fdf2ce09d72' end outside window: 2022-03-22T20:01:00 not in [2022-03-22T19:00:00, 2022-03-22T20:00:00]
I0322 21:49:24.964166       1 log.go:32] [Warning] Asset ETL: disk 'pvc-fa2baa11-f438-4795-aa32-0c3b6e11f41a' end outside window: 2022-03-22T20:01:00 not in [2022-03-22T19:00:00, 2022-03-22T20:00:00]

Minutes running for 1d and 1h queries now look full

I0323 00:23:14.617923       1 log.go:47] [Info] MINUTES: 1440.000000
I0323 00:23:14.619729       1 log.go:47] [Info] MINUTES: 1440.000000
I0323 00:23:14.619802       1 log.go:47] [Info] MINUTES: 1440.000000
I0323 00:23:14.619861       1 log.go:47] [Info] MINUTES: 1440.000000
I0323 00:23:14.623312       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:14.623455       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:14.623539       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:14.623607       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:14.623669       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:14.760218       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:14.760352       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:14.760435       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:14.760503       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:14.760572       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:15.002791       1 log.go:47] [Info] MINUTES: 1440.000000
I0323 00:23:15.003041       1 log.go:47] [Info] MINUTES: 1440.000000
I0323 00:23:15.003194       1 log.go:47] [Info] MINUTES: 1440.000000
I0323 00:23:15.003295       1 log.go:47] [Info] MINUTES: 1440.000000
I0323 00:23:15.025458       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:15.025486       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:15.025505       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:15.025551       1 log.go:47] [Info] MINUTES: 60.000000
I0323 00:23:15.025618       1 log.go:47] [Info] MINUTES: 60.000000

@nikovacevic
Copy link
Contributor Author

The problem I'm running into here is that this change actually breaks the cluster_helper_test.go tests. But they're very hard to understand, and almost appear to be automatically generated. Maybe @michaelmdresser or @kaelanspatel can help me to understand how I'm supposed to interpret these?

queryLocalStorageBytes := fmt.Sprintf(`avg_over_time(sum(container_fs_limit_bytes{device!="tmpfs", id="/"}) by (instance, %s)[%s:%dm])`, env.GetPromClusterLabel(), durStr, minsPerResolution)
queryLocalActiveMins := fmt.Sprintf(`count(node_total_hourly_cost) by (%s, node)[%s:%dm]`, env.GetPromClusterLabel(), durStr, minsPerResolution)

resChPVCost := ctx.QueryAtTime(queryPVCost, t)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can now query at a time, t, which is nice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely nice!

pkg/costmodel/router.go Outdated Show resolved Hide resolved
Comment on lines 192 to 202
if !t.IsZero() {
// TODO remove log
log.Infof("[Prom] time=%s query=%s", strconv.FormatInt(t.Unix(), 10), query)
q.Set("time", strconv.FormatInt(t.Unix(), 10))
} else {
q.Set("time", time.Now().UTC().Format(time.RFC3339))
// for non-range queries, we set the timestamp for the query to time-offset
// this is a special use case that's typically only used when our primary
// prom db has delayed insertion (thanos, cortex, etc...)
if promQueryOffset != 0 && ctx.name != AllocationContextName {
q.Set("time", time.Now().Add(-promQueryOffset).UTC().Format(time.RFC3339))
} else {
q.Set("time", time.Now().UTC().Format(time.RFC3339))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, here, want to be very careful @mbolt35

Copy link
Contributor

@mbolt35 mbolt35 Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I love these changes, but I am bit concerned that this will break the intention behind the hosted hacks (which I'm still not certain why allocation breaks).

Nothing outside of very specific circumstances should we actually set the prom query offset, so while I would love to discontinue the offset hackery, I think the only way to keep it consistent is the add the -promQueryOffset to the t time.Time.

We can discuss this more if you have more specific concerns, but if you can imagine every prod deployment of our product having promQueryOffset=0, that'd be great for now. I need to investigate remote write as a solution for hosted, but until I can do that, we need to be able to query at a static offset

Copy link
Contributor

@michaelmdresser michaelmdresser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster_helpers_test.go long ago flew from my mind, so I'm not sure that I can offer much insight, but I'd be happy to walk through it with you. At a glance, I'd guess it could be related to the nodeMap[id].Minutes = ... change.

pkg/costmodel/router.go Outdated Show resolved Hide resolved
// prom db has delayed insertion (thanos, cortex, etc...)
if promQueryOffset != 0 && ctx.name != AllocationContextName {
q.Set("time", time.Now().Add(-promQueryOffset).UTC().Format(time.RFC3339))
if !t.IsZero() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make t a pointer or have a separate method (like you're doing with QueryAtTime) if its optional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its okay to do this (when is anyone actually going to try to run a query for January 1, year 1?) but it feels awkward to my Go-sense 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But nil pointers are scary, too! It's just a zero value :) But I see what you mean, of course -- I'll try it out and see what feels better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up liking time.Time more, actually. (You can use time.Now() and you don't have to worry about nil pointers and you don't have to worry about mutating the value.) But we'll see how it ages! It'd be a quick change to make.

pkg/costmodel/router.go Outdated Show resolved Hide resolved
Copy link
Contributor

@Sean-Holcomb Sean-Holcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, while Michael and Bolt have pointed out improvement, this PR accomplishes the task that it is scoped for.

Base automatically changed from etl to develop March 29, 2022 02:01
…9 to legal time parameters for Prometheus and Thanos querying
@nikovacevic nikovacevic merged commit f735645 into develop Mar 29, 2022
@nikovacevic nikovacevic deleted the niko/query-time branch March 29, 2022 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Asset queries result in data off by 1m
4 participants