Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Request Queue Diagnostics #889

Merged
merged 105 commits into from Aug 30, 2021
Merged

Prometheus Request Queue Diagnostics #889

merged 105 commits into from Aug 30, 2021

Conversation

mbolt35
Copy link
Collaborator

@mbolt35 mbolt35 commented Aug 5, 2021

The idea behind these changes is gaining insight into the state of the outgoing prometheus/thanos requests. This diagnostic tool will assist us in determining future optimization strategies.

Goals

  • Add visualization/diagnostic to the queue of prometheus/thanos queries so at any given point, we can check:
    • How many requests are queued?
    • What requests are queued?
    • How long has the request been queued?
    • What issued the query (context naming, "categories")?
  • To ensure we are correctly monitoring any of the queries we issue, we should move the prometheus/thanos query endpoints from /api/ to the cost-model so that they can be queued alongside the backend queries

@mbolt35 mbolt35 self-assigned this Aug 5, 2021
@mbolt35 mbolt35 added the enhancement New feature or request label Aug 5, 2021
Cluster: cluster,
Namespace: namespace,
},
Pod: pod,
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization here which seemed to have a fairly significant reduction in allocation upon very loose observation via pprof. The benefits from non-string keys are still very hard to pin down, as the keys are escaped at some point and count towards heap allocations. I did not expect embedding namespaceKey to have that dramatic of an effect, but it appears to have helped for appendLabels()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I never would have thought. 😲

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest, I'm not completely sold, but I did test this in isolation and there were noticeable changes. I still need to use some better benchmarks for measurement.

…his will allow better visibility on frontend queries made through our product.
w.Write(body)
}

func (a *Accesses) PrometheusQueryRange(w http.ResponseWriter, r *http.Request, _ httprouter.Params) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, the expected payload changes between QueryRange and Query on the frontend. For now, I'm sticking to "ease of migration" versus attempting to refactor all the queries on the frontend 😄

@@ -157,46 +166,68 @@ func runQuery(query string, ctx *Context, resCh QueryResultsChan, profileLabel s
resCh <- results
}

func (ctx *Context) query(query string) (interface{}, prometheus.Warnings, error) {
// RawQuery is a direct query to the prometheus client and returns the body of the response
func (ctx *Context) RawQuery(query string) ([]byte, error) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RawQuery and RawQueryRange were required for proxying queries from the frontend while also piping the queries through our request queue.


// Note that the warnings return value from client.Do() is always nil using this
// version of the prometheus client library. We parse the warnings out of the response
// body after json decodidng completes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this comment sheds light on the current scenario. The prometheus Warnings are actually part of the response payload. Since the http request will never return warnings, we needed to implement our own warning parsing, which happens after the unmarshalling.

AjayTripathy and others added 28 commits August 20, 2021 15:18
Merge Master into develop
Create PULL_REQUEST_TEMPLATE.md
…h a TryDequeue() method for non-blocking dequeue and Clear() for resetting the queue contents.
…g-quote

Add missing quote to CloudStatus object json rule
optimize query, fix for no cadvisor relabels
Add helm config parameter for shared overhead costs
…string. Refactor specific utilities into respecitive _util package.
…his will allow better visibility on frontend queries made through our product.
…h a TryDequeue() method for non-blocking dequeue and Clear() for resetting the queue contents.
@mbolt35 mbolt35 merged commit a65c111 into develop Aug 30, 2021
@mbolt35 mbolt35 deleted the bolt/prom-diagnostics branch August 30, 2021 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants