Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Prometheus URL label #2503

Merged
merged 9 commits into from
May 19, 2021
Merged

fix: Prometheus URL label #2503

merged 9 commits into from
May 19, 2021

Conversation

seremenko-wish
Copy link
Contributor

@seremenko-wish seremenko-wish commented Apr 27, 2021

Related issue

Fixes #2502

Proposed changes

Because an existing HTTP router library does not register in Context object any valuable information that can be used for determining whether URL was matched or not, routers need to be registered in MetricsManager directly and all requests need to be matched a second time. Once the request URL is matched to any of the endpoints in any of the registered routers, middleware emits a metric with a value equal to the path, otherwise label {unmatched} is used. All matched parameters in the path are getting replaced by {param} in the metric label, so:

  • request to /clients/12345 and /clients/randomString with emit an event with label /clients/{param}

Known issues:

  • because the same MetricsManager is used for both Public and Admin HTTP servers, and our router does not populate context with route matching information (like port, matched endpoint), requests sent to public API can be matched with admin API endpoint. No solution is available for this problem.
  • I am trying to reconstruct the endpoint from the request URL and matched parameters, so the request /clients/clients can be matched with endpoint /clients/:id, and reconstruction logic would replace all strings clients to {param}.
  • Ideally, we should analyze the status code, but middleware does not have access to this info (or I can't find a way to get it in the middleware)

Checklist

  • I have read the contributing guidelines.
  • I have read the security policy.
  • I confirm that this pull request does not address a security
    vulnerability. If this pull request addresses a security. vulnerability, I
    confirm that I got green light (please contact
    security@ory.sh) from the maintainers to push
    the changes.
  • I have added tests that prove my fix is effective or that my feature
    works.
  • I have added or changed the documentation.

Further comments

aeneasr
aeneasr previously approved these changes Apr 29, 2021
Copy link
Member

@aeneasr aeneasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I think we can merge it as is. My only question would be - is it normal to remove these params? I guess you would use tracing to find actual slow queries, not prometheus?

@aeneasr aeneasr marked this pull request as ready for review April 29, 2021 06:44
@aeneasr aeneasr self-assigned this Apr 29, 2021
@aeneasr
Copy link
Member

aeneasr commented Apr 29, 2021

Do you want this to be merged already or do you want to address the known issues first?


Not to self (release log):

Currently, prometheus labels are URL-specific leading to high cardinality which can cause OOM and other issues in Ory Hydra, Prometheus, and other systems. This patch addresses that issue by removing cardinality:

- TBD

@seremenko-wish
Copy link
Contributor Author

My only question would be - is it normal to remove these params? I guess you would use tracing to find actual slow queries, not prometheus?

Right, Prometheus is a monitoring tool, and it should not collect data for all unique requested URLs, so my solution is acceptable.

Do you want this to be merged already or do you want to address the known issues first?

Until we start to use a more powerful router, known issues can't be fixed. So let's proceed and merge it.

Thank you.

@aeneasr
Copy link
Member

aeneasr commented Apr 30, 2021

Ideally, we should analyze the status code, but middleware does not have access to this info (or I can't find a way to get it in the middleware)

I believe that rw is of type negroni.ResponseWriter so we have access to status, size and so on:

	var rw http.ResponseWriter
	res := rw.(negroni.ResponseWriter)
	status := res.Status()
	size := res.Size()

I also saw your PR in Ory Kratos - thank you! I would say we put the contents of https://github.com/ory/kratos/tree/4b4032681dc177bb2811684c5858844a69f2bb1d/metrics/prometheus into github.com/ory/x/prometheusx so that we can just import it in e.g. Ory Kratos and that future patches and improvements are rolled out to all projects!

What do you think?

@seremenko-wish
Copy link
Contributor Author

I will experiment with status codes next week, and see how I can improve existing metrics. It would be nice to add counters per status code per endpoint.

@seremenko-wish seremenko-wish mentioned this pull request May 5, 2021
4 tasks
@seremenko-wish
Copy link
Contributor Author

Submitted the PR that refactors the code.

For status codes, I will create a separate ticket/feature request.

@aeneasr
Copy link
Member

aeneasr commented May 11, 2021

The changes in ory/x are published as https://github.com/ory/x/releases/tag/v0.0.237

@@ -1,26 +0,0 @@
package prometheus
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aeneasr what should I do with this file?

@aeneasr
Copy link
Member

aeneasr commented May 19, 2021

Awesome, thank you! :)

@aeneasr aeneasr merged commit f588ec6 into ory:master May 19, 2021
@seremenko-wish seremenko-wish deleted the prom-metrics-fix branch May 19, 2021 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Excessive Prometheus label cardinality issue
2 participants