[Prometheus] Add new prometheus metrics and metrics endpoint #827

dolbik · 2018-04-26T06:55:35Z

Signed-off-by: Dmitry Dolbik dolbik@gmail.com

It is draft implementation to support Prometheus in Hydra (#669). Prometheus collects all metrics and more then current telemetry does. Current telemetry and prometheus works together in this pull request
Added two new Prometheus metrics:

Secure token service requests served per endpoint
Secure token service response time per endpoint

The result of prometheus:
go_gc_duration_seconds{quantile="0"} 0 go_gc_duration_seconds{quantile="0.25"} 0 go_gc_duration_seconds{quantile="0.5"} 0 go_gc_duration_seconds{quantile="0.75"} 0 go_gc_duration_seconds{quantile="1"} 0 go_gc_duration_seconds_sum 0 go_gc_duration_seconds_count 0 go_goroutines 16 go_memstats_alloc_bytes 3.479416e+06 go_memstats_alloc_bytes_total 3.479416e+06 go_memstats_buck_hash_sys_bytes 1.445704e+06 go_memstats_frees_total 1375 go_memstats_gc_sys_bytes 268288 go_memstats_heap_alloc_bytes 3.479416e+06 go_memstats_heap_idle_bytes 679936 go_memstats_heap_inuse_bytes 4.857856e+06 go_memstats_heap_objects 19294 go_memstats_heap_released_bytes_total 0 go_memstats_heap_sys_bytes 5.537792e+06 go_memstats_last_gc_time_seconds 0 go_memstats_lookups_total 12 go_memstats_mallocs_total 20669 go_memstats_mcache_inuse_bytes 13888 go_memstats_mcache_sys_bytes 16384 go_memstats_mspan_inuse_bytes 59280 go_memstats_mspan_sys_bytes 65536 go_memstats_next_gc_bytes 4.473924e+06 go_memstats_other_sys_bytes 1.028528e+06 go_memstats_stack_inuse_bytes 753664 go_memstats_stack_sys_bytes 753664 go_memstats_sys_bytes 9.115896e+06 sts_requests_total{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master"} 1 sts_requests_total{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.005"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.01"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.025"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.05"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.1"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.25"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="1"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="2.5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="10"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="+Inf"} 1 sts_response_time_seconds_sum{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master"} 0.001876666 sts_response_time_seconds_count{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.005"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.01"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.025"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.05"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.1"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.25"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="1"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="2.5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="10"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="+Inf"} 1 sts_response_time_seconds_sum{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master"} 0.000721316 sts_response_time_seconds_count{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master"} 1

aeneasr

Thank you for the proposal. I think there's still some way to go, also because I'm quite unfamiliar with how Prometheus works. Can you recommend a crash course for me, so I can better judge the implementation here?

aeneasr · 2018-04-26T07:30:00Z

health/handler.go

 )

 const (
 	HealthStatusPath = "/health/status"
+	MetricsStatusPath = "/health/metrics"


This should probably be renamed to prometheus

aeneasr · 2018-04-26T07:31:45Z

metrics/telemetry/middleware.go

@@ -100,10 +100,14 @@ func NewMetricsManager(issuerURL string, databaseURL string, l logrus.FieldLogge
 		salt:         uuid.New(),
 		BuildTime:    buildTime, BuildVersion: version, BuildHash: hash,
 	}
+
+	go mm.RegisterSegment()


This is a side effect of NewMetricsHandler which might introduce some issues during testing. Not sure if I like this here.

I'll back it to old place as we decide to separate Prometheus and telemetry flag

aeneasr · 2018-04-26T07:32:47Z

cmd/server/handler.go

@@ -106,9 +106,7 @@ func RunHost(c *config.Config) func(cmd *cobra.Command, args []string) {

 		if ok, _ := cmd.Flags().GetBool("disable-telemetry"); !ok && os.Getenv("DISABLE_TELEMETRY") != "1" {


Not sure if we should disable prometheus if this is true. The idea of this is to allow people to disable sending reports to our servers. Prometheus runs locally and doesn't send anything to us

ping @dolbik

@arekkas I left this part practically unchanged. Server reports stay as is. Prometheus middleware is always enabled https://github.com/ory/hydra/pull/827/files#diff-34e2a4fce852fbba1d689b617c6d4a94R114

Oh Yeah! Sorry, didn't see that - perfect :)

aeneasr · 2018-04-26T07:33:47Z

metrics/prometheus/metrics.go

+			},
+			[]string{"endpoint"},
+		),
+		ResponseTime: prometheus.NewHistogramVec(


How do we decide which prometheus statistics to use here? Is sts_response_time_seconds standardized somewhere, or chosen arbitrarily?

https://prometheus.io/docs/concepts/metric_types/

HistogramVec is used for request durations for example.

Name of the counter should be like https://prometheus.io/docs/practices/naming/ and we can change it accoring rules in doc

Name of the counter should be like https://prometheus.io/docs/practices/naming/ and we can change it accoring rules in doc

Damn, that's what I feared. Naming is subject to the environment you're in, so hardcoding this is not a very good idea. Another issue is what metrics we collect, like do or do we not want to collect CPU metrics? But what about the thousands of other possible metrics?

I feel like implementing prometheus is like implementing proper analytics/tracking - the data being sent is really dependent on what you want to learn.

Just as an example, you want to name this secure_token_service. I'd disagree and say to use the name of the product. Someone else might want to have here oauth2_server or whatever. It's really hard to define something that is so dependent on the context it's being used in.

I agree with you. Some configurable value can be added (for example CLUSTER_NAME, default hydra) for naming. In this case metrics name will be CLUSTER_NAME_response_time_seconds for example

By default prometheus client gathers information about:

Number of goroutines that currently exist.

A summary of the GC invocation durations

Number of bytes allocated and still in use

Total number of bytes allocated, even if freed

Number of bytes obtained by system. Sum of all system allocations

Total number of pointer lookups

Total number of mallocs

Total number of frees

Number of heap bytes allocated and still in use

Number of heap bytes obtained from system

Number of heap bytes waiting to be used

Number of heap bytes that are in use

Total number of heap bytes released to OS

Number of allocated objects

Number of bytes in use by the stack allocator

Number of bytes obtained from system for stack allocator

Number of bytes in use by mspan structures

Number of bytes used for mspan structures obtained from system

Number of bytes in use by mcache structures

Number of bytes used for mcache structures obtained from system

Number of bytes used by the profiling bucket hash table

Number of bytes used for garbage collection system metadata

Number of bytes used for other system allocations

Number of heap bytes when next garbage collection will take place

Number of seconds since 1970 of last garbage collection

Total user and system CPU time spent in seconds

Number of open file descriptors

Maximum number of open file descriptors

Virtual memory size in bytes

Resident memory size in bytes

Start time of the process since unix epoch in seconds

All other metrics should be added manually. Thats why i added 2 metrics that may be needed for users

Ah I see - maybe a good idea then would be to just send the default metricsf or now?

In this case new middleware does not make sense. New endpoint will be added into "health".

Should i leave middleware as skeleton for future or it can be deleted?

I propose to leave a empty middleware for prometheus with examples. It is given ability to simple add custom metrics if needed

Yeah I think that makes sense - we could - in the future - also have the ability to load a plugin for handling prometheus middleware, which would make this a nobrainer to use.

aeneasr · 2018-04-26T07:34:23Z

metrics/prometheus/middleware.go

+}
+
+func NewMetricsHandler(l logrus.FieldLogger, version, hash, buildTime string) *PrometheusHandler {
+	l.Info("Setting up Prometheus metrics")


Factories should try and avoid side effects such as logging

aeneasr · 2018-04-26T07:35:08Z

metrics/prometheus/middleware.go

+}
+
+func (pmm *PrometheusHandler) ServeHTTP(rw http.ResponseWriter, r *http.Request, next http.HandlerFunc) {
+	defer func(start time.Time) {


Not sure we need a defer here? Just put it behind next?

dolbik · 2018-04-26T07:55:56Z

@arekkas Maybe this article can help https://blog.alexellis.io/prometheus-monitoring/

dolbik · 2018-04-27T07:32:48Z

metrics/prometheus/metrics.go

+			"buildTime": buildTime,
+		},
+	}
+	return pm


I propose to leave a empty middleware for prometheus with examples. It is given ability to simple add custom metrics if needed

Yeah, that makes sense!

Signed-off-by: Dmitry Dolbik <dolbik@gmail.com>

dolbik · 2018-04-27T10:12:07Z

@arekkas Looks like a have finished code according discussion.

dolbik · 2018-05-02T15:19:47Z

@arekkas Are any actions needed from my side?

aeneasr · 2018-05-02T16:13:57Z

No, all good, I'm just extremely busy with #836 right now, I'll merge it soon

aeneasr · 2018-05-08T08:16:35Z

Thank you for your contribution!

* health: Adds new prometheus metrics and metrics endpoint (ory#827) Signed-off-by: Dmitry Dolbik <dolbik@gmail.com> * IDM-410 Prometheus Metrics for the Secure token Service

Based on ory#827 JIRA: WAP-1891

* OData no longer defaulted, using header to cause it * Simpler value * Add Prometheus endpoint Based on ory#827 JIRA: WAP-1891

aeneasr · 2018-11-18T15:43:52Z

@dolbik would you mind sharing how you set up prometheus with hydra? What would be excellent, if you could create an docker example here and create a PR for it :)

mobintmu · 2023-05-02T15:34:24Z

How can I calculate the number of users that get sessions which aren't expired yet? My goal is to calculate the number of online users.
Does the Prometheus feature support the number of users that get the sessions?

dolbik force-pushed the Prometheus branch from 6e4e9b8 to d667348 Compare April 26, 2018 07:29

aeneasr reviewed Apr 26, 2018

View reviewed changes

dolbik force-pushed the Prometheus branch 3 times, most recently from 888fd6b to 7c2cc77 Compare April 26, 2018 12:46

dolbik commented Apr 27, 2018

View reviewed changes

dolbik force-pushed the Prometheus branch from a4ae33d to 3bf6a11 Compare April 27, 2018 09:52

[Prometheus] Add new prometheus metrics and metrics endpoint

80c3904

Signed-off-by: Dmitry Dolbik <dolbik@gmail.com>

dolbik force-pushed the Prometheus branch from 3bf6a11 to 80c3904 Compare April 27, 2018 10:02

aeneasr merged commit ef94f98 into ory:master May 8, 2018

mgloystein pushed a commit to spotxchange/hydra that referenced this pull request May 14, 2018

Add Prometheus endpoint

d42abfc

Based on ory#827 JIRA: WAP-1891

mgloystein added a commit to spotxchange/hydra that referenced this pull request May 14, 2018

Prometheous (#7)

f8ef740

* OData no longer defaulted, using header to cause it * Simpler value * Add Prometheus endpoint Based on ory#827 JIRA: WAP-1891

aeneasr mentioned this pull request May 24, 2018

Prometheus metrics #669

Closed

drwatsno mentioned this pull request Aug 25, 2020

Implement prometheus metrics endpoint ory/kratos#672

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Prometheus] Add new prometheus metrics and metrics endpoint #827

[Prometheus] Add new prometheus metrics and metrics endpoint #827

dolbik commented Apr 26, 2018 •

edited

aeneasr left a comment

aeneasr Apr 26, 2018

aeneasr Apr 26, 2018

dolbik Apr 26, 2018

aeneasr Apr 26, 2018

aeneasr Apr 29, 2018

dolbik May 2, 2018

aeneasr May 2, 2018

aeneasr Apr 26, 2018

dolbik Apr 26, 2018

aeneasr Apr 26, 2018

aeneasr Apr 26, 2018

dolbik Apr 26, 2018

dolbik Apr 26, 2018

aeneasr Apr 26, 2018

dolbik Apr 26, 2018

dolbik Apr 27, 2018

aeneasr Apr 27, 2018

aeneasr Apr 26, 2018

aeneasr Apr 26, 2018

dolbik commented Apr 26, 2018

dolbik Apr 27, 2018

aeneasr Apr 27, 2018

dolbik commented Apr 27, 2018

dolbik commented May 2, 2018

aeneasr commented May 2, 2018

aeneasr commented May 8, 2018

aeneasr commented Nov 18, 2018

mobintmu commented May 2, 2023

		@@ -106,9 +106,7 @@ func RunHost(c config.Config) func(cmd cobra.Command, args []string) {

		if ok, _ := cmd.Flags().GetBool("disable-telemetry"); !ok && os.Getenv("DISABLE_TELEMETRY") != "1" {

[Prometheus] Add new prometheus metrics and metrics endpoint #827

[Prometheus] Add new prometheus metrics and metrics endpoint #827

Conversation

dolbik commented Apr 26, 2018 • edited

aeneasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dolbik commented Apr 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dolbik commented Apr 27, 2018

dolbik commented May 2, 2018

aeneasr commented May 2, 2018

aeneasr commented May 8, 2018

aeneasr commented Nov 18, 2018

mobintmu commented May 2, 2023

dolbik commented Apr 26, 2018 •

edited