New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RED metrics (instrumentation) to system endpoints #1013

Open
wants to merge 7 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@alexellis
Copy link
Member

alexellis commented Jan 4, 2019

Description

Add RED metrics (instrumentation) to system endpoints

As part of this work the Prometheus go client version was upgraded via dep

Motivation and Context

  • I have raised an issue to propose this change (required)

Fixes #532
Fixes #533

How Has This Been Tested?

With a new image built from the PR on a Swarm cluster - I deployed the cows function from the store then invoked it and saw the new metrics appear with a sanitised path:

# HELP http_request_duration_seconds Seconds spent serving HTTP requests.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",path="system_function_cows",status="200",le="0.005"} 0
http_request_duration_seconds_bucket{method="GET",path="system_function_cows",status="200",le="0.01"} 0
http_request_duration_seconds_bucket{method="GET",path="system_function_cows",status="200",le="0.025"} 10
http_request_duration_seconds_bucket{method="GET",path="system_function_cows",status="200",le="0.05"} 11
http_request_duration_seconds_bucket{method="GET",path="system_function_cows",status="200",le="0.1"} 11
http_request_duration_seconds_bucket{method="GET",path="system_function_cows",status="200",le="0.25"} 11
http_request_duration_seconds_bucket{method="GET",path="system_function_cows",status="200",le="0.5"} 11
http_request_duration_seconds_bucket{method="GET",path="system_function_cows",status="200",le="1"} 11
http_request_duration_seconds_bucket{method="GET",path="system_function_cows",status="200",le="2.5"} 11
http_request_duration_seconds_bucket{method="GET",path="system_function_cows",status="200",le="5"} 11
http_request_duration_seconds_bucket{method="GET",path="system_function_cows",status="200",le="10"} 11
http_request_duration_seconds_bucket{method="GET",path="system_function_cows",status="200",le="+Inf"} 11
http_request_duration_seconds_sum{method="GET",path="system_function_cows",status="200"} 0.18653660000000002
http_request_duration_seconds_count{method="GET",path="system_function_cows",status="200"} 11
http_request_duration_seconds_bucket{method="GET",path="system_functions",status="200",le="0.005"} 0
http_request_duration_seconds_bucket{method="GET",path="system_functions",status="200",le="0.01"} 1
http_request_duration_seconds_bucket{method="GET",path="system_functions",status="200",le="0.025"} 10
http_request_duration_seconds_bucket{method="GET",path="system_functions",status="200",le="0.05"} 10
http_request_duration_seconds_bucket{method="GET",path="system_functions",status="200",le="0.1"} 10
http_request_duration_seconds_bucket{method="GET",path="system_functions",status="200",le="0.25"} 10
http_request_duration_seconds_bucket{method="GET",path="system_functions",status="200",le="0.5"} 10
http_request_duration_seconds_bucket{method="GET",path="system_functions",status="200",le="1"} 10
http_request_duration_seconds_bucket{method="GET",path="system_functions",status="200",le="2.5"} 10
http_request_duration_seconds_bucket{method="GET",path="system_functions",status="200",le="5"} 10
http_request_duration_seconds_bucket{method="GET",path="system_functions",status="200",le="10"} 10
http_request_duration_seconds_bucket{method="GET",path="system_functions",status="200",le="+Inf"} 10
http_request_duration_seconds_sum{method="GET",path="system_functions",status="200"} 0.1422476
http_request_duration_seconds_count{method="GET",path="system_functions",status="200"} 10
http_request_duration_seconds_bucket{method="POST",path="system_alert",status="0",le="0.005"} 0
http_request_duration_seconds_bucket{method="POST",path="system_alert",status="0",le="0.01"} 0
http_request_duration_seconds_bucket{method="POST",path="system_alert",status="0",le="0.025"} 1
http_request_duration_seconds_bucket{method="POST",path="system_alert",status="0",le="0.05"} 1
http_request_duration_seconds_bucket{method="POST",path="system_alert",status="0",le="0.1"} 1
http_request_duration_seconds_bucket{method="POST",path="system_alert",status="0",le="0.25"} 1
http_request_duration_seconds_bucket{method="POST",path="system_alert",status="0",le="0.5"} 1
http_request_duration_seconds_bucket{method="POST",path="system_alert",status="0",le="1"} 1
http_request_duration_seconds_bucket{method="POST",path="system_alert",status="0",le="2.5"} 1
http_request_duration_seconds_bucket{method="POST",path="system_alert",status="0",le="5"} 1
http_request_duration_seconds_bucket{method="POST",path="system_alert",status="0",le="10"} 1
http_request_duration_seconds_bucket{method="POST",path="system_alert",status="0",le="+Inf"} 1
http_request_duration_seconds_sum{method="POST",path="system_alert",status="0"} 0.0226654
http_request_duration_seconds_count{method="POST",path="system_alert",status="0"} 1
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{status="0"} 1
http_requests_total{status="200"} 21

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I've read the CONTRIBUTION guide
  • I have signed-off my commits with git commit -s
  • I have added tests to cover my changes.
  • All new and existing tests passed.

alexellis added some commits Jan 4, 2019

Add service RED metrics definitions
Partially fixes #532 by introducing two metrics that are
supported by Kubernetes HPAv2 and RED metrics-style
dashboards.

Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>
Instrument system calls
Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>
Bump Prometheus client version
- updates the Prometheus go client version and switches to the
promhttp handler to avoid conflicts with the new system-level
metrics.

Tested with Docker Swarm locally - no conflicts and new metrics
were gathered.

Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>
Add instrumentation to the alert handler
Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>
@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Jan 5, 2019

@kenfdev @stefanprodan any thoughts on the promhttp / prom client tests that are failing on and off? I don't like the sight of a panic

 ---> Running in cf2d76e50457
?       github.com/openfaas/faas/gateway        [no test files]
ok      github.com/openfaas/faas/gateway/handlers       0.007s  coverage: 22.3% of statements
panic: send on closed channel

goroutine 21 [running]:
github.com/openfaas/faas/gateway/vendor/github.com/prometheus/client_golang/prometheus.(*metricMap).Describe(...)
        /go/src/github.com/openfaas/faas/gateway/vendor/github.com/prometheus/client_golang/prometheus/vec.go:222
github.com/openfaas/faas/gateway/metrics.(*Exporter).Describe(0xc420156240, 0xc420158360)
        /go/src/github.com/openfaas/faas/gateway/metrics/exporter.go:44 +0xce
created by github.com/openfaas/faas/gateway/metrics.Test_Describe_DescribesThePrometheusMetrics
        /go/src/github.com/openfaas/faas/gateway/metrics/exporter_test.go:41 +0x17d

Otherwise the new metrics are turning up and there is an RC version of the gateway available to try out on the Docker Hub: openfaas/gateway:0.9.15-rc5

Alex

Name: "requests_total",
Help: "The total number of HTTP requests.",
},
[]string{"status"},

This comment has been minimized.

@LucasRoesler

LucasRoesler Jan 5, 2019

Member

should this metric also have method and path too?

This comment has been minimized.

@alexellis

alexellis Jan 5, 2019

Member

I took the example from @stefanprodan as "what Kubernetes/managed dashboards mandates" and asked the same question on Slack. I don't know.

@stefanprodan can other bits of data be recorded here too and still be compatible?

This comment has been minimized.

@LucasRoesler

LucasRoesler Jan 5, 2019

Member

Compatibility is going to depend on which "managed dashboard" we want to support, making an explicit list will help and it would be good to comment about those in the code to ensure future changes don't accidentally break it.

In my experience with prometheus queries, additional labels shouldn't be an issue as long as the required labels exist.

This comment has been minimized.

@alexellis

alexellis Jan 5, 2019

Member

Personally I'd like the same labels as on the histogram at least.

This comment has been minimized.

@LucasRoesler

LucasRoesler Jan 5, 2019

Member

agreed, I was confused and thought I had the wrong image when I first queried the metrics

This comment has been minimized.

@alexellis

alexellis Jan 7, 2019

Member

Added labels to match both the counter and histogram

psn.ServiceMetrics.Histogram.WithLabelValues(method, urlToLabel(URL), code).Observe(duration.Seconds())
}

var invalidChars = regexp.MustCompile(`[^a-zA-Z0-9]+`)

This comment has been minimized.

@LucasRoesler

LucasRoesler Jan 5, 2019

Member

I have seen labels with /, I have also never had an issue querying and filtering on those labels. Is there a reference for which characters cause issues?

This comment has been minimized.

@alexellis

alexellis Jan 5, 2019

Member

Pretty sure / wasn't allowed in labels.

This comment has been minimized.

@LucasRoesler

This comment has been minimized.

@LucasRoesler

LucasRoesler Jan 5, 2019

Member

I have a prometheus scraping metrics with slashes right now (the space is accidental and being fixed). But it seems like labels are really flexible

This comment has been minimized.

@alexellis

alexellis Jan 7, 2019

Member

@stefanprodan WDYT about slashes?

@stefanprodan

This comment has been minimized.

Copy link
Member

stefanprodan commented Jan 5, 2019

The panic could come from the fact that metrics.BuildMetricsOptions and metrics. RegisterExporter are called more than once during testing. Moving those to init or using once should fix it.

@LucasRoesler

This comment has been minimized.

Copy link
Member

LucasRoesler commented Jan 5, 2019

I just started testing the openfaas/gateway:0.9.15-rc5 build.:

  • The metrics seem to be reported correctly.
  • It doesn't record any metrics for 401 errors, when basic auth is on.
  • With the UI open and load testing a function for 5+ mins, I haven't seen any crashes in the gateway yet
Instrument async handlers
- instruments async handler for report and for queueing async
requests
- make MustRegister only ever run once to prevent sync issues

Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>
@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Jan 5, 2019

Newer test image covering the async behaviours in openfaas/gateway:0.9.15-rc6

@LucasRoesler

This comment has been minimized.

Copy link
Member

LucasRoesler commented Jan 5, 2019

I just noticed that if a function with a subpath is called, the subpath is not preserved in the labels. For internal uses, this information might not be as important, but i can certainly see a user wanting that information for the same reason we want the path info in the labels.

@LucasRoesler

This comment has been minimized.

Copy link
Member

LucasRoesler commented Jan 5, 2019

Also just noticed a weird status value on the function path, with the POST method it is reporting "status: 0"

image

@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Jan 5, 2019

#1013 (comment)

Is this comment related to the PR or to metrics in general?

@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Jan 5, 2019

#1013 (comment)

Good spot - I am using in interceptor pattern for those two.. and no explicit status is written, I can fix that quite quickly.

@LucasRoesler

This comment has been minimized.

Copy link
Member

LucasRoesler commented Jan 5, 2019

#1013 (comment)

Is this comment related to the PR or to metrics in general?

Both, if expected the path value to contain the called URL path, not just the part upto the function

@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Jan 5, 2019

Actually that only counts for the async report - the queued proxy does write a status via w.WriteHeader.

alexellis added some commits Jan 5, 2019

Split out notifiers
- splits out notifiers and writes status for async handler

Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>
Add unit tests for MakeNotifierWrapper
- fixes issue where result was assigned to value rather than
to pointer reference.

Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>
@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Jan 5, 2019

Latest image openfaas/gateway:0.9.15-rc9 - this contains fixes for some issues I'd introduced in the way I'd written the notifier that was showing "0" as the status code.

Unit tests also added for the middleware.

@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Jan 7, 2019

@LucasRoesler

This comment has been minimized.

Copy link
Member

LucasRoesler commented Jan 13, 2019

I just tried openfaas/gateway:0.9.15-rc9 and the "0" issue is resolved, but the http endpoints still don't seem to detect the synchronous function calls.

I am expecting a metric series for function_nodeinfo

image

@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Jan 13, 2019

Synchronous invocations are not part of this PR.

@LucasRoesler

This comment has been minimized.

Copy link
Member

LucasRoesler commented Jan 14, 2019

Then it seems to be working as expected

@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Jan 14, 2019

I checked with Julius Volz, a maintainer of Prometheus who said path values can allow /. @stefanprodan what do you want to do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment