add metric reporting to check-endpoints #893

sanchezl · 2020-06-29T22:58:31Z

Add the following metrics to the check-endpoint utility:

namespace_endpoint_check_count
namespace_endpoint_check_tcp_connect_latency_gauge
namespace_endpoint_check_dns_resolve_latency_gauge

Removed any dependencies on the kube-apisever pod.

NOTE: metrics to be scraped by Prometheus in a followup PR

sanchezl · 2020-06-29T23:15:22Z

I'm still having trouble getting Prometheus to scrape the added metrics.

Notes

Prometheus reports Get https://10.0.x.x:17697/metrics: context deadline exceeded.
Running curl https://10.0.x.x:17697/metrics from the Prometheus pod (or any pod not the kube-apiserver pod), I get a TCP connect timeout.
Running curl https://10.0.x.x:6443/metrics from the Prometheus pod succeeds.
I can successfully run curl https://localhost:17967/metrics from any container in the kube-apiserver pod and see the expected metrics.
Open ports on node:

[root@ip-10-0-150-5 /]# lsof -nP  -iTCP -sTCP:LISTEN 
COMMAND      PID     USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
. . . 
kube-apis 835604     root    7u  IPv6 5779468      0t0  TCP *:6443 (LISTEN)
cluster-k 835869     root    6u  IPv6 5778296      0t0  TCP *:17697 (LISTEN)

[root@ip-10-0-150-5 /]# netstat -lnp                                                                                                                                                                                
Active Internet connections (only servers)                                                                                                                                                                          
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name                                                                                                                    
. . . 
tcp6       0      0 :::17697                :::*                    LISTEN      835869/cluster-kube 
tcp6       0      0 :::6443                 :::*                    LISTEN      835604/kube-apiserv 
. . .

iptables:

[root@ip-10-0-150-5 ~]# iptables --table nat --list KUBE-SERVICES | grep "/apiserver"
KUBE-MARK-MASQ  tcp  -- !ip-10-128-0-0.us-west-1.compute.internal/14  ip-172-30-112-163.us-west-1.compute.internal  /* openshift-kube-apiserver/apiserver:https cluster IP */ tcp dpt:https
KUBE-SVC-X7YGTN7QRQI2VNWZ  tcp  --  anywhere             ip-172-30-112-163.us-west-1.compute.internal  /* openshift-kube-apiserver/apiserver:https cluster IP */ tcp dpt:https
KUBE-MARK-MASQ  tcp  -- !ip-10-128-0-0.us-west-1.compute.internal/14  ip-172-30-112-163.us-west-1.compute.internal  /* openshift-kube-apiserver/apiserver:check-endpoints cluster IP */ tcp dpt:17697
KUBE-SVC-H6IOIK73SDTBATQ4  tcp  --  anywhere             ip-172-30-112-163.us-west-1.compute.internal  /* openshift-kube-apiserver/apiserver:check-endpoints cluster IP */ tcp dpt:17697

sanchezl · 2020-06-30T00:32:41Z

/test e2e-aws

sanchezl · 2020-06-30T13:41:17Z

/test e2e-aws
/test e2e-aws-serial

sanchezl · 2020-06-30T13:43:44Z

I'm still having trouble getting Prometheus to scrape the added metrics.

I need to get the ports opened on the master node machines in the underlying platform.

sanchezl · 2020-07-06T14:39:01Z

/retest

deads2k · 2020-07-06T19:11:22Z

pkg/cmd/checkendpoints/controller/metrics.go

+		endpointCheckCounter = metrics.NewCounterVec(&metrics.CounterOpts{
+			Name: prefix + "endpoint_check_count",
+			Help: "Report status of endpoint checks for each pod over time.",
+		}, []string{"endpoint", "tcpConnect", "dnsResolve"})


something about status codes too?

something about status codes too?

or is that a different metric.

No status codes as it's simply a TCP connection. Would you like to see the kind of error somehow?

No status codes as it's simply a TCP connection. Would you like to see the kind of error somehow?

yes. Not all failure modes are the same. connection refused, no route to host, dial error, timeout, they alll mean differen things

I would still like timeout, connection refused, no route to host to have separate counts so that we can see how they change over time.

deads2k · 2020-07-06T19:12:42Z

pkg/cmd/checkendpoints/controller/metrics.go

+			Help: "Report status of endpoint checks for each pod over time.",
+		}, []string{"endpoint", "tcpConnect", "dnsResolve"})
+
+		tcpConnectLatencyGauge = metrics.NewGaugeVec(&metrics.GaugeOpts{


what do we use for infinity (no connection?)

just an empty value ""

deads2k · 2020-07-06T19:13:38Z

minor questions.

/approve

deads2k · 2020-07-10T17:22:32Z

outage calculation in upgrade looks incorrect. See

status:
  conditions:
  - lastTransitionTime: "2020-07-08T15:09:11Z"
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnectSuccess
    status: "True"
    type: Reachable
  failures:
  - latency: 10.005827597s
    message: 'openshift-apiserver-service-172-30-151-149-443: failed to establish
      a TCP connection to 172.30.151.149:443: dial tcp 172.30.151.149:443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2020-07-08T15:41:50Z"
  - latency: 10.001308826s
    message: 'openshift-apiserver-service-172-30-151-149-443: failed to establish
      a TCP connection to 172.30.151.149:443: dial tcp 172.30.151.149:443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2020-07-08T15:41:36Z"
  - latency: 10.002833732s
    message: 'openshift-apiserver-service-172-30-151-149-443: failed to establish
      a TCP connection to 172.30.151.149:443: dial tcp 172.30.151.149:443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2020-07-08T15:40:20Z"
  - latency: 10.000673641s
    message: 'openshift-apiserver-service-172-30-151-149-443: failed to establish
      a TCP connection to 172.30.151.149:443: dial tcp 172.30.151.149:443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2020-07-08T15:40:17Z"
  successes:
  - latency: 1.281624ms
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:16Z"
  - latency: 1.549382ms
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:15Z"
  - latency: 1.819134ms
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:14Z"
  - latency: 192.237µs
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:13Z"
  - latency: 676.948µs
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:12Z"
  - latency: 256.643µs
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:11Z"
  - latency: 1.815262ms
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:10Z"
  - latency: 435.609µs
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:09Z"
  - latency: 1.409842ms
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:08Z"
  - latency: 2.111243ms
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:07Z"

deads2k · 2020-07-16T19:11:11Z

we can start somewhere

/lgtm

openshift-ci-robot · 2020-07-16T19:11:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, sanchezl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-07-16T19:54:56Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-16T23:09:54Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-16T23:22:55Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-17T01:20:10Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot requested review from mfojtik and sttts June 29, 2020 22:58

sanchezl force-pushed the point-to-point-metrics branch 3 times, most recently from 708513a to 414ef2b Compare July 3, 2020 03:55

deads2k reviewed Jul 6, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 6, 2020

sanchezl force-pushed the point-to-point-metrics branch 3 times, most recently from 6f53eaf to 39e128e Compare July 8, 2020 00:23

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 16, 2020

point-to-point netowrk check: report metrics

3ee8198

This was referenced Jul 16, 2020

outage calculation in upgrade looks incorrect. See #903

Closed

add TCP connection status label #904

Closed

sanchezl force-pushed the point-to-point-metrics branch from 702efaf to 3ee8198 Compare July 16, 2020 18:30

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 16, 2020

openshift-ci-robot assigned deads2k Jul 16, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 16, 2020

openshift-merge-robot merged commit 69559e9 into openshift:master Jul 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metric reporting to check-endpoints #893

add metric reporting to check-endpoints #893

sanchezl commented Jun 29, 2020 •

edited

sanchezl commented Jun 29, 2020 •

edited

sanchezl commented Jun 30, 2020

sanchezl commented Jun 30, 2020

sanchezl commented Jun 30, 2020

sanchezl commented Jul 6, 2020

deads2k Jul 6, 2020

deads2k Jul 6, 2020

sanchezl Jul 7, 2020

deads2k Jul 10, 2020

deads2k Jul 16, 2020

deads2k Jul 6, 2020

sanchezl Jul 7, 2020

deads2k commented Jul 6, 2020

deads2k commented Jul 10, 2020

deads2k commented Jul 16, 2020

openshift-ci-robot commented Jul 16, 2020

openshift-bot commented Jul 16, 2020

openshift-bot commented Jul 16, 2020

openshift-bot commented Jul 16, 2020

openshift-bot commented Jul 17, 2020

add metric reporting to check-endpoints #893

add metric reporting to check-endpoints #893

Conversation

sanchezl commented Jun 29, 2020 • edited

sanchezl commented Jun 29, 2020 • edited

sanchezl commented Jun 30, 2020

sanchezl commented Jun 30, 2020

sanchezl commented Jun 30, 2020

sanchezl commented Jul 6, 2020

deads2k Jul 6, 2020

Choose a reason for hiding this comment

deads2k Jul 6, 2020

Choose a reason for hiding this comment

sanchezl Jul 7, 2020

Choose a reason for hiding this comment

deads2k Jul 10, 2020

Choose a reason for hiding this comment

deads2k Jul 16, 2020

Choose a reason for hiding this comment

deads2k Jul 6, 2020

Choose a reason for hiding this comment

sanchezl Jul 7, 2020

Choose a reason for hiding this comment

deads2k commented Jul 6, 2020

deads2k commented Jul 10, 2020

deads2k commented Jul 16, 2020

openshift-ci-robot commented Jul 16, 2020

openshift-bot commented Jul 16, 2020

openshift-bot commented Jul 16, 2020

openshift-bot commented Jul 16, 2020

openshift-bot commented Jul 17, 2020

sanchezl commented Jun 29, 2020 •

edited

sanchezl commented Jun 29, 2020 •

edited