Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metric reporting to check-endpoints #893

Merged

Conversation

sanchezl
Copy link
Contributor

@sanchezl sanchezl commented Jun 29, 2020

Add the following metrics to the check-endpoint utility:

  • namespace_endpoint_check_count
  • namespace_endpoint_check_tcp_connect_latency_gauge
  • namespace_endpoint_check_dns_resolve_latency_gauge

Removed any dependencies on the kube-apisever pod.

NOTE: metrics to be scraped by Prometheus in a followup PR

@sanchezl
Copy link
Contributor Author

sanchezl commented Jun 29, 2020

I'm still having trouble getting Prometheus to scrape the added metrics.

Notes

  • Prometheus reports Get https://10.0.x.x:17697/metrics: context deadline exceeded.
  • Running curl https://10.0.x.x:17697/metrics from the Prometheus pod (or any pod not the kube-apiserver pod), I get a TCP connect timeout.
  • Running curl https://10.0.x.x:6443/metrics from the Prometheus pod succeeds.
  • I can successfully run curl https://localhost:17967/metrics from any container in the kube-apiserver pod and see the expected metrics.
  • Open ports on node:
[root@ip-10-0-150-5 /]# lsof -nP  -iTCP -sTCP:LISTEN 
COMMAND      PID     USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
. . . 
kube-apis 835604     root    7u  IPv6 5779468      0t0  TCP *:6443 (LISTEN)
cluster-k 835869     root    6u  IPv6 5778296      0t0  TCP *:17697 (LISTEN)

[root@ip-10-0-150-5 /]# netstat -lnp                                                                                                                                                                                
Active Internet connections (only servers)                                                                                                                                                                          
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name                                                                                                                    
. . . 
tcp6       0      0 :::17697                :::*                    LISTEN      835869/cluster-kube 
tcp6       0      0 :::6443                 :::*                    LISTEN      835604/kube-apiserv 
. . . 
  • iptables:
[root@ip-10-0-150-5 ~]# iptables --table nat --list KUBE-SERVICES | grep "/apiserver"
KUBE-MARK-MASQ  tcp  -- !ip-10-128-0-0.us-west-1.compute.internal/14  ip-172-30-112-163.us-west-1.compute.internal  /* openshift-kube-apiserver/apiserver:https cluster IP */ tcp dpt:https
KUBE-SVC-X7YGTN7QRQI2VNWZ  tcp  --  anywhere             ip-172-30-112-163.us-west-1.compute.internal  /* openshift-kube-apiserver/apiserver:https cluster IP */ tcp dpt:https
KUBE-MARK-MASQ  tcp  -- !ip-10-128-0-0.us-west-1.compute.internal/14  ip-172-30-112-163.us-west-1.compute.internal  /* openshift-kube-apiserver/apiserver:check-endpoints cluster IP */ tcp dpt:17697
KUBE-SVC-H6IOIK73SDTBATQ4  tcp  --  anywhere             ip-172-30-112-163.us-west-1.compute.internal  /* openshift-kube-apiserver/apiserver:check-endpoints cluster IP */ tcp dpt:17697

@sanchezl
Copy link
Contributor Author

/test e2e-aws

@sanchezl
Copy link
Contributor Author

/test e2e-aws
/test e2e-aws-serial

@sanchezl
Copy link
Contributor Author

I'm still having trouble getting Prometheus to scrape the added metrics.

I need to get the ports opened on the master node machines in the underlying platform.

@sanchezl sanchezl force-pushed the point-to-point-metrics branch 3 times, most recently from 708513a to 414ef2b Compare July 3, 2020 03:55
@sanchezl
Copy link
Contributor Author

sanchezl commented Jul 6, 2020

/retest

endpointCheckCounter = metrics.NewCounterVec(&metrics.CounterOpts{
Name: prefix + "endpoint_check_count",
Help: "Report status of endpoint checks for each pod over time.",
}, []string{"endpoint", "tcpConnect", "dnsResolve"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something about status codes too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something about status codes too?

or is that a different metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No status codes as it's simply a TCP connection. Would you like to see the kind of error somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No status codes as it's simply a TCP connection. Would you like to see the kind of error somehow?

yes. Not all failure modes are the same. connection refused, no route to host, dial error, timeout, they alll mean differen things

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like timeout, connection refused, no route to host to have separate counts so that we can see how they change over time.

Help: "Report status of endpoint checks for each pod over time.",
}, []string{"endpoint", "tcpConnect", "dnsResolve"})

tcpConnectLatencyGauge = metrics.NewGaugeVec(&metrics.GaugeOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we use for infinity (no connection?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an empty value ""

@deads2k
Copy link
Contributor

deads2k commented Jul 6, 2020

minor questions.

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 6, 2020
@sanchezl sanchezl force-pushed the point-to-point-metrics branch 3 times, most recently from 6f53eaf to 39e128e Compare July 8, 2020 00:23
@deads2k
Copy link
Contributor

deads2k commented Jul 10, 2020

outage calculation in upgrade looks incorrect. See

status:
  conditions:
  - lastTransitionTime: "2020-07-08T15:09:11Z"
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnectSuccess
    status: "True"
    type: Reachable
  failures:
  - latency: 10.005827597s
    message: 'openshift-apiserver-service-172-30-151-149-443: failed to establish
      a TCP connection to 172.30.151.149:443: dial tcp 172.30.151.149:443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2020-07-08T15:41:50Z"
  - latency: 10.001308826s
    message: 'openshift-apiserver-service-172-30-151-149-443: failed to establish
      a TCP connection to 172.30.151.149:443: dial tcp 172.30.151.149:443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2020-07-08T15:41:36Z"
  - latency: 10.002833732s
    message: 'openshift-apiserver-service-172-30-151-149-443: failed to establish
      a TCP connection to 172.30.151.149:443: dial tcp 172.30.151.149:443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2020-07-08T15:40:20Z"
  - latency: 10.000673641s
    message: 'openshift-apiserver-service-172-30-151-149-443: failed to establish
      a TCP connection to 172.30.151.149:443: dial tcp 172.30.151.149:443: i/o timeout'
    reason: TCPConnectError
    success: false
    time: "2020-07-08T15:40:17Z"
  successes:
  - latency: 1.281624ms
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:16Z"
  - latency: 1.549382ms
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:15Z"
  - latency: 1.819134ms
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:14Z"
  - latency: 192.237µs
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:13Z"
  - latency: 676.948µs
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:12Z"
  - latency: 256.643µs
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:11Z"
  - latency: 1.815262ms
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:10Z"
  - latency: 435.609µs
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:09Z"
  - latency: 1.409842ms
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:08Z"
  - latency: 2.111243ms
    message: 'openshift-apiserver-service-172-30-151-149-443: tcp connection to 172.30.151.149:443
      succeeded'
    reason: TCPConnect
    success: true
    time: "2020-07-08T15:54:07Z"

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 16, 2020
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 16, 2020
@deads2k
Copy link
Contributor

deads2k commented Jul 16, 2020

we can start somewhere

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 16, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, sanchezl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 69559e9 into openshift:master Jul 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants