New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add metric reporting to check-endpoints #893
add metric reporting to check-endpoints #893
Conversation
I'm still having trouble getting Prometheus to scrape the added metrics. Notes
|
/test e2e-aws |
/test e2e-aws |
I need to get the ports opened on the master node machines in the underlying platform. |
708513a
to
414ef2b
Compare
/retest |
endpointCheckCounter = metrics.NewCounterVec(&metrics.CounterOpts{ | ||
Name: prefix + "endpoint_check_count", | ||
Help: "Report status of endpoint checks for each pod over time.", | ||
}, []string{"endpoint", "tcpConnect", "dnsResolve"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something about status codes too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something about status codes too?
or is that a different metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No status codes as it's simply a TCP connection. Would you like to see the kind of error somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No status codes as it's simply a TCP connection. Would you like to see the kind of error somehow?
yes. Not all failure modes are the same. connection refused, no route to host, dial error, timeout, they alll mean differen things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still like timeout, connection refused, no route to host to have separate counts so that we can see how they change over time.
Help: "Report status of endpoint checks for each pod over time.", | ||
}, []string{"endpoint", "tcpConnect", "dnsResolve"}) | ||
|
||
tcpConnectLatencyGauge = metrics.NewGaugeVec(&metrics.GaugeOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do we use for infinity (no connection?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just an empty value ""
minor questions. /approve |
6f53eaf
to
39e128e
Compare
outage calculation in upgrade looks incorrect. See
|
702efaf
to
3ee8198
Compare
we can start somewhere /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, sanchezl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
Add the following metrics to the check-endpoint utility:
namespace
_endpoint_check_count
namespace
_endpoint_check_tcp_connect_latency_gauge
namespace
_endpoint_check_dns_resolve_latency_gauge
Removed any dependencies on the kube-apisever pod.