-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aggregator: add APIService unavailability metrics #71380
aggregator: add APIService unavailability metrics #71380
Conversation
sttts
commented
Nov 23, 2018
/lgtm |
cb323d7
to
f8730dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/cc @logicalhan |
@jennybuckley: GitHub didn't allow me to request PR reviews from the following users: logicalhan. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @deads2k Approved? |
case apiregistration.Available: | ||
if wasTrue && newCondition.Status == apiregistration.ConditionFalse { | ||
unavailableCounter.WithLabelValues(apiService.Name, newCondition.Reason).Inc() | ||
unavailableGauge.WithLabelValues(apiService.Name).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we directly set this instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a related question: what happens in HA? Multiple instances fight for their (availability) view of the world. How is that handled with a gauge usually. @brancz do you have a suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some kind of apiserver identity here and put it into a label
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re HA: If the API server endpoints are being scraped directly, instance labels can be attached via the Prometheus configuration to each API server target using https://prometheus.io/docs/prometheus/latest/configuration/configuration/#static_config.
If those API servers are sitting behind LBs and are opaque, then things get complicated. If those gauge values are different at different scrapes, those values would be flapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing should ever be scraped through a load balancer, that would not be whitebox monitoring. And yes what @s-urbaniak said is correct, at ingestion time Prometheus attaches the apiserver's target's labels onto the time-series produced by this metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sttts Hi, why do multiple instances have different (availability) view of the world? Could you please help explain it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's a distributed system: different instances are (probably) on different computers with different network connectivity, and they definitely scrape at different times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lavalamp So each instance would run below func on sync()
. Do you mean in different instances, the parameters originalAPIService
and newAPIService
might have different values?
func updateAPIServiceStatus(client apiregistrationclient.APIServicesGetter, originalAPIService, newAPIService *apiregistrationv1.APIService) (*apiregistrationv1.APIService, error) {
/retest |
@@ -92,6 +94,7 @@ func (r *proxyHandler) ServeHTTP(w http.ResponseWriter, req *http.Request) { | |||
} | |||
|
|||
if !handlingInfo.serviceAvailable { | |||
unavailableRequestCounter.WithLabelValues(handlingInfo.name).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't we already recording status code with sufficient labels to obtain this info? (xref https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L157-L170)
if not, should we add API group and version to the existing metric labels, rather than adding a count metric for a specific http code here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The referenced metrics will be recorded in the aggregated apiserver, not here in the aggregator. I.e. we miss those requests. Not sure you mean this: we could add to the very same metrics from the aggregator as well, labelling it with "failed in the aggregator".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, we don't already have a filter in place that will call one of metrics.{InstrumentRouteFunc,MonitorRequest,Record}
for error responses served by the aggregator? if not, it seems better to fix that to capture all errors (not just service unavailable), and ensure there's sufficient labels to isolate errors from a particular group/version/resource. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a commit to show what I mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think?
That's what I meant above.
func setConditionAndUpdateStatus(client apiregistrationclient.APIServicesGetter, apiService *apiregistration.APIService, newCondition apiregistration.APIServiceCondition) error { | ||
orig := apiService.DeepCopy() | ||
apiregistration.SetAPIServiceCondition(apiService, newCondition) | ||
if !reflect.DeepEqual(orig, apiService) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invert the check and return early if equal to unnest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
9a6a03f
to
8182637
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, sttts The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
8182637
to
889e43f
Compare
staging/src/k8s.io/kube-aggregator/pkg/controllers/status/available_controller.go
Outdated
Show resolved
Hide resolved
889e43f
to
fce6eb0
Compare
/retest |
/lgtm |
@logicalhan: changing LGTM is restricted to assignees, and only kubernetes/kubernetes repo collaborators may be assigned issues. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@liggitt ptal |
/lgtm |
/retest Review the full test history for this PR. Silence the bot with an |