Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregator: add APIService unavailability metrics #71380

Merged

Conversation

@sttts
Copy link
Contributor

commented Nov 23, 2018

Add aggregator_unavailable_apiservice_{count,gauge} metrics in the kube-aggregator.
@mfojtik

This comment has been minimized.

Copy link
Contributor

commented Nov 23, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Nov 23, 2018
@sttts sttts force-pushed the sttts:sttts-aggregator-metrics-available branch from cb323d7 to f8730dc Nov 23, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm label Nov 23, 2018
Copy link
Contributor

left a comment

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Nov 23, 2018
@jennybuckley

This comment has been minimized.

Copy link
Contributor

commented Nov 26, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Nov 26, 2018

@jennybuckley: GitHub didn't allow me to request PR reviews from the following users: logicalhan.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @logicalhan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sttts

This comment has been minimized.

Copy link
Contributor Author

commented Nov 27, 2018

/assign @deads2k

Approved?

case apiregistration.Available:
if wasTrue && newCondition.Status == apiregistration.ConditionFalse {
unavailableCounter.WithLabelValues(apiService.Name, newCondition.Reason).Inc()
unavailableGauge.WithLabelValues(apiService.Name).Inc()

This comment has been minimized.

Copy link
@deads2k

deads2k Nov 27, 2018

Contributor

Can we directly set this instead?

This comment has been minimized.

Copy link
@sttts

sttts Nov 28, 2018

Author Contributor

I have a related question: what happens in HA? Multiple instances fight for their (availability) view of the world. How is that handled with a gauge usually. @brancz do you have a suggestion?

This comment has been minimized.

Copy link
@sttts

sttts Nov 28, 2018

Author Contributor

We need some kind of apiserver identity here and put it into a label

This comment has been minimized.

Copy link
@s-urbaniak

s-urbaniak Nov 28, 2018

Contributor

re HA: If the API server endpoints are being scraped directly, instance labels can be attached via the Prometheus configuration to each API server target using https://prometheus.io/docs/prometheus/latest/configuration/configuration/#static_config.

If those API servers are sitting behind LBs and are opaque, then things get complicated. If those gauge values are different at different scrapes, those values would be flapping.

This comment has been minimized.

Copy link
@brancz

brancz Nov 28, 2018

Member

Nothing should ever be scraped through a load balancer, that would not be whitebox monitoring. And yes what @s-urbaniak said is correct, at ingestion time Prometheus attaches the apiserver's target's labels onto the time-series produced by this metric.

@deads2k

This comment has been minimized.

Copy link
Contributor

commented Nov 27, 2018

@kubernetes/sig-api-machinery-pr-reviews have a look at these metrics. They will let us know about flapping apiservers.

/approve
/hold

holding for other comments.

@sttts it looks like you subsumed #71273 . Is that the case?

@@ -92,6 +94,7 @@ func (r *proxyHandler) ServeHTTP(w http.ResponseWriter, req *http.Request) {
}

if !handlingInfo.serviceAvailable {
unavailableRequestCounter.WithLabelValues(handlingInfo.name).Inc()

This comment has been minimized.

Copy link
@liggitt

liggitt Nov 29, 2018

Member

aren't we already recording status code with sufficient labels to obtain this info? (xref https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L157-L170)

if not, should we add API group and version to the existing metric labels, rather than adding a count metric for a specific http code here?

This comment has been minimized.

Copy link
@sttts

sttts Nov 29, 2018

Author Contributor

The referenced metrics will be recorded in the aggregated apiserver, not here in the aggregator. I.e. we miss those requests. Not sure you mean this: we could add to the very same metrics from the aggregator as well, labelling it with "failed in the aggregator".

This comment has been minimized.

Copy link
@liggitt

liggitt Nov 29, 2018

Member

oh, we don't already have a filter in place that will call one of metrics.{InstrumentRouteFunc,MonitorRequest,Record} for error responses served by the aggregator? if not, it seems better to fix that to capture all errors (not just service unavailable), and ensure there's sufficient labels to isolate errors from a particular group/version/resource. What do you think?

This comment has been minimized.

Copy link
@sttts

sttts Nov 29, 2018

Author Contributor

Added a commit to show what I mean.

This comment has been minimized.

Copy link
@sttts

sttts Nov 29, 2018

Author Contributor

What do you think?

That's what I meant above.

func setConditionAndUpdateStatus(client apiregistrationclient.APIServicesGetter, apiService *apiregistration.APIService, newCondition apiregistration.APIServiceCondition) error {
orig := apiService.DeepCopy()
apiregistration.SetAPIServiceCondition(apiService, newCondition)
if !reflect.DeepEqual(orig, apiService) {

This comment has been minimized.

Copy link
@liggitt

liggitt Nov 29, 2018

Member

invert the check and return early if equal to unnest?

This comment has been minimized.

Copy link
@sttts

sttts Nov 29, 2018

Author Contributor

done

@sttts sttts force-pushed the sttts:sttts-aggregator-metrics-available branch 2 times, most recently from 9a6a03f to 8182637 Nov 29, 2018
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sttts sttts force-pushed the sttts:sttts-aggregator-metrics-available branch from 8182637 to 889e43f Nov 29, 2018
@sttts sttts force-pushed the sttts:sttts-aggregator-metrics-available branch from 889e43f to fce6eb0 Dec 3, 2018
@sttts

This comment has been minimized.

Copy link
Contributor Author

commented Dec 3, 2018

/retest

@logicalhan

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2018

/lgtm

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2018

@logicalhan: changing LGTM is restricted to assignees, and only kubernetes/kubernetes repo collaborators may be assigned issues.

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sttts sttts removed the do-not-merge/hold label Dec 4, 2018
@sttts

This comment has been minimized.

Copy link
Contributor Author

commented Dec 5, 2018

@liggitt ptal

@liggitt

This comment has been minimized.

Copy link
Member

commented Dec 5, 2018

/lgtm

@fejta-bot

This comment has been minimized.

Copy link

commented Dec 6, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 82b0d8f into kubernetes:master Dec 6, 2018
18 checks passed
18 checks passed
cla/linuxfoundation sttts authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
@logicalhan logicalhan referenced this pull request Dec 8, 2018
6 of 6 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.