Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregator: add APIService unavailability metrics #71380

Merged

Conversation

sttts
Copy link
Contributor

@sttts sttts commented Nov 23, 2018

Add aggregator_unavailable_apiservice_{count,gauge} metrics in the kube-aggregator.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 23, 2018
@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 23, 2018
@sttts sttts added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 23, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 23, 2018
@sttts sttts added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 23, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Nov 23, 2018
@mfojtik
Copy link
Contributor

mfojtik commented Nov 23, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 23, 2018
@sttts sttts force-pushed the sttts-aggregator-metrics-available branch from cb323d7 to f8730dc Compare November 23, 2018 13:55
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 23, 2018
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 23, 2018
@jennybuckley
Copy link

/cc @logicalhan

@k8s-ci-robot
Copy link
Contributor

@jennybuckley: GitHub didn't allow me to request PR reviews from the following users: logicalhan.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @logicalhan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sttts
Copy link
Contributor Author

sttts commented Nov 27, 2018

/assign @deads2k

Approved?

case apiregistration.Available:
if wasTrue && newCondition.Status == apiregistration.ConditionFalse {
unavailableCounter.WithLabelValues(apiService.Name, newCondition.Reason).Inc()
unavailableGauge.WithLabelValues(apiService.Name).Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we directly set this instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a related question: what happens in HA? Multiple instances fight for their (availability) view of the world. How is that handled with a gauge usually. @brancz do you have a suggestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some kind of apiserver identity here and put it into a label

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re HA: If the API server endpoints are being scraped directly, instance labels can be attached via the Prometheus configuration to each API server target using https://prometheus.io/docs/prometheus/latest/configuration/configuration/#static_config.

If those API servers are sitting behind LBs and are opaque, then things get complicated. If those gauge values are different at different scrapes, those values would be flapping.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing should ever be scraped through a load balancer, that would not be whitebox monitoring. And yes what @s-urbaniak said is correct, at ingestion time Prometheus attaches the apiserver's target's labels onto the time-series produced by this metric.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts Hi, why do multiple instances have different (availability) view of the world? Could you please help explain it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's a distributed system: different instances are (probably) on different computers with different network connectivity, and they definitely scrape at different times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lavalamp So each instance would run below func on sync(). Do you mean in different instances, the parameters originalAPIService and newAPIService might have different values?

func updateAPIServiceStatus(client apiregistrationclient.APIServicesGetter, originalAPIService, newAPIService *apiregistrationv1.APIService) (*apiregistrationv1.APIService, error) {

@deads2k
Copy link
Contributor

deads2k commented Nov 27, 2018

@kubernetes/sig-api-machinery-pr-reviews have a look at these metrics. They will let us know about flapping apiservers.

/approve
/hold

holding for other comments.

@sttts it looks like you subsumed #71273 . Is that the case?

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 27, 2018
@sttts
Copy link
Contributor Author

sttts commented Nov 29, 2018

/retest

@@ -92,6 +94,7 @@ func (r *proxyHandler) ServeHTTP(w http.ResponseWriter, req *http.Request) {
}

if !handlingInfo.serviceAvailable {
unavailableRequestCounter.WithLabelValues(handlingInfo.name).Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we already recording status code with sufficient labels to obtain this info? (xref https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L157-L170)

if not, should we add API group and version to the existing metric labels, rather than adding a count metric for a specific http code here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The referenced metrics will be recorded in the aggregated apiserver, not here in the aggregator. I.e. we miss those requests. Not sure you mean this: we could add to the very same metrics from the aggregator as well, labelling it with "failed in the aggregator".

Copy link
Member

@liggitt liggitt Nov 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, we don't already have a filter in place that will call one of metrics.{InstrumentRouteFunc,MonitorRequest,Record} for error responses served by the aggregator? if not, it seems better to fix that to capture all errors (not just service unavailable), and ensure there's sufficient labels to isolate errors from a particular group/version/resource. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit to show what I mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

That's what I meant above.

func setConditionAndUpdateStatus(client apiregistrationclient.APIServicesGetter, apiService *apiregistration.APIService, newCondition apiregistration.APIServiceCondition) error {
orig := apiService.DeepCopy()
apiregistration.SetAPIServiceCondition(apiService, newCondition)
if !reflect.DeepEqual(orig, apiService) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invert the check and return early if equal to unnest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sttts sttts force-pushed the sttts-aggregator-metrics-available branch 2 times, most recently from 9a6a03f to 8182637 Compare November 29, 2018 15:40
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sttts sttts force-pushed the sttts-aggregator-metrics-available branch from 8182637 to 889e43f Compare November 29, 2018 15:43
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 1, 2018
@sttts sttts force-pushed the sttts-aggregator-metrics-available branch from 889e43f to fce6eb0 Compare December 3, 2018 14:04
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 3, 2018
@sttts
Copy link
Contributor Author

sttts commented Dec 3, 2018

/retest

@logicalhan
Copy link
Member

/lgtm

@k8s-ci-robot
Copy link
Contributor

@logicalhan: changing LGTM is restricted to assignees, and only kubernetes/kubernetes repo collaborators may be assigned issues.

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sttts sttts removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2018
@sttts
Copy link
Contributor Author

sttts commented Dec 5, 2018

@liggitt ptal

@liggitt
Copy link
Member

liggitt commented Dec 5, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 5, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet