Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promoting some flowcontrol metrics to Beta #118882

Closed
andrewsykim opened this issue Jun 26, 2023 · 14 comments · Fixed by #119110
Closed

Promoting some flowcontrol metrics to Beta #118882

andrewsykim opened this issue Jun 26, 2023 · 14 comments · Fixed by #119110
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@andrewsykim
Copy link
Member

What would you like to be added?

Some flowcontrol metrics in kube-apiserver have been around for a long time and are very high value signals for how APF is impacting apiserver latency. With APF graduated to Beta in 1.20 and on-going work to get to GA, I wanted to open an issue to discuss graduating some of these metrics to Beta (or Stable because we skip Beta?), specifically ones that are high value signals for apiserver latency with many users depending on them already. Here are metrics that I personally think would benefit from graduation to Stable based on their usage at Google:

  • request_wait_duration_seconds
  • request_concurrency_in_use
  • request_concurrency_limit
  • rejected_requests_total
  • dispatched_requests_total
  • current_inqueue_requests
  • current_executing_requests

For other metrics, I propose we leave as alpha for now as they pertain to internals details of APF that most users will likely not care about and mostly catered to maintainers who are looking to optimize implementation details.

Why is this needed?

Promoting widely adopted metrics to Beta/Stable will provide stronger guarantees on their compatibility going forward.

@andrewsykim andrewsykim added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 26, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 26, 2023
@andrewsykim
Copy link
Member Author

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 26, 2023
@andrewsykim
Copy link
Member Author

cc @wojtek-t @MikeSpreitzer

@alexzielenski
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 27, 2023
@wojtek-t
Copy link
Member

I think we should split two topics:

  1. the exact set of metrics
  2. if we promote some or not

Re (1), I don't have very strong opinion - certainly the first 5 metrics you proposed make sense to me. I'm not entirely sure about the last two which are really "the current state, not valid anymore 1ms later", which makes them hard to reason about given the monitoring pipelines usually have much larger latency, but I'm not strongly opposed either...

Re (2) - we should absolutely start thinking about promoting them, but given APF is not yet in GA, we should not propomote metrics to GA. But with the recent change in the metrics framework and introduction a Beta phase, I think we can definitely consider promoting them to beta.

@MikeSpreitzer
Copy link
Member

MikeSpreitzer commented Jun 29, 2023

Yes, there are two topics here.

Regarding current_inqueue_requests: that is an ambiguous shorthand! We have both apiserver_current_inqueue_requests and apiserver_flowcontrol_current_inqueue_requests, and they are different. For the complementary state, we have apiserver_current_inflight_requests and apiserver_flowcontrol_current_executing_requests. So I surmise that @andrewsykim meant the apiserver_flowcontrol_... ones. While those are indeed just instantaneous gauges, they will show a problem when the problem is persistent --- which I suspect is a common enough case, and Andrew is saying those have been found valuable. Note that apiserver_flowcontrol_request_concurrency_in_use is also a gauge that can vary quickly.

I agree that Beta is the metric maturity level to aim for.

@MikeSpreitzer
Copy link
Member

MikeSpreitzer commented Jun 29, 2023

BTW, apiserver_flowcontrol_request_concurrency_in_use is badly named. While it has "request" in its name, it actually is in units of seats. Note that we have some metrics that count requests and similar metrics that count seats, so it would be helpful to have the metric names clearly and correctly distinguish the two. I think that perhaps we should fix this name before graduating to Beta.

See #118960

@MikeSpreitzer
Copy link
Member

MikeSpreitzer commented Jun 29, 2023

The metric named apiserver_flowcontrol_request_concurrency_limit also has a misleading name. Fortunately, there is already a solution. The metric apiserver_flowcontrol_nominal_limit_seats exists and carries the same value. I think we should remove apiserver_flowcontrol_request_concurrency_limit. See #118957

@andrewsykim
Copy link
Member Author

@MikeSpreitzer I don't disagree with using better names for these metrics, but we are effectively making a breaking change for only a name change and no actual changes to the metric schema.

While these are alpha metrics, I'm not sure it's worthwhile to break the metric schema for the sake of better naming. I would instead be in favor of graduating existing metrics that are widely adopted to Beta while also introducing the new metric in alpha with more accurate naming which can be adopted in newer versions.

@andrewsykim andrewsykim changed the title Promoting some flowcontrol metrics to stable (beta?) Promoting some flowcontrol metrics to Beta Jul 5, 2023
@MikeSpreitzer
Copy link
Member

Certainly we can fix name problems by introducing new names before removing old ones that carry equivalent information.

I am not enthused about promoting to Beta stability something that we plan to delete.

Could we introduce the new name at Beta stability, on the grounds that its content has already proven worthwhile under the old name?

@wojtek-t
Copy link
Member

wojtek-t commented Jul 6, 2023

While these are alpha metrics, I'm not sure it's worthwhile to break the metric schema for the sake of better naming. I would instead be in favor of graduating existing metrics that are widely adopted to Beta while also introducing the new metric in alpha with more accurate naming which can be adopted in newer versions.

I don't agree - we either agree that the existing names are good enough then just promote those (and not introduce any new ones then). Or we deprecated the existing ones and introduce new ones which will be promoted to beta then.

@andrewsykim
Copy link
Member Author

Selfishly I am not in favor of breaking any metrics just for name sake due to the ripple effects it will have on existing monitoring that rely on them. But I also understand that that these are alpha metrics so we have the right to break them :)

I am not really convinced that users will care that the metric is named apiserver_flowcontrol_request_concurrency_limit and not apiserver_flowcontrol_nominal_limit_seats, as long as it is accurate w.r.t to request_concurrency_in_use (now current_executing_seats). FWIW I feel the same way between concurreny_in_use vs current_executing_seats, while the latter is technically more accurate, both measure the same thing and the former is good enough representation of what is being measured.

@andrewsykim
Copy link
Member Author

(either way, we should still update the metric description for apiserver_flowcontrol_request_concurrency_limit to mention it accounts for total seats and includes adjustments from borrowing)

@MikeSpreitzer
Copy link
Member

Elsewhere you can see users who are confused. That comes from the combination of an outdated name and an outdated HELP string.

When I complained about there being no formal maturity level for "deprecated", I was overlooking the distinct field for indicating the release at which a metric is deprecated. So I agree that #118959 should be modified to not delete apiserver_flowcontrol_request_concurrency_limit but rather fix its HELP string and mark it as deprecated in release 1.28 because apiserver_flowcontrol_nominal_limit_seats was introduced in an earlier release and is the better name.

Now that #118960 has merged we have apiserver_flowcontrol_current_executing_seats as a better name for the content that was only available in apiserver_flowcontrol_request_concurrency_in_use, and I propose we should make the newer name be the one at Beta.

So I propose that the list of Beta metrics be as follows.

  • request_wait_duration_seconds
    
  • apiserver_flowcontrol_current_executing_seats
    
  • apiserver_flowcontrol_nominal_limit_seats
    
  • rejected_requests_total
    
  • dispatched_requests_total
    
  • current_inqueue_requests
    
  • current_executing_requests
    

@MikeSpreitzer
Copy link
Member

I have updated #118959 and commented in #119110 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants