-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3352 Aggregated Discovery to Beta #3722
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
kep-number: 3352 | ||
alpha: | ||
approver: "@deads2k" | ||
beta: | ||
approver: "@deads2k" |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -616,6 +616,7 @@ main focus will be on kubectl and golang clients. | |
#### Beta | ||
|
||
- kubectl uses the aggregated discovery feature by default | ||
- Metrics are added | ||
|
||
#### GA | ||
|
||
|
@@ -678,7 +679,12 @@ channel if you need any help or guidance. --> | |
|
||
###### Does enabling the feature change any default behavior? | ||
|
||
No | ||
Clients using client-go version 1.26 and up will use the aggregated | ||
discovery endpoint rather than the unaggregated discovery endpoint. | ||
This is handled automatically in client-go and clients should see less | ||
requests to the api server when fetching discovery information. Client | ||
versions older than 1.26 will continue to use the old unaggregated | ||
discovery endpoint without any changes. | ||
|
||
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? | ||
|
||
|
@@ -690,7 +696,13 @@ and restarting the component. No other changes should be necessary to | |
disable the feature. | ||
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. | ||
--> Yes, the feature may be disabled by reverting the feature flag. | ||
--> | ||
|
||
Yes, the feature may be disabled on the apiserver by reverting the | ||
feature flag. This will disable aggregated discovery for all clients. If there is a golang specific client side bug, the feature may also be | ||
turned off in client-go via the | ||
[WithLegacy()](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L80) | ||
toggle and this will require a recompile of the application. | ||
|
||
###### What happens if we reenable the feature if it was previously rolled back? | ||
|
||
|
@@ -731,6 +743,18 @@ feature flags will be enabled on some API servers and not others | |
during the rollout. Similarly, consider large clusters and how | ||
enablement/disablement will rollout across nodes. --> | ||
|
||
During a rollout, some apiservers may support aggregated discovery and | ||
some may not. It is recommended that clients request for both the | ||
aggregated discovery document with a fallback to the unaggregated | ||
discovery format. This can be achieved by setting the Accept header to | ||
have a fallback to the default GVK of the `/apis` and `/api` endpoint. | ||
Jefftree marked this conversation as resolved.
Show resolved
Hide resolved
Jefftree marked this conversation as resolved.
Show resolved
Hide resolved
|
||
For example, to request the aggregated discovery type and fallback to | ||
the unaggregated discovery, the following header can be sent: `Accept: | ||
application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io,application/json` | ||
|
||
This kind of fallback is already implemented in client-go and this | ||
note is intended for non-golang clients. | ||
|
||
###### What specific metrics should inform a rollback? | ||
|
||
<!-- What signals should users be paying attention to when the feature | ||
|
@@ -748,11 +772,20 @@ term, we may want to require automated upgrade/rollback tests, but we | |
are missing a bunch of machinery and tooling and can't do that now. | ||
--> | ||
|
||
n/a. The API introduced does not store data and state is recalculated on the upgrade, downgrade, upgrade cycle. No state is preserved between versions. | ||
|
||
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? | ||
|
||
<!-- Even if applying deprecation policies, they may still surprise | ||
some users. --> | ||
|
||
By enabling aggregated discovery as the default, the new API is | ||
slightly different from the unaggregated version. The | ||
StorageVersionHash field is removed from resources in the aggregated | ||
discovery API. The storage version migrator will have an additional | ||
flag when initializing the discovery client to continue using the | ||
unaggregated API. | ||
Jefftree marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Monitoring Requirements | ||
|
||
<!-- This section must be completed when targeting beta to a release. | ||
|
@@ -766,6 +799,17 @@ the previous answers based on experience in the field. --> | |
Kubernetes API (e.g., checking if there are objects with field X set) | ||
may be a last resort. Avoid logs or events for this purpose. --> | ||
|
||
Operators can check whether an aggregated discovery request can be | ||
made by sending a request to `apis` with | ||
`application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io,application/json` | ||
as the Accept header and looking at the the `Content-Type` response | ||
header. A Content Type response header of `Content-Type: | ||
application/json;g=apidiscovery.k8s.io;v=v2beta1;as=APIGroupDiscoveryList` | ||
indicates that aggregated discovery is supported and a `Content-Type: | ||
application/json` header indicates that aggregated discovery is not | ||
supported. They can also check for the presence of aggregated | ||
discovery related metrics: `aggregated_discovery_aggregation_count` | ||
|
||
###### How can someone using this feature know that it is working for their instance? | ||
|
||
<!-- For instance, if this is a pod-related feature, it should be | ||
|
@@ -802,14 +846,18 @@ the next question. --> | |
- [x] Metrics | ||
- Metric name: `aggregator_discovery_aggregation_duration` | ||
- Components exposing the metric: `kube-server` | ||
- This is a metric for exposing the time it took to aggregate all the | ||
- This is a metric for exposing the time it took to aggregate all the api resources. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find it surprising that a duration counter like this wouldn't have a "count" value associated with it? https://prometheus.io/docs/concepts/metric_types/ Both There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The count metric is listed below There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, I guess I was just thinking that the |
||
|
||
- Metric name: `aggregator_discovery_aggregation_count` | ||
- Components exposing the metric: `kube-server` | ||
- This is a metric for the number of times that the discovery document has been aggregated. | ||
|
||
###### Are there any missing metrics that would be useful to have to improve observability of this feature? | ||
|
||
<!-- Describe the metrics themselves and the reasons why they weren't | ||
added (e.g., cost, implementation difficulties, etc.). --> | ||
|
||
Yes. A metric for the regeneration count of the discovery document. `aggregator_discovery_aggregation_count` | ||
No. | ||
|
||
### Dependencies | ||
|
||
|
@@ -833,6 +881,12 @@ cluster-level services (e.g. DNS): | |
- Impact of its degraded performance or high-error rates on the | ||
feature: --> | ||
|
||
No, but if aggregated apiservers are present, the feature will attempt | ||
to contact and aggregate the data published from the aggregated | ||
apiserver on a set interval. If there is high error rate, stale data | ||
may be returned because the latest data was not able to be fetched | ||
from the aggregated apiserver. | ||
|
||
### Scalability | ||
|
||
<!-- For alpha, this section is encouraged: reviewers should consider | ||
|
@@ -858,6 +912,12 @@ Focusing mostly on: | |
- periodic API calls to reconcile state (e.g. periodic fetching | ||
state, heartbeats, leader election, etc.) --> | ||
|
||
No. Enabling this feature should reduce the total number of API calls | ||
for client discovery. Instead of clients sending a discovery request | ||
to all group versions (`/apis/<group>/<version>`), they will only need | ||
to send a request to the aggregated endpoint to obtain all resources | ||
that the cluster supports. | ||
|
||
###### Will enabling / using this feature result in introducing new API types? | ||
|
||
<!-- Describe them, providing: | ||
|
@@ -866,12 +926,16 @@ Focusing mostly on: | |
- Supported number of objects per namespace (for namespace-scoped | ||
objects) --> | ||
|
||
Yes, but these API types are not persisted. | ||
|
||
###### Will enabling / using this feature result in any new calls to the cloud provider? | ||
|
||
<!-- Describe them, providing: | ||
- Which API(s): | ||
- Estimated increase: --> | ||
|
||
No. | ||
|
||
###### Will enabling / using this feature result in increasing size or count of the existing API objects? | ||
|
||
<!-- Describe them, providing: | ||
|
@@ -880,6 +944,8 @@ objects) --> | |
- Estimated amount of new objects: (e.g., new Object X for every | ||
existing Pod) --> | ||
|
||
No. | ||
|
||
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? | ||
|
||
<!-- Look at the [existing SLIs/SLOs]. | ||
|
@@ -890,6 +956,8 @@ details. | |
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos --> | ||
|
||
No. | ||
|
||
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? | ||
|
||
<!-- Things to keep in mind include: additional in-memory state, | ||
|
@@ -900,6 +968,8 @@ large cases, again with respect to the [supported limits]. | |
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md --> | ||
|
||
No. | ||
|
||
### Troubleshooting | ||
|
||
<!-- This section must be completed when targeting beta to a release. | ||
|
@@ -914,6 +984,8 @@ may consider splitting it into a dedicated `Playbook` document | |
|
||
###### How does this feature react if the API server and/or etcd is unavailable? | ||
|
||
The feature is built into the API server, and will not work if the API server is unavailable. | ||
|
||
###### What are other known failure modes? | ||
|
||
<!-- For each of them, fill in the following information by copying | ||
|
@@ -932,6 +1004,8 @@ why. --> | |
|
||
###### What steps should be taken if SLOs are not being met to determine the problem? | ||
|
||
The feature can be rolled back by setting the AggregatedDiscoveryEndpoint feature flag to false. | ||
|
||
## Implementation History | ||
|
||
<!-- Major milestones in the lifecycle of a KEP should be tracked in | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a sentence explaining that this means that if there is a client bug, either the application needs to be re-compiled or the server needs to disable the feature until the clients are fixed.
I'm ok with this, but we want to be sure cluster-admins can understand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated