Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3352 Aggregated Discovery to Beta #3722

Merged
merged 1 commit into from
Jan 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-api-machinery/3352.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 3352
alpha:
approver: "@deads2k"
beta:
approver: "@deads2k"
82 changes: 78 additions & 4 deletions keps/sig-api-machinery/3352-aggregated-discovery/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -616,6 +616,7 @@ main focus will be on kubectl and golang clients.
#### Beta

- kubectl uses the aggregated discovery feature by default
- Metrics are added

#### GA

Expand Down Expand Up @@ -678,7 +679,12 @@ channel if you need any help or guidance. -->

###### Does enabling the feature change any default behavior?

No
Clients using client-go version 1.26 and up will use the aggregated
discovery endpoint rather than the unaggregated discovery endpoint.
This is handled automatically in client-go and clients should see less
requests to the api server when fetching discovery information. Client
versions older than 1.26 will continue to use the old unaggregated
discovery endpoint without any changes.

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Expand All @@ -690,7 +696,13 @@ and restarting the component. No other changes should be necessary to
disable the feature.
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
--> Yes, the feature may be disabled by reverting the feature flag.
-->

Yes, the feature may be disabled on the apiserver by reverting the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a sentence explaining that this means that if there is a client bug, either the application needs to be re-compiled or the server needs to disable the feature until the clients are fixed.

I'm ok with this, but we want to be sure cluster-admins can understand it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

feature flag. This will disable aggregated discovery for all clients. If there is a golang specific client side bug, the feature may also be
turned off in client-go via the
[WithLegacy()](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L80)
toggle and this will require a recompile of the application.

###### What happens if we reenable the feature if it was previously rolled back?

Expand Down Expand Up @@ -731,6 +743,18 @@ feature flags will be enabled on some API servers and not others
during the rollout. Similarly, consider large clusters and how
enablement/disablement will rollout across nodes. -->

During a rollout, some apiservers may support aggregated discovery and
some may not. It is recommended that clients request for both the
aggregated discovery document with a fallback to the unaggregated
discovery format. This can be achieved by setting the Accept header to
have a fallback to the default GVK of the `/apis` and `/api` endpoint.
Jefftree marked this conversation as resolved.
Show resolved Hide resolved
Jefftree marked this conversation as resolved.
Show resolved Hide resolved
For example, to request the aggregated discovery type and fallback to
the unaggregated discovery, the following header can be sent: `Accept:
application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io,application/json`

This kind of fallback is already implemented in client-go and this
note is intended for non-golang clients.

###### What specific metrics should inform a rollback?

<!-- What signals should users be paying attention to when the feature
Expand All @@ -748,11 +772,20 @@ term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

n/a. The API introduced does not store data and state is recalculated on the upgrade, downgrade, upgrade cycle. No state is preserved between versions.

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

<!-- Even if applying deprecation policies, they may still surprise
some users. -->

By enabling aggregated discovery as the default, the new API is
slightly different from the unaggregated version. The
StorageVersionHash field is removed from resources in the aggregated
discovery API. The storage version migrator will have an additional
flag when initializing the discovery client to continue using the
unaggregated API.
Jefftree marked this conversation as resolved.
Show resolved Hide resolved

### Monitoring Requirements

<!-- This section must be completed when targeting beta to a release.
Expand All @@ -766,6 +799,17 @@ the previous answers based on experience in the field. -->
Kubernetes API (e.g., checking if there are objects with field X set)
may be a last resort. Avoid logs or events for this purpose. -->

Operators can check whether an aggregated discovery request can be
made by sending a request to `apis` with
`application/json;as=APIGroupDiscoveryList;v=v2beta1;g=apidiscovery.k8s.io,application/json`
as the Accept header and looking at the the `Content-Type` response
header. A Content Type response header of `Content-Type:
application/json;g=apidiscovery.k8s.io;v=v2beta1;as=APIGroupDiscoveryList`
indicates that aggregated discovery is supported and a `Content-Type:
application/json` header indicates that aggregated discovery is not
supported. They can also check for the presence of aggregated
discovery related metrics: `aggregated_discovery_aggregation_count`

###### How can someone using this feature know that it is working for their instance?

<!-- For instance, if this is a pod-related feature, it should be
Expand Down Expand Up @@ -802,14 +846,18 @@ the next question. -->
- [x] Metrics
- Metric name: `aggregator_discovery_aggregation_duration`
- Components exposing the metric: `kube-server`
- This is a metric for exposing the time it took to aggregate all the
- This is a metric for exposing the time it took to aggregate all the api resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it surprising that a duration counter like this wouldn't have a "count" value associated with it?

https://prometheus.io/docs/concepts/metric_types/

Both Histogram and Summary have a corresponding _count metric?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The count metric is listed below aggregator_discovery_aggregation_count. Are you suggesting that we should include additional sum and bucket metrics to be able to form a histogram?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I guess I was just thinking that the _count metric is added automatically, right?


- Metric name: `aggregator_discovery_aggregation_count`
- Components exposing the metric: `kube-server`
- This is a metric for the number of times that the discovery document has been aggregated.

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

<!-- Describe the metrics themselves and the reasons why they weren't
added (e.g., cost, implementation difficulties, etc.). -->

Yes. A metric for the regeneration count of the discovery document. `aggregator_discovery_aggregation_count`
No.

### Dependencies

Expand All @@ -833,6 +881,12 @@ cluster-level services (e.g. DNS):
- Impact of its degraded performance or high-error rates on the
feature: -->

No, but if aggregated apiservers are present, the feature will attempt
to contact and aggregate the data published from the aggregated
apiserver on a set interval. If there is high error rate, stale data
may be returned because the latest data was not able to be fetched
from the aggregated apiserver.

### Scalability

<!-- For alpha, this section is encouraged: reviewers should consider
Expand All @@ -858,6 +912,12 @@ Focusing mostly on:
- periodic API calls to reconcile state (e.g. periodic fetching
state, heartbeats, leader election, etc.) -->

No. Enabling this feature should reduce the total number of API calls
for client discovery. Instead of clients sending a discovery request
to all group versions (`/apis/<group>/<version>`), they will only need
to send a request to the aggregated endpoint to obtain all resources
that the cluster supports.

###### Will enabling / using this feature result in introducing new API types?

<!-- Describe them, providing:
Expand All @@ -866,12 +926,16 @@ Focusing mostly on:
- Supported number of objects per namespace (for namespace-scoped
objects) -->

Yes, but these API types are not persisted.

###### Will enabling / using this feature result in any new calls to the cloud provider?

<!-- Describe them, providing:
- Which API(s):
- Estimated increase: -->

No.

###### Will enabling / using this feature result in increasing size or count of the existing API objects?

<!-- Describe them, providing:
Expand All @@ -880,6 +944,8 @@ objects) -->
- Estimated amount of new objects: (e.g., new Object X for every
existing Pod) -->

No.

###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

<!-- Look at the [existing SLIs/SLOs].
Expand All @@ -890,6 +956,8 @@ details.
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos -->

No.

###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

<!-- Things to keep in mind include: additional in-memory state,
Expand All @@ -900,6 +968,8 @@ large cases, again with respect to the [supported limits].
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md -->

No.

### Troubleshooting

<!-- This section must be completed when targeting beta to a release.
Expand All @@ -914,6 +984,8 @@ may consider splitting it into a dedicated `Playbook` document

###### How does this feature react if the API server and/or etcd is unavailable?

The feature is built into the API server, and will not work if the API server is unavailable.

###### What are other known failure modes?

<!-- For each of them, fill in the following information by copying
Expand All @@ -932,6 +1004,8 @@ why. -->

###### What steps should be taken if SLOs are not being met to determine the problem?

The feature can be rolled back by setting the AggregatedDiscoveryEndpoint feature flag to false.

## Implementation History

<!-- Major milestones in the lifecycle of a KEP should be tracked in
Expand Down
10 changes: 6 additions & 4 deletions keps/sig-api-machinery/3352-aggregated-discovery/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,17 @@ approvers:
- "@lavalamp"

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.26"
latest-milestone: "v1.27"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.26"
beta: "v1.27"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
Expand All @@ -38,5 +39,6 @@ feature-gates:
disable-supported: true

# The following PRR answers are required at beta release
# metrics:
# - my_feature_metric
metrics:
- aggregator_discovery_aggregation_duration
- aggregator_discovery_aggregation_count