-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Karpenter v1 API & Roadmap Proposal #1222
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jonathan-innis The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
85ea0df
to
3a913ef
Compare
Pull Request Test Coverage Report for Build 8955844849Details
💛 - Coveralls |
|
||
When the KubeletConfiguration was first introduced into the NodePool, the assumption was that the kubelet configuration is a common interface and that every Cloud Provider supports the same set of kubelet configuration fields. | ||
|
||
This turned out not to be the case in reality. For instance, Cloud Providers like Azure [do not support configuring the kubelet configuration through the NodePool API](https://learn.microsoft.com/en-us/azure/aks/node-autoprovision?tabs=azure-cli#:~:text=Kubelet%20configuration%20through%20Node%20pool%20configuration%20is%20not%20supported). Kwok also has no need for the Kubelet API. Shifting these fields into the NodeClass API allows CloudProvider to pick on a case-by-case basis what kind of configuration they want to support through the Kubernetes API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we reference the slack conversation here? https://kubernetes.slack.com/archives/C04JW2J5J5P/p1709226455964629
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this proposal sir!
Documentation: Does it make sense to make the azure + aws documentation unification effort happen alongside v1?
|
||
Karpenter currently has no conceptual documentation around NodeClaims. NodeClaims have become a fundamental part of how Karpenter launches and manages nodes. There is critical observability information that is stored inside of the NodeClaim that can help users understand when certain disruption conditions are met (Expired, Empty, Drifted) or why the NodeClaim fails to launch. | ||
|
||
For Karpenter’s feature completeness at v1, we need to accurately describe to users what the purpose of Karpenter’s NodeClaims are and how to leverage the information that is stored within the NodeClaim to troubleshoot Karpenter’s decision-making. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also potentially adding some diagrams on the nodeclaim lifecycle controllers would be useful. kaito creates nodeclaims and lets karpenter launch them.
Documenting this concept might be useful as well for other people that want to leverage nodeclaims as a node bootstrapping api.
I believe Kaito will be joining the WG Serving and eventually starting with an aws provider as well(Maybe?) that does a similar thing to the azure provider. AWS is the next natural cloud provider here, since kaito is based on the karpenter CRDs.
Officially supporting this pattern of outside controllers launching nodeclaims for karpenter to manage would be great! We don't want breaking changes for nodeclaim and v1 for kaito :)
CC: @helayoty who works on kaito at aks and correct my statements here if required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I wasn't directly scoping in #894 to the v1 work, but @sanjeevrg89 volunteered himself to start adding some architecture diagrams to the upstream docs. This should include the NodeClaim lifecycle controller and should hopefully a lot of the detail to outside contributors about how each of the Karpenter controllers interacts with the system.
### Stabilize Karpenter’s Tainting Logic | ||
|
||
**Issue Ref:** https://github.com/kubernetes-sigs/karpenter/issues/624, https://github.com/kubernetes-sigs/karpenter/issues/1049 | ||
|
||
**Category:** Breaking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate taints topic, do we plan on aligning with cas on some of the taints and annotations for v1 as well? Or is this work still in discussion phase?
Would we be adding node-lifecycle.kubernetes.io/do-not-disrupt alongside karpenter.sh/do-not-disrupt? If we are not adding them alongside the annotations and instead deprecating the karpenter.sh specific annotations, then v1 might be the right time to make this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't align on node-lifecycle.kubernetes.io/do-not-disrupt
until it's registered with Kubernetes (see https://kubernetes.io/docs/reference/labels-annotations-taints/ for the list - you just need a PR against that page to add a registered thing).
So, an easy prerequisite to meet, but still something to remember to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we plan on aligning with cas on some of the taints and annotations for v1 as well? Or is this work still in discussion phase
There is an open issue on this upstream. We've been actively having some conversations around what makes sense here, but I imagine that, yes, this will be part of the taint redesign so that we don't break everyone twice.
Since the discussion was still ongoing and nothing was set in stone yet, I didn't want to specifically scope it into this item; though, I can add a little bit to make mention to it.
Musing for v1: E2E Testing for Core: Would there be an interest in testing core via e2es with kwok? There are some bits of functionality that change a lot of behavior in the v1 proposal like the ConsolidateAfter for the WhenUnderutilized consolidation policy. Would it be something we want to take on to have some of the e2e tests go from providers to also being included in core? We should take some additional testing steps for ensuring stability in core releases since we are moving to v1. We still should run and develop e2es for our providers, but some of the functionality like consolidation is driven by core, mostly. Given our providers are both on the latest version of core so you can bump the version with tags to test things there as well. But having some e2es gate consolidation and scheduling etc would be useful for gating code. We can even write them to run on multiple managed k8s services (AKS, EKS) with kwok as the cloud provider in multiple clouds(AWS, Azure)? I haven't played enough with kwok to see if something like that would work, but supposedly we could, right? |
**Standard Columns** | ||
|
||
1. Name | ||
2. NodeClass - Allows users to easily see the NodeClass that the NodePool is using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a problem: we don't know what the kind is. With this API, there could be two different node classes with the same name but different kinds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big of a problem do you think that is in the printer columns? I'm hesitant to want to include both the name and the kind/group in an API when most users may name them differently for clarity anyways. Perhaps the wide columns, but even there I'm not fully convinced that it's needed at this point since no CloudProvider supports multiple kinds for NodeClasses right now.
type: Ready | ||
nodeName: ip-192-168-71-87.us-west-2.compute.internal | ||
providerID: aws:///us-west-2b/i-053c6b324e29d2275 | ||
imageID: ami-0b1e393fbe12f411c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will we handle:
- providers where the machine image specification is a tuple (eg
{image: "e911d3ce-5a58-4e5f-bab1-e5c5492779df", version: 42}
- node provisioning where there's no image (for example: there's a pool of provisioned bare metal and someone wants Karpenter to wake up servers when needed)
?
It'd be a shame to need a v2 API for those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to see a case where an image doesn't exist assigned to a NodeClaim/Instance. In the pre-provisioned case, the NodeClaim will eventually have an image assigned to it when it wakes up. In the tuple case, you could construct a string that would make sense in the context of "imageID" e.g. "e911d3ce-5a58-4e5f-bab1-e5c5492779df/v42"
|
||
1. Name | ||
2. Instance Type | ||
3. Capacity - Moved from the wide output to the standard output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where in the NodeClaim API does Capacity come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It comes from the labels -- karpenter.sh/capacity-type
throughput: 125 | ||
snapshotID: snap-0123456789 | ||
detailedMonitoring: **true** | ||
status: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't block v1, but: could we add the ARN of the instance profile into .status
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the need for the ARN vs. just the name? The instance profile will be in the same partition and is global, so I think the ARN should be easily constructed from the name
### Stabilize Karpenter’s Tainting Logic | ||
|
||
**Issue Ref:** https://github.com/kubernetes-sigs/karpenter/issues/624, https://github.com/kubernetes-sigs/karpenter/issues/1049 | ||
|
||
**Category:** Breaking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't align on node-lifecycle.kubernetes.io/do-not-disrupt
until it's registered with Kubernetes (see https://kubernetes.io/docs/reference/labels-annotations-taints/ for the list - you just need a PR against that page to add a registered thing).
So, an easy prerequisite to meet, but still something to remember to do.
If we want to announce v1, the Kubernetes blog team can help with that. |
I didn't want to directly couple it to v1 (because I don't think I view it as a blocking change) but there's a lot of contribution work that's happening between some folks on the AWS side and @njtran to start getting more extensive testing and benchmarking with Kwok. I can imagine that this work is going to happen in parallel with v1 but I don't personally want to block on it since it's going to be a long-poll to move over a lot of the testing that's currently sitting in the providers into the upstream codebase. |
This another one that has unknown timelines in my mind and I don't think has any direct impact to users if it comes after v1 (at least from a breaking change perspective). I'd rather just keep working on this effort and not view it as a blocking action for the v1 release. |
|
||
## Migration Path | ||
|
||
Karpenter will **not** be changing its API group or resource kind as part of the v1 API bump. By avoiding this, we can leverage the [existing Kubernetes conversion webhook process](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#webhook-conversion) for upgrading APIs, which allows upgrades to newer versions of the API to occur in-place, without any customer intervention or node rolling. The upgrade process will be executed as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "node rolling" mean here - no need to recreate or update the Nodes provisioned by Karpenter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node rolling here means a full recycle/recreation of all the nodes in the cluster.
|
||
1. Apply the updated NodePool, NodeClaim, and EC2NodeClass CRDs, which will contain a `v1` versions listed under the `versions` section of the CustomResourceDefinition | ||
2. Upgrade Karpenter controller to its `v1` version. This version of Karpenter will start reasoning in terms of the `v1` API schema in its API requests. Resources will be converted from the v1beta1 to the v1 version at runtime, using conversion webhooks shipped by the upstream Karpenter project and the Cloud Providers (for NodeClass changes). | ||
3. Users update their `v1beta1` manifests that they are applying through IaC or GitOps to use the new `v1` version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Karpenter provided a manifest conversion tool when they promoted their API version to v1beta1
, do you plan to provide a similar tool for v1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conversion webhooks should translate the CRs in memory for you. The level of support for a conversion tool probably differs per cloud provider. Worth opening a question/issue in your interested Cloud provider's repo.
**Wide Columns (-o wide)** | ||
|
||
1. Weight - Viewing the NodePools that will be evaluated first should be easily observable but may not be immediately useful to all users, particularly if the NodePools are named in a way that already indicate their ordering e.g. suffixed with fallback | ||
2. CPU% - The capacity of the CPU for all nodes provisioned by this NodePool as a percentage of its limits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limits here are spec.limits, right? If the user does not specify this value, what will be displayed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if there's no limit, I would think this would show up with some nil or special value.
|
||
Karpenter currently doesn’t enforce immutability on NodeClaims in v1beta1, though we implicitly assume that users should not be acting against these objects after creation, as the NodeClaim lifecycle controller won’t react to any change after the initial instance launch. | ||
|
||
Karpenter can make every `spec` field immutable on the NodeClaim after its initial creation. This will be enforced through CEL validation, where you can perform a check like `[self == oldSelf](https://kubernetes.io/docs/reference/using-api/cel/#language-overview)` to enforce that the fields cannot have changed after the initial apply. Users who are not on K8s 1.25+ that supports CEL will get the same validation enforced by validating webhooks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Karpenter can make every `spec` field immutable on the NodeClaim after its initial creation. This will be enforced through CEL validation, where you can perform a check like `[self == oldSelf](https://kubernetes.io/docs/reference/using-api/cel/#language-overview)` to enforce that the fields cannot have changed after the initial apply. Users who are not on K8s 1.25+ that supports CEL will get the same validation enforced by validating webhooks. | |
Karpenter can make every `spec` field immutable on the NodeClaim after its initial creation. This will be enforced through CEL validation, where you can perform a check like [`self == oldSelf`](https://kubernetes.io/docs/reference/using-api/cel/#language-overview) to enforce that the fields cannot have changed after the initial apply. Users who are not on K8s 1.25+ that supports CEL will get the same validation enforced by validating webhooks. |
|
||
**Category:** Breaking | ||
|
||
Karpenter wants to expand the usage of the `karpenter.sh/disruption=disrupting:NoSchedule` taint that it currently leverages to cordon nodes during disruption, to also taint nodes with a `karpenter.sh/disruption-candidate:NoSchedule` taint. By tainting nodes when they become candidates (past expiry, drifted, etc.), we ensure that we will launch new nodes when we get more pods that join the cluster, reducing the chance that we will continue to get `karpenter.sh/do-not-disrupt` pods that continue to schedule to the same node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Karpenter wants to expand the usage of the `karpenter.sh/disruption=disrupting:NoSchedule` taint that it currently leverages to cordon nodes during disruption, to also taint nodes with a `karpenter.sh/disruption-candidate:NoSchedule` taint. By tainting nodes when they become candidates (past expiry, drifted, etc.), we ensure that we will launch new nodes when we get more pods that join the cluster, reducing the chance that we will continue to get `karpenter.sh/do-not-disrupt` pods that continue to schedule to the same node. | |
Karpenter wants to expand the usage of the `karpenter.sh/disruption=disrupting:NoSchedule` taint that it currently leverages to cordon nodes during disruption, to also taint nodes with a `karpenter.sh/disruption=candidate:NoSchedule` taint. By tainting nodes when they become candidates (past expiry, drifted, etc.), we ensure that we will launch new nodes when we get more pods that join the cluster, reducing the chance that we will continue to get `karpenter.sh/do-not-disrupt` pods that continue to schedule to the same node. |
|
||
#### Tasks | ||
|
||
- [ ] Design and implement a `spec.consolidateAfter` field for the v1 API, reworking our synchronous wait to ensure that waiting for nodes that haven’t reached the end of their `consolidateAfter` timeframe doesn’t block other disruption evaluation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [ ] Design and implement a `spec.consolidateAfter` field for the v1 API, reworking our synchronous wait to ensure that waiting for nodes that haven’t reached the end of their `consolidateAfter` timeframe doesn’t block other disruption evaluation | |
- [ ] Design and implement a `spec.consolidateAfter` field for the v1 API of NodePool, reworking our synchronous wait to ensure that waiting for nodes that haven’t reached the end of their `consolidateAfter` timeframe doesn’t block other disruption evaluation |
|
||
Karpenter currently relies on a hash to determine whether certain fields in Karpenter’s NodeClaims have drifted from their owning NodePool and owning EC2NodeClass. Today, this is determined by hashing a set of fields on the NodePool or EC2NodeClass and then validating that this hash still matches the NodeClaim’s hash. | ||
|
||
This hashing mechanism works well for additive changes to the API, but does not work well when adding fields to the hashing function that already have a set value on a customer’s cluster. In particular, we have a need to make breaking changes to this hash scheme from these two issues: https://github.com/kubernetes-sigs/karpenter/issues/909 and https://github.com/aws/karpenter-provider-aws/issues/5447. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the field with the default values won't work well when the user specifies it later?
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
/remove-lifecycle stale |
Fixes #N/A
Description
This proposes the Karpenter stable v1 API and Roadmap. The Roadmap includes features and cleanup tasks that we need to complete before reaching v1.
How was this change tested?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.