Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Karpenter v1 API & Roadmap Proposal #1222

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jonathan-innis
Copy link
Member

Fixes #N/A

Description

This proposes the Karpenter stable v1 API and Roadmap. The Roadmap includes features and cleanup tasks that we need to complete before reaching v1.

How was this change tested?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jonathan-innis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 1, 2024
@jonathan-innis jonathan-innis force-pushed the v1-rfc branch 2 times, most recently from 85ea0df to 3a913ef Compare May 1, 2024 23:51
@coveralls
Copy link

coveralls commented May 2, 2024

Pull Request Test Coverage Report for Build 8955844849

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 78.859%

Totals Coverage Status
Change from base Build 8916772215: 0.0%
Covered Lines: 8352
Relevant Lines: 10591

💛 - Coveralls


When the KubeletConfiguration was first introduced into the NodePool, the assumption was that the kubelet configuration is a common interface and that every Cloud Provider supports the same set of kubelet configuration fields.

This turned out not to be the case in reality. For instance, Cloud Providers like Azure [do not support configuring the kubelet configuration through the NodePool API](https://learn.microsoft.com/en-us/azure/aks/node-autoprovision?tabs=azure-cli#:~:text=Kubelet%20configuration%20through%20Node%20pool%20configuration%20is%20not%20supported). Kwok also has no need for the Kubelet API. Shifting these fields into the NodeClass API allows CloudProvider to pick on a case-by-case basis what kind of configuration they want to support through the Kubernetes API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reference the slack conversation here? https://kubernetes.slack.com/archives/C04JW2J5J5P/p1709226455964629

Copy link
Member

@Bryce-Soghigian Bryce-Soghigian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this proposal sir!

Documentation: Does it make sense to make the azure + aws documentation unification effort happen alongside v1?


Karpenter currently has no conceptual documentation around NodeClaims. NodeClaims have become a fundamental part of how Karpenter launches and manages nodes. There is critical observability information that is stored inside of the NodeClaim that can help users understand when certain disruption conditions are met (Expired, Empty, Drifted) or why the NodeClaim fails to launch.

For Karpenter’s feature completeness at v1, we need to accurately describe to users what the purpose of Karpenter’s NodeClaims are and how to leverage the information that is stored within the NodeClaim to troubleshoot Karpenter’s decision-making.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also potentially adding some diagrams on the nodeclaim lifecycle controllers would be useful. kaito creates nodeclaims and lets karpenter launch them.

Documenting this concept might be useful as well for other people that want to leverage nodeclaims as a node bootstrapping api.

I believe Kaito will be joining the WG Serving and eventually starting with an aws provider as well(Maybe?) that does a similar thing to the azure provider. AWS is the next natural cloud provider here, since kaito is based on the karpenter CRDs.

Officially supporting this pattern of outside controllers launching nodeclaims for karpenter to manage would be great! We don't want breaking changes for nodeclaim and v1 for kaito :)

CC: @helayoty who works on kaito at aks and correct my statements here if required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I wasn't directly scoping in #894 to the v1 work, but @sanjeevrg89 volunteered himself to start adding some architecture diagrams to the upstream docs. This should include the NodeClaim lifecycle controller and should hopefully a lot of the detail to outside contributors about how each of the Karpenter controllers interacts with the system.

designs/v1-roadmap.md Outdated Show resolved Hide resolved
Comment on lines +48 to +52
### Stabilize Karpenter’s Tainting Logic

**Issue Ref:** https://github.com/kubernetes-sigs/karpenter/issues/624, https://github.com/kubernetes-sigs/karpenter/issues/1049

**Category:** Breaking
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate taints topic, do we plan on aligning with cas on some of the taints and annotations for v1 as well? Or is this work still in discussion phase?

Would we be adding node-lifecycle.kubernetes.io/do-not-disrupt alongside karpenter.sh/do-not-disrupt? If we are not adding them alongside the annotations and instead deprecating the karpenter.sh specific annotations, then v1 might be the right time to make this change.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't align on node-lifecycle.kubernetes.io/do-not-disrupt until it's registered with Kubernetes (see https://kubernetes.io/docs/reference/labels-annotations-taints/ for the list - you just need a PR against that page to add a registered thing).

So, an easy prerequisite to meet, but still something to remember to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we plan on aligning with cas on some of the taints and annotations for v1 as well? Or is this work still in discussion phase

There is an open issue on this upstream. We've been actively having some conversations around what makes sense here, but I imagine that, yes, this will be part of the taint redesign so that we don't break everyone twice.

Since the discussion was still ongoing and nothing was set in stone yet, I didn't want to specifically scope it into this item; though, I can add a little bit to make mention to it.

designs/v1-api.md Show resolved Hide resolved
designs/v1-roadmap.md Show resolved Hide resolved
designs/v1-roadmap.md Show resolved Hide resolved
designs/v1-roadmap.md Show resolved Hide resolved
@Bryce-Soghigian
Copy link
Member

Musing for v1:

E2E Testing for Core: Would there be an interest in testing core via e2es with kwok? There are some bits of functionality that change a lot of behavior in the v1 proposal like the ConsolidateAfter for the WhenUnderutilized consolidation policy. Would it be something we want to take on to have some of the e2e tests go from providers to also being included in core? We should take some additional testing steps for ensuring stability in core releases since we are moving to v1.

We still should run and develop e2es for our providers, but some of the functionality like consolidation is driven by core, mostly. Given our providers are both on the latest version of core so you can bump the version with tags to test things there as well. But having some e2es gate consolidation and scheduling etc would be useful for gating code.

We can even write them to run on multiple managed k8s services (AKS, EKS) with kwok as the cloud provider in multiple clouds(AWS, Azure)? I haven't played enough with kwok to see if something like that would work, but supposedly we could, right?

designs/v1-api.md Show resolved Hide resolved
designs/v1-api.md Show resolved Hide resolved
designs/v1-api.md Outdated Show resolved Hide resolved
**Standard Columns**

1. Name
2. NodeClass - Allows users to easily see the NodeClass that the NodePool is using
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a problem: we don't know what the kind is. With this API, there could be two different node classes with the same name but different kinds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big of a problem do you think that is in the printer columns? I'm hesitant to want to include both the name and the kind/group in an API when most users may name them differently for clarity anyways. Perhaps the wide columns, but even there I'm not fully convinced that it's needed at this point since no CloudProvider supports multiple kinds for NodeClasses right now.

type: Ready
nodeName: ip-192-168-71-87.us-west-2.compute.internal
providerID: aws:///us-west-2b/i-053c6b324e29d2275
imageID: ami-0b1e393fbe12f411c
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we handle:

  • providers where the machine image specification is a tuple (eg {image: "e911d3ce-5a58-4e5f-bab1-e5c5492779df", version: 42}
  • node provisioning where there's no image (for example: there's a pool of provisioned bare metal and someone wants Karpenter to wake up servers when needed)

?

It'd be a shame to need a v2 API for those cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to see a case where an image doesn't exist assigned to a NodeClaim/Instance. In the pre-provisioned case, the NodeClaim will eventually have an image assigned to it when it wakes up. In the tuple case, you could construct a string that would make sense in the context of "imageID" e.g. "e911d3ce-5a58-4e5f-bab1-e5c5492779df/v42"


1. Name
2. Instance Type
3. Capacity - Moved from the wide output to the standard output
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where in the NodeClaim API does Capacity come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It comes from the labels -- karpenter.sh/capacity-type

throughput: 125
snapshotID: snap-0123456789
detailedMonitoring: **true**
status:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't block v1, but: could we add the ARN of the instance profile into .status?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the need for the ARN vs. just the name? The instance profile will be in the same partition and is global, so I think the ARN should be easily constructed from the name

Comment on lines +48 to +52
### Stabilize Karpenter’s Tainting Logic

**Issue Ref:** https://github.com/kubernetes-sigs/karpenter/issues/624, https://github.com/kubernetes-sigs/karpenter/issues/1049

**Category:** Breaking
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't align on node-lifecycle.kubernetes.io/do-not-disrupt until it's registered with Kubernetes (see https://kubernetes.io/docs/reference/labels-annotations-taints/ for the list - you just need a PR against that page to add a registered thing).

So, an easy prerequisite to meet, but still something to remember to do.

@sftim
Copy link

sftim commented May 2, 2024

If we want to announce v1, the Kubernetes blog team can help with that.

@jonathan-innis
Copy link
Member Author

jonathan-innis commented May 5, 2024

E2E Testing for Core: Would there be an interest in testing core via e2es with kwok

I didn't want to directly couple it to v1 (because I don't think I view it as a blocking change) but there's a lot of contribution work that's happening between some folks on the AWS side and @njtran to start getting more extensive testing and benchmarking with Kwok. I can imagine that this work is going to happen in parallel with v1 but I don't personally want to block on it since it's going to be a long-poll to move over a lot of the testing that's currently sitting in the providers into the upstream codebase.

@jonathan-innis
Copy link
Member Author

Does it make sense to make the azure + aws documentation unification effort happen alongside v1

This another one that has unknown timelines in my mind and I don't think has any direct impact to users if it comes after v1 (at least from a breaking change perspective). I'd rather just keep working on this effort and not view it as a blocking action for the v1 release.


## Migration Path

Karpenter will **not** be changing its API group or resource kind as part of the v1 API bump. By avoiding this, we can leverage the [existing Kubernetes conversion webhook process](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#webhook-conversion) for upgrading APIs, which allows upgrades to newer versions of the API to occur in-place, without any customer intervention or node rolling. The upgrade process will be executed as follows:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "node rolling" mean here - no need to recreate or update the Nodes provisioned by Karpenter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node rolling here means a full recycle/recreation of all the nodes in the cluster.


1. Apply the updated NodePool, NodeClaim, and EC2NodeClass CRDs, which will contain a `v1` versions listed under the `versions` section of the CustomResourceDefinition
2. Upgrade Karpenter controller to its `v1` version. This version of Karpenter will start reasoning in terms of the `v1` API schema in its API requests. Resources will be converted from the v1beta1 to the v1 version at runtime, using conversion webhooks shipped by the upstream Karpenter project and the Cloud Providers (for NodeClass changes).
3. Users update their `v1beta1` manifests that they are applying through IaC or GitOps to use the new `v1` version.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Karpenter provided a manifest conversion tool when they promoted their API version to v1beta1, do you plan to provide a similar tool for v1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversion webhooks should translate the CRs in memory for you. The level of support for a conversion tool probably differs per cloud provider. Worth opening a question/issue in your interested Cloud provider's repo.

**Wide Columns (-o wide)**

1. Weight - Viewing the NodePools that will be evaluated first should be easily observable but may not be immediately useful to all users, particularly if the NodePools are named in a way that already indicate their ordering e.g. suffixed with fallback
2. CPU% - The capacity of the CPU for all nodes provisioned by this NodePool as a percentage of its limits

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limits here are spec.limits, right? If the user does not specify this value, what will be displayed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if there's no limit, I would think this would show up with some nil or special value.


Karpenter currently doesn’t enforce immutability on NodeClaims in v1beta1, though we implicitly assume that users should not be acting against these objects after creation, as the NodeClaim lifecycle controller won’t react to any change after the initial instance launch.

Karpenter can make every `spec` field immutable on the NodeClaim after its initial creation. This will be enforced through CEL validation, where you can perform a check like `[self == oldSelf](https://kubernetes.io/docs/reference/using-api/cel/#language-overview)` to enforce that the fields cannot have changed after the initial apply. Users who are not on K8s 1.25+ that supports CEL will get the same validation enforced by validating webhooks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Karpenter can make every `spec` field immutable on the NodeClaim after its initial creation. This will be enforced through CEL validation, where you can perform a check like `[self == oldSelf](https://kubernetes.io/docs/reference/using-api/cel/#language-overview)` to enforce that the fields cannot have changed after the initial apply. Users who are not on K8s 1.25+ that supports CEL will get the same validation enforced by validating webhooks.
Karpenter can make every `spec` field immutable on the NodeClaim after its initial creation. This will be enforced through CEL validation, where you can perform a check like [`self == oldSelf`](https://kubernetes.io/docs/reference/using-api/cel/#language-overview) to enforce that the fields cannot have changed after the initial apply. Users who are not on K8s 1.25+ that supports CEL will get the same validation enforced by validating webhooks.


**Category:** Breaking

Karpenter wants to expand the usage of the `karpenter.sh/disruption=disrupting:NoSchedule` taint that it currently leverages to cordon nodes during disruption, to also taint nodes with a `karpenter.sh/disruption-candidate:NoSchedule` taint. By tainting nodes when they become candidates (past expiry, drifted, etc.), we ensure that we will launch new nodes when we get more pods that join the cluster, reducing the chance that we will continue to get `karpenter.sh/do-not-disrupt` pods that continue to schedule to the same node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Karpenter wants to expand the usage of the `karpenter.sh/disruption=disrupting:NoSchedule` taint that it currently leverages to cordon nodes during disruption, to also taint nodes with a `karpenter.sh/disruption-candidate:NoSchedule` taint. By tainting nodes when they become candidates (past expiry, drifted, etc.), we ensure that we will launch new nodes when we get more pods that join the cluster, reducing the chance that we will continue to get `karpenter.sh/do-not-disrupt` pods that continue to schedule to the same node.
Karpenter wants to expand the usage of the `karpenter.sh/disruption=disrupting:NoSchedule` taint that it currently leverages to cordon nodes during disruption, to also taint nodes with a `karpenter.sh/disruption=candidate:NoSchedule` taint. By tainting nodes when they become candidates (past expiry, drifted, etc.), we ensure that we will launch new nodes when we get more pods that join the cluster, reducing the chance that we will continue to get `karpenter.sh/do-not-disrupt` pods that continue to schedule to the same node.


#### Tasks

- [ ] Design and implement a `spec.consolidateAfter` field for the v1 API, reworking our synchronous wait to ensure that waiting for nodes that haven’t reached the end of their `consolidateAfter` timeframe doesn’t block other disruption evaluation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [ ] Design and implement a `spec.consolidateAfter` field for the v1 API, reworking our synchronous wait to ensure that waiting for nodes that haven’t reached the end of their `consolidateAfter` timeframe doesn’t block other disruption evaluation
- [ ] Design and implement a `spec.consolidateAfter` field for the v1 API of NodePool, reworking our synchronous wait to ensure that waiting for nodes that haven’t reached the end of their `consolidateAfter` timeframe doesn’t block other disruption evaluation


Karpenter currently relies on a hash to determine whether certain fields in Karpenter’s NodeClaims have drifted from their owning NodePool and owning EC2NodeClass. Today, this is determined by hashing a set of fields on the NodePool or EC2NodeClass and then validating that this hash still matches the NodeClaim’s hash.

This hashing mechanism works well for additive changes to the API, but does not work well when adding fields to the hashing function that already have a set value on a customer’s cluster. In particular, we have a need to make breaking changes to this hash scheme from these two issues: https://github.com/kubernetes-sigs/karpenter/issues/909 and https://github.com/aws/karpenter-provider-aws/issues/5447.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the field with the default values won't work well when the user specifies it later?

Copy link

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 25, 2024
@jmdeal
Copy link
Member

jmdeal commented May 25, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants