-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add kernel version to node label #51006
add kernel version to node label #51006
Conversation
Hi @xilabao. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
It seems we can get kernel version from node: node.Status.NodeInfo.KernelVersion. |
@FengyunPan the change is for using node affinity. |
/ok-to-test |
/lgtm |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dchen1107, xilabao No associated issue. Update pull-request body to add a reference to an issue, or get approval with The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
@xilabao can you file an issue for this request? Also I think the change is small, but deserve a release note. |
/test pull-kubernetes-node-e2e |
/retest |
/retest Review the full test history for this PR. |
1 similar comment
/retest Review the full test history for this PR. |
cc @liggitt as we talked about the labels the node should be able to when registering itself. |
/retest Review the full test history for this PR. |
we're actively trying to reduce the self-labeling capabilities of nodes (see kubernetes/community#911), since it lets them steer workloads to themselves. every label a node sets currently will have to be whitelisted. if this is used so workloads can target nodes that are capable of running the workload, that's fine. If it's used by an administrator to find nodes with a vulnerable kernel, or to steer pods away from nodes with a vulnerable kernel, that's not fine, since it is self-reported by the node. also, I have some questions about the cardinality and maintenance of this label:
|
cc @kubernetes/sig-node-pr-reviews |
cc @kubernetes/sig-auth-pr-reviews |
I agree with @vishh it's a trivial feature. and it's useful for me that I want to create pods to run some testcases relate with kernel. Just as @liggitt say, kernel version can be considered a fundamental thing. maybe we can add it to the default label. |
/retest |
re: #51006 (comment), for scheduling to nodes with specific device versions, is the expectation that kubelets will start adding sets of labels for the various attributes of devices they are making available for scheduling? This seems like the first of many attributes like this, and I was hoping a whitelist of labels under node control would start small (mostly for backwards compatibility), and stay small or shrink, rather than continuously grow. |
Yes. We are introducing a new extension point called device plugins at the
kubelet level. Deploying device plugins requires "root" privileges on the
node as of now.
Device plugins are expected to advertize node labels to aid in scheduling.
Just having a string resource name for scheduling is insufficient for many
use cases.
…On Tue, Aug 29, 2017 at 7:24 PM, Jordan Liggitt ***@***.***> wrote:
re: #51006 (comment)
<#51006 (comment)>,
for scheduling to nodes with specific device versions, is the expectation
that kubelets will start adding sets of labels for the various attributes
of devices they are making available for scheduling?
This seems like the first of many attributes like this, and I was hoping a
whitelist of labels under node control would start small (mostly for
backwards compatibility), and stay small or shrink, rather than
continuously grow.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#51006 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKOgV3xYmyCGyYs4J8wIaMphQ6wKhks5sdMfvgaJpZM4O9Ede>
.
|
The goal is for root privileges on a node to not automatically translate to root privileges on the cluster (against the API or other nodes). Control over all labels and taints allows steering workloads that can transitively grant access to high-value credentials.
Do you have pointers to those designs/docs (the ones I saw didn't call out populating node labels)? It would be helpful to see the overall direction for this. Finding a balance might be as simple as putting the device labels under a label key prefix/namespace that makes it clear the label was under the node's control, whitelisting that prefix, and documenting that those should not be used for security-related partitioning of a cluster. If so, it would be ideal to use such a prefix/namespace for this |
|
||
LabelOS = "beta.kubernetes.io/os" | ||
LabelArch = "beta.kubernetes.io/arch" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's safe to assume there are workloads in the real world that assume the presence of these labels today (with the beta
keys). when combined with making controller selectors immutable (#50808 & #50719), we have to think through the implications of changing these, and how someone would modify their existing workloads to use the non-beta labels non-disruptively. do we need to announce deprecation of beta labels, and set both beta and non-beta labels for some period? cc @kubernetes/api-reviewers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is safe within our definitions of API stability. This breaks anyone using those labels without warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt @smarterclayton and others, I have created a issue (#51756) for the beta
label.
LabelZoneFailureDomain = "failure-domain.beta.kubernetes.io/zone" | ||
LabelZoneRegion = "failure-domain.beta.kubernetes.io/region" | ||
LabelZoneFailureDomain = "failure-domain.kubernetes.io/zone" | ||
LabelZoneRegion = "failure-domain.kubernetes.io/region" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are baked into the default scheduler startup options (see DefaultFailureDomains and the kube-scheduler failure-domains flag). I think changing this can break scheduling to version skewed kubelets, at least.
@liggitt https://github.com/kubernetes/community/blob/master/contributors/design-proposals/device-plugin.md Please file issues for any label changes. The API is still very much alpha, so we can integrate your suggestions without any concern. |
1abaeb8
to
2607bab
Compare
/retest |
/test pull-kubernetes-e2e-gce-bazel |
I would love to see the host Linux Distribution propagated through as well. perhaps calling lsb_release edit to add a source blurb: func EnumerateDistro() string {
std, err := exec.Command("lsb_release", "-i", "-s", "-r").CombinedOutput()
if err != nil {
return ""
}
trimmed := strings.TrimSpace(string(std))
splitted := strings.Split(trimmed, "\n")
if len(splitted) != 2 {
return ""
}
id := splitted[0]
release := splitted[1]
return fmt.Sprintf("%s %s", id, release)
} edit 2, or read from func EnumerateDistro() string {
file, err := os.Open("/etc/os-release")
if err != nil {
return ""
}
defer file.Close()
id := ""
release := ""
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
if strings.HasPrefix(line, "ID=") {
id = line[3:]
}
if strings.HasPrefix(line, "VERSION_ID=") {
release = strings.Replace(line[11:], "\"", "", -1)
}
if release != "" && id != "" {
break
}
}
return fmt.Sprintf("%s %s", id, release)
} |
I don't see how this is usable in node affinity unless versions can be compared with anything other than equality. Encouraging users to express exactly "I need a kernel version of 4.11.9-300.fc26.x86_64" seems to me like a step in the wrong direction. Doing so will discourage users from updating their kernels to avoid breaking affinity (thus making it harder for users of k8s to remain secure), could lead to incredibly large affinity constraints... and at what benefit? I can't think of a single use-case that wouldn't be served better by a finer-grained label. Let's take @derekwaynecarr's example of gpu support (which depends on some kernel features presumably). It seems to me like a better way to solve a specific problem like the above would be to have a specific label for it ( I can't think of a single case where using an affinity rule based on the exact kernel version would be a better choice than having a dedicated helper component which applies a more specific label based on the combination of kernel config and kernel version. Given the kernel version is already available in @rphillips let's not mix that in with this, that should be a separate issue and PR imo. |
Why does k8s have to be opinionated in how a user upgrades kernels? If a deployment wants to prevent users from depending on kernel versions, then we can either give an option to not expose it, or have such deployments run admission controllers that prevent usage of the kernel version label. Portable linux containers are expected to be agnostic of the kernel version. If they care about kernel versions, then they need to be treated carefully.
This would not provide any standard APIs for developing portable system addons that depend on kernel version. |
@vishh dpending on a specific kernel version is not portable by definition, so it is unclear what you are saying here. It is almost never a good ideda to depend on a specific kernel version, unless you are a component that is tightly coupled with the kernel (ie a kernel module), in all other cases you want to dpend on a feature and its userspace API. And this is what @euank is saying, what you want to expose and depend on is a feature, which you can discover and expose at runitime by checking if the kernel supports it. But not by exposing a kernel version, but a feature (perhaps versioned) name. The devide plugin document indeed hints throughout thatthere is a (versioned) device "name" being exposed to the kublet. That's what may be exposed, a specific kernel makes little sense, especially if what you want to advertizes is hardware that may or not be present. A kernel version does not guarantee you anything about what hardware (and therefore devices) are avilable, and whether they work with that kernel version. Exposing a kernel version can only have the consequence of breaking configurations when upgrades are performed. |
Those are both already exposed in a standard way under
Neither would exposing the kernel version provide portability since it's inherently non-portable, especially given distros like RHEL/CentOS which backport large features without significantly changing the kernel version. |
If we going down the kernel-version or kubelet-version path (and I am not sure we should) it seems we need multiple labels to allow less precise version locking. e.g. Given "4.11.9-300.fc26.x86_64", something like:
Likewise, we probably need kubelet version in the same style. The lack of inequalities in labels still makes this clunky, and no matter what it is not exactly portable. Alternatively, we can make node-software-version an explicit criteria, formalize these fields into the API, and allow inequalities. That's a LOT more work... |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
What this PR does / why we need it:
Some pods may need a higher kernel version. so we doesn't need to upgrade the kernel for all node.
The label will be updated at the same time when the node start up.
Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes #51234Special notes for your reviewer:
Release note: