Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add kernel version to node label #51006

Closed

Conversation

xilabao
Copy link
Contributor

@xilabao xilabao commented Aug 21, 2017

What this PR does / why we need it:
Some pods may need a higher kernel version. so we doesn't need to upgrade the kernel for all node.
The label will be updated at the same time when the node start up.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #51234

Special notes for your reviewer:

Release note:

add kernel version to node label

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 21, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @xilabao. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 21, 2017
@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. labels Aug 21, 2017
@FengyunPan
Copy link

It seems we can get kernel version from node: node.Status.NodeInfo.KernelVersion.

@xilabao
Copy link
Contributor Author

xilabao commented Aug 21, 2017

@FengyunPan the change is for using node affinity.

@dchen1107
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 23, 2017
@dchen1107
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 23, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dchen1107, xilabao

No associated issue. Update pull-request body to add a reference to an issue, or get approval with /approve no-issue

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@dchen1107 dchen1107 added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Aug 23, 2017
@dchen1107
Copy link
Member

@xilabao can you file an issue for this request? Also I think the change is small, but deserve a release note.

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 24, 2017
@xilabao
Copy link
Contributor Author

xilabao commented Aug 24, 2017

/test pull-kubernetes-node-e2e
/test pull-kubernetes-e2e-gce-gpu
/test pull-kubernetes-kubemark-e2e-gce

@xilabao
Copy link
Contributor Author

xilabao commented Aug 24, 2017

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@luxas
Copy link
Member

luxas commented Aug 24, 2017

cc @liggitt as we talked about the labels the node should be able to when registering itself.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@liggitt
Copy link
Member

liggitt commented Aug 24, 2017

we're actively trying to reduce the self-labeling capabilities of nodes (see kubernetes/community#911), since it lets them steer workloads to themselves. every label a node sets currently will have to be whitelisted.

if this is used so workloads can target nodes that are capable of running the workload, that's fine. If it's used by an administrator to find nodes with a vulnerable kernel, or to steer pods away from nodes with a vulnerable kernel, that's not fine, since it is self-reported by the node.

also, I have some questions about the cardinality and maintenance of this label:

  • are nodes with super-specific kernel version labels useful in practice for scheduling? (edit: talked with @derekwaynecarr, this is needed for things like gpu support)
  • this label is only set when the node API object is created... it is never updated after that, which means it will be inaccurate if a node's kernel is upgraded. I really don't want nodes to start modifying their labels on update, since it can interfere with central admin management of node labels. what is the plan to maintain this label?

@liggitt
Copy link
Member

liggitt commented Aug 24, 2017

cc @kubernetes/sig-node-pr-reviews

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Aug 24, 2017
@luxas
Copy link
Member

luxas commented Aug 24, 2017

cc @kubernetes/sig-auth-pr-reviews

@k8s-ci-robot k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Aug 24, 2017
@xilabao
Copy link
Contributor Author

xilabao commented Aug 30, 2017

I agree with @vishh it's a trivial feature. and it's useful for me that I want to create pods to run some testcases relate with kernel. Just as @liggitt say, kernel version can be considered a fundamental thing. maybe we can add it to the default label.
Someone may use a customized kernel (some parts changed e.g. filesystem, network) for special purposes (e.g. performance). they might have interest in the label.

@xilabao
Copy link
Contributor Author

xilabao commented Aug 30, 2017

/retest

@liggitt
Copy link
Member

liggitt commented Aug 30, 2017

re: #51006 (comment), for scheduling to nodes with specific device versions, is the expectation that kubelets will start adding sets of labels for the various attributes of devices they are making available for scheduling?

This seems like the first of many attributes like this, and I was hoping a whitelist of labels under node control would start small (mostly for backwards compatibility), and stay small or shrink, rather than continuously grow.

@vishh
Copy link
Contributor

vishh commented Aug 30, 2017 via email

@liggitt
Copy link
Member

liggitt commented Aug 30, 2017

Yes. We are introducing a new extension point called device plugins at the kubelet level. Deploying device plugins requires "root" privileges on the node as of now.

The goal is for root privileges on a node to not automatically translate to root privileges on the cluster (against the API or other nodes). Control over all labels and taints allows steering workloads that can transitively grant access to high-value credentials.

Device plugins are expected to advertize node labels to aid in scheduling.

Do you have pointers to those designs/docs (the ones I saw didn't call out populating node labels)? It would be helpful to see the overall direction for this. Finding a balance might be as simple as putting the device labels under a label key prefix/namespace that makes it clear the label was under the node's control, whitelisting that prefix, and documenting that those should not be used for security-related partitioning of a cluster. If so, it would be ideal to use such a prefix/namespace for this


LabelOS = "beta.kubernetes.io/os"
LabelArch = "beta.kubernetes.io/arch"
Copy link
Member

@liggitt liggitt Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's safe to assume there are workloads in the real world that assume the presence of these labels today (with the beta keys). when combined with making controller selectors immutable (#50808 & #50719), we have to think through the implications of changing these, and how someone would modify their existing workloads to use the non-beta labels non-disruptively. do we need to announce deprecation of beta labels, and set both beta and non-beta labels for some period? cc @kubernetes/api-reviewers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is safe within our definitions of API stability. This breaks anyone using those labels without warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt @smarterclayton and others, I have created a issue (#51756) for the beta label.

LabelZoneFailureDomain = "failure-domain.beta.kubernetes.io/zone"
LabelZoneRegion = "failure-domain.beta.kubernetes.io/region"
LabelZoneFailureDomain = "failure-domain.kubernetes.io/zone"
LabelZoneRegion = "failure-domain.kubernetes.io/region"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are baked into the default scheduler startup options (see DefaultFailureDomains and the kube-scheduler failure-domains flag). I think changing this can break scheduling to version skewed kubelets, at least.

@vishh
Copy link
Contributor

vishh commented Aug 30, 2017

@liggitt https://github.com/kubernetes/community/blob/master/contributors/design-proposals/device-plugin.md Please file issues for any label changes. The API is still very much alpha, so we can integrate your suggestions without any concern.

@xilabao
Copy link
Contributor Author

xilabao commented Sep 1, 2017

/retest

@xilabao
Copy link
Contributor Author

xilabao commented Sep 1, 2017

/test pull-kubernetes-e2e-gce-bazel

@rphillips
Copy link
Member

rphillips commented Sep 5, 2017

I would love to see the host Linux Distribution propagated through as well. perhaps calling lsb_release

edit to add a source blurb:

func EnumerateDistro() string {
	std, err := exec.Command("lsb_release", "-i", "-s", "-r").CombinedOutput()
	if err != nil {
		return ""
	}
	trimmed := strings.TrimSpace(string(std))
	splitted := strings.Split(trimmed, "\n")
	if len(splitted) != 2 {
		return ""
	}
	id := splitted[0]
	release := splitted[1]
	return fmt.Sprintf("%s %s", id, release)
}

edit 2, or read from /etc/os-release

func EnumerateDistro() string {
	file, err := os.Open("/etc/os-release")
	if err != nil {
		return ""
	}
	defer file.Close()

	id := ""
	release := ""
	scanner := bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Text()
		if strings.HasPrefix(line, "ID=") {
			id = line[3:]
		}
		if strings.HasPrefix(line, "VERSION_ID=") {
			release = strings.Replace(line[11:], "\"", "", -1)
		}
		if release != "" && id != "" {
			break
		}
	}
	return fmt.Sprintf("%s %s", id, release)
}

@euank
Copy link
Contributor

euank commented Sep 5, 2017

the change is for using node affinity.

I don't see how this is usable in node affinity unless versions can be compared with anything other than equality.

Encouraging users to express exactly "I need a kernel version of 4.11.9-300.fc26.x86_64" seems to me like a step in the wrong direction. Doing so will discourage users from updating their kernels to avoid breaking affinity (thus making it harder for users of k8s to remain secure), could lead to incredibly large affinity constraints... and at what benefit? I can't think of a single use-case that wouldn't be served better by a finer-grained label.

Let's take @derekwaynecarr's example of gpu support (which depends on some kernel features presumably).
There's a fairly tight coupling between the version of kernel modules installed on the host and the version installed in the container (or, depending, bindmounted in).
Unfortunately, knowing the kernel version does nothing to improve the state of things there since the gpu driver version changes independently of the kernel version.
Perhaps there are kernel config options (though I can't think of any) or a minimum kernel version.
This feature neither exposes whether CONFIG_FOO is enabled (which can't be known from just a kernel version) nor allows the user to express an "at least this version" relationship.

It seems to me like a better way to solve a specific problem like the above would be to have a specific label for it (kubernetes.io/supports-gpu: true type label), and then that label can be applied based on whatever combination of the above is needed, including an "at least this kernel version" requirements.

I can't think of a single case where using an affinity rule based on the exact kernel version would be a better choice than having a dedicated helper component which applies a more specific label based on the combination of kernel config and kernel version.

Given the kernel version is already available in NodeInfo for the purpose of creating those labels, I'm not convinced this PR solves any real problems.
I would like specific examples of cases where this is the right tool, how it would be used exactly, and why more specific feature-targetted affinity labels wouldn't fit better.


@rphillips let's not mix that in with this, that should be a separate issue and PR imo.

@vishh
Copy link
Contributor

vishh commented Nov 22, 2017

I need a kernel version of 4.11.9-300.fc26.x86_64" seems to me like a step in the wrong direction. Doing so will discourage users from updating their kernels to avoid breaking affinity (thus making it harder for users of k8s to remain secure), could lead to incredibly large affinity constraints... and at what benefit?

Why does k8s have to be opinionated in how a user upgrades kernels? If a deployment wants to prevent users from depending on kernel versions, then we can either give an option to not expose it, or have such deployments run admission controllers that prevent usage of the kernel version label.

Portable linux containers are expected to be agnostic of the kernel version. If they care about kernel versions, then they need to be treated carefully.
Exposing kubelet and kernel version will benefit management of k8s system addons. For instance a device driver installer may be custom tuned for different kernel versions.

having a dedicated helper component which applies a more specific label based on the combination of kernel config and kernel version

This would not provide any standard APIs for developing portable system addons that depend on kernel version.

@simo5
Copy link
Contributor

simo5 commented Nov 22, 2017

@vishh dpending on a specific kernel version is not portable by definition, so it is unclear what you are saying here. It is almost never a good ideda to depend on a specific kernel version, unless you are a component that is tightly coupled with the kernel (ie a kernel module), in all other cases you want to dpend on a feature and its userspace API.

And this is what @euank is saying, what you want to expose and depend on is a feature, which you can discover and expose at runitime by checking if the kernel supports it. But not by exposing a kernel version, but a feature (perhaps versioned) name.

The devide plugin document indeed hints throughout thatthere is a (versioned) device "name" being exposed to the kublet. That's what may be exposed, a specific kernel makes little sense, especially if what you want to advertizes is hardware that may or not be present. A kernel version does not guarantee you anything about what hardware (and therefore devices) are avilable, and whether they work with that kernel version.

Exposing a kernel version can only have the consequence of breaking configurations when upgrades are performed.

@euank
Copy link
Contributor

euank commented Nov 22, 2017

@vishh

Exposing kubelet and kernel version will benefit management of k8s system addons

Those are both already exposed in a standard way under NodeInfo. This PR is adding them as labels in addition, which I'm trying to point out is probably not the right thing in any use-case.

[feature-specific labels] would not provide any standard APIs for developing portable system addons that depend on kernel version.

Neither would exposing the kernel version provide portability since it's inherently non-portable, especially given distros like RHEL/CentOS which backport large features without significantly changing the kernel version.
I think feature-specific addons can much more easily decide to standardize on labels like k8s.io/supports-kernel-feature-$foo: true than on kernel-version: 4.12.12-203.bigco-custom-patchset.44b85

@thockin
Copy link
Member

thockin commented Nov 22, 2017

If we going down the kernel-version or kubelet-version path (and I am not sure we should) it seems we need multiple labels to allow less precise version locking.

e.g. Given "4.11.9-300.fc26.x86_64", something like:

kubernetes.io/kernel-version-major: 4
kubernetes.io/kernel-version-minor: 4.11
kubernetes.io/kernel-version-patch: 4.11.9
kubernetes.io/kernel-version: 4.11.9-300.fc26.x86_64

Likewise, we probably need kubelet version in the same style. The lack of inequalities in labels still makes this clunky, and no matter what it is not exactly portable.

Alternatively, we can make node-software-version an explicit criteria, formalize these fields into the API, and allow inequalities. That's a LOT more work...

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 20, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 22, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add kernel version to node label