Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: RunAsUser causes pods to not start on Windows #110235

Merged

Conversation

jsturtevant
Copy link
Contributor

What type of PR is this?

/kind failing-test

What this PR does / why we need it:

#109946 Introduced Restricted policy for some of the pod tests. This broke due to #102849. See https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1529769915945324544

If the e2e suite is running with nodeOSDistro = windows then we set it to nil.

Which issue(s) this PR fixes:

Fixes #
Related to https://github.com/kubernetes/kubernetes/pull/109946/files#r881112011

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig windows
/cc @marosset @liggitt @tallclair

@k8s-ci-robot k8s-ci-robot requested a review from liggitt May 26, 2022 18:11
@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label May 26, 2022
@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/windows Categorizes an issue or PR as relevant to SIG Windows. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 26, 2022

# setting RunAsUser in Windows results invalid permissions being set on projected volumes
# https://github.com/kubernetes/kubernetes/issues/102849
if framework.NodeOSDistroIs("windows") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh... I didn't expect this bit... do we not have any e2es that test non-homogenous clusters?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I more expected to just drop RunAsUser

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a test in Windows that explicitly tests that network connectivity work across Windows and Linux. But otherwise, the assumption is the cluster we run tests against would either be Linux nodes or Windows nodes not both (other than the control plane node for Linux)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ok so we should just drop the RunAsUser? (#109946 (comment))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we drop it, the callers still need to set RunAsUser for a bunch of the tests to pass on linux, which is still going to run into the same issues of either breaking windows or needing to determine the node OS...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the callers still need to set RunAsUser for a bunch of the tests to pass on linux

is that true? I didn't think most tests were explicitly setting RunAsUser, and instead were running images that weren't built to use root by default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of them use the pause image, which should be ok, but there's also a handful using busybox which doesn't set a user.

@jsturtevant jsturtevant force-pushed the fix-new-restricted-pod-tests branch from ba86be9 to 2648881 Compare May 26, 2022 20:34
@jsturtevant
Copy link
Contributor Author

/test pull-kubernetes-e2e-capz-windows-containerd

@endocrimes
Copy link
Member

/test pull-kubernetes-node-kubelet-serial-containerd

@marosset marosset added this to In Review (v1.25) in SIG-Windows May 26, 2022
@jsturtevant
Copy link
Contributor Author

jsturtevant commented May 26, 2022

looking into the failure which seems to be that the apiserver pod didn't start. Don't believe it was related to this

/test pull-kubernetes-e2e-capz-windows-containerd

@jsturtevant jsturtevant force-pushed the fix-new-restricted-pod-tests branch from 2648881 to 061b8e8 Compare May 27, 2022 16:31
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 27, 2022
@jsturtevant
Copy link
Contributor Author

/test pull-kubernetes-e2e-capz-windows-containerd

SeccompProfile: &v1.SeccompProfile{Type: v1.SeccompProfileTypeRuntimeDefault},
}

if NodeOSDistroIs("windows") {
Copy link
Contributor

@marosset marosset May 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since --node-os-distro is set as a e2e.test argument and we do schedule some linux containers during some Windows specific e2e tests this would add windows options with RunAsUserName set for those linux containers.
I don't think this will be an issue but did want to call that out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

option: make 2 versions of GetRestrictedPodSecurityContext, one that guesses the OS and one that lets it be explicitly specified.

Then, the Mixin function should use the PodOS if present.

func GetRestrictedPodSecurityContext() *v1.PodSecurityContext {
  if NodeOSDistroIs("windows") {
    return GetRestrictedPodSecurityContextForOS(v1.WindowsPodOS) // could just be a string until PodOS merges?
  }
  return GetRestrictedPodSecurityContextForOS(v1.LinuxPodOS) // default
}

func GetRestrictedPodSecurityContextForOS(os PodOS) *v1.PodSecurityContext {
  // Bulk of the work
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at doing this but seems like it should really be done when // TODO(#105919): Handle PodOS for windows pods. is completed. It made things a bit messier than it is now because there are a couple different cases too cover (framework.NodeOSDistroIs and handling PodOSsince this needs to be addressed in MixinRestrictedPodSecurity as well.

@jsturtevant
Copy link
Contributor Author

That failure was similiar to earlier, I haven't been able to figure out what it is yet. The control plane didn't start:

kube-apiserver-capz-conf-6bprnv-control-plane-frsww          capz-conf-6bprnv-control-plane-frsww Pending       []
kube-controller-manager-capz-conf-6bprnv-control-plane-frsww capz-conf-6bprnv-control-plane-frsww Pending       []
kube-scheduler-capz-conf-6bprnv-control-plane-frsww          capz-conf-6bprnv-control-plane-frsww Pending 

/test pull-kubernetes-e2e-capz-windows-containerd

@jsturtevant
Copy link
Contributor Author

Windows job passed, created kubernetes-sigs/cluster-api-provider-azure#2338 to track issue with capz cluster.

/assign @marosset @liggitt @tallclair

@marosset
Copy link
Contributor

LGTM for me from Windows.
I noted one slight concern above and am satisfied with the approach

@marosset
Copy link
Contributor

marosset commented May 27, 2022

/triage accepted
/priority critical-urgent
(since this is breaking tests)

@k8s-ci-robot
Copy link
Contributor

@marosset: The label(s) priority/critical-urgernt cannot be applied, because the repository doesn't have them.

In response to this:

/triage accepted
/priority critical-urgernt
(since this is breaking tests)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 27, 2022
if NodeOSDistroIs("windows") && pod.Spec.SecurityContext.WindowsOptions == nil {
pod.Spec.SecurityContext.WindowsOptions = &v1.WindowsSecurityContextOptions{}
pod.Spec.SecurityContext.WindowsOptions.RunAsUserName = pointer.StringPtr(DefaultNonRootUserName)
}
}
for i := range pod.Spec.Containers {
mixinRestrictedContainerSecurityContext(&pod.Spec.Containers[i])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add similar securityContext checks for containers? It may not be a problem unless pod's explicitly have containers that have Windows options set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look to set the users for the containers currently, so left it out for now.

@marosset
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2022
@jsturtevant
Copy link
Contributor Author

@liggitt @tallclair any further thoughts on these changes?

@jsturtevant
Copy link
Contributor Author

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 1, 2022
@dims
Copy link
Member

dims commented Jun 1, 2022

/approve

/hold for previous reviewers to chime in again.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 1, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, jsturtevant

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 1, 2022
@tallclair
Copy link
Member

We might be able to clean this up further, but if this fixes currently failing tests we should probably merge it and iterate.

// If the Node OS is windows, we return nill due to issue with invalid permissions set on projected volumes
// https://github.com/kubernetes/kubernetes/issues/102849
func GetDefaultNonRootUser() *int64 {
if NodeOSDistroIs("windows") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there a constant defined for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't seem to be, If you search for it you will find it mostly used like this https://github.com/kubernetes/kubernetes/search?q=NodeOSDistroIs&type=code.

SeccompProfile: &v1.SeccompProfile{Type: v1.SeccompProfileTypeRuntimeDefault},
}

if NodeOSDistroIs("windows") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider just clearing the RunAsUser here, to cut down on the number of places we need to check the OS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this but it is also needed in MixRestrictedPodSecurty and without using podOS (which has a TODO) I think it gets a bit unwieldy

SeccompProfile: &v1.SeccompProfile{Type: v1.SeccompProfileTypeRuntimeDefault},
}

if NodeOSDistroIs("windows") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

option: make 2 versions of GetRestrictedPodSecurityContext, one that guesses the OS and one that lets it be explicitly specified.

Then, the Mixin function should use the PodOS if present.

func GetRestrictedPodSecurityContext() *v1.PodSecurityContext {
  if NodeOSDistroIs("windows") {
    return GetRestrictedPodSecurityContextForOS(v1.WindowsPodOS) // could just be a string until PodOS merges?
  }
  return GetRestrictedPodSecurityContextForOS(v1.LinuxPodOS) // default
}

func GetRestrictedPodSecurityContextForOS(os PodOS) *v1.PodSecurityContext {
  // Bulk of the work
}

@jsturtevant
Copy link
Contributor Author

@tallclair the feedback is valid but after trying it out it really feels like those changes should come in with the // TODO(#105919): Handle PodOS for windows pods. Are you ok with merging as is and revisting when PodOS is handled?

@marosset
Copy link
Contributor

marosset commented Jun 2, 2022

@tallclair the feedback is valid but after trying it out it really feels like those changes should come in with the // TODO(#105919): Handle PodOS for windows pods. Are you ok with merging as is and revisting when PodOS is handled?

Our sig-windows jobs targeting master are currently failing some tests because of this issue.
Can we use these changes as a mitigation and follow up with a more permanent solution to help us get our CI signal back to green?

@tallclair
Copy link
Member

/hold cancel

Yes, let's follow up.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2022
@k8s-ci-robot k8s-ci-robot merged commit b7337cb into kubernetes:master Jun 3, 2022
SIG-Windows automation moved this from In Review (v1.25) to Done (v1.25) Jun 3, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
SIG-Windows
  
Done (v1.25)
Development

Successfully merging this pull request may close these issues.

None yet

8 participants