Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize securityContext in injected metrics container #964

Merged
merged 1 commit into from Jan 1, 2020

Conversation

vpavlin
Copy link
Member

@vpavlin vpavlin commented Dec 11, 2019

What this PR does / why we need it:

The issue is that the injected container does not have securityContext initialized (as that is done by some earlier mutating webhook). The simplest solution for now is to simply copy the securityContext from the container that already exists (and is fully mutated) in the pod.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #963

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

No image changes

Release note:

NONE

This change is Reviewable

@k8s-ci-robot
Copy link

Hi @vpavlin. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pdmack
Copy link
Member

pdmack commented Dec 11, 2019

/assign @richardsliu @johnugeorge

@hougangliu
Copy link
Member

/ok-to-test

@hougangliu
Copy link
Member

/test kubeflow-katib-presubmit

@@ -178,11 +178,13 @@ func (s *sidecarInjector) getMetricsCollectorContainer(trial *trialsv1alpha3.Tri
}
args := getMetricsCollectorArgs(trial.Name, metricName, mc)
sidecarContainerName := getSidecarContainerName(trial.Spec.MetricsCollector.Collector.Kind)
securityContext := originalPod.Spec.Containers[0].SecurityContext.DeepCopy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can it make sure that originalPod.Spec.Containers[0].SecurityContext here can fit the metricsCollector and the originalPod.Spec.Containers[0].SecurityContext may not be generated by other mutating Webhook but defined explicitly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I fully understand the questions. Please look at my comment below which explains how the securityContext is generated and see if it helps.

For how I understand your question - the metricsCollector should have the same securityContext as the [0] container as that is how it would be generated by PSP by default - based on the ServiceAccount privileges it would generate securityContext for Pod and from there for all the containers. So as the [0] container already has the right securityContext it is safe to simply replicate it into the metricsCollector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container securityContext can be generated by PSP, but also can be specified explicitly, right?
For example, a user adds a training job in a container with securityContext: runAsUser: 1000, if user 1000 can work well for metricsCollector container, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean now. That is a fair point. I would say it would work fine - IIUC both containers are working with the same files, so if the user container works with UID 1000, the metrics container should work too, right?

I am trying to figure out if there is a common pattern about how to solve these cases with the OpenShift team.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned somewhere else the best solution would be to use reinvocationPolicy which should take care of generating the SC in the new container after it is added by webhook. That has 2 issues at the moment -

  1. it is coming in next version of OpenShift, so I cannot even test it:)
  2. the code katib uses to register webhooks does not know about the field, so I'll have to figure out how to actually do it when it is available

It seems (after discussing with OpenShift team) that using the existing SC from the other container might be our best shot here. I will try to git into it a bit more, but as I mentioned - as long as the sidecar container does not do anything special, it should work fine with the SC from existing container. The only way I see this could potentially fail if if the user configures the container very restrictively and metrics container cannot get to what it needs to read.

The only other way I could think of is to configure the SC explicitly to the state that we know will work for most configurations. Question is what happens if the user then configures his container to run as root and metrics container does not get the same SC and thus will probably not be able to access files produced by the main container.

That said, I still think copying the SC is the best solution at the moment. I will try to play around with it a bit more tomorrow to see if I can actually break this approach:)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can try another solution as workaround: now k8s runs Webhooks in string order by webhook name, we can try to run katib mutating before other related webhooks by renaming it.
for example, now its name is katib-mutating-webhook-config, we can rename it to admission-katib-mutating-webhook-config, not sure what openshift related Webhooks name is.
I think other projects like Istio have similar issue, have no idea how the project handle this case (for istio, I think istioctl kube-inject to bypass the istio mutatingwebhook as a workaround.).
BTW, @johnugeorge @richardsliu @gaocegege WDYT we provide a tool like istioctl kube-inject to inject metricsCollector container in job level offline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked RH istio team and their issue seems to be different because they set the SC explicitly as they need their container to run as privileged.

I am not sure about reordering by name because AFAIK the component generating SC is PSP (Pod Security Policies) which does not run as a webhook.

Adding the metrics container on Job level might help though - I added comment here #928 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics collector container now reads metrics file and then reports metrics to DB server, I wonder if two operations fail owning to training job SC configuration.

I tested this before sending the PR and the operations surely did not fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why RH istio team think their issue is different.
the root cause is: when a Pod generated and before persisted, PSP adds security context to all containers; then mutatingwebhook injecting new container into a Pod which leads that the new container has no consistent SC which conflicts with PSP, right?
so I think it is a common use case, many projects would inject new container into a Pod by mutatingwebhook, such as istio, kfserving.
we had better handle this case in a best practice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hougangliu @vpavlin

Can we have a command line flag to control the behavior whether we use other containers's securityContext if we cannot come to an agreement now?

@johnugeorge
Copy link
Member

@vpavlin Is this some earlier mutating webhook specific to your deployment?

@vpavlin
Copy link
Member Author

vpavlin commented Dec 12, 2019

@johnugeorge It is a problem on OpenShift in general - OpenShift is very security focused and enforces quite limited SCC by default. This is happening as part of Pod Security Policies controller and the securityContext is first generated on pod level and then from the pod into containers. (See https://github.com/openshift/apiserver-library-go/blob/master/pkg/securitycontextconstraints/sccmatching/provider.go#L152)

This apparently runs before the mutating webhooks, so when Katib then injects new container without securityContext initialized, OpenShfit is not happy about it.

So the solution I am proposing for now is to replicate the securityContext from a container that already passed through the PSP mutation into the injected container.

Does it make sense?

@johnugeorge
Copy link
Member

@johnugeorge It is a problem on OpenShift in general - OpenShift is very security focused and enforces quite limited SCC by default. This is happening as part of Pod Security Policies controller and the securityContext is first generated on pod level and then from the pod into containers. (See https://github.com/openshift/apiserver-library-go/blob/master/pkg/securitycontextconstraints/sccmatching/provider.go#L152)

This apparently runs before the mutating webhooks, so when Katib then injects new container without securityContext initialized, OpenShfit is not happy about it.

So the solution I am proposing for now is to replicate the securityContext from a container that already passed through the PSP mutation into the injected container.

Does it make sense?

Got it.
/lgtm
will wait for @hougangliu 's review.

@gaocegege
Copy link
Member

As we discussed, we decide to merge it and add an additional flag to control it. I will add it after the PR is merged.

@gaocegege
Copy link
Member

/lgtm
/approve

/cc @hougangliu

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Experiments execution fails due to non-initialized securityContext in injected container
7 participants