Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(psa) make workloads compatible with psa:restricted profile #2820

Conversation

anik120
Copy link
Contributor

@anik120 anik120 commented Jul 22, 2022

Motivation for the change:

With the introduction of Pod Security Admission, the recommended best practice is to enforce the Restricted policy of admission (see [1] for more details).

[1] https://kubernetes.io/docs/concepts/security/pod-security-standards/#restricted

Description of the change:

This PR
*) Lables the olm namespace as enforce:restricted
*) Labels the operators namespace as enforce:baseline (to allow existing CSV deployments
without securityContext set to deploy in the namespace, which won't be possible with
enforce:resticted)
*) updates the securityContext of olm workload pods(olm-operator, catalog-operator,
and CatalogSource registry pods) to adhere to the Restricted policy.
*) updates the bundle unpacking job to create a pod that adheres to the Restricted policy,
so that bundles can be unpacked in the Restricted namespace

Will also close #2644 as a byproduct of the need to fix test failures

Architectural changes:

Testing remarks:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 22, 2022
@openshift-ci openshift-ci bot requested review from awgreene and benluddy July 22, 2022 23:29
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
{{- if eq .Values.installType "upstream" }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@anik120 anik120 changed the title WIP: placeholder WIP: (psa) make workloads compatible with psa:restricted profile Jul 26, 2022
@anik120 anik120 force-pushed the introduce-psa-restricted branch 4 times, most recently from 19f2901 to 231aa71 Compare July 27, 2022 19:16
@anik120 anik120 changed the title WIP: (psa) make workloads compatible with psa:restricted profile (psa) make workloads compatible with psa:restricted profile Jul 27, 2022
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 27, 2022
@perdasilva
Copy link
Collaborator

In general it lgtm. Seems to hit all the right changes. Nicely done.
In case I missed it, it's probably worthwhile adding input validation to the runAsUser. I think at the moment 0 is allowed (as input), but won't end up stamping out a runAsUser (which is fine). But, without the validation it could be a bit confusing for the user.

I still suggest not giving the userId as a degree of freedom to the admin, at least not if we don't have a specific ask for it. The reasons being is that it could lead to backwards compatibility pain later on in case we need to fix the userId for some reason or another. If there's no clear customer need, we're also increasing the complexity of the code for no good reason. Even if we do needed it, I wonder if there's a nice library or pattern in go for storing this kind of configuration so it doesn't have to get passed all the way down the stack.

@perdasilva
Copy link
Collaborator

Another couple of things that occured to me while I slept is that by default we need to have the runAsUser in the securityContext of the jobs and catalog source pods in the majority of cases. The reason being that we cannot guarantee that the catalog source image or the bundle image have users defined in their containers. But, we kept the default userId value to -1 (i.e. no runAsUser). This means that the new flag is always necessary (except in OCP where SCC will inject the value). Even if we still want to keep the user id flag (rather than use a toggle), we should probably have a different default so that it doesn't have to be used/defined in the majority case.

@anik120 anik120 force-pushed the introduce-psa-restricted branch 2 times, most recently from 8eb6517 to d420925 Compare July 28, 2022 20:33
@anik120
Copy link
Contributor Author

anik120 commented Jul 28, 2022

Another couple of things that occured to me while I slept is that by default we need to have the runAsUser in the securityContext of the jobs and catalog source pods in the majority of cases. The reason being that we cannot guarantee that the catalog source image or the bundle image have users defined in their containers. But, we kept the default userId value to -1 (i.e. no runAsUser). This means that the new flag is always necessary (except in OCP where SCC will inject the value). Even if we still want to keep the user id flag (rather than use a toggle), we should probably have a different default so that it doesn't have to be used/defined in the majority case.

I see your point, I've changed the default value to 1001.

In case I missed it, it's probably worthwhile adding input validation to the runAsUser. I think at the moment 0 is allowed (as input), but won't end up stamping out a runAsUser (which is fine). But, without the validation it could be a bit confusing for the user.

With that ☝🏽 change in the context, the validation if runAsUser >= 0{ set UserID} allows for any positive numerical value, including 0 (root), which is fine since if someone sets runAsNonRoot but sets runAsUser to 0, PSA controller will help us out by denying admission. But also, if for whatever reason, we do need to change back to runAsRoot, we can easily set UID to 0 without any code change.

I still suggest not giving the userId as a degree of freedom to the admin, at least not if we don't have a specific ask for it. The reasons being is that it could lead to backwards compatibility pain later on in case we need to fix the userId for some reason or another.

We do have other args that we pass to the containers, and the same backward compatibility argument can be extended to them too no? In fact, if we hardcode the value, and have to change the value, we'll have to actually backport code changes. Whereas if we don't hardcode the values, and we need to change the UID for previous releases, all we need is release manifest changes for previous releases 🎉

If there's no clear customer need, we're also increasing the complexity of the code for no good reason

On the contrary, if we don't leave this degree of freedom, besides the problem mentioned above, we'll possibly have to keep changing the hardcoded value with evolving PSA enablement effort org wide at present too.

Hardcoding "run workload as UID" sounds like brings the exact problems you're trying to avoid in the first place @perdasilva , and the solution is to not harcode it.

@perdasilva
Copy link
Collaborator

perdasilva commented Jul 28, 2022

It's more a question of control. If we need to change it, we can, and it's not impact for the user. If we decide we need to pin to a user or another because of the container or other possible future changes, we can easily control that. Up and downstream. I'd just check that we aren't pulling from olm:latest for the controller. If that's not the case then the point is moot. But, just as a general guideline (and that's all it is), don't add more functionality than you need, because you'll end up having to support it. If you are confident it's needed, I'm good with it. Just making sure we're thinking it through.

@perdasilva
Copy link
Collaborator

Oh, just came to me. Something we'd do a lot at Amazon for our services was to use env vars instead. Those can be easily picked up the code without having to traverse the whole stack. It also avoids all the unit testing issues of using a singleton.

Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @anik120 that is great but we can not use RunAsUser
We need to remove ALL RunAsUser fields added.

Option A)  

Define the USER ID via the Dockerfile image. Therefore, we do NOT need to use the spec RunAsUser in the SecurityContext spec at all and we do not face the issue "container has runAsNonRoot and image will run as root "

Example:
Docker Image
USER 1001

Then, SecurityContext:

    spec:
      securityContext:
        runAsNonRoot: true
        # Please ensure that you can use SeccompProfile and do not use
        # if your project must work on old Kubernetes
        # versions < 1.19 or on vendors versions which
        # do NOT support this field by default (i.e. Openshift < 4.11 )
        seccompProfile:
          type: RuntimeDefault
      ...
      containers:
      - name: controller-manager
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL

Result: the containers will be qualified to restricted-v2 in OCP and will run as restricted on k8s

Option B:

  • We will need to ensure that where the ServiceAccount used will have the permissions for the required OCP SCC
  • RunAsNonRoot should not be set ( must be empty )

labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/warn: baseline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pod-security.kubernetes.io/warn: baseline
pod-security.kubernetes.io/warn: restricted

If the enforce is baseline then we can warn when any container on this namespace is not restricted OR not set up the warning at all. But the warning makes only sense if be less permissive than the enforcement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we should enforce restricted on the olm namespace! Can remove the warn (and audit) label(s)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the operators namespace, I've updated it to be the following:

pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/warn: privileged
pod-security.kubernetes.io/warn-version: latest

From what I've read, I think this means "allow baseline pods, but also allow privileged pod with a warning". Reason for keeping this configuration is that baseline should allow most of the CSVs out there to be installed in that namespace. If there are any operators that require privileged access, we need to allow those, but throw a warning. PLMK if the above configuration doesn't achieve what I think it does.

Comment on lines 24 to 26
{{- if eq .Values.installType "upstream" }}
runAsUser: {{ .Values.package.securityContext.runAsUser }}
{{- end }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah we should keep security as nerfed as possible. I don't think the controllers need anything above restricted.

@@ -76,6 +87,8 @@ spec:
- --client-ca
- /profile-collector-cert/tls.crt
{{- end }}
- --workload-user-id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here for the downstream is to omit the flag and let SCC inject the user id in the range it wants.

Comment on lines 24 to 26
{{- if eq .Values.installType "upstream" }}
runAsUser: {{ .Values.package.securityContext.runAsUser }}
{{- end }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point actually. Since we are baking the USER into the olm container we don't need runAsUser for the controller deployments. Only for the catalog source and unpack jobs, since we have less control there.

@perdasilva
Copy link
Collaborator

Regarding validation it's better to fail fast, rather than let PSA catch it. Might cause confusion. Idk that I'd change the runAsNonRoot behavior. More stuff to test.

@anik120 anik120 force-pushed the introduce-psa-restricted branch 3 times, most recently from 7fed4ac to b21ed5b Compare August 1, 2022 20:11
@anik120
Copy link
Contributor Author

anik120 commented Aug 1, 2022

@perdasilva @camilamacedo86 updated the PR with the following updates:

  1. The olm and catalog operator deployment does not explicitly set runAsUser, instead the images are built with user 1001.
  2. The registry pods and the bundle unpacking job still absolutely needs their runAsUser set explicitly (whether manually or by another controller like the label syncer that'll go out and inject that value in OCP), since they're not being built from any image (check code of how they're created for more details)
  3. The flag for the catalog-operator container arg is now a boolean (that sets the runAsUser as mentioned in 2 to 1001 if set to true)

I'm more worried about upstream users here. Changing a hard-coded value is trivial. Deprecating behavior is hard. Removing the degree of freedom here gives us control to do what we want without affecting users or having to take support promises into account

That makes sense to me. It'll behoove us to take this conversation to the wg meeting and let the community know that we are exposing this flag, and based on the feedback we receive, if we see that there's appetite for needing to set a custom user id for those workloads, we can introduce a new flag and get rid of the boolean one.

@perdasilva
Copy link
Collaborator

/approve

@openshift-ci
Copy link

openshift-ci bot commented Aug 1, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: anik120, perdasilva

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 1, 2022

---
apiVersion: v1
kind: Namespace
metadata:
name: {{ .Values.operator_namespace }}
labels:
pod-security.kubernetes.io/enforce: baseline

This comment was marked as resolved.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 2, 2022
@@ -2,9 +2,17 @@ apiVersion: v1
kind: Namespace
metadata:
name: {{ .Values.namespace }}
labels:
pod-security.kubernetes.io/enforce: restricted

This comment was marked as resolved.

Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shows addressing all needs ihmo. It has my
/lgtm

as well.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 2, 2022
With the introduction of [Pod Security Admission](https://kubernetes.io/docs/concepts/security/pod-security-admission/#pod-security-admission-labels-for-namespaces), the reccomeneded
best practice is to enforce the Restricted policy of admission (see [1] for more details).
This PR
*) Lables the olm namespace as `enforce:restricted`
*) Labels the operators namespace as `enforce:baseline` (to allow existing CSV deployments
without securityContext set to deploy in the namespace, which won't be possible with
`enforce:resticted`)
*) updates the securityContext of olm workload pods(olm-operator, catalog-operator,
and CatalogSource registry pods) to adhere to the `Restricted` policy.
*) updates the bundle unpacking job to create a pod that adheres to the `Restricted` policy,
so that bundles can be unpacked in the `Restricted` namespace.

Signed-off-by: Anik Bhattacharjee <anikbhattacharya93@gmail.com>
The test was modifying the `olm.operatornamespace` to an incorrect value,
and checking to make sure that the CSV was garbage collected as a result.
However, the olm-controller was copying a fresh copy back into the namespace,
so whenever the test was able to get a yes reply to the question "is the CSV
gone", in the brief window before it was copied back again, the test was passing.
This commit fixes that by making sure that if find a CSV that we expected to be
garbage collected, it passes if it determines that the CSV is a fresh copy, and
not the one modified before.

Signed-off-by: Anik Bhattacharjee <anikbhattacharya93@gmail.com>
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 2, 2022
@camilamacedo86
Copy link
Contributor

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 2, 2022
Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 2, 2022
Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 2, 2022

---
apiVersion: v1
kind: Namespace
metadata:
name: {{ .Values.operator_namespace }}
labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/enforce-version: latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anik120 is just a detail, if we add here the latest that means it will get the latest criteria available from the policy. therefore, in the future, it might no longer work with the changes in this code/release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we create an issue to address this scenario: #2827 and it is blocking the release and will get done as follow up so all fine.

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 2, 2022
@openshift-merge-robot openshift-merge-robot merged commit 67177c0 into operator-framework:master Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FLAKY] cleanup csvs with bad owner operator groups
4 participants