Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pod security policy so botkube works in restricted clusters #195

Merged
merged 3 commits into from Nov 12, 2019

Conversation

@baronomasia
Copy link
Contributor

baronomasia commented Oct 10, 2019

ISSUE TYPE
  • Bug fix Pull Request
SUMMARY
  • When attempting to install either the develop branch or v0.8.0 of botkube via helm, the botkube deployment pod crashes repeatedly due to a security permissions issue. From what I can tell, the botkube docker image references a system user by name rather than UID, which may violate default pod security policies in restricted environments (at least in v0.8.0). In addition, the pod appears to use unsafe sysctls, as this setting is required for the pod to run without crashing.

  • After making the above changes and restarting the botkube pod in CrashLoop backoff, the pod was able to run continuously and began working correctly. We were able to issue kubectl commands to botkube as well as receive event notifications in the Slack channel.

EVIDENCE

Sample output below, and I've attached the logs of the pod as a file.

v0.8.0

NAME                       READY   STATUS                       RESTARTS   AGE
botkube-7c58899999-zn7r5   0/1     CreateContainerConfigError   0          4m36s
Events:
  Type     Reason     Age                  From                                   Message
  ----     ------     ----                 ----                                   -------
  Normal   Scheduled  2m54s                default-scheduler                      Successfully assigned botkube/botkube-7c58899999-zn7r5 to <redacted>
  Normal   Pulled     90s (x8 over 2m53s)  <redacted>  Successfully pulled image "infracloud/botkube:latest"
  Warning  Failed     90s (x8 over 2m53s)  <redacted>  Error: container has runAsNonRoot and image has non-numeric user (botkube), cannot verify user is non-root
  Normal   Pulling    74s (x9 over 2m53s)  <redacted>  pulling image "infracloud/botkube:latest"

develop (commit 8835dea)

NAME                       READY   STATUS             RESTARTS   AGE
botkube-544f67c9d8-whj8d   0/1     CrashLoopBackOff   4          3m1s
Events:
  Type     Reason     Age                From                                   Message
  ----     ------     ----               ----                                   -------
  Normal   Scheduled  24s                default-scheduler                      Successfully assigned botkube/botkube-544f67c9d8-sf8x8 to <redacted>
  Normal   Pulling    15s (x2 over 23s)  <redacted>  pulling image "infracloudio/botkube:latest"
  Normal   Pulled     15s (x2 over 23s)  <redacted>  Successfully pulled image "infracloudio/botkube:latest"
  Normal   Created    15s (x2 over 23s)  <redacted>  Created container
  Normal   Started    15s (x2 over 22s)  <redacted>  Started container
  Warning  BackOff    9s                 <redacted>  Back-off restarting failed container

pod-logs.txt

Thank you and please let me know if you have any questions about these changes.

Copy link
Contributor

codenio left a comment

can this be replicated under deploy-all-in-one-tls.yaml and included in the helm chart as well?

@PrasadG193

This comment has been minimized.

Copy link
Member

PrasadG193 commented Oct 11, 2019

From what I can tell, the botkube docker image references a system user by name rather than UID, which may violate default pod security policies in restricted environments (at least in v0.8.0)

@baronomasia Should we fix this in the Dockerfile instead of creating a new resource PodSecurityPolicy?

@baronomasia baronomasia force-pushed the baronomasia:develop branch from f2bb01b to fa59ae3 Oct 11, 2019
@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Oct 11, 2019

@codenio I replicated this by using the helm install command referencing this repo that I cloned locally. I've added the TLS chart to the mix as well (I didn't even notice this file before). I attempted to test with it, but in doing so, I realized I had no idea what the ENCODED_CERTIFICATE parameter was for other than that it was some sort of CA certificate. Using git blame, I was able to figure out it's used for the Mattermost integration, so I've updated the comment in the yaml file to reflect this. I've also modified deploy-all-in-one-tls.yaml to match deploy-all-in-one.yaml, as requested.

@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Oct 11, 2019

@PrasadG193 I think this is fixed in the latest docker image / head of the develop branch version. I can only reproduce it in v0.8.0, so it may already be fixed - it's just not in an official release yet.

@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Oct 11, 2019

One last comment before I forget, I noticed a recent commit updated the apiVersion for a number of different resources defined the two yaml files in order to ensure compatibility with Kubernetes v1.16. I used extensions/vbeta1 for the new PodSecurityPolicy I added, as this is the API version that we tend to use in our environments, but I'm not as knowledgeable in this area, so I wanted to check with y'all if this was correct.

@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Oct 11, 2019

@PrasadG193 I take it back, I just saw y'all released v0.9.0 today, nice work! I reinstalled using version v0.9.0 on the helm chart install and I don't see the CreateContainerConfigError I saw with v0.8.0 (as expected). However, the botkube pod went into CrashLoopBackOff until I applied the changes made in this PR (also as expected).

@baronomasia baronomasia force-pushed the baronomasia:develop branch from d7ae9e8 to 04f0c78 Oct 11, 2019
@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Oct 11, 2019

Just rebased my changes to include the v0.9.0 commits as well.

@PrasadG193

This comment has been minimized.

Copy link
Member

PrasadG193 commented Oct 22, 2019

However, the botkube pod went into CrashLoopBackOff until I applied the changes made in this PR (also as expected).

Hi @baronomasia,
Sorry for the delayed reply. Could you please check the logs and confirm what was the reason?

@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Oct 23, 2019

@PrasadG193 No reason was given in the logs, the only error printed was around ingress issues

E1010 22:17:47.535717       1 reflector.go:125] pkg/mod/k8s.io/client-go@v0.0.0-20190620085101-78d2af792bab/tools/cache/reflector.go:98: Failed to list *v1beta1.Ingress: the server could not find the requested resource

Full logs are attached in my original message in pod-logs.txt.

@PrasadG193

This comment has been minimized.

Copy link
Member

PrasadG193 commented Oct 24, 2019

So, BotKube pod went to CrashLoopBackOff after these logs? @baronomasia

@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Oct 24, 2019

@PrasadG193 yes, that is correct. Given no readiness or liveness probe failures in the event logs for the pod, I suspect the process is exiting without printing any specific logs around the exit reason.

@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Oct 31, 2019

@PrasadG193 is there any other information that I can provide to help with this PR?

deploy-all-in-one-tls.yaml Outdated Show resolved Hide resolved
@PrasadG193

This comment has been minimized.

Copy link
Member

PrasadG193 commented Nov 4, 2019

@baronomasia AFAIU, not everyone would need PodSecurityPolicy resource to get BotKube to work. We can do following things:

  1. For install using helm: We can add a new field in values.yaml (say podSecurityPolicy.enabled). And create PodSecurityPolicy only if the value is set true using helm install command. By default, we shouldn't create it
  2. For all-in-one install: Add a Note in install sections (e.g https://www.botkube.io/installation/slack/#) to create PodSecurityPolicy manually in case of a restricted cluster.

What do you think?

@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Nov 4, 2019

Making a PSP optional definitely provides optimal flexibility for users. Happy to get started on those items as well as fixing the PSP apiVersion.

@baronomasia baronomasia force-pushed the baronomasia:develop branch from 04f0c78 to 37e63b1 Nov 4, 2019
@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Nov 4, 2019

@PrasadG193 I decided to take a slightly different approach since the serviceaccount object also requires changes for this work. Here's a quick summary of the changes:

  • Updated comments about Mattermost
  • Linted deployment yamls (there were a ton of extraneous spaces so I just let my linter do its magic).
  • Created separate psp all-one-yamls for both regular and TLS since it's not just applying a PSP to get things to work.
  • Changed the PodSecurityPolicy apiVersion to policy/v1beta1
  • Added the podSecurityPolicy.enabled option to values.yaml and used it in serviceaccount.yaml and the new podsecuritypolicy.yaml. Currently defaults to false.

The only outstanding item left is adding note to the installation page. I was looking for a place to submit a PR for the documentation, but I haven't found one. Could you point me in the right direction?

I hope that all this makes sense and let me know if you have any questions.

@PrasadG193

This comment has been minimized.

Copy link
Member

PrasadG193 commented Nov 7, 2019

Created separate psp all-one-yamls for both regular and TLS since it's not just applying a PSP to get things to work.

@baronomasia we shouldn't create additional all-one-yamls. We already have two. Having more all-in-one yamls may confuse users. We can just add an additional note in the installation doc about the additional resource that needs to be created.

For documentation, we use hugo. If you are not familiar, don't worry about it. I'll take care of that. We updated hosted documentation only when we publish newer release so that it is in sync with the latest release. You can find the docs at https://github.com/infracloudio/botkube-docs

@PrasadG193

This comment has been minimized.

Copy link
Member

PrasadG193 commented Nov 7, 2019

@baronomasia Please rebase your branch with develop

@baronomasia baronomasia force-pushed the baronomasia:develop branch from 37e63b1 to 8a74287 Nov 8, 2019
@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Nov 8, 2019

@PrasadG193 Understood - I've removed the 2 new all-in-one psp yamls. I've not used hugo before, but happy to work with you on getting documentation written. The only tricky part is going to be modifying the existing ClusterRole resource to work with the PSP, but if we provide the proper code snippets I think we'll be good. Thanks for the review and let me know if you have any other questions.

@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Nov 8, 2019

@PrasadG193 I've also rebased my branch with develop.

@PrasadG193

This comment has been minimized.

Copy link
Member

PrasadG193 commented Nov 12, 2019

LGTM. @baronomasia Can you please add the code snippet for resource specs that one needs to create manually here (in the comments) so that we won't miss anything while documenting? If you are interested in updating documentation, please let me know, we can sync up on Slack.
Thanks for the PR @baronomasia

@PrasadG193 PrasadG193 merged commit 32a4603 into infracloudio:develop Nov 12, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Nov 12, 2019

Sure thing @PrasadG193 here's a copy-pastable code snippet to run after helm install that will add the psp and modify the existing cluster-role (assuming the helm chart is installed as botkube

echo 'apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: botkube-psp
spec:
  allowedUnsafeSysctls: ["*"]
  fsGroup:
    rule: RunAsAny
  runAsUser:
    rule: RunAsAny
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  volumes: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: botkube-clusterrole
  labels:
    app: botkube
rules:
  - apiGroups: ["*"]
    resources: ["*"]
    verbs: ["get", "watch", "list"]
  - apiGroups: ["extensions"]
    resourceNames: ["botkube-psp"]
    resources: ["podsecuritypolicies"]
    verbs: ["use"]' | kubectl apply -f -
@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Nov 12, 2019

@PrasadG193 reviewing the changes again I realize there is an errant } in the clusterrole template file. I have PR open to fix this: #213

@PrasadG193

This comment has been minimized.

Copy link
Member

PrasadG193 commented Nov 12, 2019

@baronomasia just to confirm, the code snippet works when BotKube is installed via all-in-one yamls also, right?

@baronomasia

This comment has been minimized.

Copy link
Contributor Author

baronomasia commented Nov 12, 2019

@PrasadG193 the above snippet was generated from the deploy-all-in-one yaml so if anything, it should work better with the all-in-one yaml than the helm install (since the helm install can use a configurable install name that may not match the clusterrole name in the code snippet above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.