Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to skip building hook systemd unit and use raw manifest #4764

Closed
geekofalltrades opened this issue Mar 23, 2018 · 7 comments
Closed
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@geekofalltrades
Copy link
Contributor

Background

I've just spent two frustrating workdays trying to add a swapfile to nodes in certain of our instancegroups. We're using CoreOS Container Linux as our OS, and are running in AWS. CoreOS's documentation for adding swap to Container Linux is here.

I started by trying to add swap with cloud-init, as described in #3251. However, the additionalUserData field is completely unusable with Container Linux. When additionalUserData is used, kops formats the AWS user data as a multipart mime message. On Container Linux, ignition and/or coreos-cloudinit are incapable of reading this format, and the instance bricks itself on boot. It never passes AWS intialization, and never becomes reachable via SSH. After a few minutes, the ASG restarts it, and it sits in a loop restarting forever.

Next I turned to hooks. I was initially pleased to find that the kops hook system allows me to construct systemd unit files, since that's what ingnition or coreos-cloudinit would have done for me anyway. However, the current implementation is extremely limiting.

First, CoreOS want me to add a systemd swap unit, which is a unit file suffixed with ".swap." The kops hook system takes this name and appends ".service" to it, so I end up with a unit file of type ".swap.service," which is wrong.

Second, CoreOS want me to use systemd parameters in the [Unit] section which the hook system doesn't support, like After, RequiresMountsFor, and ConditionPathExists. If I try to write a full-featured unit file and put it in the "manifest" section, the hook system ends up mangling it. This configuration:

  - name: swap1.swap
    requires:
    - create-swapfile.service
    manifest: |
      [Unit]
      Description=Turn on swap
      Requires=create-swapfile.service
      After=create-swapfile.service

      [Swap]
      What=/swap1

      [Install]
      WantedBy=multi-user.target

Results in this (broken) unit file:

[Unit]
Description=Kops Hook swap1.swap
Requires=create-swapfile.service

[Service]
[Unit]
Description=Turn on swap
Requires=create-swapfile.service
After=create-swapfile.service

[Swap]
What=/swap1

[Install]
WantedBy=multi-user.target

I experimented for a bit with writing additional hooks which would unmangle the unit files written by the previous hooks, but my solutions started to get really ugly, really fast.

Proposed Solution

I'd like to extend the hook system with the ability to slurp the manifest field and use it as the unit, completely unmodified. This functionality would be gated behind a hook option to avoid breaking backwards compatibility. Similarly, I'd like to add an option which will prevent kops from mangling the name of the unit by attaching ".service" to the end of it, again gated behind a config option to preserve backwards compatibility.

My proposal for the new options (with documentation) is:

  • useRawManifest: If set to true, the value of the manifest field will be used, unmodified, as the contents of the systemd unit created by the hook. The requires and before options are ignored, and the hook will not generate the [Unit] section of the unit file for you. This gives operators access to [Unit] section parameters which are otherwise unsupported. Default: false.
  • useRawName: For legacy reasons, if the name field does not end with ".service" or ".timer," ".service" is appended to name of the systemd unit created by the hook. If useRawName is set to true, the name field will be left unmodified. This is useful for hooks which create units of systemd types aside from ".service" or ".timer," like ".swap" or ".target". Default: false.

I'm willing to work on this once the design is agreed upon.

@justinsb
Copy link
Member

What about a daemonset or hook that runs a container? You can do basically anything, a good example is the nvidia driver: https://github.com/kubernetes/kops/tree/master/hooks/nvidia-bootstrap

We need a hook where this has to run as part of boot, but I don't think there's any reason why this has to run first (I'm assuming the instances aren't that small!) A daemonset is also much easier to debug, because you can develop it without having to reboot the cluster.

The static-files approach is indeed pretty limiting, but I'm not aware of anything you can't do in a daemonset or container-hook (a container hook run as a privileged container).

FYI, kubernetes/kubernetes#53533 talks about why enabling swap might be problematic.

@geekofalltrades
Copy link
Contributor Author

Reading that issue, it sounds like the swap issue rears its head when you set memory limits on pods. The presence of swap confounds Kubelet's or Docker's ability to decide when a pod is approaching its memory limit. Currently we don't use any resource limiting in Kubernetes, so I don't think we'll hit the issue. In our environments, we see the operating system's OOM killer killing our containers when the node starts running out of memory.

I'll look into doing this with a DaemonSet or hook container, but I'd read in a few places that it's difficult to manipulate the host's systemd from within a container.

@geekofalltrades
Copy link
Contributor Author

I don't know that this is a great fit for a DaemonSet, because it's one-time systemd setup. Job/CronJob DaemonSets are tracked in kubernetes/kubernetes#36601. There are a bunch of workarounds provided, most of which aren't very palatable. One of the most common is to design a DaemonSet container which does the job, then sits in a busy-wait loop forever. This would mean leaving a container with access to the underlying host systemd lying around forever, which is a security risk.

Other solutions become more and more hacky, like the DaemonSet waiting for all of its pods to report that they're finished, then deleting itself; or having the DaemonSet run using a label selector, then update the label on its own node when it finishes running so it excludes itself from the node. Most of them involve writing a DaemonSet with some Kubernetes privileges that let it manage itself.

Running with a hook container becomes tricky for us because we use ECR as our container registry. Doing an aws ecr get-login on Container Linux is less trivial than you would hope. There's no package management to install the AWS CLI. There are Docker containers for the AWS CLI (and for small utilities which exclusively do ECR login), but none of them are official; they're all maintained by Some Guy, which doesn't pass security review at our company. If we build our own container to do ECR login, we have the chicken-and-egg problem of not being able to pull it from ECR. The DaemonSet would solve this problem, since Kubernetes does ECR login magically, but it's a poor fit for other reasons, as described above.

At this point we start looking at doing cluster-wide configuration management or rolling our own AMI just to get the utilities in place that would allow us to run a container hook, which is what I'm trying to avoid having to do by using hooks in the first place. 😄 We will probably end up there eventually anyway, but that's a deep rabbit hole that I don't want to go down until I absolutely have to.

If the kops hook system would just put in my systemd units unmolested, that would be a really easy workaround for all of the above. Being able to install arbitrary systemd units also seems like a really powerful tool for Kubernetes operators, and seems like a positive contribution to the hook system, overall.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 21, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants