Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run job on each node once to help with setup #64623

Open
mitchellmaler opened this issue Jun 1, 2018 · 46 comments
Open

Run job on each node once to help with setup #64623

mitchellmaler opened this issue Jun 1, 2018 · 46 comments

Comments

@mitchellmaler
Copy link

@mitchellmaler mitchellmaler commented Jun 1, 2018

Hello,

I am looking to see if it is possible to have a job run on each node in the cluster once. Right now our cluster is dynamically provisioned and scaled and was looking to use the kubernetes job and cronjob feature to run things to setup the node once it is provisioned or have a cron make sure something is cleaned up on each node.

@mitchellmaler
Copy link
Author

@mitchellmaler mitchellmaler commented Jun 1, 2018

/sig apps

@CaoShuFeng
Copy link
Contributor

@CaoShuFeng CaoShuFeng commented Jun 4, 2018

@mitchellmaler
Copy link
Author

@mitchellmaler mitchellmaler commented Jun 4, 2018

@CaoShuFeng I guess a daemonset would work but I was looking more towards the job/cron-job api to have something scheduled from the kubernetes side. I guess I could create a container with a cronjob which is ran from a daemonset.

@CaoShuFeng
Copy link
Contributor

@CaoShuFeng CaoShuFeng commented Jun 5, 2018

I guess I could create a container with a cronjob which is ran from a daemonset.

👍

@kow3ns kow3ns added this to Backlog in Workloads Jun 5, 2018
@fejta-bot
Copy link

@fejta-bot fejta-bot commented Sep 3, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Oct 3, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@mitchellmaler
Copy link
Author

@mitchellmaler mitchellmaler commented Oct 3, 2018

/remove-lifecycle rotten

@mitchellmaler
Copy link
Author

@mitchellmaler mitchellmaler commented Oct 3, 2018

Currently we got around this by using a third party tool to run automation on the nodes. It would be useful if jobs (daemon jobs) could be ran once on each node and even if based on node selector.

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Jan 1, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Jan 31, 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@luisdavim
Copy link

@luisdavim luisdavim commented Feb 21, 2019

/remove-lifecycle rotten
/remove-lifecycle stale

@luisdavim
Copy link

@luisdavim luisdavim commented Feb 21, 2019

I've seen this being requested many times and I think that DaemonSets should allow for a RestartPolicy of OnFailure and jobs should allow using that to execute tasks on all nodes once.

@draveness
Copy link
Member

@draveness draveness commented Mar 25, 2019

Hi @luisdavim, do we still need to support this feature? Maybe I could submit a PR to add OnFailure policy to DeamonSet.

@luisdavim
Copy link

@luisdavim luisdavim commented Mar 25, 2019

Sure, for now I've a workaround using metacontroller and a CRD but having this being supported natively would be great.

@draveness
Copy link
Member

@draveness draveness commented Mar 26, 2019

@luisdavim OK. Is it necessary to open a proposal for this change? And if it is, where should I raise it. :)

@draveness
Copy link
Member

@draveness draveness commented Mar 27, 2019

I found out there are multiple issues which discussed the support of run a once off task with daemonset.

#36601
#50689
#69001

Support OnFailure policy in daemonset policy seems quite reasonable. It could support run once job to some extent, though it cannot ensure the task run precisely once on each node.

@kubernetes/sig-apps-feature-requests

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Jul 21, 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@efussi
Copy link

@efussi efussi commented Jul 22, 2020

/remove-lifecycle stale

@pigletfly
Copy link
Member

@pigletfly pigletfly commented Aug 26, 2020

We can use BroadcastJob in https://github.com/openkruise/kruise now.It's a job that runs Pods to completion across all the nodes in the cluster.

@IAXES
Copy link

@IAXES IAXES commented Aug 26, 2020

@pigletfly Is there a long-term plan to have some/all of these features in OpenKruise be merged upstream into mainline K8S?

Also, thank you: the BroadcastJob seems to fit the use case I was raising.

@Dysproz
Copy link

@Dysproz Dysproz commented Sep 20, 2020

Based on solution that I've came up for contrail-operator project, I've created separate project with custom resource DaemonJob that should cover that problem: DaemonJob

@draveness draveness removed their assignment Sep 21, 2020
@fejta-bot
Copy link

@fejta-bot fejta-bot commented Dec 20, 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@unixfox
Copy link

@unixfox unixfox commented Dec 30, 2020

/remove-lifecycle stale

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Mar 30, 2021

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@unixfox
Copy link

@unixfox unixfox commented Mar 30, 2021

/remove-lifecycle stale

@james-callahan
Copy link

@james-callahan james-callahan commented May 6, 2021

One major drawback of working around this with a daemonset is that you need to use a pause pod to keep the pod "alive" and hence not get continuously restarted. This burns up a pod ip, which are quite precious in e.g. AWS CNI, where certain instance types can only have 4 pods per node.

@k8s-triage-robot
Copy link

@k8s-triage-robot k8s-triage-robot commented Aug 4, 2021

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@zwpaper
Copy link

@zwpaper zwpaper commented Aug 4, 2021

I will try to look into this
/remove-lifecycle stale

@zwpaper
Copy link

@zwpaper zwpaper commented Sep 10, 2021

after dive into the source code, the main stopper for a daemonSet to run OnFailure pods is the API validation here,

if spec.Template.Spec.RestartPolicy != api.RestartPolicyAlways {
, both daemonset and pod controller could work as expected in some simple scenarios, but it is not easy to cover all the cases caused by the change.

@soltysh explains here

ALL workloads controllers (daemonset, deployment, statefulset, etc) are enforcing policy to Always is coming from the desire of those application to always run

A 3rd party controller is recommended for this feature.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Sep 10, 2021

@zwpaper: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

after dive into the source code, the main stopper for a daemonSet to run OnFailure pods is the API validation here,

if spec.Template.Spec.RestartPolicy != api.RestartPolicyAlways {
, both daemonset and pod controller could work as expected in some simple scenarios, but it is not easy to cover all the cases caused by the change.

@soltysh explains here

ALL workloads controllers (daemonset, deployment, statefulset, etc) are enforcing policy to Always is coming from the desire of those application to always run

A 3rd party controller is recommended for this feature.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pigletfly
Copy link
Member

@pigletfly pigletfly commented Sep 27, 2021

@pigletfly Is there a long-term plan to have some/all of these features in OpenKruise be merged upstream into mainline K8S?

Also, thank you: the BroadcastJob seems to fit the use case I was raising.

I don't think so.Now OpenKruise is a CNCF sandbox project.

@cobb-tx
Copy link

@cobb-tx cobb-tx commented Oct 28, 2021

I tried to use job parallelism and podAntiAffinity to come true。
but There is one drawback You need manual maintenance parallelism number and node lables。

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: flush-log-cron
  namespace: logging
  labels:
    app: flush-log-cron
spec:
  schedule: "0 12 * * *"
  jobTemplate:
    spec:
      completions: 5
      parallelism: 5
      template:
        metadata:
          name: flush-log-cron
          labels:
            app: flush-log-cron
            component: cron-job
            hasDNS: "false"
        spec:
          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                    matchExpressions:
                    - key: app
                      operator: In
                      values:
                      - flush-log-cron
                    - key: component
                      operator: In
                      values:
                      - cron-job
                topologyKey: kubernetes.io/hostname
          containers:
            - name: flush-log-cron
              image: busybox:1.32
              imagePullPolicy: IfNotPresent
              command:
                - /bin/sh
                - -c
                - date; echo Hello from the Kubernetes cluster
          restartPolicy: OnFailure
          imagePullSecrets:
            - name: harbor
          nodeSelector:
            node-role-application: "true"

@anuj-kosambi
Copy link

@anuj-kosambi anuj-kosambi commented Jan 14, 2022

I achieved this via hacky workaround
I run main task in DaemonSet initContainer's and sleep in normal container

 containers:
      - args:
        - -c
        - while true;do sleep 3600;done # just to ignore completion status
        command:
        - sh
        image: busybox
        imagePullPolicy: IfNotPresent
        name: sleep
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      initContainers:
      - args:
        - -c
        - ulimit -n 65536 #main task
        command:
        - /bin/sh
        image: busybox
        imagePullPolicy: Always
        name: sys-limits
        resources: {}
        securityContext:
          allowPrivilegeEscalation: true
          capabilities: {}
          privileged: true
          readOnlyRootFilesystem: false
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Workloads
  
Backlog
Linked pull requests

Successfully merging a pull request may close this issue.

None yet