[feature request] Swarm mode should support batch/cron jobs in addition to persistent services #23880

Open
nathanleclaire opened this Issue Jun 23, 2016 · 78 comments

Comments

@nathanleclaire
Contributor

nathanleclaire commented Jun 23, 2016

Description

Today Swarm mode allows users to specify a group of homogenous containers which are meant to be kept running with the docker service CLI. This abstraction, while powerful, may not be the right fit for containers which are intended to eventually terminate or only run periodically.

Consider, for instance:

  • An admin who wishes to allow users to submit long-running compiler jobs on a Swarm cluster
  • A website which needs to process all user uploaded images into thumbnails of various sizes
  • An operator who wishes to periodically run docker rmi $(docker images --filter dangling=true -q) on each machine

Problem

Though some use cases could potentially be implemented by a service which pulls jobs off a distributed queue, there are some issues with this approach:

  1. It adds an operational burden of creating, administrating, and ensuring the health of such a queue. For many, this will kick up the barrier to entry of performing such tasks.
  2. It does not necessarily ensure that failed jobs can be re-run the correct number of times or with the correct parameters. Like above, this burden has now been offloaded to the user instead of being natively orchestrated.
  3. It does not allow for easy customization of job parallelism settings.

Anything which is intended to be run periodically (such as the image garbage collection example above), could potentially cause a thundering herd problem if the scheduling is not handled by Swarm. Imagine, for instance, that a user creates a docker service to periodically run a command inside the containers using normal old cron. If these all wake up and attempt to execute at the same time during a time when production traffic is surging, it may be an issue for more critical production web services. While it may help to mitigate the problem if the users do proper capacity planning and use flags such as --reserve-cpu, separating these concerns early on, especially when a temporal element is involved, seems to prudent. (thanks to @jpetazzo who originally pointed out these concerns to me, and probably will have good insight as well)

This issue requests and outlines a proposed CLI, to get the ball rolling on discussion, track the issue, and gather information about potential use cases.

Proposal

A new top-level command, docker jobs, could be introduced. It would allow to specify that a container should be run X times, or every Y interval of time. It could also be used to check up on these jobs.

Examples.

Run batch job once:

$ docker jobs create \
    -e S3_INPUT_BUCKET \
    -e S3_OUTPUT_BUCKET \
    -e IMAGE_NAME \
    nathanleclaire/convert
3x3bq2ibh1qe

$ docker jobs wait 3x3bq2ibh1qe; docker jobs ls
ID              NEXT PREV           FINISHED   FAILURES IMAGE
3x3bq2ibh1qe    -    3 seconds ago  1/1        0/1      nathanleclaire/convert 

Run batch job 16 times:

$ docker jobs create \
    --runs 16 \
    --parallel 3 \
    nathanleclaire/failer
bku4f1s1ncm0

$ docker jobs ls
ID            NEXT  PREV           FINISHED   FAILURES IMAGE
3x3bq2ibh1qe  -     2 minutes ago  1/1        0/1      nathanleclaire/convert 
bku4f1s1ncm0  Now   Now            4/16       6/10     nathanleclaire/failer

Run a task every hour:

$ docker jobs create \
    --every 1hr \
    nathanleclaire/hourlytask
ddh7pqvgbd8l

$ # One hour later...

$ docker jobs ls
ID            NEXT PREV         FINISHED  FAILURES IMAGE
3x3bq2ibh1qe  -    1 hour ago   1/1       0/1      nathanleclaire/convert 
bku4f1s1ncm0  -    1 hour ago   16/16     6/20     nathanleclaire/failer 
ddh7pqvgbd8l  1hr  1 minute ago 1/1       0/1      nathanleclaire/hourlytask

Interactively re-run a job:

$ docker jobs restart 3x3bq2ibh1qe
Running job 3x3bq2ibh1qe again from the beginning

Alternatively, service model could be expanded to accommodate this? But it seems they would be easier to manage (for users) as separate things.

Please let me know what you think (when you get a chance -- please focus on 1.12 first and foremost ;) ) @aluzzardi @vieux @stevvooe @abronan and others.

Cute animal

Since I'm requesting feature, least I can do is provide cute animal picture.

image

@AkihiroSuda

This comment has been minimized.

Show comment
Hide comment
@AkihiroSuda

AkihiroSuda Jun 23, 2016

Member

My "indexed job" suggestion seems related to and can be integrated to this suggestion.

#23843

$ echo "apple banana cherry" > fruits.txt
$ echo "green yellow red" > colors.txt
$ docker jobs create --parallel 3
     --scatter=FRUIT="$(cat fruits.txt)" \
     --scatter=COLOR="$(cat colors.txt)" \
     busybox
     'echo "Have a nice $COLOR $FRUIT" && sleep 5'
(container0) Have a nice green apple
(container1) Have a nice yellow banana
(container2) Have a nice red cherry
Member

AkihiroSuda commented Jun 23, 2016

My "indexed job" suggestion seems related to and can be integrated to this suggestion.

#23843

$ echo "apple banana cherry" > fruits.txt
$ echo "green yellow red" > colors.txt
$ docker jobs create --parallel 3
     --scatter=FRUIT="$(cat fruits.txt)" \
     --scatter=COLOR="$(cat colors.txt)" \
     busybox
     'echo "Have a nice $COLOR $FRUIT" && sleep 5'
(container0) Have a nice green apple
(container1) Have a nice yellow banana
(container2) Have a nice red cherry
@gerred

This comment has been minimized.

Show comment
Hide comment
@gerred

gerred Jun 23, 2016

What if the Tasks API were expanded as its own fully featured toolset, allowing individual tasks to be submitted and scheduled separate of services? Task as it stands could generalize "ServiceAnnotations" as "Annotations". Jobs and services are already abstractions upon the tasks, but it could be an easy way to allow arbitrary task controllers to handle different use cases, while including the basic abstractions of service and job in the engine itself.

This would also necessitate having more knowledge of the dispatcher that these controllers could draw from, so my suggestion might be a little pie in the sky at this point.

gerred commented Jun 23, 2016

What if the Tasks API were expanded as its own fully featured toolset, allowing individual tasks to be submitted and scheduled separate of services? Task as it stands could generalize "ServiceAnnotations" as "Annotations". Jobs and services are already abstractions upon the tasks, but it could be an easy way to allow arbitrary task controllers to handle different use cases, while including the basic abstractions of service and job in the engine itself.

This would also necessitate having more knowledge of the dispatcher that these controllers could draw from, so my suggestion might be a little pie in the sky at this point.

@kelseyhightower

This comment has been minimized.

Show comment
Hide comment
@kelseyhightower

kelseyhightower Jun 23, 2016

@gerred I think you are on to something as there are other systems that work the way you describe. You can have different controllers process jobs and monitor them for successful completion. All controllers can reuse a single scheduler to ensure proper workload placement, but it would be the individual controllers responsibility to ensure long running services are always restarted or rescheduled in face of node failures. In the case of jobs a Job Controller can ensure tasks run at least once until successful completion.

@gerred I think you are on to something as there are other systems that work the way you describe. You can have different controllers process jobs and monitor them for successful completion. All controllers can reuse a single scheduler to ensure proper workload placement, but it would be the individual controllers responsibility to ensure long running services are always restarted or rescheduled in face of node failures. In the case of jobs a Job Controller can ensure tasks run at least once until successful completion.

@aluzzardi

This comment has been minimized.

Show comment
Hide comment
@aluzzardi

aluzzardi Jun 27, 2016

Contributor

@nathanleclaire @kelseyhightower @gerred: The use case is already covered by the service model.

Services have a "mode" (I don't like this term, very open to suggestions) and a mode maps directly to an orchestrator (what @kelseyhightower refers to as a controller).

We currently have two service modes: Replicated ("scalable") and Global ("runs on every machine").

e.g. docker service create --mode global redis

Every mode has its own options and attributes, so in the future we could achieve this with something like (adapting @nathanleclaire example):

$ docker service create \
    --mode batch
    --runs 16 \
    --parallel 3 \
    nathanleclaire/failer
bku4f1s1ncm0

/cc @dongluochen @aaronlehmann

Contributor

aluzzardi commented Jun 27, 2016

@nathanleclaire @kelseyhightower @gerred: The use case is already covered by the service model.

Services have a "mode" (I don't like this term, very open to suggestions) and a mode maps directly to an orchestrator (what @kelseyhightower refers to as a controller).

We currently have two service modes: Replicated ("scalable") and Global ("runs on every machine").

e.g. docker service create --mode global redis

Every mode has its own options and attributes, so in the future we could achieve this with something like (adapting @nathanleclaire example):

$ docker service create \
    --mode batch
    --runs 16 \
    --parallel 3 \
    nathanleclaire/failer
bku4f1s1ncm0

/cc @dongluochen @aaronlehmann

@treksler

This comment has been minimized.

Show comment
Hide comment
@treksler

treksler Jul 6, 2016

First off, batch mode for docker service would be great. There would need to be a way to easily and intuitively list batch jobs, their schedules, when a job runs next, when it ran last, was it successful, etc.

I suppose docker service ls -f "mode=batch" would work, but it's not really as intuitive as docker job ls

Similarly, docker service tasks -f "mode=batch" could show last execution status in LAST STATE and next scheduled run in DESIRED STATE or maybe just show "Scheduled" or "Batch" in DESIRED STATE and havedocker service inspect display the batch schedule

The proposed --parallel is the same thing as --replicas, as far as I can see.
i.e. how many copies of this should be spawned?

The proposed --runs is somewhat conceptually similar to --restart-max-attempts
i.e. how many times do you want to (attempt to) run it, after it exits?

It looks like you can already kind of fake batch mode with --restart-delay set to something like 24 hours to create a pseudo daily job, but if I need to make sure my container/job runs exactly 30 seconds from midnight on some swarm node every day, then I am hooped without a batch mode.

Something like this might work
docker service create -f "mode=batch" --batch-run-at="23:59:30" someimage:1.0
or we could support cron syntax
docker service create -f "mode=batch" --batch-run-schedule="59 23 * * * " someimage:1.0

The problem with commands that take a --mode parameter is that only a subset of command line switches will be applicable to each mode. At that point, they may as well be separate commands.
It would be hard to tell which command line switches on docker service create apply to which mode. This is why I added the --batch-run prefix to the switches in the above examples.

treksler commented Jul 6, 2016

First off, batch mode for docker service would be great. There would need to be a way to easily and intuitively list batch jobs, their schedules, when a job runs next, when it ran last, was it successful, etc.

I suppose docker service ls -f "mode=batch" would work, but it's not really as intuitive as docker job ls

Similarly, docker service tasks -f "mode=batch" could show last execution status in LAST STATE and next scheduled run in DESIRED STATE or maybe just show "Scheduled" or "Batch" in DESIRED STATE and havedocker service inspect display the batch schedule

The proposed --parallel is the same thing as --replicas, as far as I can see.
i.e. how many copies of this should be spawned?

The proposed --runs is somewhat conceptually similar to --restart-max-attempts
i.e. how many times do you want to (attempt to) run it, after it exits?

It looks like you can already kind of fake batch mode with --restart-delay set to something like 24 hours to create a pseudo daily job, but if I need to make sure my container/job runs exactly 30 seconds from midnight on some swarm node every day, then I am hooped without a batch mode.

Something like this might work
docker service create -f "mode=batch" --batch-run-at="23:59:30" someimage:1.0
or we could support cron syntax
docker service create -f "mode=batch" --batch-run-schedule="59 23 * * * " someimage:1.0

The problem with commands that take a --mode parameter is that only a subset of command line switches will be applicable to each mode. At that point, they may as well be separate commands.
It would be hard to tell which command line switches on docker service create apply to which mode. This is why I added the --batch-run prefix to the switches in the above examples.

@aleksejlopasov

This comment has been minimized.

Show comment
Hide comment
@aleksejlopasov

aleksejlopasov Jul 29, 2016

I have the same problem with my services that must run periodically. I've created a service which runs a container with a php script inside of it. This script must run every day at the same time. But when the script finishes his work, container stops and that's all. I can't run it again from the master-node. the only way is to remove the service and recreate it. So docker swarm mode doesn't fit the situations when you run the services that are expected to exit

I have the same problem with my services that must run periodically. I've created a service which runs a container with a php script inside of it. This script must run every day at the same time. But when the script finishes his work, container stops and that's all. I can't run it again from the master-node. the only way is to remove the service and recreate it. So docker swarm mode doesn't fit the situations when you run the services that are expected to exit

@nathanleclaire

This comment has been minimized.

Show comment
Hide comment
@nathanleclaire

nathanleclaire Jul 29, 2016

Contributor

I have the same problem with my services that must run periodically. I've created a service which runs a container with a php script inside of it. This script must run every day at the same time. But when the script finishes his work, container stops and that's all. I can't run it again from the master-node. the only way is to remove the service and recreate it. So docker swarm mode doesn't fit the situations when you run the services that are expected to exit

FWIW, you might be able to work around this in the interim by running cron inside of the container, so that it wakes up and runs the script same time every day and container doesn't exit -- e.g. https://forums.docker.com/t/running-cronjob-in-debian-jessie-container/17527/5

Contributor

nathanleclaire commented Jul 29, 2016

I have the same problem with my services that must run periodically. I've created a service which runs a container with a php script inside of it. This script must run every day at the same time. But when the script finishes his work, container stops and that's all. I can't run it again from the master-node. the only way is to remove the service and recreate it. So docker swarm mode doesn't fit the situations when you run the services that are expected to exit

FWIW, you might be able to work around this in the interim by running cron inside of the container, so that it wakes up and runs the script same time every day and container doesn't exit -- e.g. https://forums.docker.com/t/running-cronjob-in-debian-jessie-container/17527/5

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Jul 29, 2016

Contributor

So docker swarm mode doesn't fit the situations when you run the services that are expected to exit.

At this time, swarm mode is focused on long-running services.

To be honest, in a production system, I have been unhappy nearly every time I've used cron. The solution provided by @nathanleclaire is great but the cron approach has problems when you need something to run, particularly when you want to customize deadline behavior and backfill. There are also issues involving how cron stores state to disk that can be problematic if the container is rescheduled elsewhere.

This is merely a personal opinion and doesn't mean we won't implement batch in the future, cause we will.

The best way to handle this, I've found, is to have the script do its own scheduling. Here is the simplest version the runs a script every minute:

#!/bin/sh

sleep 60
exec myjob.sh

While this is a toy, it has a few nice properties:

  1. Every time it exits, it will be scheduled to a free node.
  2. Simple as dirt.

The main issue is that each time it is run, it waits 60 seconds. That means if you lose the node, the task will get rescheduled and wait 60 seconds before running. This is a huge problem when we replace that with 24 hours or 2 days.

We can solve this with something like anacron, but that typically means the cron data to do backfill is located on a single node.

So, what do we do? Basically, to solve this, we want to calculate the wait time each time we run and ensure that the state is saved to commonly accessible component in the cluster. The answer to most software engineering problems is to use Redis!

The script is surprisingly simple:

#!/bin/sh

set -e

# query the ttl and set to wait, use --csv so we don't have to parse output
wait=$(redis-cli -h $REDIS_HOST -p $REDIS_PORT --csv  ttl $SCRIPT_KEY)

case $wait in 
  -2) # redis code for key does not exist
     myjob.sh
     # on success of our job, we set the ttl to the environment variable provided in the service
     # make sure the script has set -e
     redis-cli set $SCRIPT_KEY "$(date)" EX $INTERVAL_SECONDS
    fi
     ;;
   *)
    sleep $wait # sleep until the key expires
    # we can also have this sleep some modulus of wait time or have 
    # a maximum to ensure the task is rescheduled to other nodes periodically.
    ;;
esac

So, with this, we set REDIS_HOST, REDIS_PORT, SCRIPT_KEY and INTERVAL_SECONDS, we can get a script to run every X seconds that will be resilient to failures and rescheduling on nodes.

This isn't too bad for a small amount of effort. Obviously, with some effort, this can get a lot more sophisticated. An enterprising solution good make a version of this script that could simply take an environment variable with the cron schedule and do the right thing. The key here, being two properties:

  1. Enable the script to schedule itself.
  2. Store the run state in a distributed manner (some database).

An even more ambitious version could bind mount the docker socket right into the task and run a container for the script, separating the scheduler and scripts that are being scheduled.

Obviously, this isn't ideal, but this is a solid and flexible hack.

Note that this doesn't mean we won't implement batch.

Contributor

stevvooe commented Jul 29, 2016

So docker swarm mode doesn't fit the situations when you run the services that are expected to exit.

At this time, swarm mode is focused on long-running services.

To be honest, in a production system, I have been unhappy nearly every time I've used cron. The solution provided by @nathanleclaire is great but the cron approach has problems when you need something to run, particularly when you want to customize deadline behavior and backfill. There are also issues involving how cron stores state to disk that can be problematic if the container is rescheduled elsewhere.

This is merely a personal opinion and doesn't mean we won't implement batch in the future, cause we will.

The best way to handle this, I've found, is to have the script do its own scheduling. Here is the simplest version the runs a script every minute:

#!/bin/sh

sleep 60
exec myjob.sh

While this is a toy, it has a few nice properties:

  1. Every time it exits, it will be scheduled to a free node.
  2. Simple as dirt.

The main issue is that each time it is run, it waits 60 seconds. That means if you lose the node, the task will get rescheduled and wait 60 seconds before running. This is a huge problem when we replace that with 24 hours or 2 days.

We can solve this with something like anacron, but that typically means the cron data to do backfill is located on a single node.

So, what do we do? Basically, to solve this, we want to calculate the wait time each time we run and ensure that the state is saved to commonly accessible component in the cluster. The answer to most software engineering problems is to use Redis!

The script is surprisingly simple:

#!/bin/sh

set -e

# query the ttl and set to wait, use --csv so we don't have to parse output
wait=$(redis-cli -h $REDIS_HOST -p $REDIS_PORT --csv  ttl $SCRIPT_KEY)

case $wait in 
  -2) # redis code for key does not exist
     myjob.sh
     # on success of our job, we set the ttl to the environment variable provided in the service
     # make sure the script has set -e
     redis-cli set $SCRIPT_KEY "$(date)" EX $INTERVAL_SECONDS
    fi
     ;;
   *)
    sleep $wait # sleep until the key expires
    # we can also have this sleep some modulus of wait time or have 
    # a maximum to ensure the task is rescheduled to other nodes periodically.
    ;;
esac

So, with this, we set REDIS_HOST, REDIS_PORT, SCRIPT_KEY and INTERVAL_SECONDS, we can get a script to run every X seconds that will be resilient to failures and rescheduling on nodes.

This isn't too bad for a small amount of effort. Obviously, with some effort, this can get a lot more sophisticated. An enterprising solution good make a version of this script that could simply take an environment variable with the cron schedule and do the right thing. The key here, being two properties:

  1. Enable the script to schedule itself.
  2. Store the run state in a distributed manner (some database).

An even more ambitious version could bind mount the docker socket right into the task and run a container for the script, separating the scheduler and scripts that are being scheduled.

Obviously, this isn't ideal, but this is a solid and flexible hack.

Note that this doesn't mean we won't implement batch.

@aleksejlopasov

This comment has been minimized.

Show comment
Hide comment
@aleksejlopasov

aleksejlopasov Aug 1, 2016

To @nathanleclaire and @stevvooe
I've found another solution:
What if I check the state of the service before running it with 'docker service create'? If it is completed that I will delete it and create new service. If not - than echo message 'bla-bla, sorry, the service is running at the moment, try again later'.

To @nathanleclaire and @stevvooe
I've found another solution:
What if I check the state of the service before running it with 'docker service create'? If it is completed that I will delete it and create new service. If not - than echo message 'bla-bla, sorry, the service is running at the moment, try again later'.

@nathanleclaire

This comment has been minimized.

Show comment
Hide comment
@nathanleclaire

nathanleclaire Aug 1, 2016

Contributor

I've found another solution:
What if I check the state of the service before running it with 'docker service create'? If it is
completed that I will delete it and create new service. If not - than echo message 'bla bla,sorry,
the service is running at the moment, try again later'.

Why use a docker service at all in this case? docker service today is intended for processes that you want to keep running. There will never be a "stopped service" like you mention. Have a play with docker service to get a feel for how this works.

$ docker service create --name deadservice alpine sh -c 'sleep 1; echo See ya later!'
aen8savf90lqy1i9i842wy498

$ docker service ps deadservice
ID                         NAME               IMAGE   NODE  DESIRED STATE  CURRENT STATE        ERROR
cydq1r3d2piw4z6b779r39say  deadservice.1      alpine  moby  Ready          Ready 2 days ago
7zqj89296nke3158ci3rv13zw   \_ deadservice.1  alpine  moby  Shutdown       Complete 2 days ago
3c678vy2zdi9ew1otxn909u0r   \_ deadservice.1  alpine  moby  Shutdown       Complete 2 days ago
a2xo7bq99w8kcx5a8csgc4uh7   \_ deadservice.1  alpine  moby  Shutdown       Complete 2 days ago

Your solution just depends on how complex your needs are. If the job is safe to run more/less often than intended (or possibly concurrently) then you could probably get by with something like a docker service that just does while sleep 60; do ./command.sh; done. If it's not safe to run willy-nilly then you have to put more elbow grease into rigging up some details to try and ensure safe / consistent operation. But doing so is one of those things which looks simple at first glance and turns out to be fiendishly difficult in practice. It's why tech like Apache Kafka exists to try and give us more guarantees about message delivery and event ordering.

Contributor

nathanleclaire commented Aug 1, 2016

I've found another solution:
What if I check the state of the service before running it with 'docker service create'? If it is
completed that I will delete it and create new service. If not - than echo message 'bla bla,sorry,
the service is running at the moment, try again later'.

Why use a docker service at all in this case? docker service today is intended for processes that you want to keep running. There will never be a "stopped service" like you mention. Have a play with docker service to get a feel for how this works.

$ docker service create --name deadservice alpine sh -c 'sleep 1; echo See ya later!'
aen8savf90lqy1i9i842wy498

$ docker service ps deadservice
ID                         NAME               IMAGE   NODE  DESIRED STATE  CURRENT STATE        ERROR
cydq1r3d2piw4z6b779r39say  deadservice.1      alpine  moby  Ready          Ready 2 days ago
7zqj89296nke3158ci3rv13zw   \_ deadservice.1  alpine  moby  Shutdown       Complete 2 days ago
3c678vy2zdi9ew1otxn909u0r   \_ deadservice.1  alpine  moby  Shutdown       Complete 2 days ago
a2xo7bq99w8kcx5a8csgc4uh7   \_ deadservice.1  alpine  moby  Shutdown       Complete 2 days ago

Your solution just depends on how complex your needs are. If the job is safe to run more/less often than intended (or possibly concurrently) then you could probably get by with something like a docker service that just does while sleep 60; do ./command.sh; done. If it's not safe to run willy-nilly then you have to put more elbow grease into rigging up some details to try and ensure safe / consistent operation. But doing so is one of those things which looks simple at first glance and turns out to be fiendishly difficult in practice. It's why tech like Apache Kafka exists to try and give us more guarantees about message delivery and event ordering.

@aleksejlopasov

This comment has been minimized.

Show comment
Hide comment
@aleksejlopasov

aleksejlopasov Aug 2, 2016

@nathanleclaire
Maybe I had to provide you with more detailes. I run 'docker service create' with '--restart-condition none' paramether. There is a php script inside the container, that is run by the service. When it's done, the container on the worker-node will stop and the service on the master-node will stay created but the state will change to 'complete'. I can't run that service again, but I can check its state, delete it and run it again with 'docker service create' as I told in previous post

aleksejlopasov commented Aug 2, 2016

@nathanleclaire
Maybe I had to provide you with more detailes. I run 'docker service create' with '--restart-condition none' paramether. There is a php script inside the container, that is run by the service. When it's done, the container on the worker-node will stop and the service on the master-node will stay created but the state will change to 'complete'. I can't run that service again, but I can check its state, delete it and run it again with 'docker service create' as I told in previous post

@rayjohnson

This comment has been minimized.

Show comment
Hide comment
@rayjohnson

rayjohnson Aug 3, 2016

We use rundeck to talk to a docker swarm to run "batch" jobs. At enterprise level you need a lot more logic around notifications, specify the ways you do retries, changing dependent jobs, etc. While a docker specific solution would be cool - it strikes me it should be a separate service than extending docker swarm in weird ways...

We use rundeck to talk to a docker swarm to run "batch" jobs. At enterprise level you need a lot more logic around notifications, specify the ways you do retries, changing dependent jobs, etc. While a docker specific solution would be cool - it strikes me it should be a separate service than extending docker swarm in weird ways...

@rogaha

This comment has been minimized.

Show comment
Hide comment
@rogaha

rogaha Aug 3, 2016

Contributor

@nathanleclaire have you thought about job dependencies? e.g.

docker jobs create \
    --every 1hr \
    nathanleclaire/hourlytask
bku4f1s1ncm0
docker jobs create \
   --depends-on bku4f1s1ncm0
    rogaha/aggregationJob

I think it would be a nice addition to your proposal.

Contributor

rogaha commented Aug 3, 2016

@nathanleclaire have you thought about job dependencies? e.g.

docker jobs create \
    --every 1hr \
    nathanleclaire/hourlytask
bku4f1s1ncm0
docker jobs create \
   --depends-on bku4f1s1ncm0
    rogaha/aggregationJob

I think it would be a nice addition to your proposal.

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Aug 3, 2016

Contributor

have you thought about job dependencies?

@rogaha And this is where the scope creep begins ;). There are a hundred ways to represent dependency graphs, both temporal and resource-based, so getting that right makes things challenging.

I would rather we provide the functionality to build cron systems on top of swarm services. Something like this:

docker service create --name mycronservice -e EVERY=1hr stevvooe/dockacron stevvooe/myhourlyscript

Effectively, this would run an image that runs a scheduler stevvooe/dockacron that runs the container stevvooe/myhourlyscript every hour. This requires the following features:

  1. Services that can create new tasks with some sort of plugin API.
  2. Some sort of persistent storage.

I think this can be done today by bind mounting the docker socket and constraining the service to manager nodes, while providing volumes for the state storage.

Contributor

stevvooe commented Aug 3, 2016

have you thought about job dependencies?

@rogaha And this is where the scope creep begins ;). There are a hundred ways to represent dependency graphs, both temporal and resource-based, so getting that right makes things challenging.

I would rather we provide the functionality to build cron systems on top of swarm services. Something like this:

docker service create --name mycronservice -e EVERY=1hr stevvooe/dockacron stevvooe/myhourlyscript

Effectively, this would run an image that runs a scheduler stevvooe/dockacron that runs the container stevvooe/myhourlyscript every hour. This requires the following features:

  1. Services that can create new tasks with some sort of plugin API.
  2. Some sort of persistent storage.

I think this can be done today by bind mounting the docker socket and constraining the service to manager nodes, while providing volumes for the state storage.

@AkihiroSuda AkihiroSuda referenced this issue in docker/swarmkit Sep 12, 2016

Closed

Run a single task (batch job) #1501

@rayjohnson

This comment has been minimized.

Show comment
Hide comment
@rayjohnson

rayjohnson Sep 12, 2016

When compared to Rundeck, ActiveBatch or any other enterprise level batch management systems this proposal will never be of much use. Retry mechanisms, dependent jobs, triggered jobs, and many other features that this would never compete with. Alerting, job history and a full "run book" view are also typically critical features. It is scope creep to jump into this area and it would not add a lot of value.

What would add value, however, is to make it easy for an external batch scheduling system to call into swarm and run a job. The old swarm did this with "docker run" fairly well. The new service command doesn't allow for this so well. I'd suggest focusing more on enabling existing scheduling systems to use swarm over trying to recreate the wheel in yet another area.

When compared to Rundeck, ActiveBatch or any other enterprise level batch management systems this proposal will never be of much use. Retry mechanisms, dependent jobs, triggered jobs, and many other features that this would never compete with. Alerting, job history and a full "run book" view are also typically critical features. It is scope creep to jump into this area and it would not add a lot of value.

What would add value, however, is to make it easy for an external batch scheduling system to call into swarm and run a job. The old swarm did this with "docker run" fairly well. The new service command doesn't allow for this so well. I'd suggest focusing more on enabling existing scheduling systems to use swarm over trying to recreate the wheel in yet another area.

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Sep 13, 2016

Contributor

@rayjohnson I'm not sure that comparing with such systems is an apt comparison. The tools you are referencing are UIs and dependency management on top of some sort of execution system. From your response, I also sense some confusion about the use cases that services and swarm standalone hit.

From what you've filed in docker/swarmkit#1501, I'd recommend you continue using standalone swarm with Docker 1.12, rather than the built in swarm mode features.

However we extend swarm-mode will be a likely target for more feature rich job scheduling systems. You're right: the service command isn't designed for batch. I've touched on some the requirements in #23880 (comment), but your points about scope creep are a consideration. There are a million ways to schedule interdependent batch jobs and this is largely why this isn't in the first version of Docker services.

Contributor

stevvooe commented Sep 13, 2016

@rayjohnson I'm not sure that comparing with such systems is an apt comparison. The tools you are referencing are UIs and dependency management on top of some sort of execution system. From your response, I also sense some confusion about the use cases that services and swarm standalone hit.

From what you've filed in docker/swarmkit#1501, I'd recommend you continue using standalone swarm with Docker 1.12, rather than the built in swarm mode features.

However we extend swarm-mode will be a likely target for more feature rich job scheduling systems. You're right: the service command isn't designed for batch. I've touched on some the requirements in #23880 (comment), but your points about scope creep are a consideration. There are a million ways to schedule interdependent batch jobs and this is largely why this isn't in the first version of Docker services.

@mcallaway

This comment has been minimized.

Show comment
Hide comment
@mcallaway

mcallaway Sep 15, 2016

The comments about "job dependencies" and the implied "feature creep" makes me want to point out that there is a small jump between dependent jobs and full blown "workflows". This is what I'm interested in. It's probably too big for this issue, but I wanted to just put this out there so that other eyes might see it and maybe lead someplace else.

https://github.com/common-workflow-language/common-workflow-language

There's a need for workflow execution that leverages docker for the runtime environment for each "node" in a DAG (for example). CWL has arisen out of this need. But there's a lack of implementations. It's related to this issue, but I realize bigger. Maybe someone can link to other more relevant connections?

The comments about "job dependencies" and the implied "feature creep" makes me want to point out that there is a small jump between dependent jobs and full blown "workflows". This is what I'm interested in. It's probably too big for this issue, but I wanted to just put this out there so that other eyes might see it and maybe lead someplace else.

https://github.com/common-workflow-language/common-workflow-language

There's a need for workflow execution that leverages docker for the runtime environment for each "node" in a DAG (for example). CWL has arisen out of this need. But there's a lack of implementations. It's related to this issue, but I realize bigger. Maybe someone can link to other more relevant connections?

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Sep 15, 2016

Contributor

@mcallaway There are several ways of doing workflow specifications in different, spanning various levels of sophistication and complexity. CWL is a great example, but there are a number spanning the hadoop ecosystem to more academic applications. Many companies just grow their own to fit their specific needs, as well.

Any batch features in docker would like be positioned to support or complement these kinds of systems, rather define a competing workflow system. The challenge to this project is cutting out the right primitives to make that painless.

Contributor

stevvooe commented Sep 15, 2016

@mcallaway There are several ways of doing workflow specifications in different, spanning various levels of sophistication and complexity. CWL is a great example, but there are a number spanning the hadoop ecosystem to more academic applications. Many companies just grow their own to fit their specific needs, as well.

Any batch features in docker would like be positioned to support or complement these kinds of systems, rather define a competing workflow system. The challenge to this project is cutting out the right primitives to make that painless.

@OferE

This comment has been minimized.

Show comment
Hide comment
@OferE

OferE Sep 18, 2016

+1 - very important.

OferE commented Sep 18, 2016

+1 - very important.

@thaJeztah thaJeztah referenced this issue in docker/swarmkit Sep 26, 2016

Open

Do we support "one-off" service #1569

@GordonTheTurtle

This comment has been minimized.

Show comment
Hide comment
@GordonTheTurtle

GordonTheTurtle Sep 27, 2016

USER POLL

The best way to get notified of updates is to use the Subscribe button on this page.

Please don't use "+1" or "I have this too" comments on issues. We automatically
collect those comments to keep the thread short.

The people listed below have upvoted this issue by leaving a +1 comment:

@yank1

USER POLL

The best way to get notified of updates is to use the Subscribe button on this page.

Please don't use "+1" or "I have this too" comments on issues. We automatically
collect those comments to keep the thread short.

The people listed below have upvoted this issue by leaving a +1 comment:

@yank1

@danielwegener

This comment has been minimized.

Show comment
Hide comment
@danielwegener

danielwegener Feb 6, 2017

Maybe you guys could also help me understand the wording a bit or correct me if I got something wrong:
So docker standalone swarm (now legacy?) supported the docker run api (used by pretty much every third party tool interacting with docker). Now the new docker swarm mode (since 0.12.x, enabling docker swarmkit) brings support for awesome services but drops support for docker swarm standalone.

If I got that right, just a curious question/idea: How about simply (I bet it's not?) reenabling the legacy docker client api for the docker swarm such that both, swarmkits docker service but also the swarm standalone docker run command could be executed on the same swarm?
If that does not make so much sense (since docker standalone swarm seems to be an external project), how about adding support for docker run (with standalone swarm semantic) to swarmkit?

danielwegener commented Feb 6, 2017

Maybe you guys could also help me understand the wording a bit or correct me if I got something wrong:
So docker standalone swarm (now legacy?) supported the docker run api (used by pretty much every third party tool interacting with docker). Now the new docker swarm mode (since 0.12.x, enabling docker swarmkit) brings support for awesome services but drops support for docker swarm standalone.

If I got that right, just a curious question/idea: How about simply (I bet it's not?) reenabling the legacy docker client api for the docker swarm such that both, swarmkits docker service but also the swarm standalone docker run command could be executed on the same swarm?
If that does not make so much sense (since docker standalone swarm seems to be an external project), how about adding support for docker run (with standalone swarm semantic) to swarmkit?

@ehazlett

This comment has been minimized.

Show comment
Hide comment
@ehazlett

ehazlett Feb 6, 2017

Contributor

@danielwegener we are working towards parity with "docker run" (see #25303 for related) as well as consolidating the API to support what you are asking for. i'm not sure on time table as swarmkit is quite different in its model vs. docker run.

Contributor

ehazlett commented Feb 6, 2017

@danielwegener we are working towards parity with "docker run" (see #25303 for related) as well as consolidating the API to support what you are asking for. i'm not sure on time table as swarmkit is quite different in its model vs. docker run.

@ehazlett

This comment has been minimized.

Show comment
Hide comment
@ehazlett

ehazlett Feb 6, 2017

Contributor

@danielwegener i should also add, while not ideal, you can run swarm classic on a swarm mode enabled cluster to give you both.

Contributor

ehazlett commented Feb 6, 2017

@danielwegener i should also add, while not ideal, you can run swarm classic on a swarm mode enabled cluster to give you both.

@rayjohnson

This comment has been minimized.

Show comment
Hide comment
@rayjohnson

rayjohnson Feb 6, 2017

We are running both and can attest that the approach does indeed work. However, I can also agree that it is less than ideal - basically you have to manage two clusters in one and thus have twice the administrative issues. :)

We are running both and can attest that the approach does indeed work. However, I can also agree that it is less than ideal - basically you have to manage two clusters in one and thus have twice the administrative issues. :)

@danielwegener

This comment has been minimized.

Show comment
Hide comment
@danielwegener

danielwegener Feb 6, 2017

@ehazlett @rayjohnson Thanks for the quick response and clarification. I'll give the stacked cluster a try :)

@ehazlett @rayjohnson Thanks for the quick response and clarification. I'll give the stacked cluster a try :)

@chrisguidry

This comment has been minimized.

Show comment
Hide comment
@chrisguidry

chrisguidry Feb 6, 2017

While it's not perfect, I did have some measure of success with @aleksejlopasov's suggestion to use --restart-condition none. Here's how it works:

Let's say I have a container image whose ENTRYPOINT is a command that runs for 15 seconds and then exits normally.

$ docker service create --name my_scheduled_task --replicas 1 --restart-condition none ....

Then using a normal cron task:

* * * * * root docker service update my_scheduled_task

This causes the Swarm to, every minute, schedule an instance of the container on one of the swarm nodes, allowing it to run to completion. Also, if the task happens to take too long during one or more runs, you won't end up with stacked up crons because Swarm is already that it has enough replicas.

Some very important caveats:

  • If the process exits with an error code, Swarm will immediately try to reschedule it.
  • You won't get any error reporting or failure notifications from cron about this (since the command docker service update my_scheduled_task will return quickly and won't be attached to your container)

I have only used this in a few toy examples, and it seems to work well for those. I think you'd want to have another error reporting mechanism ready because it would be very easy to miss a failing task.

While it's not perfect, I did have some measure of success with @aleksejlopasov's suggestion to use --restart-condition none. Here's how it works:

Let's say I have a container image whose ENTRYPOINT is a command that runs for 15 seconds and then exits normally.

$ docker service create --name my_scheduled_task --replicas 1 --restart-condition none ....

Then using a normal cron task:

* * * * * root docker service update my_scheduled_task

This causes the Swarm to, every minute, schedule an instance of the container on one of the swarm nodes, allowing it to run to completion. Also, if the task happens to take too long during one or more runs, you won't end up with stacked up crons because Swarm is already that it has enough replicas.

Some very important caveats:

  • If the process exits with an error code, Swarm will immediately try to reschedule it.
  • You won't get any error reporting or failure notifications from cron about this (since the command docker service update my_scheduled_task will return quickly and won't be attached to your container)

I have only used this in a few toy examples, and it seems to work well for those. I think you'd want to have another error reporting mechanism ready because it would be very easy to miss a failing task.

@rayjohnson

This comment has been minimized.

Show comment
Hide comment
@rayjohnson

rayjohnson Jun 15, 2017

@Vanuan

This comment has been minimized.

Show comment
Hide comment
@Vanuan

Vanuan Jun 15, 2017

@rayjohnson I think you already can. Just use restart policy "never".

Vanuan commented Jun 15, 2017

@rayjohnson I think you already can. Just use restart policy "never".

@ms1111

This comment has been minimized.

Show comment
Hide comment
@ms1111

ms1111 Jun 15, 2017

@Vanuan Yes, just keep in mind that services with restart policy "never" tend to get restarted randomly because of docker/swarmkit#932. So this is only safe for certain types of job.

ms1111 commented Jun 15, 2017

@Vanuan Yes, just keep in mind that services with restart policy "never" tend to get restarted randomly because of docker/swarmkit#932. So this is only safe for certain types of job.

@lsl

This comment has been minimized.

Show comment
Hide comment
@lsl

lsl Jun 16, 2017

So seems like we have two distinct use cases here.

  • Cron-like service scheduling
  • Docker run for swarm mode (& persistent logs)

@ms1111 @Vanuan finding similar issues using long restart delays to solve the cron-like scheduling problem.

lsl commented Jun 16, 2017

So seems like we have two distinct use cases here.

  • Cron-like service scheduling
  • Docker run for swarm mode (& persistent logs)

@ms1111 @Vanuan finding similar issues using long restart delays to solve the cron-like scheduling problem.

@taybin

This comment has been minimized.

Show comment
Hide comment
@taybin

taybin Jun 16, 2017

Kubernetes supports these higher level concepts:

https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/

It'd be great to see support for these sort of workloads at the compose file level.

taybin commented Jun 16, 2017

Kubernetes supports these higher level concepts:

https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/

It'd be great to see support for these sort of workloads at the compose file level.

@thaJeztah thaJeztah added this to backlog in maintainers-session Jun 22, 2017

@spatialvlad

This comment has been minimized.

Show comment
Hide comment
@spatialvlad

spatialvlad Jul 11, 2017

Cron-like functionality would definitely be helpful.

It looks like the discussion in this topic found a way to run a service once using restart condition being none.

I guess I risk sounding obvious, but I might anyway mention it any, as someone might find it helpful.

It looks like it is also possible to have a basic repetitive repetition of running a job by using the delay option in the restart policy. E.g. to run a script daily (e.g. for a daily backup) this options would work:

            restart_policy:
                condition: any
                delay: 24h

Modifying the delay time one can get some other repetitive cycles, like weekly, etc.
It does not really give good control of the exact time of when the service will run. But I guess this can be somewhat "hacked" now by starting or updating the service at needed time.

Anyway, seeing native support for cron-like functionality would be great.

Cron-like functionality would definitely be helpful.

It looks like the discussion in this topic found a way to run a service once using restart condition being none.

I guess I risk sounding obvious, but I might anyway mention it any, as someone might find it helpful.

It looks like it is also possible to have a basic repetitive repetition of running a job by using the delay option in the restart policy. E.g. to run a script daily (e.g. for a daily backup) this options would work:

            restart_policy:
                condition: any
                delay: 24h

Modifying the delay time one can get some other repetitive cycles, like weekly, etc.
It does not really give good control of the exact time of when the service will run. But I guess this can be somewhat "hacked" now by starting or updating the service at needed time.

Anyway, seeing native support for cron-like functionality would be great.

@taiidani

This comment has been minimized.

Show comment
Hide comment
@taiidani

taiidani Jul 27, 2017

Well our solution of using docker-compose run against the stack file is running out of steam. I am having a hard time coming to grips with Docker's strategy of adding first-class Secrets and Config objects and then expecting us (I assume) to find alternatives for scheduled tasks and one-offs.

What is Docker's recommendation for handling Secrets in cron jobs? Should I abandon these features in favor of Hashicorp Vault, or implement workarounds like the restart condition option above where gathering logs and exit codes is a chore?

Well our solution of using docker-compose run against the stack file is running out of steam. I am having a hard time coming to grips with Docker's strategy of adding first-class Secrets and Config objects and then expecting us (I assume) to find alternatives for scheduled tasks and one-offs.

What is Docker's recommendation for handling Secrets in cron jobs? Should I abandon these features in favor of Hashicorp Vault, or implement workarounds like the restart condition option above where gathering logs and exit codes is a chore?

@Vanuan

This comment has been minimized.

Show comment
Hide comment
@Vanuan

Vanuan Jul 27, 2017

Wouldn't this work:

  1. Add the one-off to docker-compose.yml (use restart policy never)
  2. run docker deploy
  3. remove one-off from docker-compose.yml
  4. run docker deploy again

Vanuan commented Jul 27, 2017

Wouldn't this work:

  1. Add the one-off to docker-compose.yml (use restart policy never)
  2. run docker deploy
  3. remove one-off from docker-compose.yml
  4. run docker deploy again
@shouze

This comment has been minimized.

Show comment
Hide comment
@shouze

shouze Jul 27, 2017

Contributor

@Vanuan yup, but for cron jobs that won't work. Nice trick BTW.

Contributor

shouze commented Jul 27, 2017

@Vanuan yup, but for cron jobs that won't work. Nice trick BTW.

@Vanuan

This comment has been minimized.

Show comment
Hide comment
@Vanuan

Vanuan Jul 27, 2017

Yeah, for cron jobs you'd have to either use restart delay, e.g. 24h (it wouldn't be at the same time of day though) or emded cron into the container itself.

Vanuan commented Jul 27, 2017

Yeah, for cron jobs you'd have to either use restart delay, e.g. 24h (it wouldn't be at the same time of day though) or emded cron into the container itself.

@taiidani

This comment has been minimized.

Show comment
Hide comment
@taiidani

taiidani Jul 28, 2017

There are also the issues of:

  • Blocking your script while you wait for the one-off service to exit so you don't kill it accidentally midway-through. Now that we have Swarm events that may work too but you'd have to register and listen for the task exit event.
  • Navigating the service API to get the exit code of the task so you can determine if it failed/succeeded.
  • Navigating the service API (assuming you have access to service logs) to gather the logs, or giving your cron runner a message along the lines of "Logs aren't supported. Check your log aggregator please! ¯\(ツ)/¯"

At this point I'm coming to terms with having to tell our engineers, "You get insecure crons, or you get Secrets. You can't have both" while we continue our docker-compose run strategy.

There are also the issues of:

  • Blocking your script while you wait for the one-off service to exit so you don't kill it accidentally midway-through. Now that we have Swarm events that may work too but you'd have to register and listen for the task exit event.
  • Navigating the service API to get the exit code of the task so you can determine if it failed/succeeded.
  • Navigating the service API (assuming you have access to service logs) to gather the logs, or giving your cron runner a message along the lines of "Logs aren't supported. Check your log aggregator please! ¯\(ツ)/¯"

At this point I'm coming to terms with having to tell our engineers, "You get insecure crons, or you get Secrets. You can't have both" while we continue our docker-compose run strategy.

@Vanuan

This comment has been minimized.

Show comment
Hide comment
@Vanuan

Vanuan Jul 28, 2017

Well, I don't see how Docker prevents solving those issues. Those aren't unique to the swarm mode.

I imagine some supervisor script which will run your one-off script and push completion status to some database/message queue, while your dependent service waits for completion status to appear.

Similarly for gathering logs, I don't see what are you expecting. Docker doesn't prevent you from accessing logs. You can mount them to some network drive to be accessible wherever you want them to be.

I don't see how this is related to secrets at all.

Vanuan commented Jul 28, 2017

Well, I don't see how Docker prevents solving those issues. Those aren't unique to the swarm mode.

I imagine some supervisor script which will run your one-off script and push completion status to some database/message queue, while your dependent service waits for completion status to appear.

Similarly for gathering logs, I don't see what are you expecting. Docker doesn't prevent you from accessing logs. You can mount them to some network drive to be accessible wherever you want them to be.

I don't see how this is related to secrets at all.

@Vanuan

This comment has been minimized.

Show comment
Hide comment
@Vanuan

Vanuan Jul 28, 2017

Reading it more carefully, I think I understood.

You've exploited docker-compose run to run some script in the foreground, i.e. using "synchronous" commands.
E.g.:

docker-compose run mysql mysql -e "INSERT ..."
docker-compose run mysql mysql -e "CREATE ..."

That wasn't designed for that use case.

Vanuan commented Jul 28, 2017

Reading it more carefully, I think I understood.

You've exploited docker-compose run to run some script in the foreground, i.e. using "synchronous" commands.
E.g.:

docker-compose run mysql mysql -e "INSERT ..."
docker-compose run mysql mysql -e "CREATE ..."

That wasn't designed for that use case.

@Jota87

This comment has been minimized.

Show comment
Hide comment
@Jota87

Jota87 Aug 16, 2017

----Not works...----
Hey, I was struggling with this problem too. My solution has been add an environment tag on my docker compose for my swarm:

curator:
image: bobrik/curator:latest
deploy:
mode: replicated
replicas: 1
secrets:
- curator_yml
- actionfile_yml
command: --config /run/secrets/curator_yml /run/secrets/actionfile_yml
environment:
- SCHEDULE='* * * * *'

Jota87 commented Aug 16, 2017

----Not works...----
Hey, I was struggling with this problem too. My solution has been add an environment tag on my docker compose for my swarm:

curator:
image: bobrik/curator:latest
deploy:
mode: replicated
replicas: 1
secrets:
- curator_yml
- actionfile_yml
command: --config /run/secrets/curator_yml /run/secrets/actionfile_yml
environment:
- SCHEDULE='* * * * *'

@mide

This comment has been minimized.

Show comment
Hide comment
@mide

mide Aug 16, 2017

@Jota87, I think that the SCHEDULE environment variable you mentioned has been implemented at the image level, not at the orchestration level. One main difference: I'd wager that when you run that curator image, it stays running. If it were a real cron-like task, it'd do the work and go away.

I'll happily be wrong because this means the feature is closer than I previously thought 😄 .

mide commented Aug 16, 2017

@Jota87, I think that the SCHEDULE environment variable you mentioned has been implemented at the image level, not at the orchestration level. One main difference: I'd wager that when you run that curator image, it stays running. If it were a real cron-like task, it'd do the work and go away.

I'll happily be wrong because this means the feature is closer than I previously thought 😄 .

@Jota87

This comment has been minimized.

Show comment
Hide comment
@Jota87

Jota87 Aug 16, 2017

@mide you are right, this doesn't works! is trying to rerun curator all the time and burning my machines 😄😄 don't do that! i was exited with the idea but not has been my best 😄

Jota87 commented Aug 16, 2017

@mide you are right, this doesn't works! is trying to rerun curator all the time and burning my machines 😄😄 don't do that! i was exited with the idea but not has been my best 😄

@shouze

This comment has been minimized.

Show comment
Hide comment
@shouze

shouze Aug 16, 2017

Contributor

yup, one solution if none managed by docker swarm orchestrator itself would be to have a tool like traefik and pass sheduling through sercive labels that would be listened by this tool, that would run only on swarm manager nodes of course.

Contributor

shouze commented Aug 16, 2017

yup, one solution if none managed by docker swarm orchestrator itself would be to have a tool like traefik and pass sheduling through sercive labels that would be listened by this tool, that would run only on swarm manager nodes of course.

@getvivekv

This comment has been minimized.

Show comment
Hide comment
@getvivekv

getvivekv Aug 16, 2017

I am currently working on a proof of concept for this requirement, the theory is that the one-off cron jobs container can be implemented by creating a cronjob container in swarm mode that can create new swarm service in the cluster and once the service is finished running the cronjob, it just stops (restart condition: none). Having a container that has access to the host node's docker socket, can create docker service in the cluster with one replica and the container's whole purpose is to start the container, execute the cronjob (cronjobs are fed via an environment variable) and die.

getvivekv commented Aug 16, 2017

I am currently working on a proof of concept for this requirement, the theory is that the one-off cron jobs container can be implemented by creating a cronjob container in swarm mode that can create new swarm service in the cluster and once the service is finished running the cronjob, it just stops (restart condition: none). Having a container that has access to the host node's docker socket, can create docker service in the cluster with one replica and the container's whole purpose is to start the container, execute the cronjob (cronjobs are fed via an environment variable) and die.

@stevelacy

This comment has been minimized.

Show comment
Hide comment
@stevelacy

stevelacy Aug 16, 2017

This is how we ended up using swarm mode to run "one off" services:

    // command is from a network request
    serviceSpec := swarm.ServiceSpec{
      TaskTemplate: swarm.TaskSpec{
        ContainerSpec: &swarm.ContainerSpec{
          Image: "alpine",
          Command: command,
          StopSignal: "SIGINT",
        },

        RestartPolicy: &swarm.RestartPolicy{
          Condition: "none",
        },
      },
    }
    resp, err := cli.ServiceCreate(ctx, serviceSpec, types.ServiceCreateOptions{})

That code is running inside a container on a swarm master node as a container, not service. It then listens and connects to volume /var/run/docker/sock:/var/run/docker.sock and allows the docker SDK to manage new services as above.

This is how we ended up using swarm mode to run "one off" services:

    // command is from a network request
    serviceSpec := swarm.ServiceSpec{
      TaskTemplate: swarm.TaskSpec{
        ContainerSpec: &swarm.ContainerSpec{
          Image: "alpine",
          Command: command,
          StopSignal: "SIGINT",
        },

        RestartPolicy: &swarm.RestartPolicy{
          Condition: "none",
        },
      },
    }
    resp, err := cli.ServiceCreate(ctx, serviceSpec, types.ServiceCreateOptions{})

That code is running inside a container on a swarm master node as a container, not service. It then listens and connects to volume /var/run/docker/sock:/var/run/docker.sock and allows the docker SDK to manage new services as above.

@getvivekv

This comment has been minimized.

Show comment
Hide comment
@getvivekv

getvivekv Aug 16, 2017

Yes, exactly what I said above. I am still working on customizing the logic for our application. Glad to see that it works. One difference is that I'd want to run this container as a docker service for reliability as if this node goes down, in your case, the cron initiator would go down as well. However if you are running this container as a docker service, the other managers in the cluster will make sure that this container is relocated to one of the healthy manager nodes.

Plus, we are using Docker Swarm in an auto scale group in AWS. So if one of the managers go down, AWS will terminate that server and re-create a new server and the custom auto scale script will add it to the manager cluster automatically. In this case, I can't really run this cron initiator container as a standalone container, I need to let swarm manager it.

getvivekv commented Aug 16, 2017

Yes, exactly what I said above. I am still working on customizing the logic for our application. Glad to see that it works. One difference is that I'd want to run this container as a docker service for reliability as if this node goes down, in your case, the cron initiator would go down as well. However if you are running this container as a docker service, the other managers in the cluster will make sure that this container is relocated to one of the healthy manager nodes.

Plus, we are using Docker Swarm in an auto scale group in AWS. So if one of the managers go down, AWS will terminate that server and re-create a new server and the custom auto scale script will add it to the manager cluster automatically. In this case, I can't really run this cron initiator container as a standalone container, I need to let swarm manager it.

@stevelacy

This comment has been minimized.

Show comment
Hide comment
@stevelacy

stevelacy Aug 16, 2017

@getvivekv Check this out:

docker service create --name quartermaster \
  -e TOKEN=4jrs8-534js-345ds-3lrd0 \
  -p 9090:9090 \
  --constraint 'node.role == manager' \
  --mount type=bind,source=/var/run/docker.sock,destination=/var/run/docker.sock \
  stevelacy/quartermaster

Confirmed that it works, specified node.role == manager to ensure it has manager access to the swarm.

@getvivekv Check this out:

docker service create --name quartermaster \
  -e TOKEN=4jrs8-534js-345ds-3lrd0 \
  -p 9090:9090 \
  --constraint 'node.role == manager' \
  --mount type=bind,source=/var/run/docker.sock,destination=/var/run/docker.sock \
  stevelacy/quartermaster

Confirmed that it works, specified node.role == manager to ensure it has manager access to the swarm.

@shouze

This comment has been minimized.

Show comment
Hide comment
@shouze

shouze Aug 16, 2017

Contributor

ok @stevelacy related to this github repo I guess https://github.com/stevelacy/quartermaster will give a try, looks pretty good as a basis

Contributor

shouze commented Aug 16, 2017

ok @stevelacy related to this github repo I guess https://github.com/stevelacy/quartermaster will give a try, looks pretty good as a basis

@zsluedem

This comment has been minimized.

Show comment
Hide comment
@zsluedem

zsluedem Aug 23, 2017

i think one-shot job for more situations then cron job. Sometimes i just want to deploy a one time job and not a cron job based on my app.One-shot job is more flexible instead of cron one~!

i think one-shot job for more situations then cron job. Sometimes i just want to deploy a one time job and not a cron job based on my app.One-shot job is more flexible instead of cron one~!

@n3tw0rk5

This comment has been minimized.

Show comment
Hide comment
@n3tw0rk5

n3tw0rk5 Sep 11, 2017

What about repeatable one jobs that vary with each run eg:
docker run --rm -ti -v $PWD:$PWD -w $PWD image:tag /bin/bash -c “python script with an argument that changes for each container run & an input file that changes as well”

How would this work in a swarm where you may have multiple job submissions, so you want to scale?

What about repeatable one jobs that vary with each run eg:
docker run --rm -ti -v $PWD:$PWD -w $PWD image:tag /bin/bash -c “python script with an argument that changes for each container run & an input file that changes as well”

How would this work in a swarm where you may have multiple job submissions, so you want to scale?

@nasskach

This comment has been minimized.

Show comment
Hide comment
@nasskach

nasskach Sep 14, 2017

@getvivekv yes we did something similar using rundeck packaged in a container running on manager nodes only.

@getvivekv yes we did something similar using rundeck packaged in a container running on manager nodes only.

@getvivekv

This comment has been minimized.

Show comment
Hide comment
@getvivekv

getvivekv Sep 14, 2017

Assuming that rundeck needs to be installed on a server, how do you take care of the cron if rundeck server goes down ? Are you starting a swarm service with rundeck? @nasskach

Assuming that rundeck needs to be installed on a server, how do you take care of the cron if rundeck server goes down ? Are you starting a swarm service with rundeck? @nasskach

@nasskach

This comment has been minimized.

Show comment
Hide comment
@nasskach

nasskach Sep 14, 2017

@getvivekv As we have many manager nodes in our cluster, if the node where rundeck is deployed goes down, it container will be normally moved to another manager node.

I'm currently thinking* about writing a small cron application that will run also in a container on a manager node but that will read cron tasks from the labels of the different containers... like traefik is doing for load balancing.

I'm new at Golang programming and I'm still not sure that it is a good idea ^^

EDIT : I think that I didn't understand well your question in my first answer. You are asking how we do if the rundeck starts a job and during the job execution, rundeck goes down. It is a good question and we didn't manage that case. Te be honest, we use more currently ECS rather than Docker Swarm Mode. And in my opinion K8S is still a better solution until Docker Swarm Mode will add the feature for just executing a single task on the cluster.

nasskach commented Sep 14, 2017

@getvivekv As we have many manager nodes in our cluster, if the node where rundeck is deployed goes down, it container will be normally moved to another manager node.

I'm currently thinking* about writing a small cron application that will run also in a container on a manager node but that will read cron tasks from the labels of the different containers... like traefik is doing for load balancing.

I'm new at Golang programming and I'm still not sure that it is a good idea ^^

EDIT : I think that I didn't understand well your question in my first answer. You are asking how we do if the rundeck starts a job and during the job execution, rundeck goes down. It is a good question and we didn't manage that case. Te be honest, we use more currently ECS rather than Docker Swarm Mode. And in my opinion K8S is still a better solution until Docker Swarm Mode will add the feature for just executing a single task on the cluster.

@getvivekv

This comment has been minimized.

Show comment
Hide comment
@getvivekv

getvivekv Sep 14, 2017

Yes my question was about rundeck server going down, in this case you don't have redundancy. You can't initiate the cron job. We are internally discussing if we can use AWS lambda to initiate the cronjob

Yes my question was about rundeck server going down, in this case you don't have redundancy. You can't initiate the cron job. We are internally discussing if we can use AWS lambda to initiate the cronjob

@Gowiem

This comment has been minimized.

Show comment
Hide comment
@Gowiem

Gowiem Jan 4, 2018

For folks coming here looking for a non hacky solution right now, one possibility is to go the route of something like Clockwork or python schedule or the equivalent for your language. Then you can have a long running service which kicks off jobs on the schedule you determine within that system.

Gowiem commented Jan 4, 2018

For folks coming here looking for a non hacky solution right now, one possibility is to go the route of something like Clockwork or python schedule or the equivalent for your language. Then you can have a long running service which kicks off jobs on the schedule you determine within that system.

@pageflex

This comment has been minimized.

Show comment
Hide comment
@pageflex

pageflex Feb 13, 2018

We are looking to move our backoffice support to task based and to use docker to save infrastructure costs. The ability to stand an image up as a container to do a given task either on a schedule or from an external caller is a necessary feature. We'd love to have both those features with swarm for the redundancy, and the ability to also have long standing services within the same infrastructure would be a nice value add proposition.

I see all the hacks and we'll have to go that route but I'm curious if these ideas have been placed to a roadmap somewhere that can give use an expectation of when this thread will be addressed.

pageflex commented Feb 13, 2018

We are looking to move our backoffice support to task based and to use docker to save infrastructure costs. The ability to stand an image up as a container to do a given task either on a schedule or from an external caller is a necessary feature. We'd love to have both those features with swarm for the redundancy, and the ability to also have long standing services within the same infrastructure would be a nice value add proposition.

I see all the hacks and we'll have to go that route but I'm curious if these ideas have been placed to a roadmap somewhere that can give use an expectation of when this thread will be addressed.

borekb added a commit to versionpress/versionpress that referenced this issue May 13, 2018

Experiment with Swarm mode / stacks. Long-running services work nicel…
…y, however, one-off jobs (we'll need them for tests) are not really supported, see moby/moby#23880, so this will get reverted soon.

@borekb borekb referenced this issue in versionpress/versionpress May 13, 2018

Merged

Dev setup updates – spring 2018 #1329

19 of 19 tasks complete
@s17t

This comment has been minimized.

Show comment
Hide comment
@s17t

s17t May 17, 2018

I have a couple of use cases that would fit perfectly in one-off job feature. Now I'll have to just round robin on N docker server. Even without the cron part this feature is more than desirable; there are plenty of schedulers out there.

Here [0] someone wrote a wrapper just for having one-off job at swarm level after the 1.12.

[0] https://github.com/alexellis/jaas

s17t commented May 17, 2018

I have a couple of use cases that would fit perfectly in one-off job feature. Now I'll have to just round robin on N docker server. Even without the cron part this feature is more than desirable; there are plenty of schedulers out there.

Here [0] someone wrote a wrapper just for having one-off job at swarm level after the 1.12.

[0] https://github.com/alexellis/jaas

@cpitkin cpitkin referenced this issue in openfaas/faas May 23, 2018

Open

Research item: Long running batch-jobs #657

@alexellis

This comment has been minimized.

Show comment
Hide comment
@alexellis

alexellis May 23, 2018

Contributor

@s17t thanks for mentioning JaaS - if anyone would like to get jobs on Swarm this is probably one of the simplest ways you can do it right now.

Contributions are welcome - https://github.com/alexellis/jaas

OpenFaaS is also related on this topic for shorter running tasks - seconds to minutes - https://github.com/openfaas/faas

Would be ideal to see this in Swarm but we've been waiting two years for an update.

Contributor

alexellis commented May 23, 2018

@s17t thanks for mentioning JaaS - if anyone would like to get jobs on Swarm this is probably one of the simplest ways you can do it right now.

Contributions are welcome - https://github.com/alexellis/jaas

OpenFaaS is also related on this topic for shorter running tasks - seconds to minutes - https://github.com/openfaas/faas

Would be ideal to see this in Swarm but we've been waiting two years for an update.

@bitsofinfo

This comment has been minimized.

Show comment
Hide comment

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment