Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Clone in Desktop Download ZIP

Loading…

Jobs may not run on schedule #508

Closed
aphyr opened this Issue · 14 comments

4 participants

@aphyr

I'm testing Chronos with Jepsen--we'd like guarantees that jobs will be completed on schedule, even in the presence of network and node failures. I've had a spot of trouble getting started: jobs aren't invoked on schedule even before I've started inducing failure.

This test creates a five-node cluster: (n1, n2, n3, n4, n5). ZK (3.4.5+dfsg-2) runs on all five nodes, Mesos (0.23.0-1.0.debian81) masters run on the first three nodes (n1, n2, n3), and Mesos slaves run on n4 and n5. Chronos (2.3.4-1.0.81.debian77) runs on all five nodes.

We generate and insert a handful of jobs into Chronos for 300 seconds, then wait 120 seconds to watch the system in a steady state, and finally read the results. Each job consists of a small shell script which allocates a tmpfile, logs the job name and start time, sleeps for :duration seconds, then logs the completion time. To read the final state, we take the union of all files written by every job. When a run writes its final time we consider it complete--if it only logs the initial time we consider it incomplete.

Job schedules are selected such that the :interval between jobs is always greater than their :duration plus :epsilon plus an additional epsilon-forgiveness--in testing, it looks like Chronos will regularly slip deadlines by a second or two, so we allow an extra 5 seconds slop. This should guarantee that jobs never run concurrently.

All node clocks are perfectly synchronized.

To verify correctness, we expand the job spec into a sequence of :target intervals: a vector of [start-time end-time] during which the job should have begun. We only compute targets which would have been complete at the time of the final read. Then, we take all completed invocations, and attempt to find a unique :solution mapping targets to runs, using the Choco constraint solver. A job is considered invalid if some target could not be satisfied by a completed run. For example:

{:valid? false,
   :job
   {:name 36,
    :start #<DateTime 2015-08-05T21:57:42.512Z>,
    :count 5,
    :duration 6,
    :epsilon 6,
    :interval 24},
   :solution nil,
   :targets
   ([#<DateTime 2015-08-05T21:57:42.512Z>
     #<DateTime 2015-08-05T21:57:53.512Z>]
    [#<DateTime 2015-08-05T21:58:06.512Z>
     #<DateTime 2015-08-05T21:58:17.512Z>]
    [#<DateTime 2015-08-05T21:58:30.512Z>
     #<DateTime 2015-08-05T21:58:41.512Z>]
    [#<DateTime 2015-08-05T21:58:54.512Z>
     #<DateTime 2015-08-05T21:59:05.512Z>]
    [#<DateTime 2015-08-05T21:59:18.512Z>
     #<DateTime 2015-08-05T21:59:29.512Z>]),
   :extra nil,
   :complete
   [{:name 36,
     :start #<DateTime 2015-08-05T21:57:43.619Z>,
     :end #<DateTime 2015-08-05T21:57:49.621Z>}
    {:name 36,
     :start #<DateTime 2015-08-05T21:58:06.665Z>,
     :end #<DateTime 2015-08-05T21:58:12.666Z>}
    {:name 36,
     :start #<DateTime 2015-08-05T21:58:31.668Z>,
     :end #<DateTime 2015-08-05T21:58:37.670Z>}],
   :incomplete []},

This job should have run 5 times, during a 6 second window recurring every 24 seconds, from 21:57:42 to 21:59:18. However, only the first 3 targets are satisfied--Chronos never invokes the final 2. You can download the full ZK, Mesos logs, operation history, and analysis here.

You can reproduce these results by cloning https://github.com/aphyr/jepsen/, at commit c8a9acb274074efcb8bc689983f30ecac8f51605, setting up Jepsen (see https://github.com/aphyr/jepsen/tree/master/jepsen, or https://github.com/abailly/jepsen-vagrant), and running lein test in jepsen/chronos. I'm also happy to adjust the tests and re-run with newer versions.

Should I adjust our expectations for the test? Is there additional debugging information that might be helpful? Anything else I can tell you about the test structure or environment? I'd like to get this figured out. :)

@tnachen
Owner

Hi @aphyr, what kind of tasks are you passing to chronos to run and how much resources did you specify?
Also just glancing quickly I notice from the master log there are places where it detects the framework is lost and failed all the tasks. There are settings from the framework side to tell Mesos to wait for a certain time to wait for the framework to reconnect (failover timeout).
So my question is what's configuration you used to launch chronos?
And type of failures are you injecting in this case?

@elingg
Collaborator

If you could also please send over your exact JSON job definition in Chronos, it will be easy for us to check it and also try to reproduce the problem

@aphyr

what kind of tasks are you passing to chronos to run

Er, like I said, each job consists of a small shell script which allocates a tmpfile, logs the job name and start time, sleeps for :duration seconds, then logs the completion time.

See https://github.com/aphyr/jepsen/blob/master/chronos/src/jepsen/chronos.clj#L91-L95 for the exact command. :)

and how much resources did you specify?

Initially I used the defaults, but the test fails regardless of requested resources. Right now I'm using 1 mem, 1 disk, and 0.001 CPU.

So my question is what's configuration you used to launch chronos?

Here's the Mesos master config and the Mesos slave config. I'm just using the defaults from the mesosphere debian Chronos packages for Chronos's setup.

And type of failures are you injecting in this case?

Like I said, I haven't started introducing failures yet. This is with a totally-connected, ~zero-latency network and no host/process failures. :)

@aphyr

It it matters, the node hosting all 5 LXC nodes is a 48-way Xeon with 128GB of RAM. We're not really under CPU/memory pressure yet. :)

@aphyr

If you could also please send over your exact JSON job definition in Chronos, it will be easy for us to check it and also try to reproduce the problem

Here's the config for job 36:

{"async":false,
"schedule":"R5/2015-08-05T21:57:42.000Z/PT24S",
"disk":1,
"mem":1,
"name":36,
"command":"MEW=$(mktemp -p /tmp/chronos-test/); echo \"36\" >> $MEW; date -u -Ins >> $MEW; sleep 6; date -u -Ins >> $MEW;",
"epsilon":"PT6S",
"owner":"jepsen@jepsen.io",
"scheduleTimeZone":"UTC",
"cpus":0.001}
@tnachen
Owner

@aphyr I also notice a lot of different framework ids in the log, seems like you've used to cluster before and the frameworks that used to connect with the master is gone. I don't think this is related but then the logs is filled with other tasks statuses that it tries to forward back and I can't tell which framework is yours.
Can you purge all the work directories (default /tmp/mesos) from master, slave and zookeeper and retry your experiment? We'll be trying here as soon as we can get to it too. That way a cleaner log is easier to figure out what's wrong.

@aphyr

Good catch! I've updated the tests to clear the mesos directories on each run: here's another failing case.

   5
   {:valid? false,
    :job
    {:name 5,
     :start #<DateTime 2015-08-06T17:49:25.966Z>,
     :count 2,
     :duration 6,
     :epsilon 25,
     :interval 48},
    :solution nil,
    :targets
    ([#<DateTime 2015-08-06T17:49:25.966Z>
      #<DateTime 2015-08-06T17:49:55.966Z>]
     [#<DateTime 2015-08-06T17:50:13.966Z>
      #<DateTime 2015-08-06T17:50:43.966Z>]),
    :extra nil,
    :complete
    [{:node "n5",
      :name 5,
      :start #<DateTime 2015-08-06T17:49:27.956Z>,
      :end #<DateTime 2015-08-06T17:49:33.958Z>}],
    :incomplete []},
@elingg
Collaborator

Hi @aphyr, I tried launching your json on my cluster, and I was able to reproduce the issue. I believe the issue is your repeat interval. When I increase your repeat interval to "schedule":"R5/2015-08-06T18:37:42.000Z/PT50S", it works consistently for me and repeats 5 times as expected. When I use the 24s repeat interval sometimes it stops after running 3 times. Chronos by default only allows one instance of a job to run at a time so if there is not a long enough repeat interval some jobs end up not running. For more details on why a longer repeat interval is needed, see a similar discussion here, #476. Sorry for the user experience confusion here. Please retry with a larger repeat interval and let us know how it goes.

@brndnmtthws
Owner

You should shorten the scheduler horizon (the flag is --schedule_horizon) from the default of 60s, to something like 1s to provide the scheduling granularity your test needs.

@elingg
Collaborator

Thanks @brndnmtthws. Yes, @aphyr, tuning the flag Brenden mentioned may allow you to avoid increasing your repeat interval I mentioned above.

@aphyr

Chronos by default only allows one instance of a job to run at a time so if there is not a long enough repeat interval some jobs end up not running.

This was a problem in earlier versions of the test, so I modified the generator to construct intervals that can never overlap. In this case, for instance, there's a full 12 seconds between the latest possible completion time for the first run, and the earliest possible start time for the second run.

You should shorten the scheduler horizon (the flag is --schedule_horizon) from the default of 60s, to something like 1s to provide the scheduling granularity your test needs.

Aha! I'll give that a shot. Can I suggest that Chronos emit an error when asked to schedule operations at a finer granularity than the scheduler horizon, instead of accepting those jobs and failing to run them?

@brndnmtthws
Owner

Can I suggest that Chronos emit an error when asked to schedule operations at a finer granularity than the scheduler horizon, instead of accepting those jobs and failing to run them?

Great idea. Patches are welcome, too.

@aphyr

Looks like dropping the scheduler horizon to 1 second allows the test to pass in the happy case. Thank you! I'm gonna go ahead and close this for now. :)

@aphyr aphyr closed this
@brndnmtthws
Owner

Happy to help. You should stop by our offices for lunch some time and we can chat about Chronos and other computer-y things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.