
Loading…
Hi @aphyr, what kind of tasks are you passing to chronos to run and how much resources did you specify?
Also just glancing quickly I notice from the master log there are places where it detects the framework is lost and failed all the tasks. There are settings from the framework side to tell Mesos to wait for a certain time to wait for the framework to reconnect (failover timeout).
So my question is what's configuration you used to launch chronos?
And type of failures are you injecting in this case?
If you could also please send over your exact JSON job definition in Chronos, it will be easy for us to check it and also try to reproduce the problem
what kind of tasks are you passing to chronos to run
Er, like I said, each job consists of a small shell script which allocates a tmpfile, logs the job name and start time, sleeps for :duration seconds, then logs the completion time.
See https://github.com/aphyr/jepsen/blob/master/chronos/src/jepsen/chronos.clj#L91-L95 for the exact command. :)
and how much resources did you specify?
Initially I used the defaults, but the test fails regardless of requested resources. Right now I'm using 1 mem, 1 disk, and 0.001 CPU.
So my question is what's configuration you used to launch chronos?
Here's the Mesos master config and the Mesos slave config. I'm just using the defaults from the mesosphere debian Chronos packages for Chronos's setup.
And type of failures are you injecting in this case?
Like I said, I haven't started introducing failures yet. This is with a totally-connected, ~zero-latency network and no host/process failures. :)
It it matters, the node hosting all 5 LXC nodes is a 48-way Xeon with 128GB of RAM. We're not really under CPU/memory pressure yet. :)
If you could also please send over your exact JSON job definition in Chronos, it will be easy for us to check it and also try to reproduce the problem
Here's the config for job 36:
{"async":false,
"schedule":"R5/2015-08-05T21:57:42.000Z/PT24S",
"disk":1,
"mem":1,
"name":36,
"command":"MEW=$(mktemp -p /tmp/chronos-test/); echo \"36\" >> $MEW; date -u -Ins >> $MEW; sleep 6; date -u -Ins >> $MEW;",
"epsilon":"PT6S",
"owner":"jepsen@jepsen.io",
"scheduleTimeZone":"UTC",
"cpus":0.001}@aphyr I also notice a lot of different framework ids in the log, seems like you've used to cluster before and the frameworks that used to connect with the master is gone. I don't think this is related but then the logs is filled with other tasks statuses that it tries to forward back and I can't tell which framework is yours.
Can you purge all the work directories (default /tmp/mesos) from master, slave and zookeeper and retry your experiment? We'll be trying here as soon as we can get to it too. That way a cleaner log is easier to figure out what's wrong.
Good catch! I've updated the tests to clear the mesos directories on each run: here's another failing case.
5
{:valid? false,
:job
{:name 5,
:start #<DateTime 2015-08-06T17:49:25.966Z>,
:count 2,
:duration 6,
:epsilon 25,
:interval 48},
:solution nil,
:targets
([#<DateTime 2015-08-06T17:49:25.966Z>
#<DateTime 2015-08-06T17:49:55.966Z>]
[#<DateTime 2015-08-06T17:50:13.966Z>
#<DateTime 2015-08-06T17:50:43.966Z>]),
:extra nil,
:complete
[{:node "n5",
:name 5,
:start #<DateTime 2015-08-06T17:49:27.956Z>,
:end #<DateTime 2015-08-06T17:49:33.958Z>}],
:incomplete []},Hi @aphyr, I tried launching your json on my cluster, and I was able to reproduce the issue. I believe the issue is your repeat interval. When I increase your repeat interval to "schedule":"R5/2015-08-06T18:37:42.000Z/PT50S", it works consistently for me and repeats 5 times as expected. When I use the 24s repeat interval sometimes it stops after running 3 times. Chronos by default only allows one instance of a job to run at a time so if there is not a long enough repeat interval some jobs end up not running. For more details on why a longer repeat interval is needed, see a similar discussion here, #476. Sorry for the user experience confusion here. Please retry with a larger repeat interval and let us know how it goes.
You should shorten the scheduler horizon (the flag is --schedule_horizon) from the default of 60s, to something like 1s to provide the scheduling granularity your test needs.
Thanks @brndnmtthws. Yes, @aphyr, tuning the flag Brenden mentioned may allow you to avoid increasing your repeat interval I mentioned above.
Chronos by default only allows one instance of a job to run at a time so if there is not a long enough repeat interval some jobs end up not running.
This was a problem in earlier versions of the test, so I modified the generator to construct intervals that can never overlap. In this case, for instance, there's a full 12 seconds between the latest possible completion time for the first run, and the earliest possible start time for the second run.
You should shorten the scheduler horizon (the flag is --schedule_horizon) from the default of 60s, to something like 1s to provide the scheduling granularity your test needs.
Aha! I'll give that a shot. Can I suggest that Chronos emit an error when asked to schedule operations at a finer granularity than the scheduler horizon, instead of accepting those jobs and failing to run them?
Can I suggest that Chronos emit an error when asked to schedule operations at a finer granularity than the scheduler horizon, instead of accepting those jobs and failing to run them?
Great idea. Patches are welcome, too.
Looks like dropping the scheduler horizon to 1 second allows the test to pass in the happy case. Thank you! I'm gonna go ahead and close this for now. :)
Happy to help. You should stop by our offices for lunch some time and we can chat about Chronos and other computer-y things.
I'm testing Chronos with Jepsen--we'd like guarantees that jobs will be completed on schedule, even in the presence of network and node failures. I've had a spot of trouble getting started: jobs aren't invoked on schedule even before I've started inducing failure.
This test creates a five-node cluster: (n1, n2, n3, n4, n5). ZK (3.4.5+dfsg-2) runs on all five nodes, Mesos (0.23.0-1.0.debian81) masters run on the first three nodes (n1, n2, n3), and Mesos slaves run on n4 and n5. Chronos (2.3.4-1.0.81.debian77) runs on all five nodes.
We generate and insert a handful of jobs into Chronos for 300 seconds, then wait 120 seconds to watch the system in a steady state, and finally read the results. Each job consists of a small shell script which allocates a tmpfile, logs the job name and start time, sleeps for
:durationseconds, then logs the completion time. To read the final state, we take the union of all files written by every job. When a run writes its final time we consider itcomplete--if it only logs the initial time we consider itincomplete.Job schedules are selected such that the
:intervalbetween jobs is always greater than their:durationplus:epsilonplus an additionalepsilon-forgiveness--in testing, it looks like Chronos will regularly slip deadlines by a second or two, so we allow an extra 5 seconds slop. This should guarantee that jobs never run concurrently.All node clocks are perfectly synchronized.
To verify correctness, we expand the job spec into a sequence of
:targetintervals: a vector of[start-time end-time]during which the job should have begun. We only compute targets which would have been complete at the time of the final read. Then, we take all completed invocations, and attempt to find a unique:solutionmapping targets to runs, using the Choco constraint solver. A job is considered invalid if some target could not be satisfied by a completed run. For example:{:valid? false, :job {:name 36, :start #<DateTime 2015-08-05T21:57:42.512Z>, :count 5, :duration 6, :epsilon 6, :interval 24}, :solution nil, :targets ([#<DateTime 2015-08-05T21:57:42.512Z> #<DateTime 2015-08-05T21:57:53.512Z>] [#<DateTime 2015-08-05T21:58:06.512Z> #<DateTime 2015-08-05T21:58:17.512Z>] [#<DateTime 2015-08-05T21:58:30.512Z> #<DateTime 2015-08-05T21:58:41.512Z>] [#<DateTime 2015-08-05T21:58:54.512Z> #<DateTime 2015-08-05T21:59:05.512Z>] [#<DateTime 2015-08-05T21:59:18.512Z> #<DateTime 2015-08-05T21:59:29.512Z>]), :extra nil, :complete [{:name 36, :start #<DateTime 2015-08-05T21:57:43.619Z>, :end #<DateTime 2015-08-05T21:57:49.621Z>} {:name 36, :start #<DateTime 2015-08-05T21:58:06.665Z>, :end #<DateTime 2015-08-05T21:58:12.666Z>} {:name 36, :start #<DateTime 2015-08-05T21:58:31.668Z>, :end #<DateTime 2015-08-05T21:58:37.670Z>}], :incomplete []},This job should have run 5 times, during a 6 second window recurring every 24 seconds, from 21:57:42 to 21:59:18. However, only the first 3 targets are satisfied--Chronos never invokes the final 2. You can download the full ZK, Mesos logs, operation history, and analysis here.
You can reproduce these results by cloning https://github.com/aphyr/jepsen/, at commit c8a9acb274074efcb8bc689983f30ecac8f51605, setting up Jepsen (see https://github.com/aphyr/jepsen/tree/master/jepsen, or https://github.com/abailly/jepsen-vagrant), and running
lein testinjepsen/chronos. I'm also happy to adjust the tests and re-run with newer versions.Should I adjust our expectations for the test? Is there additional debugging information that might be helpful? Anything else I can tell you about the test structure or environment? I'd like to get this figured out. :)