-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs may not run on schedule #508
Comments
Hi @aphyr, what kind of tasks are you passing to chronos to run and how much resources did you specify? |
If you could also please send over your exact JSON job definition in Chronos, it will be easy for us to check it and also try to reproduce the problem |
Er, like I said, each job consists of a small shell script which allocates a tmpfile, logs the job name and start time, sleeps for :duration seconds, then logs the completion time. See https://github.com/aphyr/jepsen/blob/master/chronos/src/jepsen/chronos.clj#L91-L95 for the exact command. :)
Initially I used the defaults, but the test fails regardless of requested resources. Right now I'm using 1 mem, 1 disk, and 0.001 CPU.
Here's the Mesos master config and the Mesos slave config. I'm just using the defaults from the mesosphere debian Chronos packages for Chronos's setup.
Like I said, I haven't started introducing failures yet. This is with a totally-connected, ~zero-latency network and no host/process failures. :) |
It it matters, the node hosting all 5 LXC nodes is a 48-way Xeon with 128GB of RAM. We're not really under CPU/memory pressure yet. :) |
Here's the config for job 36: {"async":false,
"schedule":"R5/2015-08-05T21:57:42.000Z/PT24S",
"disk":1,
"mem":1,
"name":36,
"command":"MEW=$(mktemp -p /tmp/chronos-test/); echo \"36\" >> $MEW; date -u -Ins >> $MEW; sleep 6; date -u -Ins >> $MEW;",
"epsilon":"PT6S",
"owner":"jepsen@jepsen.io",
"scheduleTimeZone":"UTC",
"cpus":0.001} |
@aphyr I also notice a lot of different framework ids in the log, seems like you've used to cluster before and the frameworks that used to connect with the master is gone. I don't think this is related but then the logs is filled with other tasks statuses that it tries to forward back and I can't tell which framework is yours. |
Good catch! I've updated the tests to clear the mesos directories on each run: here's another failing case. 5
{:valid? false,
:job
{:name 5,
:start #<DateTime 2015-08-06T17:49:25.966Z>,
:count 2,
:duration 6,
:epsilon 25,
:interval 48},
:solution nil,
:targets
([#<DateTime 2015-08-06T17:49:25.966Z>
#<DateTime 2015-08-06T17:49:55.966Z>]
[#<DateTime 2015-08-06T17:50:13.966Z>
#<DateTime 2015-08-06T17:50:43.966Z>]),
:extra nil,
:complete
[{:node "n5",
:name 5,
:start #<DateTime 2015-08-06T17:49:27.956Z>,
:end #<DateTime 2015-08-06T17:49:33.958Z>}],
:incomplete []}, |
Hi @aphyr, I tried launching your json on my cluster, and I was able to reproduce the issue. I believe the issue is your repeat interval. When I increase your repeat interval to "schedule":"R5/2015-08-06T18:37:42.000Z/PT50S", it works consistently for me and repeats 5 times as expected. When I use the 24s repeat interval sometimes it stops after running 3 times. Chronos by default only allows one instance of a job to run at a time so if there is not a long enough repeat interval some jobs end up not running. For more details on why a longer repeat interval is needed, see a similar discussion here, #476. Sorry for the user experience confusion here. Please retry with a larger repeat interval and let us know how it goes. |
You should shorten the scheduler horizon (the flag is |
Thanks @brndnmtthws. Yes, @aphyr, tuning the flag Brenden mentioned may allow you to avoid increasing your repeat interval I mentioned above. |
This was a problem in earlier versions of the test, so I modified the generator to construct intervals that can never overlap. In this case, for instance, there's a full 12 seconds between the latest possible completion time for the first run, and the earliest possible start time for the second run.
Aha! I'll give that a shot. Can I suggest that Chronos emit an error when asked to schedule operations at a finer granularity than the scheduler horizon, instead of accepting those jobs and failing to run them? |
Great idea. Patches are welcome, too. |
Looks like dropping the scheduler horizon to 1 second allows the test to pass in the happy case. Thank you! I'm gonna go ahead and close this for now. :) |
Happy to help. You should stop by our offices for lunch some time and we can chat about Chronos and other computer-y things. |
I'm testing Chronos with Jepsen--we'd like guarantees that jobs will be completed on schedule, even in the presence of network and node failures. I've had a spot of trouble getting started: jobs aren't invoked on schedule even before I've started inducing failure.
This test creates a five-node cluster: (n1, n2, n3, n4, n5). ZK (3.4.5+dfsg-2) runs on all five nodes, Mesos (0.23.0-1.0.debian81) masters run on the first three nodes (n1, n2, n3), and Mesos slaves run on n4 and n5. Chronos (2.3.4-1.0.81.debian77) runs on all five nodes.
We generate and insert a handful of jobs into Chronos for 300 seconds, then wait 120 seconds to watch the system in a steady state, and finally read the results. Each job consists of a small shell script which allocates a tmpfile, logs the job name and start time, sleeps for
:duration
seconds, then logs the completion time. To read the final state, we take the union of all files written by every job. When a run writes its final time we consider itcomplete
--if it only logs the initial time we consider itincomplete
.Job schedules are selected such that the
:interval
between jobs is always greater than their:duration
plus:epsilon
plus an additionalepsilon-forgiveness
--in testing, it looks like Chronos will regularly slip deadlines by a second or two, so we allow an extra 5 seconds slop. This should guarantee that jobs never run concurrently.All node clocks are perfectly synchronized.
To verify correctness, we expand the job spec into a sequence of
:target
intervals: a vector of[start-time end-time]
during which the job should have begun. We only compute targets which would have been complete at the time of the final read. Then, we take all completed invocations, and attempt to find a unique:solution
mapping targets to runs, using the Choco constraint solver. A job is considered invalid if some target could not be satisfied by a completed run. For example:This job should have run 5 times, during a 6 second window recurring every 24 seconds, from 21:57:42 to 21:59:18. However, only the first 3 targets are satisfied--Chronos never invokes the final 2. You can download the full ZK, Mesos logs, operation history, and analysis here.
You can reproduce these results by cloning https://github.com/aphyr/jepsen/, at commit c8a9acb274074efcb8bc689983f30ecac8f51605, setting up Jepsen (see https://github.com/aphyr/jepsen/tree/master/jepsen, or https://github.com/abailly/jepsen-vagrant), and running
lein test
injepsen/chronos
. I'm also happy to adjust the tests and re-run with newer versions.Should I adjust our expectations for the test? Is there additional debugging information that might be helpful? Anything else I can tell you about the test structure or environment? I'd like to get this figured out. :)
The text was updated successfully, but these errors were encountered: