New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs may not run on schedule #508

Closed
aphyr opened this Issue Aug 5, 2015 · 14 comments

Comments

Projects
None yet
4 participants
@aphyr

aphyr commented Aug 5, 2015

I'm testing Chronos with Jepsen--we'd like guarantees that jobs will be completed on schedule, even in the presence of network and node failures. I've had a spot of trouble getting started: jobs aren't invoked on schedule even before I've started inducing failure.

This test creates a five-node cluster: (n1, n2, n3, n4, n5). ZK (3.4.5+dfsg-2) runs on all five nodes, Mesos (0.23.0-1.0.debian81) masters run on the first three nodes (n1, n2, n3), and Mesos slaves run on n4 and n5. Chronos (2.3.4-1.0.81.debian77) runs on all five nodes.

We generate and insert a handful of jobs into Chronos for 300 seconds, then wait 120 seconds to watch the system in a steady state, and finally read the results. Each job consists of a small shell script which allocates a tmpfile, logs the job name and start time, sleeps for :duration seconds, then logs the completion time. To read the final state, we take the union of all files written by every job. When a run writes its final time we consider it complete--if it only logs the initial time we consider it incomplete.

Job schedules are selected such that the :interval between jobs is always greater than their :duration plus :epsilon plus an additional epsilon-forgiveness--in testing, it looks like Chronos will regularly slip deadlines by a second or two, so we allow an extra 5 seconds slop. This should guarantee that jobs never run concurrently.

All node clocks are perfectly synchronized.

To verify correctness, we expand the job spec into a sequence of :target intervals: a vector of [start-time end-time] during which the job should have begun. We only compute targets which would have been complete at the time of the final read. Then, we take all completed invocations, and attempt to find a unique :solution mapping targets to runs, using the Choco constraint solver. A job is considered invalid if some target could not be satisfied by a completed run. For example:

{:valid? false,
   :job
   {:name 36,
    :start #<DateTime 2015-08-05T21:57:42.512Z>,
    :count 5,
    :duration 6,
    :epsilon 6,
    :interval 24},
   :solution nil,
   :targets
   ([#<DateTime 2015-08-05T21:57:42.512Z>
     #<DateTime 2015-08-05T21:57:53.512Z>]
    [#<DateTime 2015-08-05T21:58:06.512Z>
     #<DateTime 2015-08-05T21:58:17.512Z>]
    [#<DateTime 2015-08-05T21:58:30.512Z>
     #<DateTime 2015-08-05T21:58:41.512Z>]
    [#<DateTime 2015-08-05T21:58:54.512Z>
     #<DateTime 2015-08-05T21:59:05.512Z>]
    [#<DateTime 2015-08-05T21:59:18.512Z>
     #<DateTime 2015-08-05T21:59:29.512Z>]),
   :extra nil,
   :complete
   [{:name 36,
     :start #<DateTime 2015-08-05T21:57:43.619Z>,
     :end #<DateTime 2015-08-05T21:57:49.621Z>}
    {:name 36,
     :start #<DateTime 2015-08-05T21:58:06.665Z>,
     :end #<DateTime 2015-08-05T21:58:12.666Z>}
    {:name 36,
     :start #<DateTime 2015-08-05T21:58:31.668Z>,
     :end #<DateTime 2015-08-05T21:58:37.670Z>}],
   :incomplete []},

This job should have run 5 times, during a 6 second window recurring every 24 seconds, from 21:57:42 to 21:59:18. However, only the first 3 targets are satisfied--Chronos never invokes the final 2. You can download the full ZK, Mesos logs, operation history, and analysis here.

You can reproduce these results by cloning https://github.com/aphyr/jepsen/, at commit c8a9acb274074efcb8bc689983f30ecac8f51605, setting up Jepsen (see https://github.com/aphyr/jepsen/tree/master/jepsen, or https://github.com/abailly/jepsen-vagrant), and running lein test in jepsen/chronos. I'm also happy to adjust the tests and re-run with newer versions.

Should I adjust our expectations for the test? Is there additional debugging information that might be helpful? Anything else I can tell you about the test structure or environment? I'd like to get this figured out. :)

@tnachen

This comment has been minimized.

Show comment
Hide comment
@tnachen

tnachen Aug 6, 2015

Member

Hi @aphyr, what kind of tasks are you passing to chronos to run and how much resources did you specify?
Also just glancing quickly I notice from the master log there are places where it detects the framework is lost and failed all the tasks. There are settings from the framework side to tell Mesos to wait for a certain time to wait for the framework to reconnect (failover timeout).
So my question is what's configuration you used to launch chronos?
And type of failures are you injecting in this case?

Member

tnachen commented Aug 6, 2015

Hi @aphyr, what kind of tasks are you passing to chronos to run and how much resources did you specify?
Also just glancing quickly I notice from the master log there are places where it detects the framework is lost and failed all the tasks. There are settings from the framework side to tell Mesos to wait for a certain time to wait for the framework to reconnect (failover timeout).
So my question is what's configuration you used to launch chronos?
And type of failures are you injecting in this case?

@elingg

This comment has been minimized.

Show comment
Hide comment
@elingg

elingg Aug 6, 2015

Member

If you could also please send over your exact JSON job definition in Chronos, it will be easy for us to check it and also try to reproduce the problem

Member

elingg commented Aug 6, 2015

If you could also please send over your exact JSON job definition in Chronos, it will be easy for us to check it and also try to reproduce the problem

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 6, 2015

what kind of tasks are you passing to chronos to run

Er, like I said, each job consists of a small shell script which allocates a tmpfile, logs the job name and start time, sleeps for :duration seconds, then logs the completion time.

See https://github.com/aphyr/jepsen/blob/master/chronos/src/jepsen/chronos.clj#L91-L95 for the exact command. :)

and how much resources did you specify?

Initially I used the defaults, but the test fails regardless of requested resources. Right now I'm using 1 mem, 1 disk, and 0.001 CPU.

So my question is what's configuration you used to launch chronos?

Here's the Mesos master config and the Mesos slave config. I'm just using the defaults from the mesosphere debian Chronos packages for Chronos's setup.

And type of failures are you injecting in this case?

Like I said, I haven't started introducing failures yet. This is with a totally-connected, ~zero-latency network and no host/process failures. :)

aphyr commented Aug 6, 2015

what kind of tasks are you passing to chronos to run

Er, like I said, each job consists of a small shell script which allocates a tmpfile, logs the job name and start time, sleeps for :duration seconds, then logs the completion time.

See https://github.com/aphyr/jepsen/blob/master/chronos/src/jepsen/chronos.clj#L91-L95 for the exact command. :)

and how much resources did you specify?

Initially I used the defaults, but the test fails regardless of requested resources. Right now I'm using 1 mem, 1 disk, and 0.001 CPU.

So my question is what's configuration you used to launch chronos?

Here's the Mesos master config and the Mesos slave config. I'm just using the defaults from the mesosphere debian Chronos packages for Chronos's setup.

And type of failures are you injecting in this case?

Like I said, I haven't started introducing failures yet. This is with a totally-connected, ~zero-latency network and no host/process failures. :)

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 6, 2015

It it matters, the node hosting all 5 LXC nodes is a 48-way Xeon with 128GB of RAM. We're not really under CPU/memory pressure yet. :)

aphyr commented Aug 6, 2015

It it matters, the node hosting all 5 LXC nodes is a 48-way Xeon with 128GB of RAM. We're not really under CPU/memory pressure yet. :)

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 6, 2015

If you could also please send over your exact JSON job definition in Chronos, it will be easy for us to check it and also try to reproduce the problem

Here's the config for job 36:

{"async":false,
"schedule":"R5/2015-08-05T21:57:42.000Z/PT24S",
"disk":1,
"mem":1,
"name":36,
"command":"MEW=$(mktemp -p /tmp/chronos-test/); echo \"36\" >> $MEW; date -u -Ins >> $MEW; sleep 6; date -u -Ins >> $MEW;",
"epsilon":"PT6S",
"owner":"jepsen@jepsen.io",
"scheduleTimeZone":"UTC",
"cpus":0.001}

aphyr commented Aug 6, 2015

If you could also please send over your exact JSON job definition in Chronos, it will be easy for us to check it and also try to reproduce the problem

Here's the config for job 36:

{"async":false,
"schedule":"R5/2015-08-05T21:57:42.000Z/PT24S",
"disk":1,
"mem":1,
"name":36,
"command":"MEW=$(mktemp -p /tmp/chronos-test/); echo \"36\" >> $MEW; date -u -Ins >> $MEW; sleep 6; date -u -Ins >> $MEW;",
"epsilon":"PT6S",
"owner":"jepsen@jepsen.io",
"scheduleTimeZone":"UTC",
"cpus":0.001}
@tnachen

This comment has been minimized.

Show comment
Hide comment
@tnachen

tnachen Aug 6, 2015

Member

@aphyr I also notice a lot of different framework ids in the log, seems like you've used to cluster before and the frameworks that used to connect with the master is gone. I don't think this is related but then the logs is filled with other tasks statuses that it tries to forward back and I can't tell which framework is yours.
Can you purge all the work directories (default /tmp/mesos) from master, slave and zookeeper and retry your experiment? We'll be trying here as soon as we can get to it too. That way a cleaner log is easier to figure out what's wrong.

Member

tnachen commented Aug 6, 2015

@aphyr I also notice a lot of different framework ids in the log, seems like you've used to cluster before and the frameworks that used to connect with the master is gone. I don't think this is related but then the logs is filled with other tasks statuses that it tries to forward back and I can't tell which framework is yours.
Can you purge all the work directories (default /tmp/mesos) from master, slave and zookeeper and retry your experiment? We'll be trying here as soon as we can get to it too. That way a cleaner log is easier to figure out what's wrong.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 6, 2015

Good catch! I've updated the tests to clear the mesos directories on each run: here's another failing case.

   5
   {:valid? false,
    :job
    {:name 5,
     :start #<DateTime 2015-08-06T17:49:25.966Z>,
     :count 2,
     :duration 6,
     :epsilon 25,
     :interval 48},
    :solution nil,
    :targets
    ([#<DateTime 2015-08-06T17:49:25.966Z>
      #<DateTime 2015-08-06T17:49:55.966Z>]
     [#<DateTime 2015-08-06T17:50:13.966Z>
      #<DateTime 2015-08-06T17:50:43.966Z>]),
    :extra nil,
    :complete
    [{:node "n5",
      :name 5,
      :start #<DateTime 2015-08-06T17:49:27.956Z>,
      :end #<DateTime 2015-08-06T17:49:33.958Z>}],
    :incomplete []},

aphyr commented Aug 6, 2015

Good catch! I've updated the tests to clear the mesos directories on each run: here's another failing case.

   5
   {:valid? false,
    :job
    {:name 5,
     :start #<DateTime 2015-08-06T17:49:25.966Z>,
     :count 2,
     :duration 6,
     :epsilon 25,
     :interval 48},
    :solution nil,
    :targets
    ([#<DateTime 2015-08-06T17:49:25.966Z>
      #<DateTime 2015-08-06T17:49:55.966Z>]
     [#<DateTime 2015-08-06T17:50:13.966Z>
      #<DateTime 2015-08-06T17:50:43.966Z>]),
    :extra nil,
    :complete
    [{:node "n5",
      :name 5,
      :start #<DateTime 2015-08-06T17:49:27.956Z>,
      :end #<DateTime 2015-08-06T17:49:33.958Z>}],
    :incomplete []},
@elingg

This comment has been minimized.

Show comment
Hide comment
@elingg

elingg Aug 6, 2015

Member

Hi @aphyr, I tried launching your json on my cluster, and I was able to reproduce the issue. I believe the issue is your repeat interval. When I increase your repeat interval to "schedule":"R5/2015-08-06T18:37:42.000Z/PT50S", it works consistently for me and repeats 5 times as expected. When I use the 24s repeat interval sometimes it stops after running 3 times. Chronos by default only allows one instance of a job to run at a time so if there is not a long enough repeat interval some jobs end up not running. For more details on why a longer repeat interval is needed, see a similar discussion here, #476. Sorry for the user experience confusion here. Please retry with a larger repeat interval and let us know how it goes.

Member

elingg commented Aug 6, 2015

Hi @aphyr, I tried launching your json on my cluster, and I was able to reproduce the issue. I believe the issue is your repeat interval. When I increase your repeat interval to "schedule":"R5/2015-08-06T18:37:42.000Z/PT50S", it works consistently for me and repeats 5 times as expected. When I use the 24s repeat interval sometimes it stops after running 3 times. Chronos by default only allows one instance of a job to run at a time so if there is not a long enough repeat interval some jobs end up not running. For more details on why a longer repeat interval is needed, see a similar discussion here, #476. Sorry for the user experience confusion here. Please retry with a larger repeat interval and let us know how it goes.

@brndnmtthws

This comment has been minimized.

Show comment
Hide comment
@brndnmtthws

brndnmtthws Aug 6, 2015

Member

You should shorten the scheduler horizon (the flag is --schedule_horizon) from the default of 60s, to something like 1s to provide the scheduling granularity your test needs.

Member

brndnmtthws commented Aug 6, 2015

You should shorten the scheduler horizon (the flag is --schedule_horizon) from the default of 60s, to something like 1s to provide the scheduling granularity your test needs.

@elingg

This comment has been minimized.

Show comment
Hide comment
@elingg

elingg Aug 6, 2015

Member

Thanks @brndnmtthws. Yes, @aphyr, tuning the flag Brenden mentioned may allow you to avoid increasing your repeat interval I mentioned above.

Member

elingg commented Aug 6, 2015

Thanks @brndnmtthws. Yes, @aphyr, tuning the flag Brenden mentioned may allow you to avoid increasing your repeat interval I mentioned above.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 6, 2015

Chronos by default only allows one instance of a job to run at a time so if there is not a long enough repeat interval some jobs end up not running.

This was a problem in earlier versions of the test, so I modified the generator to construct intervals that can never overlap. In this case, for instance, there's a full 12 seconds between the latest possible completion time for the first run, and the earliest possible start time for the second run.

You should shorten the scheduler horizon (the flag is --schedule_horizon) from the default of 60s, to something like 1s to provide the scheduling granularity your test needs.

Aha! I'll give that a shot. Can I suggest that Chronos emit an error when asked to schedule operations at a finer granularity than the scheduler horizon, instead of accepting those jobs and failing to run them?

aphyr commented Aug 6, 2015

Chronos by default only allows one instance of a job to run at a time so if there is not a long enough repeat interval some jobs end up not running.

This was a problem in earlier versions of the test, so I modified the generator to construct intervals that can never overlap. In this case, for instance, there's a full 12 seconds between the latest possible completion time for the first run, and the earliest possible start time for the second run.

You should shorten the scheduler horizon (the flag is --schedule_horizon) from the default of 60s, to something like 1s to provide the scheduling granularity your test needs.

Aha! I'll give that a shot. Can I suggest that Chronos emit an error when asked to schedule operations at a finer granularity than the scheduler horizon, instead of accepting those jobs and failing to run them?

@brndnmtthws

This comment has been minimized.

Show comment
Hide comment
@brndnmtthws

brndnmtthws Aug 6, 2015

Member

Can I suggest that Chronos emit an error when asked to schedule operations at a finer granularity than the scheduler horizon, instead of accepting those jobs and failing to run them?

Great idea. Patches are welcome, too.

Member

brndnmtthws commented Aug 6, 2015

Can I suggest that Chronos emit an error when asked to schedule operations at a finer granularity than the scheduler horizon, instead of accepting those jobs and failing to run them?

Great idea. Patches are welcome, too.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Aug 6, 2015

Looks like dropping the scheduler horizon to 1 second allows the test to pass in the happy case. Thank you! I'm gonna go ahead and close this for now. :)

aphyr commented Aug 6, 2015

Looks like dropping the scheduler horizon to 1 second allows the test to pass in the happy case. Thank you! I'm gonna go ahead and close this for now. :)

@aphyr aphyr closed this Aug 6, 2015

@brndnmtthws

This comment has been minimized.

Show comment
Hide comment
@brndnmtthws

brndnmtthws Aug 7, 2015

Member

Happy to help. You should stop by our offices for lunch some time and we can chat about Chronos and other computer-y things.

Member

brndnmtthws commented Aug 7, 2015

Happy to help. You should stop by our offices for lunch some time and we can chat about Chronos and other computer-y things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment