Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.InterruptedException race in chiming functions. #37

Open
dazld opened this issue Sep 7, 2020 · 42 comments
Open

java.lang.InterruptedException race in chiming functions. #37

dazld opened this issue Sep 7, 2020 · 42 comments
Assignees

Comments

@dazld
Copy link

dazld commented Sep 7, 2020

There seems to be a race between closing a chime and the chiming function invocation that can lead to a java.lang.InterruptedException being thrown when functions are awaiting something coming from, for example, a future, or have consumed a function that does this somewhere along the call stack.

It'd be nice if either the chiming function was never called, or any inflight functions were allowed to complete before closing.

Example:

(let [now (Instant/now)
      chiming (chime/chime-at (chime/periodic-seq now (Duration/ofSeconds 1))
                              (fn [time]
                                (log/info :future-task @(future (count {})))
                                (log/info :chime (str time)))
                              {:error-handler (fn [e]
                                                (log/error :chime-failed e)
                                                true)
                               :on-finished #(log/info :done)})]
  (Thread/sleep 1000)
  (.close chiming))

outputs:

2020-09-07 15:51:25.548 INFO  :future-task 0
2020-09-07 15:51:25.549 INFO  :chime 2020-09-07T14:51:25.547924Z
2020-09-07 15:51:26.550 INFO  :done
2020-09-07 15:51:26.550 ERROR :chime-failed #error {
 :cause nil
 :via
 [{:type java.lang.InterruptedException
   :message nil
   :at [java.util.concurrent.FutureTask awaitDone FutureTask.java 418]}]
 :trace
 [[java.util.concurrent.FutureTask awaitDone FutureTask.java 418]
  [java.util.concurrent.FutureTask get FutureTask.java 190]
  [clojure.core$deref_future invokeStatic core.clj 2300]
  [clojure.core$future_call$reify__8454 deref core.clj 6974]
  [clojure.core$deref invokeStatic core.clj 2320]
  [clojure.core$deref invoke core.clj 2306]
  ...
  [chime.core$chime_at$schedule_loop__13568$task__13572$fn__13573 invoke core.clj 91]
  [chime.core$chime_at$schedule_loop__13568$task__13572 invoke core.clj 90]
  [clojure.lang.AFn run AFn.java 22]
  [java.util.concurrent.Executors$RunnableAdapter call Executors.java 515]
  [java.util.concurrent.FutureTask run FutureTask.java 264]
  [java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask run ScheduledThreadPoolExecutor.java 304]
  [java.util.concurrent.ThreadPoolExecutor runWorker ThreadPoolExecutor.java 1128]
  [java.util.concurrent.ThreadPoolExecutor$Worker run ThreadPoolExecutor.java 628]
  [java.lang.Thread run Thread.java 834]]}

Note, might have to run this a couple of times to see the exception, as it doesn't always happen.

@jarohen
Copy link
Owner

jarohen commented Sep 9, 2020

Thanks @dazld, will have a look 😄

@dazld
Copy link
Author

dazld commented Sep 9, 2020

No biggie! Was taking chime for a test drive and these couple of things popped out. While we’re talking, would you mind pushing a release of the version with the IPending implementation? That threw me as well, and noticed it’s fixed in master.

@jarohen
Copy link
Owner

jarohen commented Sep 12, 2020

Ah, I see what's going on here - I'm afraid this is (currently, at least) intended behaviour within Chime. If a schedule is explicitly closed, we assume the system is shutting down in some way, and interrupt the scheduling thread to give it a chance to cleanly close any resources it's opened. Interrupting a thread that's currently waiting on a deref causes the deref to throw the InterruptedException.

With that in mind, we can repro it reliably by making the future (Thread/sleep 500), say.

One example of prior art here would be the difference between ExecutorService's shutdown and shutdownNow methods - Chime behaves like the latter. I'll preserve this for close, at least, to avoid breaking users that depend on it.

I'll consider what it'd involve to add something like that to Chime, and what it's impact would be. In the meantime, maybe either a lock or a shared continue? flag would achieve the same aim?

(let [now (Instant/now)
      close-lock (Object.)
      chiming (chime/chime-at (chime/periodic-seq now (Duration/ofSeconds 1))
                              (fn [time]
                                (log/info :chime (str time))
                                (locking close-lock
                                  (log/info :future-task @(future (Thread/sleep 500)))))
                              {:error-handler (fn [e]
                                                (log/error :chime-failed e)
                                                true)
                               :on-finished #(log/info :done)})]
  (Thread/sleep 250)
  (locking close-lock
    (.close chiming)))

Re the 0.3.3 release - I've pushed a 0.3.3-SNAPSHOT up with the other fixes.

Thanks again for raising this issue :)

@jimpil
Copy link
Contributor

jimpil commented Dec 26, 2020

The behaviour of .shutdownNow() (as seen from consumers) can be simulated with a cancelled? (AtomicBoolean) flag, can't it? Instead of calling (f time) true, you would call (when (false? (.get cancelled?)) (f time) true). Doesn't this make moving to .shutdown() possible? Yes, it would leave the task scheduled for one more time, but that time will not actually run. I'd say that's a sensible compromise in order to get rid of the racy nature of Interrupts. Just a thought...

@jimpil
Copy link
Contributor

jimpil commented Dec 27, 2020

BTW, I tried out the approach described in my previous comment, and the only test that breaks is this (for obvious reasons):

FAIL in (test-cancelling-overrunning-task) (core_test.clj:111)
expected: (instance? InterruptedException (clojure.core/deref !error))
  actual: nil

Ran 18 tests containing 45 assertions.
1 failures, 0 errors.
Tests failed.

@jimpil
Copy link
Contributor

jimpil commented Dec 27, 2020

A complementary approach would be to let consumers choose how the executor is shutdown (i.e. :orderly VS :abruptly), defaulting to the latter for backwards-compatibility. Here is a sketch:

(let [pool (Executors/newSingleThreadScheduledExecutor thread-factory)
      !latch (promise)
      cancelled? (atom false)
      shutdown* (case exec-shutdown 
                     :orderly (fn [] 
                                (reset! cancelled? true)
                                (.shutdown pool))
                     :abruptly (fn [] (.shutdownNow pool)))] ;; default

@jimpil
Copy link
Contributor

jimpil commented Dec 28, 2020

I retract my last two comments! There is already a !latch that can be leveraged:

(when-not (realized? !latch)
   (f time)
   true)

This allows moving to .shutdown() w/o introducing anything new, and w/o breaking anyone. The only 'breaking' (if you want to call it that) change is that custom error-handlers no longer have to mimic the default one (wrt to checking for the InterruptException) - something which, to my surprise, is not documented. Isn't it a problem if a provided error-handler ignores the interrupt and returns true (in the current interrupt-based approach)?

[EDIT]: Actually it's not 'breaking', is it? The default error-handler (or any other that already mimics it) can stay as is - they just will never see an InterruptedException, so that check is effectively redundant (functionally equivalent to true). In the absence of interrupts altogether, the overall behaviour of chime doesn't change - does it?

@jimpil
Copy link
Contributor

jimpil commented Dec 28, 2020

I'm trying to digest the following comment (seen a few messages above):

One example of prior art here would be the difference between ExecutorService's shutdown and shutdownNow methods - Chime behaves like the latter. I'll preserve this for close, at least, to avoid breaking users that depend on it.

This is an implementation detail, and users shouldn't depend on it - in fact I'm struggling to see how they even could. Users of chime come to it looking for a flexible scheduling primitive. The promise/guarantee of chime can be exhausted in single sentence - your function will be called at the times you provided, unless you manually cancel (and that's very powerful btw). As long as chime can uphold that promise, what constructs you use to implement it is sort of irrelevant. Users that depend on esoteric intricacies of some implementation should know better, but in the case of chime, with such a narrow API I honestly don't see how users can possibly be depending on the difference of shutdown VS shutdownNow(). The only thing that (indirectly) cares about it is the default error-handler which special-cases InterruptedException (somewhat sneakily I might add, as I can't find any mentions that this type of exception should be given any form of special attention - say by custom error handlers).

My point is that chime can keep its exact same api/functionality, while slightly changing its implementation, which today is centered around a non-deterministic/best-effort/racy interrupt-based approach (all because of shutdownNow()), and nobody will be able to tell the difference.

Looking back at the history of the project, I can see that !latch is a kind of a recent addition (as an attempt to make sure that the on-finished callback is called only once). This means that in its conception the code didn't have this notion of are-we-done? that the !latch later brought. In a world where you don't have this explicit notion, the best you can do is play with interrupts and hope for the best. In that context, I can totally see why the original implementation started out as it did (based on interrupts). It's actually pretty neat/clever if you think what it had to work with. However, you later did add the !latch and that gives you all the control you need to move away from interrupts, while keeping (I'd argue more reliably) the don't call f after cancel semantics that users care about.

Anyway, enough waffling...i think it's pretty clear how much I want to use chime... 👍

@jarohen
Copy link
Owner

jarohen commented Dec 29, 2020

Hey @jimpil, thanks for the analysis - you've clearly gone quite deep into this 😄

If I may, to summarise, it seems there's a couple of points here:

  1. "Users may want both shutdown and shutdownNow behaviour." Indeed, I recall reading that this was the reason that ExecutorService doesn't implement Closeable - because there's no good default here. Maybe we should expose both? I do think that interrupts are the right thing to use here, in the shutdownNow case, given this is how ExecutorServices work - but we could of course be clearer about it.
  2. "The default error handler knows to handle InterruptedException specifically, which is arguably too much knowledge" - agreed. I think you've pretty much arrived at the correct cause here - the error handler has been extracted out of what Chime did internally before, and has hung on to the previous implementation detail. I wouldn't be averse to a new handler here (deprecating the old one) with clearer semantics - maybe something closer to the underlying thread's 'uncaught exception handler' would be good prior art to lean on.

Have I understood it correctly? :)

James

@jimpil
Copy link
Contributor

jimpil commented Dec 29, 2020

Hi @jarohen,

See my comments inline:

Users may want both shutdown and shutdownNow behaviour

We may arrive at this conclusion in the end, but it is kind of premature to consider this at this stage. We first need to understand the following:

  1. What is the public promise of chime?
  2. How is that achieved (i.e. what is shutdownNow contributing)?
  3. Is it an absolute guarantee, or is it more like a best-effort approach?
  4. Are there any alternatives that don't compromise the promise?

Assuming that we agree on chime's singular promise (i.e. I shall call your function at the specified times, until you manually cancel), it's an interesting exercise to go through the rest.

  1. How is that achieved (i.e. what is shutdownNow contributing)?

As I've already explained, when there is no global notion of done?, the only way you can implement the until you manually cancel part is via interrupts. Once you accept/embrace this compromise, then of course InterruptedException has to be promoted (so to speak), as that has effectively become the done? signal. In other words, the implementation makes perfect sense, assuming that interrupts are the only viable approach.

  1. Is it an absolute guarantee, or is it more like a best-effort approach?

I think you'll agree that relying on interrupts is inherently racy, and therefore by definition a best-effort approach

  1. Are there any alternatives that don't compromise the promise?

There hasn't always been one, but the moment you added the !latch you've essentially introduced a global notion of done? (a great name for the !latch btw). Guess what - you no longer need interrupts - you have your own, fully reliable way of communicating when things should stop. This alone, enables you to move to shutdown w/o lying (in the sense of not upholding your promise) to anyone.

Now, if after doing all that, you still feel that there are masochists out there who would actually prefer their core scheduling primitive to be racy/not-deterministic and centered around interrupts, despite chime never having made such a promise, and despite virtually no reason to go down that route, then OK sure perhaps you can consider exposing a racy shutdown option, and make it an official feature (or bug depending on your perspective). This is up to your discretion and taste. However, having a non-racy default (which remember, upholds the core promise) is a no-brainer.

I do think that interrupts are the right thing to use here, in the shutdownNow case, given this is how ExecutorServices work - but we could of course be clearer about it.

Yes indeed, if you go down the shutdownNow route you literally have no choice. Being clearer, and instructing users to always keep an eye out for InterruptedException in their error-handlers is super important, given the current implementation. I bet there are plenty of error-handlers out there that don't pay any attention to InterruptedException, but frankly I don't blame them. You can't expect everyone to have the time to go through the source code of every single lib they're using.

"The default error handler knows to handle InterruptedException specifically, which is arguably too much knowledge" - agreed.

Yes but this is kind of necessary. The real problem, given the current implementation is that it's not communicated to the users (e.g via the doc-string - it just mentions truthy VS falsy).

I wouldn't be averse to a new handler here (deprecating the old one) with clearer semantics - maybe something closer to the underlying thread's 'uncaught exception handler' would be good prior art to lean on.

There is no better/clearer error-handler given the current implementation. It will ultimately be an InterruptedException that is thrown (via shutdownNow). Whether you handle that via a global exception-handler (or not), doesn't change the fact that it is a racy interrupt. I would advise taking a step back, and asking yourself why does it need to be this way? (i.e. centered around interrupts)? And if there is indeed a reason that it has to be this way, how come it's not reflected in the tests (i.e. all tests pass when I switch from shutdownNow to shutdown)?

Hope that helps in your thinking process 👍

@jarohen
Copy link
Owner

jarohen commented Dec 29, 2020

Thanks

An important meta point, first of all:

Now, if after doing all that, you still feel that there are masochists out there who would actually prefer their core scheduling primitive to be racy/not-deterministic and centered around interrupts, despite chime never having made such a promise, and despite virtually no reason to go down that route, then OK sure perhaps you can consider exposing a racy shutdown option, and make it an official feature (or bug depending on your perspective). This is up to your discretion and taste. However, having a non-racy default (which remember, upholds the core promise) is a no-brainer.

Please don't dismiss people with alternative opinions in this way on my repos. You have very valid points and suggestions that I'm more than willing to consider - there is no need to detract from them by using such language and tone.

@jimpil
Copy link
Contributor

jimpil commented Dec 29, 2020

Please don't dismiss people with alternative opinions in this way on my repos. You have very valid points and suggestions that I'm more than willing to consider - there is no need to detract from them by using such language and tone.

You're right - I apologise for my poor choice of words, but I'm not actually dismissing anyone. I'm basically concluding that there may ultimately be value in accommodating those people (regardless of their opinions, or how they might look to me). By the way, I'm not a native speaker, but I totally get your point...I re-read my comment and i could have easily expressed the same point using different words. Point taken 👍

@jarohen
Copy link
Owner

jarohen commented Dec 29, 2020

I think you'll agree that relying on interrupts is inherently racy

I disagree, I'm afraid. Could you elaborate?

There hasn't always been one, but the moment you added the !latch you've essentially introduced a global notion of done?

Guess what - you no longer need interrupts - you have your own, fully reliable way of communicating when things should stop.

I disagree with these too - that we have a 'global notion of done?', that we have a 'fully reliable means of communication without interrupts', and hence that we don't need them. What happens if the user wants to shut down the schedule while the job is currently running?

Openly, I'm a little discouraged by your outright dismissal of interrupts without an acknowledgement of under what circumstances they might be useful. It doesn't fill me with confidence that you've fully considered the benefits of the opposite position before rejecting it. Similarly, I am aware of their pitfalls - so please don't assume I introduced them without due thought and consideration, nor that I am not sufficiently experienced that I don't know of alternatives.

We may arrive at this conclusion in the end, but it is kind of premature to consider this at this stage. We first need to understand the following:
...
Assuming that we agree on chime's singular promise (i.e. I shall call your function at the specified times, until you manually cancel), it's an interesting exercise to go through the rest.

Likewise. I have been working on/with Chime for ~7 years now. While I don't pretend to have gotten everything right, please do consider that I may have put some thought into it over that time! :)

@jimpil
Copy link
Contributor

jimpil commented Dec 29, 2020

I disagree, I'm afraid. Could you elaborate?

The perils of the interrupt flag, what/how affects it, and the general pains of using it correctly have been discussed extensively, and I'd like to avoid focusing on that, as it's rather secondary to my main point. Their most useful feature is the fact that blocking operations like network IO, .sleep(), .wait() etc (things out of our control) attempt to abort/cleanup when they see one. But how does that relate to chime? Does chime aspire to attempt to abort a task that is happening 'right-now' (i.e. has already started)? Which brings me to my next point...

I disagree with these too - that we have a 'global notion of done?', that we have a 'fully reliable means of communication without interrupts', and hence that we don't need them. What happens if the user wants to shut down the schedule while the job is currently running?

A user who wants to stop a side-effecting task that has already started (e.g sending an email, writing a file etc), is by definition a user who doesn't care about how much of the side-effect actually happened. I know it seems like the opposite because he/she (seems to) wants for things to stop asap, but there is no way for them to actually know, unless they are the ones driving it. Moreover, what does it mean to stop the schedule for consumers? Are we not talking about the schedule as a whole (as opposed to a single scheduled task)? If so, then what happens in the real world? What happens when you try to cancel a recurring direct-debit on the same day (let alone same instant) it's supposed to go out? I don't know about your bank, but with mine what gets cancelled is the next instalment. What happens with Amazon when you try to cancel a subscription that has already been processed (a few days before it actually arrives), or after a parcel has been dispatched? If i remember correctly, in the latter case the option to cancel is not even there anymore - you have to go the refund route. Generally speaking, there is a cutoff point with these things, after which it simply doesn't make sense to cancel. In chime's case, that cutoff moment can/should(?) be considered to be the moment something starts to run. Anything else opens the door to potential leaks, corrupted-files, half-sent or abandoned http requests, emails that reached half their usual recipients, or god knows what else - things that should generally be avoided. If you want for user-code to be able to bail-out (say from a loop) when the thread is interrupted, that is kind of a different discussion, to which I'd say that the notions of closing VS cancelling should probably be distinct.

Openly, I'm a little discouraged by your outright dismissal of interrupts without an acknowledgement of under what circumstances they might be useful. It doesn't fill me with confidence that you've fully considered the benefits of the opposite position before rejecting it.

It's rather sad to hear this, as it proves that I didn't quite get the message across. I'm not opposing, nor dismissing interrupts in the general sense. I just don't think they provide a sound foundation to carry the unscheduling semantics of close. They are a great candidate for the cancelling/interrupting notion i mentioned above, but firstly that should be driven ideally only by user-code, and secondly it should be a distinct thing (feature if you will). As far as the benefits of the opposite position go, I was the first to admit/point-out that up until the (recent) addition of !latch, relying on shutdownNow was not just desirable, but also necessary as w/o it you would always have one more side-effect produced after un-scheduling (and that's clearly a problem).

Likewise. I have been working on/with Chime for ~7 years now. While I don't pretend to have gotten everything right, please do consider that I may have put some thought into it over that time! :)

I don't know what idea you got, but I consider the work you've done with chime simply excellent - from the core idea down to the actual implementation, and ultimately I am utterly grateful for the hours you've put in. The discussion we're having here is not to have a go at you, or your work. Quite the contrary I would say - I'm itching to use chime for serious work, and as a consequence I want to understand it and improve it (if possible of course). I am more than happy to be enlightened on anything you feel is important.

Finally, I was already wondering in the back of my head, but given how much more you've re-enforced your position around interrupts and shutdownNow, I just have to ask with a straight face, otherwise I won't sleep tonight :(. How is it possible that such a design center-piece can be swapped out w/o any (real) tests breaking?

Thanks again...

@jimpil
Copy link
Contributor

jimpil commented Dec 29, 2020

It just occurred to me that the reify returned could implement Future :

Future
 (cancel [_ interrupt?]
   (when-let [^Future fut @current] ;; `current` is an atom which holds the current-job's future (as returned by `.schecule`)
      (and (not (.isCancelled fut)) 
           (.cancel fut interrupt?))))

Again, just a thought - I'll come back to this with a clean head tomorrow and re-evaluate...

@jimpil
Copy link
Contributor

jimpil commented Dec 30, 2020

This is kind of neat and it also kind of addresses #30, if I'm understanding the request correctly:

ScheduledFuture
(cancel [_ interrupt?] ;; expose interrupt facilities (opt-in)
  (when-let [^ScheduledFuture fut @current]
    (and (not (.isCancelled fut))
         (.cancel fut interrupt?))))
(getDelay [_ time-unit] ;; expose remaining time until next chime (partly addresses #30)
  (when-let [^ScheduledFuture fut @current]
    (.getDelay fut time-unit)))

@jimpil
Copy link
Contributor

jimpil commented Dec 30, 2020

I was just reading this this morning, after several years of reading the actual book. I am not going to copy-paste any sections in particular, as there is a lot of good information in there, but the key takeaway (for me at least), is that the right reaction to an interrupt, will ultimately depend on the actual method that is being interrupted. In other words, there is no one size fits all solution, and chime is certainly in no position to know what the semantics of the underlying chiming-fn are. Only the consumer knows that, hence only the consumer can make that decision.

@jarohen
Copy link
Owner

jarohen commented Dec 30, 2020

Hey, thanks for the clarifications.

I think you'll agree that relying on interrupts is inherently racy

I disagree, I'm afraid. Could you elaborate?

The perils of the interrupt flag, what/how affects it, ...

Ah, ok - apologies, I thought you were suggesting that the interrupt flag had/caused race conditions (the original subject of this issue). I definitely agree that interrupts have their perils - if we can get closer to the textbooks this can only be a good thing.

@jarohen
Copy link
Owner

jarohen commented Dec 30, 2020

If you want for user-code to be able to bail-out (say from a loop) when the thread is interrupted, that is kind of a different discussion, to which I'd say that the notions of closing VS cancelling should probably be distinct.

This is the essence of my original proposal - I recognise that there are two distinct use cases, and Chime currently only handles the shutdownNow case - you'd already convinced me that it should support both.

Generally speaking, there is a cutoff point with these things, after which it simply doesn't make sense to cancel. In chime's case, that cutoff moment can/should(?) be considered to be the moment something starts to run. Anything else opens the door to potential leaks, corrupted-files, half-sent or abandoned http requests, emails that reached half their usual recipients, or god knows what else - things that should generally be avoided.

IMO it's not really for Chime to decide this. In terms of opinions, it's a scheduler that takes a seq of times and a function, that's it. Beyond that, it's for Chime's users to determine how to handle errors, how to shutdown safely, etc. If you're supplying a long-running function (say, an overnight batch - the original use-case that led to Chime) that is interrupt-aware, you should be able to interrupt it.

Where I think we're short atm is the more graceful stop - the user should (but doesn't currently) have the option to stop the schedule after the currently running job. On balance, I'd say it'd be less intuitive to run the next schedule, even if you stopped it beforehand? In your examples with banks and Amazon, their systems may well have constraints which means they need to operate like this - they might have daily batch processing of DDs, or the parcel may well already be making its way through the logistics chain - but we don't need to impose such constraints on all Chime users, who may just be using the scheduler to send a daily email? If I stopped the 2am daily schedule at midday on the 30th, I wouldn't expect it to send an email at 2am on the 31st, say.

Finally, I was already wondering in the back of my head, but given how much more you've re-enforced your position around interrupts and shutdownNow, I just have to ask with a straight face, otherwise I won't sleep tonight :(. How is it possible that such a design center-piece can be swapped out w/o any (real) tests breaking?

Hopefully the above clarifies my thoughts here? If it helps, I wouldn't go so far as to call it a 'design center-piece' - it's not the core value of Chime, and I'm certainly open to iterating on it. We do have a test for the current behaviour, which I think you've run into - test-cancelling-overrunning-task - changing the semantics to shutdown breaks this test.

@jarohen
Copy link
Owner

jarohen commented Dec 30, 2020

I was just reading this this morning, after several years of reading the actual book.

I wasn't aware of this checklist - thanks :)

@jarohen
Copy link
Owner

jarohen commented Dec 30, 2020

Re ScheduledFuture. I like the idea in theory - in practice it seems to have a couple of tradeoffs?

  • getDelay - would it be preferable to have an API that returns the next time, and let the user calculate the delay, if need be? By the time we've returned the delay, it's already out of date.
  • cancel - this introduces the question of whether I'm cancelling just the current job or not, potentially? I mean, this is potentially a use-case - 'I'd like to cancel the current job but not the overall schedule' (indeed, a question you mention further up). I'm also not mad keen on boolean params in cases like these, I always have to look them up - which is why I'm drawn to something like shutdown/shutdownNow. The word 'shutdown' (to me, at least) is a clearer indication that we're shutting down the whole thing.

Maybe this could be a chime-specific protocol, but with semantics/naming taken from ExecutorService and friends?

@jarohen
Copy link
Owner

jarohen commented Dec 30, 2020

It's rather sad to hear this, as it proves that I didn't quite get the message across. I'm not opposing, nor dismissing interrupts in the general sense.

I don't know what idea you got, ...

FWIW, it was about a handful of sentences in your original message that gave an overall impression of being a little dismissive and patronising. I appreciate the clarifications, thanks - won't dwell on them any further. Happy to file under 'the perils of text-only communication' and move on with improving Chime :)

@jimpil
Copy link
Contributor

jimpil commented Dec 30, 2020

This is the essence of my original proposal - I recognise that there are two distinct use cases, and Chime currently only handles the shutdownNow case - you'd already convinced me that it should support both.

First of all, it's great to hear that there is some form of convergence. Secondly, both this comment of yours, and a later one make it sound like chime aimed/aspired to accommodate interrupt-aware loops from day one. If that is indeed the case, should this be mentioned somewhere in the README, or even better the doc-string?

IMO it's not really for Chime to decide this. In terms of opinions, it's a scheduler that takes a seq of times and a function, that's it. Beyond that, it's for Chime's users to determine how to handle errors, how to shutdown safely, etc

It is great that you have this opinion, but unfortunately chime doesn't... As we've established, currently chime hardcodes abrupt shutdowns, and its default error-handler correctness comes down to the fact that it knows more than your average error-handler. Pretty much none of the decisions you mentioned are left for the user to make :(.

If you're supplying a long-running function (say, an overnight batch - the original use-case that led to Chime) that is interrupt-aware, you should be able to interrupt it.

I couldn't agree more! You should definitely be able to interrupt it, but you should be explicit about what you're trying to do. The semantics of close have to be macroscopic (i.e. if chime-at means 'schedule', close on the return value must mean 'unschedule'). Trying to interrupt something that is (maybe) happening 'right-now' has little to do with the overall schedule (in my head). As i think you've correctly identified yourself, conflating the two in a single notion of cancel leads to compromises.

Where I think we're short atm is the more graceful stop - the user should (but doesn't currently) have the option to stop the schedule after the currently running job. On balance, I'd say it'd be less intuitive to run the next schedule, even if you stopped it beforehand?

I'm not following 100% what you mean here so let's try to use a minimal example using 3 successive times (t1, t2, t3).

  • User gracefully shutdowns the pool (somehow) and it happens to fall exactly on t1 => job at t1 finishes and pool shuts down
  • User gracefully shutdowns the pool sometime between t1 and t2 => job remains scheduled to run at t2 (but won't actually execute when the time comes as it will be protected by the latch)

Do you see any problems with the above? What do you find less intuitive?

In your examples with banks and Amazon, their systems may well have constraints which means they need to operate like this - they might have daily batch processing of DDs, or the parcel may well already be making its way through the logistics chain - but we don't need to impose such constraints on all Chime users, who may just be using the scheduler to send a daily email?

The only purpose of these examples was to showcase how tricky/leaky it is to express things like i want to cancel something that is already happening. A parcel having started to make its way through the logistics chain is not too different than having opened a Socket.

If I stopped the 2am daily schedule at midday on the 30th, I wouldn't expect it to send an email at 2am on the 31st, say.

Again, I couldn't agree more, but it kinda worries me that you have to state this at all, so could you elaborate if you don't mind? Have I said, or proposed something that would cause this?

We do have a test for the current behaviour, which I think you've run into - test-cancelling-overrunning-task - changing the semantics to shutdown breaks this test.

The only aspect of that test that fails is the (instance? InterruptedException @!error), which is simply the result of the error-handler never being called (because there was no InterruptedException). How is that meaningful breakage?

I wasn't aware of this checklist - thanks :)

This book is a goldmine 👍

getDelay - would it be preferable to have an API that returns the next time, and let the user calculate the delay, if need be? By the time we've returned the delay, it's already out of date.

My understanding from the Java docs is that this is calculated on-the-fly every time you call it, so it's always correct. That said, I would need to look at the actual source to be 100% sure. To be clear, I'm not opposing what you suggested - I'm just saying that there is a good chance this does exactly what a consumer might want (to know how long until the next chime). After all, he/she does have access to the times (he/she provided them), and there is a utility fn (without-past-times) to use against them if someone wants to find out all the upcoming chime times.

cancel - this introduces the question of whether I'm cancelling just the current job or not, potentially? I mean, this is potentially a use-case - 'I'd like to cancel the current job but not the overall schedule' (indeed, a question you mention further up). I'm also not mad keen on boolean params in cases like these, I always have to look them up - which is why I'm drawn to something like shutdown/shutdownNow. The word 'shutdown' (to me, at least) is a clearer indication that we're shutting down the whole thing.

No such dilemma introduced. If you can call both close and cancel on something, then they have to mean different things. If you accept/agree that close means 'unschedule the whole thing', then cancel can only ever mean abort the current thing (potentially with interrupts). Doing both means calling both (in which order is again a different question). The matter of boolean params is kind of superficial I find. One could simply call clojure.core/future-cancel (or any other custom wrapper) and forget about the boolean param.

Maybe this could be a chime-specific protocol, but with semantics/naming taken from ExecutorService and friends?

I haven't explored this idea, simply because I didn't need to. You have already made the decision to implement AutoCloseable, and ScheduledFuture not only has the abstractions i needed, but it also happens to be the type .schedule returns! Staying within the java.util.concurrent abstractions seems like a good idea to me, but as I said i haven't explored your idea so take everything with a grain of salt.

@jimpil
Copy link
Contributor

jimpil commented Dec 30, 2020

Re getDelay - w/o having looked at the source yet, this SO post kind of fills me with confidence regarding what I said earlier.

@jarohen
Copy link
Owner

jarohen commented Dec 30, 2020

First of all, it's great to hear that there is some form of convergence. Secondly, both this comment of yours, and a later one make it sound like chime aimed/aspired to accommodate interrupt-aware loops from day one. If that is indeed the case, should this be mentioned somewhere in the README, or even better the doc-string?

Yep, let's add it to the doc-string of the abrupt shutdown 👍

It is great that you have this opinion, but unfortunately chime doesn't... As we've established, currently chime hardcodes abrupt shutdowns, and its default error-handler correctness comes down to the fact that it knows more than your average error-handler. Pretty much none of the decisions you mentioned are left for the user to make :(.

Right - propose that we fix this by exposing a separate graceful shutdown, and take the knowledge of the InterruptedException away from the error handler.

I couldn't agree more! You should definitely be able to interrupt it, but you should be explicit about what you're trying to do. The semantics of close have to be macroscopic (i.e. if chime-at means 'schedule', close on the return value must mean 'unschedule'). Trying to interrupt something that is (maybe) happening 'right-now' has little to do with the overall schedule (in my head). As i think you've correctly identified yourself, conflating the two in a single notion of cancel leads to compromises.

Agreed, close should close the whole schedule. Think it'll be clearer to expose two new shutdown methods (à la ExecutorService) - if we then alias close to shutdown-now (and make it clear that that's what it does/has done all along) we can keep it backwards compatible.

While I remember, I haven't yet considered the impact of this on the core.async wrapper - TODO 🤔

I'm not following 100% what you mean here so let's try to use a minimal example using 3 successive times (t1, t2, t3).

* User gracefully shutdowns the pool (somehow) and it happens to fall exactly on `t1` => job at `t1` finishes and pool shuts down
* User gracefully shutdowns the pool sometime between `t1` and `t2` => job remains scheduled to run at `t2` (but won't actually execute when the time comes as it will be protected by the latch)

Do you see any problems with the above? What do you find less intuitive?

If I stopped the 2am daily schedule at midday on the 30th, I wouldn't expect it to send an email at 2am on the 31st, say.

Again, I couldn't agree more, but it kinda worries me that you have to state this at all, so could you elaborate if you don't mind? Have I said, or proposed something that would cause this?

Right - it's the 'job remains scheduled to run at t2' that got me here but I see you're covering that with the latch - fair. Might be worth shutting down the scheduler eagerly though, could be a while til the next run.

We do have a test for the current behaviour, which I think you've run into - test-cancelling-overrunning-task - changing the semantics to shutdown breaks this test.

The only aspect of that test that fails is the (instance? InterruptedException @!error), which is simply the result of the error-handler never being called (because there was no InterruptedException). How is that meaningful breakage?

Ah, ok - yep. As a first pass, this test could check that the sleep is interrupted, rather than relying on the error handling behaviour. After that, there's probably better ways of expressing the desired behaviour too.

My understanding from the Java docs is that [getDelay] is calculated on-the-fly every time you call it, so it's always correct. That said, I would need to look at the actual source to be 100% sure. To be clear, I'm not opposing what you suggested - I'm just saying that there is a good chance this does exactly what a consumer might want (to know how long until the next chime).

It's always correct, but only at the exact moment it returns it's calculated - after that, it's out of date, and the user doesn't know the 'now' basis that was used in the calculation. If we pass the raw date back to the user instead, they're in charge of choosing a fixed 'now' to make that calculation.

After all, he/she does have access to the times (he/she provided them), and there is a utility fn (without-past-times) to use against them if someone wants to find out all the upcoming chime times.

We've fixed an issue previously (#10) around holding on to the head of the lazy seq - I wouldn't want to ask the user to do the same, especially if it's infinite. We have the tail of the sequence available, and could expose it - this would be particularly suitable if the user only wanted the next time, as it wouldn't require evaluating any more of the sequence.

Let's take that one to #30 :)

cancel - this introduces the question of whether I'm cancelling just the current job or not, potentially? I mean, this is potentially a use-case - 'I'd like to cancel the current job but not the overall schedule' (indeed, a question you mention further up). I'm also not mad keen on boolean params in cases like these, I always have to look them up - which is why I'm drawn to something like shutdown/shutdownNow. The word 'shutdown' (to me, at least) is a clearer indication that we're shutting down the whole thing.

No such dilemma introduced. If you can call both close and cancel on something, then they have to mean different things. If you accept/agree that close means 'unschedule the whole thing', then cancel can only ever mean abort the current thing (potentially with interrupts). Doing both means calling both (in which order is again a different question). The matter of boolean params is kind of superficial I find. One could simply call clojure.core/future-cancel (or any other custom wrapper) and forget about the boolean param.

Your point about calling .close on the schedule object also applies to calling (.cancel schedule), I think? As in, both could legitimately mean 'unschedule the whole thing'. There's probably a name here that removes all doubt. Tbh, we don't have to introduce this in the first increment - it's an additive change, can always come after.

@jimpil
Copy link
Contributor

jimpil commented Dec 30, 2020

Hi there,

Don't be alarmed by the out-of-order responses - I just think architectural decisions will greatly inform the implementation, so I'd like to start with those.

Your point about calling .close on the schedule object also applies to calling (.cancel schedule), I think? As in, both could legitimately mean 'unschedule the whole thing'. There's probably a name here that removes all doubt. Tbh, we don't have to introduce this in the first increment - it's an additive change, can always come after.

From an architectural stsandpoint, what we are trying to do here is to separate the (macroscopic) notion of a schedule from the (microscopic) notion of an individual task, and ultimately provide the end-user sane semantics for managing both (depending on his needs which only he knows). Looking at chime from the outside, it only exposes something Closeable. Exposing something also cancellable which is assigned clear semantics (and those communicated in the doc-string) is the cleanest way to achieve this separation, and in this case particularly convenient both for you (the lib author) and the end-user, simply because this is exactly what .schedule() returns (you can't possibly give any more control to the end user). To me (a non-native speaker), the naming also falls into place kind of perfectly. (close schedule) means 'gracefully unschedule', (cancel schedule) means 'cancel the next job (potentially abruptly as it may have started)', and (doto (close schedule) (cancel true)) means do both (kind of what chime tries to do today by default). But even if my sense of English is too liberal, you can always have two little wrappers named differently. The fact of the matter is that exposing the underlying ScheduledFuture firstly empowers end-users, and secondly maps perfectly to the this new notion we need - the next (potentially already started) job ).

We've fixed an issue previously (#10) around holding on to the head of the lazy seq - I wouldn't want to ask the user to do the same, especially if it's infinite. We have the tail of the sequence available, and could expose it - this would be particularly suitable if the user only wanted the next time, as it wouldn't require evaluating any more of the sequence.

While the issue with lazy-seqs and holding their head is definitely there, and chime users should always be vigilante and aware, I don't think it's worth fixating on it, as holding the head of an infinite sequence is a bad idea in general - not just in chime. It's up to the user to re-generate the sequence of times, rather than re-using it (hence keeping a reference to it), if he wanted this feature I described. This is common practice/thinking amongst Clojure developers - you simply don't hold the head of a lazy-seq. I can kind of see the danger, but at the same time I can also see a perfectly valid way of achieving maximum flexibility w/o adding/exposing anything new.

It's always correct, but only at the exact moment it's calculated - after that, it's out of date, and the user doesn't know the 'now' basis that was used in the calculation. If we pass the raw date back to the user instead, they're in charge of choosing a fixed 'now' to make that calculation.

You've lost me here, but that could be down to me claiming that .getDelay partly addresses issue 30, which in-turn may have led you down the wrong thinking path. Let me try again...the only thing that .getDelay offers is a way to reliably ask the question how long until the next chime?. It is reliable because it is asking the source (the actual ScheduledFuture). In that sense the only now basis that makes sense is when the method ends up being called. Similarly, it cannot be out-of-date. It is returned to whoever asked for it, presumably for him/her to use it immediately. Why, for instance, don't you worry that (.between ChronoUnit/MILLIS (now clock) time) is 'out-of-date' by the time .schedule() sees it?

Right - it's the 'job remains scheduled to run at t2' that got me here but I see you're covering that with the latch - fair. Might be worth shutting down the scheduler eagerly though, could be a while til the next run.

It may be a long time, but it may not be...what does 'long-time' mean in this context anyway? And how informative this ultimately is? From a pure scheduling perspective, I'd argue that the thing that matters is that there are no more chimes. From a pure resource-management perspective, in theory you're right - the sooner we can get rid of resources that are not ultimately going to be used, the better (especially when talking about current heavy-weight Thread objejcts). However, I refuse to let that 'nice-to-have' pollute my thoughts of correctness. In fact, having seen the latest talks on Loom, I strongly believe that we will eventually all move away from this obsession to reuse as few threads as we possibly need etc etc. In such a world, where Threads are so cheap that pooling them becomes an anti-pattern, this whole resource-management concern you alluded to becomes literaly a non-issue.

I'll come back to you with implementation ideas tomorrow...As to how this all interacts with core.async I'll admit that i haven't looked at that at all. I tend to run all tests after every single change i make, trusting that there is good (or at least reasonable) test coverage. So far so good...

@jarohen
Copy link
Owner

jarohen commented Dec 31, 2020

(close schedule) means 'gracefully unschedule', (cancel schedule) means 'cancel the next job (potentially abruptly as it may have started)', and (doto (close schedule) (cancel true)) means do both (kind of what chime tries to do today by default).

While that is one valid interpretation, there is enough ambiguity here that I don't want to use these names. Let's go with shutdown and shutdown-now, given their prior art in ExecutorService. Let's also maintain close as before, now as an alias to shutdown-now and document that it follows the same semantics.

You've lost me here, but that could be down to me claiming that .getDelay partly addresses issue 30, which in-turn may have led you down the wrong thinking path. Let me try again...

This again comes across dismissive and patronising, as if the only way I'm able to think is when I'm being guided by you. I've given you the benefit of the doubt on a number of occasions, but seriously, please make this the last time.

Similarly, it cannot be out-of-date. It is returned to whoever asked for it, presumably for him/her to use it immediately.

I'm suggesting to give the user the raw data, and letting them make calculations based on it should they wish. If you give the users the raw data, they're free to make the calculation immediately, or they're free to make it later.

In any event, this isn't a blocker for this issue - let's either move it to #30 as requested or drop it for now.

Why, for instance, don't you worry that (.between ChronoUnit/MILLIS (now clock) time) is 'out-of-date' by the time .schedule() sees it?

In this case, I am the user - and I am choosing to make this calculation as close as possible to when it's used. I'd prefer to pass an Instant or equivalent, but unfortunately that's not part of the ScheduledExecutorService API.

From a pure scheduling perspective, I'd argue that the thing that matters is that there are no more chimes.

Agreed.

From a pure resource-management perspective, in theory you're right - the sooner we can get rid of resources that are not ultimately going to be used, the better (especially when talking about current heavy-weight Thread objects). However, I refuse to let that 'nice-to-have' pollute my thoughts of correctness.

Are you suggesting that closing the thread pool when we know the schedule's shut down is somehow 'incorrect'? If not, assuming it doesn't introduce excessive complexity, why would we not clean up when we can?

In fact, having seen the latest talks on Loom, I strongly believe that we will eventually all move away from this obsession to reuse as few threads as we possibly need etc etc. In such a world, where Threads are so cheap that pooling them becomes an anti-pattern, this whole resource-management concern you alluded to becomes literally a non-issue.

I'm also looking forward to Loom, and agree that it'll change the way we all think. However, on the understanding that we're not precluding ourselves from moving that way in the future as and when it becomes widely available, I don't think it applies in our current context. Even when Loom does become widely available, we still have to consider that there'll be users of Chime who aren't in a position to make that migration for a while.

@jimpil
Copy link
Contributor

jimpil commented Dec 31, 2020

While that is one valid interpretation, there is enough ambiguity here that I don't want to use these names. Let's go with shutdown and shutdown-now, given their prior art in ExecutorService. Let's also maintain close as before, now as an alias to shutdown-now and document that it follows the same semantics.

Imagine for a second that chime-at included something like the following in its doc-string:

Returns an object which:
1. Fully implements `AutoCloseable` reflecting the enduring/macroscopic aspect of the *entire schedule* 
2. Partly implements `ScheduledFuture` reflecting the ephemeral/microscopic aspect of the *next-task*
Closing the entire schedule leads towards a graceful shutdown of the underlying `ExecutorService` (i.e. unschedule), 
whereas cancelling the next-task allows for (potentially abrupt) interruption. 
Performing both is a perfectly valid option. 

Out of curiosity, would you still consider the returned object semantically ambiguous, after reading this?
If so, that's fair enough - I don't have any more arguments on this.

This again comes across dismissive and patronising, as if the only way I'm able to think is when I'm being guided by you. I've given you the benefit of the doubt on a number of occasions, but please make this the last time.

I honestly don't see how this comes off as dismissive and/or patronising. I'm admitting that I didn't quite follow your comment, but also that it could be my own fault. I have no problem apologising, but not for something I don't even understand. In fact, I'm starting to feel that you're the one being overly defensive and somewhat dismissive. Communicating such deep/complex matters is not easy, and I'm making a big effort to to be both polite and informative here. If you aren't finding this useful or informative we can stop at any time - just let me know. After all we're both spending valuable time from our lives...

I'm suggesting to give the user the raw data, and letting them make calculations based on it should they wish. If you give the users the raw data, they're free to make the calculation immediately, or they're free to make it later.

Again, fair-enough if you think this is helpful. My point is that that raw-data the user provided in the first place. To say that you're going to expose what the user gave you, just sounds weird (to me).

In this case, I am the user - and I am choosing to make this calculation as close as possible to when it's used. I'd prefer to pass an Instant or equivalent, but unfortunately that's not part of the ScheduledExecutorService API.

Again out of curiosity, what makes you think that an end-user will have a dramatically different-use case (when calling .getDelay)?

Are you suggesting that closing the thread pool when we know the schedule's shut down is somehow 'incorrect'? If not, assuming it doesn't introduce excessive complexity, why would we not clean up when we can?

If there was a way to have a graceful shutdown AND immediate cleanup, that would be just great! However, that is part of the conundrum with shutdown VS shutdownNow, isn't it? What I'm suggesting is simply that I wouldn't give up the graceful shutdown, in order to have immediate-cleanup, as it is kind of a secondary concern for a scheduling construct. If you can come up with a reliable way of achieving both, I would be very excited to hear/review.

I'm also looking forward to Loom, and agree that it'll change the way we all think. However, on the understanding that we're not precluding ourselves from moving that way in the future as and when it becomes widely available, I don't think it applies in our current context. Even when Loom does become widely available, we still have to consider that there'll be users of Chime who aren't in a position to make that migration for a while.

Sure, but it does highlight that this notion of resource-management (for threads), is only ever going to decrease in importance. People who haven't made the transition, should of course be accommodated, but we're not talking about breaking anyone's program here. We're talking about leaving a couple of objects in memory. As I mentioned, if you can come with a solution for that, I'm looking forward to hear about it.

@jarohen
Copy link
Owner

jarohen commented Dec 31, 2020

Out of curiosity, would you still consider the returned object semantically ambiguous, after reading this?
If so, that's fair enough - I don't have any more arguments on this.

Am I understanding correctly that you propose that we change close to be the graceful shutdown, so that cancel becomes the abrupt shutdown? If so, this would be a breaking change.

I agree that we need to make the doc-string absolutely semantically clear - I additionally think that unambiguous names would be preferable. close (cancel to a lesser degree) isn't unambiguous, and I don't want to leave the user needing to reason that 'there's a cancel, so close must behave like this' - I'd rather they were independently clear.

@jarohen
Copy link
Owner

jarohen commented Dec 31, 2020

Again, fair-enough if you think this is helpful. My point is that that raw-data the user provided in the first place. To say that you're going to expose what the user gave you, just sounds weird (to me).

This has the tradeoff of the user keeping hold of the either the original seq (or the seq generator, passing it around with the schedule.

Again out of curiosity, what makes you think that an end-user will have a dramatically different-use case (when calling .getDelay)?

I don't - I just think that passing back the delay means that users pretty much have to use it immediately; passing them the raw time gives them the option of using it immediately or later.

Maybe I'm missing some value here regarding re-purposing the ScheduledFuture interface. It seems that we're introducing a little mud by saying 'the returned instance reflects both the overall schedule and the next invocation'. Do we get benefits from coupling Chime to it, returning a value that implements it? Are there things that a user could do with the Chime schedule if only it implemented ScheduledFuture, say?

Similarly, if ScheduledFuture didn't have getDelay, would we be considering exposing that in that form?

@jarohen
Copy link
Owner

jarohen commented Dec 31, 2020

If there was a way to have a graceful shutdown AND immediate cleanup, that would be just great! However, that is part of the conundrum with shutdown VS shutdownNow, isn't it? What I'm suggesting is simply that I wouldn't give up the graceful shutdown, in order to have immediate-cleanup, as it is kind of a secondary concern for a scheduling construct. If you can come up with a reliable way of achieving both, I would be very excited to hear/review.

Just to confirm, we're not proposing shutdown or shutdown-now?

Assuming we were to go with distinct shutdown and shutdown-now functionality, they should be (mostly) independent of each other? Admittedly, as I've not actually tried to code this up, there may be complexity I'm missing.

@jarohen
Copy link
Owner

jarohen commented Dec 31, 2020

Sure, but it does highlight that this notion of resource-management (for threads), is only ever going to decrease in importance. People who haven't made the transition, should of course be accommodated, but we're not talking about breaking anyone's program here. We're talking about leaving a couple of objects in memory. As I mentioned, if you can come with a solution for that, I'm looking forward to hear about it.

I'm not planning to (personally) solve this at this point in time. I'm not saying it's not valuable - if there's a way of migrating Chime to be more Loom-friendly, I'm all ears - but it's not one I'll dedicate brain cycles to just yet.

@jimpil
Copy link
Contributor

jimpil commented Dec 31, 2020

Re: the core-async integration:

At first glance, I don't see how the stuff we're discussing breaks chime-ch. However, I did notice something else. The doc-string states:

ch - (optional) Channel to chime on - defaults to a new unbuffered channel
Closing this channel stops the schedule.

I see no evidence of this statement. I suspect you meant to say closing the returned channel stops the schedule, but then that should be a few lines up. If the intention of the doc-string is correct, then I think the chiming-fn should be something like the following:

(fn [time]
  (when-not (a/>!! ch time) 
    (a/close ch)))

With the risk of rushing to conclusions, i will say that this looks like another remnant of a time when chime would actually consider the return of the chime-fn. However, at the moment it is not (true is returned after it). I apologise in advance if that's far-reaching...

@jarohen
Copy link
Owner

jarohen commented Dec 31, 2020

@jimpil Quite possibly - the core.async integration hasn't had any love for a while, and the test coverage is now quite patchy. Created #44 for that one.

@jarohen
Copy link
Owner

jarohen commented Dec 31, 2020

In fact, I'm starting to feel that you're the one being overly defensive and somewhat dismissive. Communicating such deep/complex matters is not easy

Agreed. This is why I feel it important to remove such tension from the discussion where possible. I was aiming to feed back in a (hopefully) constructive manner, without judgement, how some of the things you've been saying over the course of this issue have genuinely come across - from the other side, it's not conducive to a good discussion. I don't intend to be defensive, and for that I'm sorry - I'd rather be open about it so that we can get on with improving Chime.

If you aren't finding this useful or informative we can stop at any time - just let me know. After all we're both spending valuable time from our lives...

To be clear, I'm not saying this isn't useful - just harder than it needs to be.

@jimpil
Copy link
Contributor

jimpil commented Dec 31, 2020

Am I understanding correctly that you propose that we change close to be the graceful shutdown, so that cancel becomes the abrupt shutdown? If so, this would be a breaking change.

I am not suggesting anything concrete yet. If I'm honest, mentally I'm still in the design phase, and so I'd like to avoid mixing in existing baggage. In other words I'm coming at this from a clean-slate perspective (at least for now). I do realise that this is not super practical, but I am not great at focusing on multiple things :(.

I agree that we need to make the doc-string absolutely semantically clear - I additionally think that unambiguous names would be preferable. close (cancel to a lesser degree) isn't unambiguous, and I don't want to leave the user needing to reason that 'there's a cancel, so close must behave like this' - I'd rather they were independently clear.

Accounting for the fact that chime does have existing users, I get your point and I can only agree. However in principle, close doesn't have to behave this way (as you put it) - it just makes sense for it to (given that the AutoCloseable impl is the semantic equivalent of unshecule).

This has the tradeoff of the user keeping hold of the either the original seq (or the seq generator, passing it around with the schedule.

I don't disagree per-ce, I just find that to be a perfectly reasonable expectation for such a niche feature.

I don't - I just think that passing back the delay means that users pretty much have to use it immediately

This is the bit I truly struggle to understand what you mean...The user can call .getDelay as many times as he wants, so why would he ever feel that urgency you're describing?

Maybe I'm missing some value here regarding re-purposing the ScheduledFuture interface. It seems that we're introducing a little mud by saying 'the returned instance reflects both the overall schedule and the next invocation'. Do we get benefits from coupling Chime to it, returning a value that implements it?

The underlying construct that chime is based on is ScheduledFuture. My thinking is that exposing the next-job as a ScheduledFuture is the most transparent semantic you could possibly offer. It's not even an imposed semantic - just a proxy. That's exactly what I find so empowering from a user's perspective. You could in theory, return two things from chime-at (whatever it returns today + the atom holding the current scheduled future). In some sense this is like giving super-powers to the user, as you're exposing the entire object (rather than what you've chosen to implement), but at the same time this feels significantly muddier.

Are there things that a user could do with the Chime schedule if only it implemented ScheduledFuture, say?

Of course, but only what the semantics of ScheduledFuture allow (i.e. the usual suspects cancel, getDelay, get etc etc).

Similarly, if ScheduledFuture didn't have getDelay, would we be considering exposing that in that form?

Absolutely! If you go back to my original comment on this you'll see that my first draft implemented Future, not ScheduledFuture. The main motivation was the cancel method. The getDelay one came as an afterthought right after realising that .schedule returns ScheduledFuture, and that we could have the perfect bridge here.

Assuming we were to go with distinct shutdown and shutdown-now functionality, they should be (mostly) independent of each other? Admittedly, as I've not actually tried to code this up, there may be complexity I'm missing.

Do you recall my very first suggestion of a new :exec-shutdown (:orderly VS :abruptly) option? If you are fully committed to not changing the semantics of close, then the cleanest way to offer a choice about which of the two methods is called, a brand new option (defaulting to the existing behaviour) is doable. This solution however only addresses the matter of shutdown. It doesn't go deeper into questions like how do I cancel the next/current job, while keeping the schedule alive.

I'm not planning to (personally) solve this at this point in time. I'm not saying it's not valuable - if there's a way of migrating Chime to be more Loom-friendly, I'm all ears - but it's not one I'll dedicate brain cycles to just yet.

chime is already loom-compatible (i.e. you can pass your own ThreadFactory). To be 100% loom-friendly, you probably want to expose the executor too (so that a non-pooling executor can be passed instead). But as you hinted, this is low-hanging fruit...

Agreed. This is why I feel it important to remove such tension from the discussion where possible. I was aiming to feed back in a (hopefully) constructive manner, without judgement, how some of the things you've been saying over the course of this issue have genuinely come across - from the other side, it's not conducive to a good discussion. I don't intend to be defensive, and for that I'm sorry - I'd rather be open about it so that we can get on with improving Chime.

👍

@jimpil
Copy link
Contributor

jimpil commented Jan 1, 2021

Here is a rough sketch of the API I would personally like for chime to either provide, or allow it to be built by end-users. Feel free to ignore the actual implementation - I'm more showing this for the function names and doc-strings.

(defn cancel-next!
  "Cancels the next upcoming chime, potentially abruptly,
   as it may have already started. The rest of the schedule
   will remain unaffected, unless the interruption is handled
   by the error-handler (i.e. `InterruptedException`), and it
   returns falsey, or throws (the default one returns true)."
  ([sched]
   (cancel-next! sched true)) ;; essentially the same as `future-cancel`
  ([^ScheduledFuture sched interrupt?]
   (.cancel sched interrupt?)))

(defn until-next
  "Returns the remaining time (in millis by default)
   until the next chime (via `ScheduledFuture.getDelay()`)."
  ([sched]
   (until-next sched TimeUnit/MILLISECONDS))
  ([^ScheduledFuture sched time-unit]
   (.getDelay sched time-unit)))

(defn cancel-next?!
  "Like `cancel-next!`, but only if the next task hasn't already started."
  [^ScheduledFuture sched]
  (when (pos? (until-next sched))
    (cancel-next! sched)))

(defn shutdown!
  "Gracefully closes the entire schedule (per `pool.shutdown()`).
   If the next task hasn't started yet, it will be cancelled,
   otherwise it will be allowed to finish."
  [^AutoCloseable sched]
  (-> (doto sched (.close))
      cancel-next?!))

(defn shutdown-now!
  "Attempts a graceful shutdown, but if the latest task
   is already happening attempts to interrupt it.
   Semantically equivalent to `pool.shutdownNow()`
   when called with <interrupt?> = true (the default)."
  ([sched]
   (shutdown-now! sched true))
  ([sched interrupt?]
   (or (shutdown! sched)
       (cancel-next! sched interrupt?))))

(defn wait-for!
  "Blocking call for waiting until the schedule finishes,
   or the provided <timeout-ms> has elapsed."
  ([sched]
   (deref sched))
  ([sched timeout-ms timeout-val]
   (deref sched timeout-ms timeout-val)))

@jimpil
Copy link
Contributor

jimpil commented Jan 1, 2021

I've edited my last comment 3 times, but this this third iteration shows signs of maturity...For instance, notice how shutdown!, addresses your resource-management concerns. Moreover, notice how shutdown-now! reuses shutdown!. I have to admit that this is rather beautiful idiom (assuming that followup tests back it up), and the best part is that it emerged 100% organically.

@jarohen
Copy link
Owner

jarohen commented Jan 3, 2021

Thanks - and a Happy New Year to you too :)

Plenty of food for thought here, thanks - some initial musings:

  • shutdown/shutdown-now - agree with the docstrings. I don't get the reason to include an interrupt? flag in shutdown-now - if I don't want to interrupt a currently running task, I'll call shutdown?
  • wait-for! - if this is just calling deref, do we need to expose it separately?
  • cancel-next and friends - I'm not convinced about the external API semantics here, from a user's point of view. If I call it twice, should it cancel two tasks? Cancelling the current task might be easier to specify - if there's one running, it cancels it - if there isn't, it's a no-op.
    As it's not currently exposed by the Chime API, not immediately straightforward, and opens up another problem space (including mutating the future schedule, even, from Feature request: Allow adding new times to a chimes-ch channel #39), I'd be keen to land the changes to shutdown first and then consider what to do here. Let's consider that on Feature request: Allow adding new times to a chimes-ch channel #39, and focus on the shutdown here.

Incidentally, there's a bug in the (admittedly pseudo-code) implementation of cancel-next?! that goes some way to showing what I'm looking to avoid with .getDelay:

  • you first call (pos? (until-next ...)), it's 1ms, so it's positive
  • let's say a millisecond ticks over, the job starts
  • it then calls cancel-next!, which cancels the job abruptly.

Thinking also of the use case of rendering the next time the schedule will run (stealing inspiration from #30), there's no way to do this reliably with .getDelay - if the time increments between when the delay is calculated and when the next time is rendered, the next time will be wrong.

That said, I'm coming around to the idea of additionally exposing .getDelay, I do see it as a common use case, but I'd like to give users the raw data too. Even if, as you say, the user knew how to generate the sequence in the first place, #30 has convinced me that exposing the tail of the times sequence would be useful.

Are there things that a user could do with the Chime schedule if only it implemented ScheduledFuture, say?

Of course, but only what the semantics of ScheduledFuture allow (i.e. the usual suspects cancel, getDelay, get etc etc).

This outlines why we might adopt a similar API, I was wondering whether there were benefits to returning an object that implements this exact interface. Having had a few days to mull this over, though, the fact that ScheduledExecutorService returning a ScheduledFuture in the scheduleAtFixedRate case means there's a precedent for using ScheduledFuture to represent a recurring job (a precedent I'd missed previously, apologies if this is what you were trying to get me to understand!). Implementing ScheduledFuture to represent the whole schedule is much more palatable - I was mainly concerned with returning an object that tried to represent both the whole schedule and the next task.

Likewise, I'll think on this a bit more :)

@jimpil
Copy link
Contributor

jimpil commented Jan 3, 2021

Hi again,

shutdown/shutdown-now - agree with the docstrings. I don't get the reason to include an interrupt? flag in shutdown-now - if I don't want to interrupt a currently running task, I'll call shutdown?

The arity is there more to match the cancel-next one, but yeah you're right, it has limited usefulness.

wait-for! - if this is just calling deref, do we need to expose it separately?

Not really, but in my view a higher level API like this should be instructive/informative to the user.

cancel-next and friends - I'm not convinced about the external API semantics here, from a user's point of view. If I call it twice, should it cancel two tasks? Cancelling the current task might be easier to specify - if there's one running, it cancels it - if there isn't, it's a no-op.

You're right this is a terrible name! Something like cancel-upcoming!, or even cancel-current! should make things clearer.

Incidentally, there's a bug in the (admittedly pseudo-code) implementation of cancel-next?! that goes some way to showing what I'm looking to avoid with .getDelay:

I really don't want to dwell over this, but two points:

  • One should clearly communicate that the phrase if not already started comes with millisecond tolerance
  • 1 millisecond is plenty enough to test for postive (pos?), and to call .cancel right after. I've tested this repeatedly on a 9 year old laptop and it doesn't break a sweat.

So as long as the user can live with the millisecond tolerance (999 microseconds left are considered too late), I don't see the bug. In fact, I used my google-fu to try to find what kind of use-cases .getDelay is useful/used for, and most of the code/cases i found were very very similar. If I had tried to be clever and played with micro-seconds, or even worse, nano-seconds, then yes, the race you describe suddenly becomes a real threat.

Thinking also of the use case of rendering the next time the schedule will run (stealing inspiration from #30), there's no way to do this reliably with .getDelay - if the time increments between when the delay is calculated and when the next time is rendered, the next time will be wrong.

Indeed, I don't think it is possible to calculate the datetime of the next-chime and have it always be 100% .equal() to the datetime the user provided. That said, i personally don't see the usefulness of it. Something like the following should suffice, no? Are you worried about sub-millisecond drifts?

(defn next-chime-at
  "Returns the (future) `ZonedDateTime` when the next chime will occur,
   or nil if it has already started (millisecond tolerance)."
  (^ZonedDateTime [sched]
   (next-chime-at sched times/*clock*))
  (^ZonedDateTime [sched ^Clock clock]
   (let [remaining (until-next-chime sched)]
     (when (pos? remaining)
       (-> (ZonedDateTime/now clock)
           (.plus remaining ChronoUnit/MILLIS))))))

Implementing ScheduledFuture to represent the whole schedule is much more palatable - I was mainly concerned with returning an object that tried to represent both the whole schedule and the next task.

There is no way (that I can see) to use ScheduledFuture to represent the whole schedule in chime, because it schedules incrementally - i.e. you only know about the lastly scheduled-future.

Likewise, I'll think on this a bit more :)

👍

@jimpil
Copy link
Contributor

jimpil commented Jan 3, 2021

For reference, here is what I am able to do locally:

(def sched (chime-at (times/every-n-seconds 2) println))
=> #'user/sched
#object[java.time.Instant 0x6d43f538 2021-01-03T22:52:38.986748Z]
#object[java.time.Instant 0x697c7c48 2021-01-03T22:52:40.986748Z]
(cancel-current?! sched)
=> true
#object[java.time.Instant 0x2c7eff03 2021-01-03T22:52:44.986748Z]
#object[java.time.Instant 0x6f801063 2021-01-03T22:52:46.986748Z]
#object[java.time.Instant 0x1d18d29e 2021-01-03T22:52:48.986748Z]
(cancel-current?! sched)
=> true
#object[java.time.Instant 0x170c6349 2021-01-03T22:52:52.986748Z]
#object[java.time.Instant 0x4c51004c 2021-01-03T22:52:54.986748Z]
(shutdown! sched)
=> true
;; no more chimes

@jimpil
Copy link
Contributor

jimpil commented Jan 4, 2021

Let's perform the following thought experiment:

Assume that a user has the ability to get both the milliseconds (from .getDelay), AND the Instant (the first element in the current times tail) until the next chime. If the user attempts to schedule something brand new to fall exactly on that moment in time, which one of the two should he trust (i.e. which one is more accurate)? It's definitely more convenient to use the Instant as chime already understands it, but building a new Instant from the raw millis will actually be closer to the real schedule. As you said in an earlier comment of yours you're not using Instants with the Executor - you're using delays (millis)., and .getDelay will include the true time-remaining (including any latencies incurred along the way). Moreover, think about overrunning jobs that are perhaps pushed-forward, or dropped. The fixed times sequence can't help you there. Hope that helps...

@jarohen jarohen self-assigned this Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants