I think this was partially addressed in 63888cb, but I'm running into some problems with it.
First, if a job fails to start because the description is invalid, then manager.Manager will move it into the FAILED state, rather than submitting it to the Executor. That makes sense, but in this case no NO_JOBS event is generated even if the invalid job is the last one, which can cause api.manager.Manager.wait4all() to wait for such a message forever.
Second, there's a race condition in api.manager.Manager.wait4all(). If the last job was valid and finished, and its status and NO_JOBS messages have been queued, then the AllJobsFinished check in api.manager.Manager.wait4all() will return True and we return from the function immediately. However, it's possible that the JST/JFI/NO_JOBS have not yet been received. As a result, on the next call to wait4all() if a job is running then AllJobsFinished will return False, but then those previous messages will be received by the poller and wait4all() returns while there are still running jobs.
This turned out to be a bit tricky to fix but I think I have something that works. See the explanation in #152.
An even better solution would be to replace the synchronous status request with AllJobsFinished with a request that, on the server side, checks whether there are any active jobs, and if not posts a NO_JOBS message to the event queue. Then the client side can simply call that and then process events until it runs into a NO_JOBS one. But that requires a change in the protocol, and it will mean requiring a Poller rather than having it be optional, so I stopped short of that.
Currently if client submit 1k jobs, he asks about 1k job statuses. Instead there should be info returned from QCG-PJ if all submited jobs finished.
The text was updated successfully, but these errors were encountered: