Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target that should not run yet fails #405

Closed
LudvigOlsen opened this issue May 17, 2023 · 4 comments
Closed

Target that should not run yet fails #405

LudvigOlsen opened this issue May 17, 2023 · 4 comments

Comments

@LudvigOlsen
Copy link

LudvigOlsen commented May 17, 2023

I have a workflow with 2 targets (say A and B), where the input to B is created by A. I submit both jobs at the same time and shortly after, A is running and B is failed.

Here, I first print the path to the input file and whether it exists, then the status (gwfss is just an alias for summary). A is submitted and B failed. In this case, I actually only asked to submit B but it also submitted A, so it knows B depends on A.

Skærmbillede 2023-05-17 kl  15 13 05

Now, usually I suspect this type of error to be my own, but in this case, the workflow is so simple, that it seems to be a bug. I have had some instances previously, where I suspected this to be a bug but where the workflow was way too complex to be certain.

Here is the code for submitting B. Note that A is supposed to make sample_dir / "dataset" / "feature_dataset.npy" so it doesn't currently exist. to_strings is just a list comprehension converting Paths to strings.

billede

The job fails when it cannot find the feature_dataset.npy.

Skærmbillede 2023-05-17 kl  15 06 58

When looking in gwf info for B, it correctly has the ...feature_dataset.npy path in inputs, and so it shouldn't run as that file does not exist.

Let me know, if you need other information.

@dansondergaard
Copy link
Collaborator

What does jobinfo say about target A? Could it be that A didn't produce that output files it should, but still completed with a zero return code? Then B would start running, but fail since an input file is missing.

Does it still say "submitted" after some time? Slurm is a bit unreliable when it comes to fetching the status, so it can take 10-30 seconds before it reports the correct status for a job/target.

If you can produce a minimal example the reproduces the error, that would be great!

@LudvigOlsen
Copy link
Author

LudvigOlsen commented May 17, 2023

It starts running after a short while. A is quite a long job (currently fails after 20+ minutes due to some bugs I'm working through).

Jobinfo for A:
Skærmbillede 2023-05-17 kl  20 00 22

Start times are identical in jobinfo:
B: 2023-05-17T13:58:42 A: 2023-05-17T13:58:42

End time for B is 2023-05-17T13:58:52, A is still running.

Will see if I can make a reproducible example in the coming days :-) (It's 8PM here in Singapore)

@dansondergaard
Copy link
Collaborator

Can you provide the complete output of gwf info for target B? Also, can you provide the path (over e-mail) to the workflow file? Thanks :-)

@LudvigOlsen
Copy link
Author

I've sent it all in a mail :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants