Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing isofrags.tar.gz in parallel mode (v0.2.3b) #33

Open
jfear opened this issue Apr 18, 2016 · 2 comments
Open

missing isofrags.tar.gz in parallel mode (v0.2.3b) #33

jfear opened this issue Apr 18, 2016 · 2 comments

Comments

@jfear
Copy link

jfear commented Apr 18, 2016

Not sure if this is a bug or specific for my use case.

When running rail in parallel mode using ipcluster with Slurm I get a RuntimeError that isofrags.tar.gz does not exist. If I restart from that point everything finishes cleanly.

If I run rail in parallel on a single node with ipcluster (i.e. local instead of slurm) everything runs cleanly.

I am guessing it has something to do with using slurm. Probably not your problem, only bring it up because there is a mention of this in a commit log on the parallel branch. Please let me know if you have a known fix or a suggestion what might be going on.

Thanks
Justin

@nellore
Copy link
Owner

nellore commented Apr 18, 2016

Thanks for the bug report! So the error output is exactly The file isofrags.tar.gz does not exist and thus cannot be cached.?

Sounds like a race condition. Still somewhat mysterious to me, but in dooplicity/emr_simulator.py try replacing

            if not os.path.isfile(file_or_archive):
                iface.fail(('The file %s does not exist and thus cannot '
                            'be cached.') % file_or_archive,
                            steps=(job_flow[step_number:]
                                        if step_number != 0 else None))
                failed = True
                raise RuntimeError

(lines 1422-1427) with something like

            retries = 0
            while not os.path.isfile(file_or_archive):
                time.sleep(1)
                retries += 1
                if retries > 5: break
            if not os.path.isfile(file_or_archive):
                iface.fail(('The file %s does not exist and thus cannot '
                            'be cached.') % file_or_archive,
                            steps=(job_flow[step_number:]
                                        if step_number != 0 else None))
                failed = True
                raise RuntimeError

and let me know what happens.

@jfear
Copy link
Author

jfear commented May 26, 2016

This fixes the problem #37

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants