Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in ".pathnames.iv" file in position command #493

Open
ASLeonard opened this issue Apr 12, 2023 · 12 comments
Open

Race condition in ".pathnames.iv" file in position command #493

ASLeonard opened this issue Apr 12, 2023 · 12 comments

Comments

@ASLeonard
Copy link

Hi,
Probably not a common case, but querying positions in a graph in an outer parallel loop is buggy, because of this line writing a local pathname file with a fixed name.

std::string path_name_file = basename + ".pathnames.iv";

Running parallel -j 1 ... fixed the issue for me, so pretty sure it is exactly due to a race condition on this line. It seems there is another similar case here.

Best,
Alex

@subwaystation
Copy link
Member

subwaystation commented Apr 12, 2023

Hi @ASLeonard,

the code link you sent is not run in parallel, as far as I can see. The construction of the XP index is single threaded, with the exception being the https://github.com/ekg/mmmulti we use.
Furthermore, which exact command did you use? Which exact error did you get?
Without further information I don't know how to help you. Thanks!

@ASLeonard
Copy link
Author

Sorry for not being clear. This is the command I was running something like

parallel -j 4 odgi position -i pggb.og -p HER:{1}-{2}  -E -d 1000 -o /dev/stdout ::: <paths of interest> 

So the parallelism is running multiple calls of odgi position simultaneously, not running odgi parallel with multiple threads.

And this is one of the errors when running in the directory /cluster/work/alex/KIT

terminate called after throwing an instance of 'std::logic_error'
  what():  Error: File "/cluster/work/alex/KIT/.pathnames.iv" contains zero symbol.
warning [libhandlegraph]: Serialized object does not appear to match deserialzation type.
warning [libhandlegraph]: It is either an old version or in the wrong format.
warning [libhandlegraph]: Attempting to load it anyway. Future releases will reject it!
terminate called after throwing an instance of 'std::runtime_error'
  what():  Error rewinding to load non-magic-prefixed SerializableHandleGraph

Since there are multiple instances of odgi position running at the same time in the same directory, all calls are seemingly reading/writing to the same "hardcoded" filename of "/cluster/work/alex/KIT/.pathnames.iv"

@subwaystation
Copy link
Member

subwaystation commented Apr 12, 2023

I see. From what I understand the problem is not originating from the XP index, because here we generate the file paths randomly. Also non of the code of the XP index is invoked within position_main.cpp.
However, warning [libhandlegraph]: hints that something goes wrong when accessing the input ODGI file? Note sure.
Hopefully @ekg knows more.

@subwaystation
Copy link
Member

Which version of ODGI are you using @ASLeonard ?

@ASLeonard
Copy link
Author

I built odgi from tip (a054641).

From what I understand the problem is not originating from the XP index, because here we generate the file paths randomly. Also non of the code of the XP index is invoked within position_main.cpp.

I'm less sure this is the reason then based on your experience, but the problem only appears when calling multiple odgi positions in parallel and never occurs when running multiple odgi position calls sequentially, so I still think the issue is with different calls racing to the "/cluster/work/alex/KIT/.pathnames.iv" file.

@AndreaGuarracino
Copy link
Member

AndreaGuarracino commented Apr 12, 2023 via email

@subwaystation
Copy link
Member

I understand the problem, but I don't understand how it can happen 🍿
Because I can't follow which part of odgi position should invoke code that would read or write .pathnames.iv.

@subwaystation
Copy link
Member

@subwaystation
Copy link
Member

Ah, the problem could be that you are writing to dev/stdout. Obviously all 4 runs in parallel are writing to the same device. Is this intentional?

@ASLeonard
Copy link
Author

I was writing to stdout to further postprocess the output. I can try writing to separate files (based on the unique paths of interest), so that would at least test if the -o /dev/stdout is causing any issues.

I'm not sure why the files have "no name" and are just dot names in the working directory. $TMPDIR is unset (but pretty sure I saw this issue on the compute nodes where $TMPDIR is set).

@subwaystation
Copy link
Member

So my current assumption is that when odgi is loading the same graph in parallel, some temporary files are created (I would not know why?) or are assumed to be exisiting, but actually do not contain any information either because:

  • The file does actually not exist, did you check @ASLeonard ?
  • The file(s) should be named differently for each of you parallel runs, so that we don't have a race condition.

I am not familiar enough with the libhandlegraph API, maybe @ekg or @adamnovak have some intuition here? Thanks!

@AndreaGuarracino
Copy link
Member

AndreaGuarracino commented Apr 13, 2023

I am confused by this command:

parallel -j 4 odgi position -i pggb.og -p HER:{1}-{2}  -E -d 1000 -o /dev/stdout ::: <paths of interest> 

-E requires a file in input and odgi position does not have the -o option. Options of parallel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants