Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel jobs fail to lock or rewind the control file at end of run #438

Closed
mchamberland opened this issue Apr 27, 2018 · 58 comments
Closed
Labels

Comments

@mchamberland
Copy link
Contributor

mchamberland commented Apr 27, 2018

I'm seeing the same behaviour that @ojalaj reported in the comments of PR #368: parallel jobs end with an error ("failed to lock or rewind the control file")

For what it's worth, I see it when running a BEAM accelerator with an IAEA phase space source and also when running DOSXYZnrc with source 20 and a shared library (either BEAM or external).

Not sure if this is general lock file troubles or if there's anything else going on.

@mchamberland
Copy link
Contributor Author

To clarify: the "failed to rewind the control file" does not necessarily happen at the end of the run. It can happen anytime throughout the run, which means if I submit 250 jobs, I may be left with only ~40 after a few minutes.

The message I see is:

...
egsLockControlFile: failed to lock file for 12 seconds...
egsLockControlFile: failed to lock file after 1 minute wait!

which is what seems to be killing the majority of my jobs.

Completely anecdotal, but I don't see this as often with a ~2013 version of EGSnrc, but I haven't ruled out this being a user-account issue (I run the 2 versions of EGSnrc on different user accounts for now).

@ojalaj
Copy link

ojalaj commented May 2, 2018

I have also experienced this race condition of the lock file. There seems to be at least two separate issues: 1) jobs really killed during the simulation and 2) jobs running properly in terms of calculation results, but something going wrong in the last parts of finishing scripts. My knowledge for this is very limited, but as a cure 1) I have tried reducing N_CHUNKS in src/egsnrc.macros, so that the jobs wouldn't access lock file so often and increasing batch_sleep_time in scripts/batch_options.xxx, so that jobs wouldn't start (and access lock file to pick up the next chunk of particles) within such a short time frame.

Do you know, where is defined the value (time) for how long the lock file is accessed until killing the job?However, increasing the value might also help.

@mchamberland
Copy link
Contributor Author

Thanks, those are good suggestions!

This is where the code tries to lock the file for 1 minute. Maybe I’ll try slightly increasing those numbers.

@marenaud
Copy link
Contributor

marenaud commented May 2, 2018

I just want to echo @ojalaj and say that I also encountered the two different flavors of this issue, and also fixed 1) by increasing batch_sleep_time which is annoying when splitting a < 2 minute job in 100+ parts because the batch_sleep_time ends up being a significant overhead. This has been happening for quite a few releases, so probably not due to a recent bug.

The best solution would probably to finalize #341 and do away with lock files entirely... but when I looked into it, there seemed to be some issues with shared libraries.

@mchamberland
Copy link
Contributor Author

I think I may have found why v2018 experiences more lock file errors than our ~2013 system.

In egs_c_utils.c, the 2013 version tries locking the file for 10 minutes before giving up, whereas in v2018, this is reduced to 1 minute. After increasing the time back to 10 minutes in v2018, I see much fewer failures (2 out of 250 vs >150 before).

I would personally suggest increasing the time back to 10 minutes because the drawback from many failed runs is worse than having a job wait a few minutes to properly lock the lock file.

@ftessier
Copy link
Member

ftessier commented May 8, 2018

Thanks @mchamberland; but that must have been a local change: from our end it seems it was always 1 minute... At any rate, seems there would be no harm done by increasing the wait time.

@mchamberland
Copy link
Contributor Author

@ftessier Correct, I've confirmed that the change was done locally.

Since we're here, there seems to be duplicated code in cutils/egs_c_utils.c and pieces/egs_c_utils_unix.c. Any reason why? I was also confused about which file gets picked up during configuration.

@rtownson rtownson added the bug label Jun 1, 2018
@crcrewso
Copy link
Contributor

crcrewso commented Jun 9, 2018

This explains our currently much higher failure rate! Thank you!

When looking at the file, should we increase the number of loops or increase the sleep time?

@mchamberland
Copy link
Contributor Author

I don't think it matters which one you increase, but we changed it to loop 60 times every 10 seconds, so it tries for 10 minutes.

But even with this change, I was still getting quite a few jobs failing. What seems to have helped, is to reduce the rate of lock attempts. In other words, given a typical simulation, figure out how many times your jobs will access the lock file per second (usually, something like Njobs * Nchunk / typical_simulation_time). In this case, Nchunk is a good candidate to decrease.

@crcrewso
Copy link
Contributor

crcrewso commented Jun 12, 2018

I've been thinking about this for a bit and there might be a better rule of thumb than @mchamberland mentioned. If we consider an ideal scheduler, and a large number of histories per job, then at the end of the first batch, each job will try to lock at the same time. Now there are some logical leaps here, which I could explain but I'm thinking the summary should be sufficient:

Assuming a linear wait time with perfect collisions

  • The number of tries (t) should be greater than the maximum number of simultaneous jobs (j)
  • The wait time should be at least twice as long as the time taken to lock, edit, and release the lock file. (s)
  • The value of t*j < 0.5 * s

This has the advantage of after the first of 10 checkpoints all threads should be timed so differently that they should not collide again.

Now if we create a more advanced timing system, where we add a randomized broadening term we could cut down the number of loops of any given thread, but doing so would mean that we'd retain collision risks through the whole run.

I can do a bit of legwork on this if the group thinks that there would be advantages to keeping this algorithm safe, efficient, and unnecessary for users to configure.

Or we could document this bit of code in an appendix.

@mchamberland
Copy link
Contributor Author

@crcrewso You've put way more thought than me into this, but I definitely support such an initiative! I can help with testing until roughly mid-August. After that, I'm not sure when I will have access to a cluster with EGSnrc again.

@crcrewso
Copy link
Contributor

Before I write something up, one question. Does anyone know if the lock file stays locked while the results are being written?

@mchamberland
Copy link
Contributor Author

mchamberland commented Jun 12, 2018

EDIT: Oops, misread your question. I have no idea and that's a good question. My hunch is no, but @rtownson @blakewalters can chime in.

@blakewalters
Copy link
Contributor

You're hunch is correct, @mchamberland. The .lock file is unlocked while writing results.

@crcrewso
Copy link
Contributor

crcrewso commented Jun 12, 2018

If that's the case then here is my proposed new algorithm (based mostly on experience and an old locking conversation from years back).

Considering most cluster storage is higher latency and RAID HDD on a storage server:

  1. Create a new time elapsed variable (te)
  2. try
  3. If locked, wait a random amount of seconds between 3 and 15
  4. increment te by that amount of time
  5. try again
  6. repeat 3-5 until either te reaches 120 seconds or success
  7. if te > 120 incriment by 30 seconds
  8. Repeat 7 until either success or te > 1200

This should introduce enough time and variance to protect us from issues. Additionally this would mean that we might not need to keep the exb default wait per job dispatch time so high. (right now it's 1 second) on clusters that are lucky enough to have fast dispatching and SSD's.

Thoughts, arguments, holes, worries?

Edit: this algorithm would work better as a while loop than as a nested for loop

@mchamberland
Copy link
Contributor Author

Sounds like a good starting point to me!

@ojalaj
Copy link

ojalaj commented Jun 12, 2018

Thank you experts for the great work. I do not have skills to contribute on the development part, but I can also do testing on our small cluster (186 cores, normal RAID HDDs, SLURM), if needed.

@crcrewso
Copy link
Contributor

crcrewso commented Jun 18, 2018

I just created a barely tested change to the locking algorithm. I can't right now test it on windows or slurm. If someone could test each of those and comment on my commit page that would be hugely helpful.

thank you
--- edited to remove the bad link and replaced it
crcrewso@2475674

@ojalaj
Copy link

ojalaj commented Jun 18, 2018

Hi @crcrewso.

I tried to open the link provided, but page not found.

@crcrewso
Copy link
Contributor

Sorry try this

crcrewso@2475674

@mchamberland
Copy link
Contributor Author

@crcrewso Great work! I'll give it a shot sometime today.

@ojalaj
Copy link

ojalaj commented Jul 5, 2018

Hi @crcrewso, we tried your script with Linux/Slurm. After inserting the changes to egs_c_utils.c, we recompiled basically in every folder under HEN_HOUSE and EGS_HOME, just to make sure that the changes would take effect. Unfortunately the failure rate did not change.

@crcrewso
Copy link
Contributor

crcrewso commented Jul 5, 2018

@ojalaj

I am, not doubting you, I just have a couple questions to track down why it didn't change anything (I would actually expect the failure rate to get worse in a certain scenario)

Before recompiling everything did you rerun the egs configure script, either the gui or HEN_HOUSE/scripts/configure?

When submitting with exb can you confirm that the jobs are actually starting and getting to their first control point? Ie does the lock file's first number ever drop? Does the second ever increment above zero?

@ojalaj
Copy link

ojalaj commented Jul 5, 2018

@crcrewso

  1. I did not run configure script (I thought that recompilation would be enough).
  2. I need to confirm from my PhD student, but my understanding is that jobs actually started, i.e. 30 jobs initially, but something like 2 of them ultimately continued to simulate histories from the lock file. This is similar behaviour I have experienced in the past, but not that much these days (I'm using different cluster than my PhD student, who is really suffering from this issue). Tomorrow we can run another test to check/confirm how it goes with the lock file.

@crcrewso
Copy link
Contributor

crcrewso commented Jul 5, 2018

@ojalaj I only tested the script against full reconfiguration, I doubt running make everywhere a user thought of would be thorough enough.

On a sidenote I have had little experience with jobs successfully restarting. Please could you test from a clean submission, all lock files and temp files from previous runs removed?

@ojalaj
Copy link

ojalaj commented Jul 6, 2018

@crcrewso
Answering to the question above:

Yes the jobs actually start. The jobs that fail, output
"lockControlFile: failed to lock file for 12 seconds..." to .egslog file several times
and then
"EGS_JCFControl: failed to rewind the job control file
finishSimulation(egs_chamber) -2"

The jobs that don't fail, run as they are supposed to, i.e., the first number in the .lock file gradually drops and the second number increases and finally the simulation ends, when all the particles are simulated.

Also, the tests have been clean submissions (all lock files and temp files from previous runs removed).

Now we ran the configure script and tested again with a clean submission (30 jobs) (egs_chamber run to simulate a profile). Now about 60% of the jobs seems to fail, whereas before only couple of jobs survived. So there seems to be some improvement, but this needs still further testing.

@crcrewso
Copy link
Contributor

crcrewso commented Jul 6, 2018

@ojalaj
I'm going to propose 3 easy changes to your code for testing. These should all be made together.

I've included the full line for convenience but it will probably be easier just to type the numerical changes

1, Replace
+ else {cycleTime = 30;}
with
+ else {cycleTime = 15;}

2, Replace
+ if (elapsedTime < 120) { cycleTime = 2 + (rand() % 20); }
+ if (elapsedTime < 120) { cycleTime = 1 + (rand() % 16); }

3, in your batch slurm file there should be a line like
batch_sleep_time=1
I know it's painful, but could you set it to 2

Let me know how that goes after you rerun the configure script

@ojalaj
Copy link

ojalaj commented Jul 6, 2018

@crcrewso

We applied the changes and rerun the configuration script. Actually we had batch_sleep_time=5, so setting it to 2 wasn't painful. Anyhow, we still can't see much improvement....

@crcrewso
Copy link
Contributor

crcrewso commented Jul 6, 2018

Could I get your input files?
Could you Try either quadroupling your number of histories or dividing the number of jobs dispatched by 4

@mchamberland
Copy link
Contributor Author

@crcrewso Reducing from 250 to 62 jobs, they've all been running for more than 40 minutes and none of them have failed yet.

@mchamberland
Copy link
Contributor Author

@crcrewso No failure with the reduced job number for this particular case.

@crcrewso
Copy link
Contributor

crcrewso commented Jul 6, 2018 via email

@mchamberland
Copy link
Contributor Author

@crcrewso Aaaah! Yes, that makes a lot of sense, actually. Thanks!

@ojalaj
Copy link

ojalaj commented Jul 8, 2018

Hi @crcrewso.

We tried both quadroupling number of histories and dividing the number of jobs dispatched by 4, but result is still the same - 30 dispatched jobs and 2 of them end up running till the end. We'll figure out if we can share the input file - if not, we will make some simplified example input file.

I don't know, but could this be cluster specific, i.e., the input file will work on some cluster with not much other traffic and/or faster HW, but not on others? Our problems are on a Uni cluster with ~1500 cores with number of other users. However I'm using (I'm the only user at the moment) another small cluster with ~200 cores, where I haven't had these issues (which I had in the past on the older Uni clusters).

@crcrewso
Copy link
Contributor

crcrewso commented Jul 9, 2018 via email

@ojalaj
Copy link

ojalaj commented Jul 9, 2018

I tried changing 1200 --> 7200. I'll let you know tomorrow did it help. We have prepared a test input file for you. To which address we can send it?

@crcrewso
Copy link
Contributor

crcrewso commented Jul 9, 2018

Since it's a test input file, would you have any objection posting it to crcrewso@2475674? Otherwise you can try -removed-

@ojalaj
Copy link

ojalaj commented Aug 13, 2018

We have been running test simulations on two different clusters - a University cluster with hundreds of cores and number of other users with various calculation needs and a dedicated cluster (186 cores) only for these simulations with no other users. At the Uni cluster we experience the issue with large number of failing jobs with both default EGSnrc .lock file control and the changes proposed by @crcrewso.

However, on the dedicated cluster we are able to run all the simulatons with no problems with default EGSnrc .lock file control, so it seems that the issue is somehow generally related to HW/SW configuration and/or other traffic on the cluster.

@marenaud
Copy link
Contributor

@ojalaj just for fun, do you know the file systems being used on each cluster? I'm wondering if the lock file issues are due to the use of NFS (or specific NFS settings) to share a common partition across nodes. I'd be really curious to know if the dedicated cluster where you have no issues is using something else than NFS?

@ojalaj
Copy link

ojalaj commented Oct 26, 2018

This is something I really don't know....I need to check from admins or do you have some commands that I could use to find out? Local file system (using df -Th) on the dedicated cluster seems to be xfs.

@marenaud
Copy link
Contributor

marenaud commented Oct 26, 2018

One of the ways would be to log onto a compute node and run df to see if you have any NFS-mounted folders (and if you can identify that you run jobs from there). For example on our cluster we share /home across all nodes via nfs, so the compute node will have a line in df that says:

controller-node:/home 8719676416 6100080640 2180126720  74% /home

where controller-node is either the hostname or IP address that hosts the disk. Another way is to look at /etc/exports to see if the node which has all the hard drives exports any drives to an nfs server.

edit: doing df -Th will also reveal whether a folder is mounted as nfs, but you'd have to do it on a compute node, not the host node.

@ojalaj
Copy link

ojalaj commented Oct 26, 2018

I'll ask the details directly from the admins (same admins for both clusters) and then let you know more.

@ojalaj
Copy link

ojalaj commented Oct 29, 2018

This is what I got: Both clusters are using NFS for compute-nodes. Only differences that come to my mind, is that the dedicated cluster with no issues is running on Scientific Linux 6 (using plain ethernet for NFS), whereas the other one on Centos 7 (using Infiniband IPoIB for NFS). Of course the number of clients and amount of io usage on the dedicated cluster is much smaller.

@marenaud
Copy link
Contributor

Wow interesting. And the dedicated cluster is the one where you have no issues eh. Well, there goes that theory.

@ftessier
Copy link
Member

ftessier commented Oct 29, 2018

Just a random thought: could the higher bandwith IB fabric actually be bottle-necking the NFS server?

There is an NFS setting that is apparently often overlooked, the Number of servers (see https://access.redhat.com/solutions/2216 for example). As far as I understand, this sets the maximum number of concurrent connection (implemented as daemon threads).

Keep in mind that a short simulation on a large number of nodes might be requesting many NFS connections all the time. Our experience is that NFS grinds to a halt when the number of requests is beyond what is in fact available.

You can get a sense of the server load with the uptime command.

On our 400-node cluster, we increased the number of NFS "severs" to 64, and that alone solved a lot of problems with NFS lock ups. We still bring down the NFS daemon from time to time, but not nearly as often as before...

You may want to ping your system administrator about this setting.

@marenaud
Copy link
Contributor

marenaud commented Oct 29, 2018

Thanks Fred, that's a nice thing to try! We had the default 8 threads on our cluster, so I'll test 64 and report.

edit: didn't help :(

@ojalaj
Copy link

ojalaj commented Oct 31, 2018

According to admins, at our end the setting has "always" been 64 and the server load has never been even near full. But thank you anyway @ftessier ! Other suggestions are also welcome!

@ojalaj
Copy link

ojalaj commented Nov 13, 2018

I started an egs_chamber simulation on the cluster (with issues). I'm using the 'develop' branch from Sep 25th and I've changed N_CHUNKS in src/egsnrc.macros to 1 and increased batch_sleep_time in scripts/batch_options.xxx to 10 seconds.

What caught my attention was that even if I've changed N_CHUNKS ("how many chunks do we want to split the parallel run into") (and re-compiled under HEN_HOUSE/egs++ and EGS_HOME/egs_chamber), I still get the following to each .egslog file under each egsrun directory:

Fresh simulation of 2000000000 histories

Parallel run with 50 jobs and 10 chunks per job

which I understand so that changing N_CHUNKS has not had any effect. Is there something I've done wrong here? Or do I need to re-compile somewhere else?

@rtownson
Copy link
Collaborator

@ojalaj in egs++ codes N_CHUNKS is set in the run control input block:

:start run control:
nbatch = 1
nchunk = 1
ncase = etc...
:stop run control:

@ojalaj
Copy link

ojalaj commented Nov 13, 2018

Great - thank you @rtownson !

edit: And yes, it is well-documented (https://nrc-cnrc.github.io/EGSnrc/doc/pirs898/common.html), so I should have looked there first!

@Abdullah-Abuhaimed
Copy link

Abdullah-Abuhaimed commented Mar 28, 2020

Dear all,

I use EGSnrc 2020, so far everything works properly with an exception to running parallel jobs at the beginning, similar to the error reported here. I have tried with different user codes (beamnrc, dosxyznrc, and cavity) and had the same error. Our system is slurm, and when I submit parallel jobs, they are not working at all and get the error below:

***************** Error:

Failed to create a lock file named /home/aabuhaimed/EGSnrc/......lock

***************** Quiting now.

and I get the lines below in the each error file:

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file after 1 minute wait!

The lock file is created by the first job in the same directory, but it is empty and the jobs do not run. I tried different ways, but still not working. Any idea how to fix this issue?

@rtownson
Copy link
Collaborator

Hi @Abdullah-Abuhaimed, I see that @blakewalters is addressing your question on reddit and via email, so we will not follow up here.

@TTianCui
Copy link

Dear all,
I run a BEAM accelerator with an IAEA phase space source and DOSXYZnrc with source 20 and a shared library.When the DOSXYZnrc Photpn splitting number set ≤1,the parallel jobs make sucessfully.However,when the Photpn splitting number set >1,the single thread make it,parallel jobs cannot run sucessfully.When I submit parallel jobs,all of them are working at a short time and failed.
The egslogfile is end at 'will perform charged-particle range rejection against voxel bounddaries'.
I tried different way,increase or decrease the number of parallel and history,but still no working.Any adea about this issue?

@jedarko
Copy link

jedarko commented Apr 22, 2020

Dear all,

I use EGSnrc 2020, so far everything works properly with an exception to running parallel jobs at the beginning, similar to the error reported here. I have tried with different user codes (beamnrc, dosxyznrc, and cavity) and had the same error. Our system is slurm, and when I submit parallel jobs, they are not working at all and get the error below:

***************** Error:

Failed to create a lock file named /home/aabuhaimed/EGSnrc/......lock

***************** Quiting now.

and I get the lines below in the each error file:

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file for 12 seconds...

egsLockControlFile: failed to lock file after 1 minute wait!

The lock file is created by the first job in the same directory, but it is empty and the jobs do not run. I tried different ways, but still not working. Any idea how to fix this issue?

Was there a resolution to this issue. i seems to have the exact same issue.

@crcrewso
Copy link
Contributor

crcrewso commented Apr 22, 2020

There are two solutions, in batch_options there should be a wait time between jobs control, for slurm try setting this to something large like 5 seconds (make sure it's not a multiple of 12, primes probably are better here)

Or you could try crcrewso@2475674

Edit, forgot I submitted this as PR #499

@jedarko
Copy link

jedarko commented Apr 22, 2020

I tried crcrewso/EGSnrc@2475674 and it worked like a charm. thanks

@ftessier
Copy link
Member

I see that #499 was merged and there were additional improvements regarding lock file issues in Release 2021, notably the uniform run control object (#588) and the new egs-parallel scripts (#628). Hence I will close this Issue for now. Don't hesitate to reopen it if the infamous lock file rears its ugly head again 😄 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants