-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel jobs fail to lock or rewind the control file at end of run #438
Comments
To clarify: the "failed to rewind the control file" does not necessarily happen at the end of the run. It can happen anytime throughout the run, which means if I submit 250 jobs, I may be left with only ~40 after a few minutes. The message I see is: ... which is what seems to be killing the majority of my jobs. Completely anecdotal, but I don't see this as often with a ~2013 version of EGSnrc, but I haven't ruled out this being a user-account issue (I run the 2 versions of EGSnrc on different user accounts for now). |
I have also experienced this race condition of the lock file. There seems to be at least two separate issues: 1) jobs really killed during the simulation and 2) jobs running properly in terms of calculation results, but something going wrong in the last parts of finishing scripts. My knowledge for this is very limited, but as a cure 1) I have tried reducing N_CHUNKS in src/egsnrc.macros, so that the jobs wouldn't access lock file so often and increasing batch_sleep_time in scripts/batch_options.xxx, so that jobs wouldn't start (and access lock file to pick up the next chunk of particles) within such a short time frame. Do you know, where is defined the value (time) for how long the lock file is accessed until killing the job?However, increasing the value might also help. |
Thanks, those are good suggestions! This is where the code tries to lock the file for 1 minute. Maybe I’ll try slightly increasing those numbers. |
I just want to echo @ojalaj and say that I also encountered the two different flavors of this issue, and also fixed 1) by increasing batch_sleep_time which is annoying when splitting a < 2 minute job in 100+ parts because the batch_sleep_time ends up being a significant overhead. This has been happening for quite a few releases, so probably not due to a recent bug. The best solution would probably to finalize #341 and do away with lock files entirely... but when I looked into it, there seemed to be some issues with shared libraries. |
I think I may have found why v2018 experiences more lock file errors than our ~2013 system. In egs_c_utils.c, the 2013 version tries locking the file for 10 minutes before giving up, whereas in v2018, this is reduced to 1 minute. After increasing the time back to 10 minutes in v2018, I see much fewer failures (2 out of 250 vs >150 before). I would personally suggest increasing the time back to 10 minutes because the drawback from many failed runs is worse than having a job wait a few minutes to properly lock the lock file. |
Thanks @mchamberland; but that must have been a local change: from our end it seems it was always 1 minute... At any rate, seems there would be no harm done by increasing the wait time. |
@ftessier Correct, I've confirmed that the change was done locally. Since we're here, there seems to be duplicated code in cutils/egs_c_utils.c and pieces/egs_c_utils_unix.c. Any reason why? I was also confused about which file gets picked up during configuration. |
This explains our currently much higher failure rate! Thank you! When looking at the file, should we increase the number of loops or increase the sleep time? |
I don't think it matters which one you increase, but we changed it to loop 60 times every 10 seconds, so it tries for 10 minutes. But even with this change, I was still getting quite a few jobs failing. What seems to have helped, is to reduce the rate of lock attempts. In other words, given a typical simulation, figure out how many times your jobs will access the lock file per second (usually, something like Njobs * Nchunk / typical_simulation_time). In this case, Nchunk is a good candidate to decrease. |
I've been thinking about this for a bit and there might be a better rule of thumb than @mchamberland mentioned. If we consider an ideal scheduler, and a large number of histories per job, then at the end of the first batch, each job will try to lock at the same time. Now there are some logical leaps here, which I could explain but I'm thinking the summary should be sufficient: Assuming a linear wait time with perfect collisions
This has the advantage of after the first of 10 checkpoints all threads should be timed so differently that they should not collide again. Now if we create a more advanced timing system, where we add a randomized broadening term we could cut down the number of loops of any given thread, but doing so would mean that we'd retain collision risks through the whole run. I can do a bit of legwork on this if the group thinks that there would be advantages to keeping this algorithm safe, efficient, and unnecessary for users to configure. Or we could document this bit of code in an appendix. |
@crcrewso You've put way more thought than me into this, but I definitely support such an initiative! I can help with testing until roughly mid-August. After that, I'm not sure when I will have access to a cluster with EGSnrc again. |
Before I write something up, one question. Does anyone know if the lock file stays locked while the results are being written? |
EDIT: Oops, misread your question. I have no idea and that's a good question. My hunch is no, but @rtownson @blakewalters can chime in. |
You're hunch is correct, @mchamberland. The .lock file is unlocked while writing results. |
If that's the case then here is my proposed new algorithm (based mostly on experience and an old locking conversation from years back). Considering most cluster storage is higher latency and RAID HDD on a storage server:
This should introduce enough time and variance to protect us from issues. Additionally this would mean that we might not need to keep the exb default wait per job dispatch time so high. (right now it's 1 second) on clusters that are lucky enough to have fast dispatching and SSD's. Thoughts, arguments, holes, worries? Edit: this algorithm would work better as a while loop than as a nested for loop |
Sounds like a good starting point to me! |
Thank you experts for the great work. I do not have skills to contribute on the development part, but I can also do testing on our small cluster (186 cores, normal RAID HDDs, SLURM), if needed. |
I just created a barely tested change to the locking algorithm. I can't right now test it on windows or slurm. If someone could test each of those and comment on my commit page that would be hugely helpful. thank you |
Hi @crcrewso. I tried to open the link provided, but page not found. |
Sorry try this |
@crcrewso Great work! I'll give it a shot sometime today. |
Hi @crcrewso, we tried your script with Linux/Slurm. After inserting the changes to egs_c_utils.c, we recompiled basically in every folder under HEN_HOUSE and EGS_HOME, just to make sure that the changes would take effect. Unfortunately the failure rate did not change. |
I am, not doubting you, I just have a couple questions to track down why it didn't change anything (I would actually expect the failure rate to get worse in a certain scenario) Before recompiling everything did you rerun the egs configure script, either the gui or HEN_HOUSE/scripts/configure? When submitting with exb can you confirm that the jobs are actually starting and getting to their first control point? Ie does the lock file's first number ever drop? Does the second ever increment above zero? |
|
@ojalaj I only tested the script against full reconfiguration, I doubt running make everywhere a user thought of would be thorough enough. On a sidenote I have had little experience with jobs successfully restarting. Please could you test from a clean submission, all lock files and temp files from previous runs removed? |
@crcrewso Yes the jobs actually start. The jobs that fail, output The jobs that don't fail, run as they are supposed to, i.e., the first number in the .lock file gradually drops and the second number increases and finally the simulation ends, when all the particles are simulated. Also, the tests have been clean submissions (all lock files and temp files from previous runs removed). Now we ran the configure script and tested again with a clean submission (30 jobs) (egs_chamber run to simulate a profile). Now about 60% of the jobs seems to fail, whereas before only couple of jobs survived. So there seems to be some improvement, but this needs still further testing. |
@ojalaj I've included the full line for convenience but it will probably be easier just to type the numerical changes 1, Replace 2, Replace 3, in your batch slurm file there should be a line like Let me know how that goes after you rerun the configure script |
We applied the changes and rerun the configuration script. Actually we had batch_sleep_time=5, so setting it to 2 wasn't painful. Anyhow, we still can't see much improvement.... |
Could I get your input files? |
@crcrewso Reducing from 250 to 62 jobs, they've all been running for more than 40 minutes and none of them have failed yet. |
@crcrewso No failure with the reduced job number for this particular case. |
I think you were running too many jobs. When determining the number of jobs to run one needs to keep in mind the time to save
Lets consider a typical cluster raidArray. It might save at 100 MBps. If were saving a 100 MB pardose file, with latency, the lock file will be locked for about 3 seconds.
250 of these would take 750 seconds at least. If theres other latencies that could easily reach the 1200 second maximum when things start timing out with my code
Does this make sense.
…On July 6, 2018 at 5:09:48 PM, Marc Chamberland ***@***.******@***.***)) wrote:
@crcrewso(https://github.com/crcrewso) No failure with the reduced job number for this particular case.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub(#438 (comment)), or mute the thread(https://github.com/notifications/unsubscribe-auth/AFzs33cy0fVOdacdWHmb9q0EbF_9ZweMks5uD-48gaJpZM4TqzcR).
|
@crcrewso Aaaah! Yes, that makes a lot of sense, actually. Thanks! |
Hi @crcrewso. We tried both quadroupling number of histories and dividing the number of jobs dispatched by 4, but result is still the same - 30 dispatched jobs and 2 of them end up running till the end. We'll figure out if we can share the input file - if not, we will make some simplified example input file. I don't know, but could this be cluster specific, i.e., the input file will work on some cluster with not much other traffic and/or faster HW, but not on others? Our problems are on a Uni cluster with ~1500 cores with number of other users. However I'm using (I'm the only user at the moment) another small cluster with ~200 cores, where I haven't had these issues (which I had in the past on the older Uni clusters). |
Okay. Lets try something ridiculous. Inhave one constant in the while loop with a value of 1200. Try setting this to 7200.
…On July 8, 2018 at 1:29:11 PM, ojalaj ***@***.******@***.***)) wrote:
Hi @crcrewso(https://github.com/crcrewso).
We tried both quadroupling number of histories and dividing the number of jobs dispatched by 4, but result is still the same - 30 dispatched jobs and 2 of them end up running till the end. We'll figure out if we can share the input file - if not, we will make some simplified example input file.
I don't know, but could this be cluster specific, i.e., the input file will work on some cluster with not much other traffic and/or faster HW, but not on others? Our problems are on a Uni cluster with ~1500 cores with number of other users. However I'm using (I'm the only user at the moment) another small cluster with ~200 cores, where I haven't had these issues (which I had in the past on the older Uni clusters).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub(#438 (comment)), or mute the thread(https://github.com/notifications/unsubscribe-auth/AFzs3--3sbL9nA1ELvPuTQI579zqeCx_ks5uEl2HgaJpZM4TqzcR).
|
I tried changing 1200 --> 7200. I'll let you know tomorrow did it help. We have prepared a test input file for you. To which address we can send it? |
Since it's a test input file, would you have any objection posting it to crcrewso@2475674? Otherwise you can try -removed- |
We have been running test simulations on two different clusters - a University cluster with hundreds of cores and number of other users with various calculation needs and a dedicated cluster (186 cores) only for these simulations with no other users. At the Uni cluster we experience the issue with large number of failing jobs with both default EGSnrc .lock file control and the changes proposed by @crcrewso. However, on the dedicated cluster we are able to run all the simulatons with no problems with default EGSnrc .lock file control, so it seems that the issue is somehow generally related to HW/SW configuration and/or other traffic on the cluster. |
@ojalaj just for fun, do you know the file systems being used on each cluster? I'm wondering if the lock file issues are due to the use of NFS (or specific NFS settings) to share a common partition across nodes. I'd be really curious to know if the dedicated cluster where you have no issues is using something else than NFS? |
This is something I really don't know....I need to check from admins or do you have some commands that I could use to find out? Local file system (using df -Th) on the dedicated cluster seems to be xfs. |
One of the ways would be to log onto a compute node and run df to see if you have any NFS-mounted folders (and if you can identify that you run jobs from there). For example on our cluster we share /home across all nodes via nfs, so the compute node will have a line in df that says:
where controller-node is either the hostname or IP address that hosts the disk. Another way is to look at /etc/exports to see if the node which has all the hard drives exports any drives to an nfs server. edit: doing df -Th will also reveal whether a folder is mounted as nfs, but you'd have to do it on a compute node, not the host node. |
I'll ask the details directly from the admins (same admins for both clusters) and then let you know more. |
This is what I got: Both clusters are using NFS for compute-nodes. Only differences that come to my mind, is that the dedicated cluster with no issues is running on Scientific Linux 6 (using plain ethernet for NFS), whereas the other one on Centos 7 (using Infiniband IPoIB for NFS). Of course the number of clients and amount of io usage on the dedicated cluster is much smaller. |
Wow interesting. And the dedicated cluster is the one where you have no issues eh. Well, there goes that theory. |
Just a random thought: could the higher bandwith IB fabric actually be bottle-necking the NFS server? There is an NFS setting that is apparently often overlooked, the Keep in mind that a short simulation on a large number of nodes might be requesting many NFS connections all the time. Our experience is that NFS grinds to a halt when the number of requests is beyond what is in fact available. You can get a sense of the server load with the On our 400-node cluster, we increased the number of NFS "severs" to 64, and that alone solved a lot of problems with NFS lock ups. We still bring down the NFS daemon from time to time, but not nearly as often as before... You may want to ping your system administrator about this setting. |
Thanks Fred, that's a nice thing to try! We had the default 8 threads on our cluster, so I'll test 64 and report. edit: didn't help :( |
According to admins, at our end the setting has "always" been 64 and the server load has never been even near full. But thank you anyway @ftessier ! Other suggestions are also welcome! |
I started an egs_chamber simulation on the cluster (with issues). I'm using the 'develop' branch from Sep 25th and I've changed N_CHUNKS in src/egsnrc.macros to 1 and increased batch_sleep_time in scripts/batch_options.xxx to 10 seconds. What caught my attention was that even if I've changed N_CHUNKS ("how many chunks do we want to split the parallel run into") (and re-compiled under HEN_HOUSE/egs++ and EGS_HOME/egs_chamber), I still get the following to each .egslog file under each egsrun directory:
which I understand so that changing N_CHUNKS has not had any effect. Is there something I've done wrong here? Or do I need to re-compile somewhere else? |
@ojalaj in egs++ codes N_CHUNKS is set in the
|
Great - thank you @rtownson ! edit: And yes, it is well-documented (https://nrc-cnrc.github.io/EGSnrc/doc/pirs898/common.html), so I should have looked there first! |
Dear all, I use EGSnrc 2020, so far everything works properly with an exception to running parallel jobs at the beginning, similar to the error reported here. I have tried with different user codes (beamnrc, dosxyznrc, and cavity) and had the same error. Our system is slurm, and when I submit parallel jobs, they are not working at all and get the error below: ***************** Error: Failed to create a lock file named /home/aabuhaimed/EGSnrc/......lock ***************** Quiting now. and I get the lines below in the each error file: egsLockControlFile: failed to lock file for 12 seconds... egsLockControlFile: failed to lock file for 12 seconds... egsLockControlFile: failed to lock file for 12 seconds... egsLockControlFile: failed to lock file for 12 seconds... egsLockControlFile: failed to lock file for 12 seconds... egsLockControlFile: failed to lock file after 1 minute wait! The lock file is created by the first job in the same directory, but it is empty and the jobs do not run. I tried different ways, but still not working. Any idea how to fix this issue? |
Hi @Abdullah-Abuhaimed, I see that @blakewalters is addressing your question on reddit and via email, so we will not follow up here. |
Dear all, |
Was there a resolution to this issue. i seems to have the exact same issue. |
There are two solutions, in batch_options there should be a wait time between jobs control, for slurm try setting this to something large like 5 seconds (make sure it's not a multiple of 12, primes probably are better here) Or you could try crcrewso@2475674 Edit, forgot I submitted this as PR #499 |
I tried crcrewso/EGSnrc@2475674 and it worked like a charm. thanks |
I see that #499 was merged and there were additional improvements regarding lock file issues in Release 2021, notably the uniform run control object (#588) and the new |
I'm seeing the same behaviour that @ojalaj reported in the comments of PR #368: parallel jobs end with an error ("failed to lock or rewind the control file")
For what it's worth, I see it when running a BEAM accelerator with an IAEA phase space source and also when running DOSXYZnrc with source 20 and a shared library (either BEAM or external).
Not sure if this is general lock file troubles or if there's anything else going on.
The text was updated successfully, but these errors were encountered: