New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault for large DF-MP3 calculation #1764
Comments
possibly related to #1679. Although it should fit in the Is there any other output? |
Technically, the problem is in the DFOCC module, not the OCC module. Thoughts, @bozkaya? The part that grabs my attention is the line |
@hokru I can't find a Contents of the scratch directory:
|
Thanks. Odd having no @JonathonMisiewicz I've seen these overflowing numbers before without crashes. But could still point in the right direction. (mentioned here #898 ) |
Since I need these numbers for a paper I may be motivated enough to build a debug version and dig in deeper. Or, is there a prebuilt debug version somewhere? |
No, there's no prebuilt debug. You'll probably want to build from master, not 1.3.x. I don't see that it could cause a problem, but you may as well use a -jkfit for the scf, not -ri. |
I had used -RI for the SCF because I am comparing against my dumb DF implementation in CFOUR that only uses a single fitting basis; I can try with -JKFIT. |
If you want to go bug-fixing yourself, the DF-MP3 code starts around here. The timer file would give more detailed information about where the segfault occurred, but based on the output, |
Sorry, I don't have the bandwidth to build it right now, but I think this line could be responsible for the bad memory estimate and, perhaps, that's causing problems later on. The variables |
I had looked into the integer overflow issues in the DF code with MP3 as an example several months ago. The negative memory values reported are of course integral overflows and one can fix the printing of memory requirements by changing the int to a type that holds larger integers. But the actual problem happens later when, if I understand correctly, an array index becomes bigger than 2,147,483,647. In my DF-MP3 test calculation this happened in the main loop in mp3_WabefT2. Memory for I, Vs, Va, Ts, and Ta was successfully allocated, but then the quantity a * navirA * nQ became too large. I forced it to long or long-long, so the product could be evaluated (and printed printed out) as 2148655392 (as opposed to -2146311904 with int) but this positive value "anavirAnQ3" was illegal for the subsequent contraction. I->contract(false, true, navirA * nb, navirA, nQ, K, K, 0, anavirAnQ3, 1.0, 0.0); So, it is the array index, and not the array value, that is bigger than the 32-bit integer. And our math libraries index arrays with the 32-bit integer type! I tried to compile Psi against MKL and OpenBLAS with 64-bit index arrays (the ILP64 interface) but the resulting program was not stable. So, if my thinking is correct, I am afraid we do not have an easy fix as long as Psi4 expects math libraries with 32-bit integer indices. I can share some debug code (modified dfocc.h, df_ref.cc, tei_grad_corr.cc, df_corr.cc, ref_grad.cc, and mp3_W_intr.cc with some long int and printf statements) and sample outputs if anybody thinks this is helpful. |
After @andysim 's fix I get a normal print
Though the calculation exceeds my 64 GiB RAM in the end. @kaljugit wow, looks like you went deep! |
good news. The MKL is fine for this. I got the calculation finished. |
That is a good news! @hokru if you have time and resources, could you please check if your modification also works for the larger test case that gave me trouble. I am unable to try out your fix for another couple of weeks: |
@kaljugit It goes past the MP2 printout so it might work.
|
well, I see now that the 3rd order correlation energy is zero in my calculations...so this is not solved yet. |
excellent! |
@hokru Thank you for giving it a try! Yes, with my fixes it completed MP2 I omitted diffuse function for debugging only. All research work is with aug-cc-pVXZ or zapa-nr. The latter, while not as diffuse as aug-cc-pVXZ, gave me very nice basis set convergence for E2. For this system, mar-cc-pV5Z was actually not an obvious improvement over aug-cc-pVQZ ... Proton affinity with the latter was closer to aug-cc-pV5Z result compared to proton affinity with mar-cc-pV5Z. |
@devinamatthews Thanks for sharing the patch. Speaking of science, I am not sure if your example was a test or production job but I would be careful with third-order correlation energies in cc-pVDZ basis. See https://www.ncbi.nlm.nih.gov/pubmed/17186479 for details. |
Apparently the problem is not completely fixed. Running (H2O)30 results in:
Any ideas where to look next? |
One could try to incorporate the debugging "Printf" statements from the attached file to mp3_W_intr.cc in your patched-up system to see how far the MP3 calculation progresses. The lines with anavirAnQ1, anavirAnQ2, anavirAnQ3 are probably not relevant after your patches but printing the value of the product (a navirA nQ) out in this main loop would be still helpful. |
It appears the problem is that |
I made a lazy workaround by just using |
@devinamatthews This sounds great until @bozkaya can respond here. Would you mind patching this into Psi4 master? |
Hi all, I am out of Country for the International Junior Science Olympiad (IJSO), hence I could not catch up with you. I know the problem for large molecules, it is because of int. I think if I change all ints to long long int, the problem will be solved. When I find an available time I can take a look. Meanwhile, you can use your patch and update dfocc as long as your patch passes all dfocc tests. Alternatively, a volunter may change all int data types to LLI for dfocc. @devinamatthews @dgasmith Best regards, |
Wouldn’t using `size_t` be the best option? It is, after all, designed to be the size type...
|
Yes in most cases size_t would be better. However, we need to investigate all int variables in DFOCC whether they can have negative values or not, may be some of them are signed ints. Hence, the safest solution could be changing int to long long it. Overall, size_t is okay if we sure that we are not breaking any other part of the code, if we are not sure then long long int is a good solution. @susilehtola |
What is quite likely another case of the same general problem here. User reports large calculations give negative memories and bad array lengths, leading to a crash. |
Per Bozkaya request, |
A DF-MP3 calculation of C36H38/cc-pVDZ (36ene) fails with a segfault in the OCC module. See attached output file.
output.txt
The text was updated successfully, but these errors were encountered: