Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault for large DF-MP3 calculation #1764

Open
devinamatthews opened this issue Dec 2, 2019 · 29 comments
Open

Segfault for large DF-MP3 calculation #1764

devinamatthews opened this issue Dec 2, 2019 · 29 comments
Labels
coding-needed For issues where we know the issue and need somebody to code the solution. crash For issues that cause a Psi4 non-compile, crash, segfault, or algorithm failure. dfocc For issues in the DFOCC module.

Comments

@devinamatthews
Copy link

A DF-MP3 calculation of C36H38/cc-pVDZ (36ene) fails with a segfault in the OCC module. See attached output file.
output.txt

@hokru
Copy link
Member

hokru commented Dec 2, 2019

possibly related to #1679. Although it should fit in the int variable raised in that issue. (i think)

Is there any other output? stderr captured by slurm?
What is the last content of timer.dat?

@JonathonMisiewicz
Copy link
Contributor

Technically, the problem is in the DFOCC module, not the OCC module.

Thoughts, @bozkaya? The part that grabs my attention is the line Memory requirement for CC contractions: -49826.57 MB. Somehow, I doubt that negative memory is accurate.

@devinamatthews
Copy link
Author

@hokru I can't find a timer.dat file in the submission or scratch directory (although I see it for other jobs that completed normally). All of the output is in the file attached above. I tried again with 700G memory and the symptoms were the same.

Contents of the scratch directory:

psi.22173.276
psi.22173.277
psi.22173.35
psi.22173.97
stdout.default.22173.180.npy

@hokru
Copy link
Member

hokru commented Dec 3, 2019

Thanks. Odd having no timer.dat at all. Hard to test these large calculations ourselves.

@JonathonMisiewicz I've seen these overflowing numbers before without crashes. But could still point in the right direction. (mentioned here #898 )

@devinamatthews
Copy link
Author

Since I need these numbers for a paper I may be motivated enough to build a debug version and dig in deeper. Or, is there a prebuilt debug version somewhere?

@loriab
Copy link
Member

loriab commented Dec 3, 2019

No, there's no prebuilt debug. You'll probably want to build from master, not 1.3.x. I don't see that it could cause a problem, but you may as well use a -jkfit for the scf, not -ri.

@devinamatthews
Copy link
Author

I had used -RI for the SCF because I am comparing against my dumb DF implementation in CFOUR that only uses a single fitting basis; I can try with -JKFIT.

@JonathonMisiewicz
Copy link
Contributor

If you want to go bug-fixing yourself, the DF-MP3 code starts around here. The timer file would give more detailed information about where the segfault occurred, but based on the output, mp3_t2_1st_sc or t2_2nd_sc seem most likely.

@andysim
Copy link
Member

andysim commented Dec 3, 2019

Sorry, I don't have the bandwidth to build it right now, but I think this line could be responsible for the bad memory estimate and, perhaps, that's causing problems later on. The variables aocc2AA and nvir2AA are declared int, so their product will be computed as an int, overflow, and then that overflowed entity is cast to double. A quick fix for that would be to declare the various dimensioning variables as size_t. Hopefully that'll fix the issue 🤞

@kaljugit
Copy link

kaljugit commented Dec 4, 2019

I had looked into the integer overflow issues in the DF code with MP3 as an example several months ago. The negative memory values reported are of course integral overflows and one can fix the printing of memory requirements by changing the int to a type that holds larger integers. But the actual problem happens later when, if I understand correctly, an array index becomes bigger than 2,147,483,647.

In my DF-MP3 test calculation this happened in the main loop in mp3_WabefT2.

Memory for I, Vs, Va, Ts, and Ta was successfully allocated, but then the quantity a * navirA * nQ became too large. I forced it to long or long-long, so the product could be evaluated (and printed printed out) as 2148655392 (as opposed to -2146311904 with int) but this positive value "anavirAnQ3" was illegal for the subsequent contraction.

I->contract(false, true, navirA * nb, navirA, nQ, K, K, 0, anavirAnQ3, 1.0, 0.0);

So, it is the array index, and not the array value, that is bigger than the 32-bit integer. And our math libraries index arrays with the 32-bit integer type!

I tried to compile Psi against MKL and OpenBLAS with 64-bit index arrays (the ILP64 interface) but the resulting program was not stable. So, if my thinking is correct, I am afraid we do not have an easy fix as long as Psi4 expects math libraries with 32-bit integer indices.

I can share some debug code (modified dfocc.h, df_ref.cc, tei_grad_corr.cc, df_corr.cc, ref_grad.cc, and mp3_W_intr.cc with some long int and printf statements) and sample outputs if anybody thinks this is helpful.

@hokru
Copy link
Member

hokru commented Dec 4, 2019

After @andysim 's fix I get a normal print

MO spaces...

         FC   OCC   VIR   FV
        ----------------------
         36   91   567    0

        Number of basis functions in the DF-CC basis: 2548

        Available memory                      :  61440.00 MB
        Memory requirement for 3-index ints   :   7413.66 MB
        Memory requirement for DF-CC int trans:  23261.99 MB
        Memory requirement for CC contractions:  60934.08 MB
        Warning: T2 amplitudes will be stored on the disk!
        Memory requirement for Wabef term     :  34201.37 MB

Though the calculation exceeds my 64 GiB RAM in the end.

@kaljugit wow, looks like you went deep!

@hokru
Copy link
Member

hokru commented Dec 4, 2019

good news. The MKL is fine for this. I got the calculation finished.
Trouble was likely again #1679 because it failed right at the amplitude writing. I applied the long long int modification and the size_t suggestion above (see patch).
Results: outfile.txt
git patch: fix.patch.txt (sort of untested hot fix for now)

@kaljugit
Copy link

kaljugit commented Dec 4, 2019

That is a good news! @hokru if you have time and resources, could you please check if your modification also works for the larger test case that gave me trouble. I am unable to try out your fix for another couple of weeks:
Input:
kk_dfmp3_test.log

@hokru
Copy link
Member

hokru commented Dec 4, 2019

@kaljugit It goes past the MP2 printout so it might work.
Anion without diffuse functions, though. Are you sure? fno-mp3/mar-cc-pV5Z should work very well with the 4-fold symmetry, btw. Only the integral writing after the scf is painfully slow (single-threaded).

set globals {
  basis       mar-cc-pV5Z
  freeze_core true
  ints_tolerance 1e-11
  s_tolerance 1e-9
}
energy('fno-mp3')

@hokru
Copy link
Member

hokru commented Dec 4, 2019

well, I see now that the 3rd order correlation energy is zero in my calculations...so this is not solved yet.

@devinamatthews
Copy link
Author

@hokru I made a few changes beyond what you had in your patch and it seems to work correctly now, for this molecule at least. MP2 and MP3 correlation energies are non-zero and in line with what I expect from smaller systems.
patch.txt

@hokru
Copy link
Member

hokru commented Dec 4, 2019

excellent!
I admit I did not do a good job and it was quickly done while waiting for lunch time.

@kaljugit
Copy link

kaljugit commented Dec 5, 2019

@hokru Thank you for giving it a try!

Yes, with my fixes it completed MP2
DF-MP2 Total Energy (a.u.) : -419.66275196620722
wrote out recalculated T2_2 (IA|JB) amplitudes in mp3_WmnijT2AA,
succeeded through mp3_WmbejT2,
and then died in mp3_WabefT2.

I omitted diffuse function for debugging only. All research work is with aug-cc-pVXZ or zapa-nr. The latter, while not as diffuse as aug-cc-pVXZ, gave me very nice basis set convergence for E2. For this system, mar-cc-pV5Z was actually not an obvious improvement over aug-cc-pVQZ ... Proton affinity with the latter was closer to aug-cc-pV5Z result compared to proton affinity with mar-cc-pV5Z.

@kaljugit
Copy link

kaljugit commented Dec 5, 2019

@devinamatthews Thanks for sharing the patch. Speaking of science, I am not sure if your example was a test or production job but I would be careful with third-order correlation energies in cc-pVDZ basis. See https://www.ncbi.nlm.nih.gov/pubmed/17186479 for details.

@devinamatthews
Copy link
Author

Apparently the problem is not completely fixed. Running (H2O)30 results in:

	MO spaces... 

	 FC   OCC   VIR   FV 
	----------------------
	 30  120   570    0

	Number of basis functions in the DF-CC basis: 2520

	Available memory                      : 667572.02 MB 
	Memory requirement for 3-index ints   :   7838.47 MB 
	Memory requirement for DF-CC int trans:  24103.73 MB 
	Memory requirement for CC contractions: 142778.32 MB 
	Total memory requirement for DF+CC int: 150616.79 MB 
	Memory requirement for Wabef term     :  49600.59 MB 

Traceback (most recent call last):
  File "/users/damatthews/apps/psi4/bin/psi4", line 289, in <module>
    exec(content)
  File "<string>", line 121, in <module>
  File "/users/damatthews/apps/psi4/lib/psi4/driver/driver.py", line 561, in energy
    wfn = procedures['energy'][lowername](lowername, molecule=molecule, **kwargs)
  File "/users/damatthews/apps/psi4/lib/psi4/driver/procrouting/proc.py", line 333, in select_mp3
    return func(name, **kwargs)
  File "/users/damatthews/apps/psi4/lib/psi4/driver/procrouting/proc.py", line 1620, in run_dfocc
    dfocc_wfn = core.dfocc(ref_wfn)

MemoryError: std::bad_array_new_length

Any ideas where to look next?

@kaljugit
Copy link

kaljugit commented Dec 5, 2019

One could try to incorporate the debugging "Printf" statements from the attached file to mp3_W_intr.cc in your patched-up system to see how far the MP3 calculation progresses.

The lines with anavirAnQ1, anavirAnQ2, anavirAnQ3 are probably not relevant after your patches but printing the value of the product (a navirA nQ) out in this main loop would be still helpful.

mp3_W_intr.cc.gz

@devinamatthews
Copy link
Author

It appears the problem is that Tensor1d (used e.g. in Tensor2d::write_symm()) uses int for the size--I'll have to update the whole class.

@devinamatthews
Copy link
Author

devinamatthews commented Dec 6, 2019

I made a lazy workaround by just using double[] instead of Tensor1d in write_symm and read_symm, but I guess it is only a temporary reprieve as once the T2 vector is more than INT_MAX elements basic functions like daxpy will stop working.

@dgasmith
Copy link
Member

dgasmith commented Dec 6, 2019

@devinamatthews This sounds great until @bozkaya can respond here. Would you mind patching this into Psi4 master?

@bozkaya
Copy link
Contributor

bozkaya commented Dec 6, 2019

Hi all,

I am out of Country for the International Junior Science Olympiad (IJSO), hence I could not catch up with you. I know the problem for large molecules, it is because of int. I think if I change all ints to long long int, the problem will be solved. When I find an available time I can take a look. Meanwhile, you can use your patch and update dfocc as long as your patch passes all dfocc tests. Alternatively, a volunter may change all int data types to LLI for dfocc. @devinamatthews @dgasmith

Best regards,

@susilehtola
Copy link
Member

susilehtola commented Dec 6, 2019 via email

@bozkaya
Copy link
Contributor

bozkaya commented Dec 6, 2019

Yes in most cases size_t would be better. However, we need to investigate all int variables in DFOCC whether they can have negative values or not, may be some of them are signed ints. Hence, the safest solution could be changing int to long long it. Overall, size_t is okay if we sure that we are not breaking any other part of the code, if we are not sure then long long int is a good solution. @susilehtola

@loriab loriab added this to the Psi4 1.4 milestone Jan 21, 2020
@JonathonMisiewicz
Copy link
Contributor

What is quite likely another case of the same general problem here. User reports large calculations give negative memories and bad array lengths, leading to a crash.

@JonathonMisiewicz JonathonMisiewicz removed this from the Psi4 1.4 milestone Apr 7, 2021
@JonathonMisiewicz JonathonMisiewicz added this to the Psi4 1.5 milestone Apr 7, 2021
@JonathonMisiewicz
Copy link
Contributor

Per Bozkaya request, dfocc is going untouched until after the Great Re-Sync, which is slated for 1.5 at earliest.

@JonathonMisiewicz JonathonMisiewicz added coding-needed For issues where we know the issue and need somebody to code the solution. crash For issues that cause a Psi4 non-compile, crash, segfault, or algorithm failure. dfocc For issues in the DFOCC module. labels Jun 28, 2021
@JonathonMisiewicz JonathonMisiewicz modified the milestones: Psi4 1.5, Psi4 1.6 Nov 18, 2021
@JonathonMisiewicz JonathonMisiewicz removed this from the Psi4 1.6 milestone Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
coding-needed For issues where we know the issue and need somebody to code the solution. crash For issues that cause a Psi4 non-compile, crash, segfault, or algorithm failure. dfocc For issues in the DFOCC module.
Projects
None yet
Development

No branches or pull requests

9 participants