Segfault for large DF-MP3 calculation #1764

devinamatthews · 2019-12-02T23:00:48Z

A DF-MP3 calculation of C36H38/cc-pVDZ (36ene) fails with a segfault in the OCC module. See attached output file.
output.txt

hokru · 2019-12-02T23:14:11Z

possibly related to #1679. Although it should fit in the int variable raised in that issue. (i think)

Is there any other output? stderr captured by slurm?
What is the last content of timer.dat?

JonathonMisiewicz · 2019-12-03T01:00:01Z

Technically, the problem is in the DFOCC module, not the OCC module.

Thoughts, @bozkaya? The part that grabs my attention is the line Memory requirement for CC contractions: -49826.57 MB. Somehow, I doubt that negative memory is accurate.

devinamatthews · 2019-12-03T15:45:41Z

@hokru I can't find a timer.dat file in the submission or scratch directory (although I see it for other jobs that completed normally). All of the output is in the file attached above. I tried again with 700G memory and the symptoms were the same.

Contents of the scratch directory:

psi.22173.276
psi.22173.277
psi.22173.35
psi.22173.97
stdout.default.22173.180.npy

hokru · 2019-12-03T16:43:51Z

Thanks. Odd having no timer.dat at all. Hard to test these large calculations ourselves.

@JonathonMisiewicz I've seen these overflowing numbers before without crashes. But could still point in the right direction. (mentioned here #898 )

devinamatthews · 2019-12-03T16:46:08Z

Since I need these numbers for a paper I may be motivated enough to build a debug version and dig in deeper. Or, is there a prebuilt debug version somewhere?

loriab · 2019-12-03T16:53:18Z

No, there's no prebuilt debug. You'll probably want to build from master, not 1.3.x. I don't see that it could cause a problem, but you may as well use a -jkfit for the scf, not -ri.

devinamatthews · 2019-12-03T16:54:35Z

I had used -RI for the SCF because I am comparing against my dumb DF implementation in CFOUR that only uses a single fitting basis; I can try with -JKFIT.

JonathonMisiewicz · 2019-12-03T17:02:06Z

If you want to go bug-fixing yourself, the DF-MP3 code starts around here. The timer file would give more detailed information about where the segfault occurred, but based on the output, mp3_t2_1st_sc or t2_2nd_sc seem most likely.

andysim · 2019-12-03T17:04:56Z

Sorry, I don't have the bandwidth to build it right now, but I think this line could be responsible for the bad memory estimate and, perhaps, that's causing problems later on. The variables aocc2AA and nvir2AA are declared int, so their product will be computed as an int, overflow, and then that overflowed entity is cast to double. A quick fix for that would be to declare the various dimensioning variables as size_t. Hopefully that'll fix the issue 🤞

kaljugit · 2019-12-04T09:06:30Z

I had looked into the integer overflow issues in the DF code with MP3 as an example several months ago. The negative memory values reported are of course integral overflows and one can fix the printing of memory requirements by changing the int to a type that holds larger integers. But the actual problem happens later when, if I understand correctly, an array index becomes bigger than 2,147,483,647.

In my DF-MP3 test calculation this happened in the main loop in mp3_WabefT2.

Memory for I, Vs, Va, Ts, and Ta was successfully allocated, but then the quantity a * navirA * nQ became too large. I forced it to long or long-long, so the product could be evaluated (and printed printed out) as 2148655392 (as opposed to -2146311904 with int) but this positive value "anavirAnQ3" was illegal for the subsequent contraction.

I->contract(false, true, navirA * nb, navirA, nQ, K, K, 0, anavirAnQ3, 1.0, 0.0);

So, it is the array index, and not the array value, that is bigger than the 32-bit integer. And our math libraries index arrays with the 32-bit integer type!

I tried to compile Psi against MKL and OpenBLAS with 64-bit index arrays (the ILP64 interface) but the resulting program was not stable. So, if my thinking is correct, I am afraid we do not have an easy fix as long as Psi4 expects math libraries with 32-bit integer indices.

I can share some debug code (modified dfocc.h, df_ref.cc, tei_grad_corr.cc, df_corr.cc, ref_grad.cc, and mp3_W_intr.cc with some long int and printf statements) and sample outputs if anybody thinks this is helpful.

hokru · 2019-12-04T09:19:50Z

After @andysim 's fix I get a normal print

MO spaces...

         FC   OCC   VIR   FV
        ----------------------
         36   91   567    0

        Number of basis functions in the DF-CC basis: 2548

        Available memory                      :  61440.00 MB
        Memory requirement for 3-index ints   :   7413.66 MB
        Memory requirement for DF-CC int trans:  23261.99 MB
        Memory requirement for CC contractions:  60934.08 MB
        Warning: T2 amplitudes will be stored on the disk!
        Memory requirement for Wabef term     :  34201.37 MB

Though the calculation exceeds my 64 GiB RAM in the end.

@kaljugit wow, looks like you went deep!

hokru · 2019-12-04T15:05:38Z

good news. The MKL is fine for this. I got the calculation finished.
Trouble was likely again #1679 because it failed right at the amplitude writing. I applied the long long int modification and the size_t suggestion above (see patch).
Results: outfile.txt
git patch: fix.patch.txt (sort of untested hot fix for now)

kaljugit · 2019-12-04T16:07:59Z

That is a good news! @hokru if you have time and resources, could you please check if your modification also works for the larger test case that gave me trouble. I am unable to try out your fix for another couple of weeks:
Input:
kk_dfmp3_test.log

hokru · 2019-12-04T16:29:17Z

@kaljugit It goes past the MP2 printout so it might work.
Anion without diffuse functions, though. Are you sure? fno-mp3/mar-cc-pV5Z should work very well with the 4-fold symmetry, btw. Only the integral writing after the scf is painfully slow (single-threaded).

set globals {
  basis       mar-cc-pV5Z
  freeze_core true
  ints_tolerance 1e-11
  s_tolerance 1e-9
}
energy('fno-mp3')

hokru · 2019-12-04T19:02:42Z

well, I see now that the 3rd order correlation energy is zero in my calculations...so this is not solved yet.

devinamatthews · 2019-12-04T22:38:39Z

@hokru I made a few changes beyond what you had in your patch and it seems to work correctly now, for this molecule at least. MP2 and MP3 correlation energies are non-zero and in line with what I expect from smaller systems.
patch.txt

hokru · 2019-12-04T22:41:32Z

excellent!
I admit I did not do a good job and it was quickly done while waiting for lunch time.

kaljugit · 2019-12-05T05:22:49Z

@hokru Thank you for giving it a try!

Yes, with my fixes it completed MP2
DF-MP2 Total Energy (a.u.) : -419.66275196620722
wrote out recalculated T2_2 (IA|JB) amplitudes in mp3_WmnijT2AA,
succeeded through mp3_WmbejT2,
and then died in mp3_WabefT2.

I omitted diffuse function for debugging only. All research work is with aug-cc-pVXZ or zapa-nr. The latter, while not as diffuse as aug-cc-pVXZ, gave me very nice basis set convergence for E2. For this system, mar-cc-pV5Z was actually not an obvious improvement over aug-cc-pVQZ ... Proton affinity with the latter was closer to aug-cc-pV5Z result compared to proton affinity with mar-cc-pV5Z.

kaljugit · 2019-12-05T05:31:14Z

@devinamatthews Thanks for sharing the patch. Speaking of science, I am not sure if your example was a test or production job but I would be careful with third-order correlation energies in cc-pVDZ basis. See https://www.ncbi.nlm.nih.gov/pubmed/17186479 for details.

devinamatthews · 2019-12-05T16:55:06Z

Apparently the problem is not completely fixed. Running (H2O)30 results in:

	MO spaces... 

	 FC   OCC   VIR   FV 
	----------------------
	 30  120   570    0

	Number of basis functions in the DF-CC basis: 2520

	Available memory                      : 667572.02 MB 
	Memory requirement for 3-index ints   :   7838.47 MB 
	Memory requirement for DF-CC int trans:  24103.73 MB 
	Memory requirement for CC contractions: 142778.32 MB 
	Total memory requirement for DF+CC int: 150616.79 MB 
	Memory requirement for Wabef term     :  49600.59 MB 

Traceback (most recent call last):
  File "/users/damatthews/apps/psi4/bin/psi4", line 289, in <module>
    exec(content)
  File "<string>", line 121, in <module>
  File "/users/damatthews/apps/psi4/lib/psi4/driver/driver.py", line 561, in energy
    wfn = procedures['energy'][lowername](lowername, molecule=molecule, **kwargs)
  File "/users/damatthews/apps/psi4/lib/psi4/driver/procrouting/proc.py", line 333, in select_mp3
    return func(name, **kwargs)
  File "/users/damatthews/apps/psi4/lib/psi4/driver/procrouting/proc.py", line 1620, in run_dfocc
    dfocc_wfn = core.dfocc(ref_wfn)

MemoryError: std::bad_array_new_length

Any ideas where to look next?

kaljugit · 2019-12-05T19:59:54Z

One could try to incorporate the debugging "Printf" statements from the attached file to mp3_W_intr.cc in your patched-up system to see how far the MP3 calculation progresses.

The lines with anavirAnQ1, anavirAnQ2, anavirAnQ3 are probably not relevant after your patches but printing the value of the product (a navirA nQ) out in this main loop would be still helpful.

mp3_W_intr.cc.gz

devinamatthews · 2019-12-05T23:00:31Z

It appears the problem is that Tensor1d (used e.g. in Tensor2d::write_symm()) uses int for the size--I'll have to update the whole class.

devinamatthews · 2019-12-06T02:48:30Z

I made a lazy workaround by just using double[] instead of Tensor1d in write_symm and read_symm, but I guess it is only a temporary reprieve as once the T2 vector is more than INT_MAX elements basic functions like daxpy will stop working.

dgasmith · 2019-12-06T13:48:19Z

@devinamatthews This sounds great until @bozkaya can respond here. Would you mind patching this into Psi4 master?

bozkaya · 2019-12-06T15:07:55Z

Hi all,

I am out of Country for the International Junior Science Olympiad (IJSO), hence I could not catch up with you. I know the problem for large molecules, it is because of int. I think if I change all ints to long long int, the problem will be solved. When I find an available time I can take a look. Meanwhile, you can use your patch and update dfocc as long as your patch passes all dfocc tests. Alternatively, a volunter may change all int data types to LLI for dfocc. @devinamatthews @dgasmith

Best regards,

susilehtola · 2019-12-06T19:13:19Z

Wouldn’t using `size_t` be the best option? It is, after all, designed to be the size type...

bozkaya · 2019-12-06T21:13:19Z

Yes in most cases size_t would be better. However, we need to investigate all int variables in DFOCC whether they can have negative values or not, may be some of them are signed ints. Hence, the safest solution could be changing int to long long it. Overall, size_t is okay if we sure that we are not breaking any other part of the code, if we are not sure then long long int is a good solution. @susilehtola

JonathonMisiewicz · 2020-09-16T15:45:22Z

What is quite likely another case of the same general problem here. User reports large calculations give negative memories and bad array lengths, leading to a crash.

JonathonMisiewicz · 2021-04-07T21:40:10Z

Per Bozkaya request, dfocc is going untouched until after the Great Re-Sync, which is slated for 1.5 at earliest.

loriab added this to the Psi4 1.4 milestone Jan 21, 2020

JonathonMisiewicz removed this from the Psi4 1.4 milestone Apr 7, 2021

JonathonMisiewicz added this to the Psi4 1.5 milestone Apr 7, 2021

JonathonMisiewicz added coding-needed For issues where we know the issue and need somebody to code the solution. crash For issues that cause a Psi4 non-compile, crash, segfault, or algorithm failure. dfocc For issues in the DFOCC module. labels Jun 28, 2021

JonathonMisiewicz modified the milestones: Psi4 1.5, Psi4 1.6 Nov 18, 2021

JonathonMisiewicz removed this from the Psi4 1.6 milestone Mar 15, 2022

hokru mentioned this issue Aug 11, 2022

dfocc: coupled DIIS, long int, and testing #2669

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault for large DF-MP3 calculation #1764

Segfault for large DF-MP3 calculation #1764

devinamatthews commented Dec 2, 2019

hokru commented Dec 2, 2019

JonathonMisiewicz commented Dec 3, 2019

devinamatthews commented Dec 3, 2019

hokru commented Dec 3, 2019

devinamatthews commented Dec 3, 2019

loriab commented Dec 3, 2019

devinamatthews commented Dec 3, 2019

JonathonMisiewicz commented Dec 3, 2019

andysim commented Dec 3, 2019

kaljugit commented Dec 4, 2019 •

edited

hokru commented Dec 4, 2019

hokru commented Dec 4, 2019

kaljugit commented Dec 4, 2019

hokru commented Dec 4, 2019

hokru commented Dec 4, 2019

devinamatthews commented Dec 4, 2019

hokru commented Dec 4, 2019

kaljugit commented Dec 5, 2019

kaljugit commented Dec 5, 2019

devinamatthews commented Dec 5, 2019

kaljugit commented Dec 5, 2019

devinamatthews commented Dec 5, 2019

devinamatthews commented Dec 6, 2019 •

edited

dgasmith commented Dec 6, 2019

bozkaya commented Dec 6, 2019

susilehtola commented Dec 6, 2019 via email

bozkaya commented Dec 6, 2019

JonathonMisiewicz commented Sep 16, 2020

JonathonMisiewicz commented Apr 7, 2021

Segfault for large DF-MP3 calculation #1764

Segfault for large DF-MP3 calculation #1764

Comments

devinamatthews commented Dec 2, 2019

hokru commented Dec 2, 2019

JonathonMisiewicz commented Dec 3, 2019

devinamatthews commented Dec 3, 2019

hokru commented Dec 3, 2019

devinamatthews commented Dec 3, 2019

loriab commented Dec 3, 2019

devinamatthews commented Dec 3, 2019

JonathonMisiewicz commented Dec 3, 2019

andysim commented Dec 3, 2019

kaljugit commented Dec 4, 2019 • edited

hokru commented Dec 4, 2019

hokru commented Dec 4, 2019

kaljugit commented Dec 4, 2019

hokru commented Dec 4, 2019

hokru commented Dec 4, 2019

devinamatthews commented Dec 4, 2019

hokru commented Dec 4, 2019

kaljugit commented Dec 5, 2019

kaljugit commented Dec 5, 2019

devinamatthews commented Dec 5, 2019

kaljugit commented Dec 5, 2019

devinamatthews commented Dec 5, 2019

devinamatthews commented Dec 6, 2019 • edited

dgasmith commented Dec 6, 2019

bozkaya commented Dec 6, 2019

susilehtola commented Dec 6, 2019 via email

bozkaya commented Dec 6, 2019

JonathonMisiewicz commented Sep 16, 2020

JonathonMisiewicz commented Apr 7, 2021

kaljugit commented Dec 4, 2019 •

edited

devinamatthews commented Dec 6, 2019 •

edited