New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdomp2-2 test case fails #391

Closed
mbanck opened this Issue May 28, 2016 · 11 comments

Comments

Projects
None yet
6 participants
@mbanck

mbanck commented May 28, 2016

I'm seeing a crash in cdomp2-2 for 1.0rc on Debian unstable:

        Computing CD-MP2 energy using SCF MOs (Canonical CD-MP2)...
        =======================================================================
        Nuclear Repulsion Energy (a.u.)    :    25.59060766929188
        CD-HF Energy (a.u.)                :  -129.25927206477397
        REF Energy (a.u.)                  :  -129.25927206477397
        Alpha-Alpha Contribution (a.u.)    :    -0.03921205696199
        Alpha-Beta Contribution (a.u.)     :    -0.20476196654188
        Beta-Beta Contribution (a.u.)      :    -0.03692010667753
        Scaled_SS Correlation Energy (a.u.):    -0.02537738787984
        Scaled_OS Correlation Energy (a.u.):    -0.24571435985026
        CD-SCS-MP2 Total Energy (a.u.)     :  -129.53036381250408
        CD-SOS-MP2 Total Energy (a.u.)     :  -129.52546262127842
        CD-SCSN-MP2 Total Energy (a.u.)    :  -129.39326467277954
        CD-MP2 Correlation Energy (a.u.)   :    -0.28089413018141
        CD-MP2 Total Energy (a.u.)         :  -129.54016619495539
        =======================================================================

        Number of alpha independent-pairs:172
        Number of beta independent-pairs :157

 ==============================================================================
 ================ Performing CD-OMP2 iterations... ============================
 ==============================================================================
                    Minimizing CD-MP2-L Functional
                    ------------------------------
 Iter       E_total           DE           RMS MO Grad      MAX MO Grad      RMS T2
 ----    ---------------    ----------     -----------      -----------     ----------
   1     2171897840405125233095316875881590551467307175332860142219670350830780702696283291230457729657914921746891661667264776735054294112163359650172177568018543474023361651834998812756231127668278305330007498940536320077394375160362584676926576578513365418212288803914428069938462720.0000000000     2.17e+276            inf        1.41e+183        8.10e+89
An error has occurred python-side. Traceback (most recent call last):

  File "<string>", line 40, in <module>

  File "/build/psi4-lIOjn9/psi4-1.0~rc/share/python/driver.py", line 444, in energy
    wfn = procedures['energy'][lowername](lowername, molecule=molecule, **kwargs)

  File "/build/psi4-lIOjn9/psi4-1.0~rc/share/python/procedures/proc.py", line 223, in select_omp2
    return func(name, **kwargs)

  File "/build/psi4-lIOjn9/psi4-1.0~rc/share/python/procedures/proc.py", line 1308, in run_dfocc
    dfocc_wfn = psi4.dfocc(ref_wfn)

RuntimeError:
Fatal Error: DF-OCC iterations are diverging
Error occurred in file: /build/psi4-lIOjn9/psi4-1.0~rc/src/bin/dfocc/occ_iterations.cc on line: 263
The most recent 5 function calls were:

psi::PsiException::PsiException(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char const*, int)
psi::dfoccwave::DFOCC::occ_iterations()
psi::dfoccwave::DFOCC::cd_omp2_manager()
psi::dfoccwave::DFOCC::compute_energy()
psi::dfoccwave::dfoccwave(boost::shared_ptr<psi::Wavefunction>, psi::Options&)
@mbanck

This comment has been minimized.

Show comment
Hide comment
@mbanck

mbanck May 28, 2016

The reference is already somewhat off:

        Computing CD-MP2 energy using SCF MOs (Canonical CD-MP2)... 
        ======================================================================= 
-       Nuclear Repulsion Energy (a.u.)    :    25.59060766929189
-       CD-HF Energy (a.u.)                :  -129.25927207001686
-       REF Energy (a.u.)                  :  -129.25927207001686
-       Alpha-Alpha Contribution (a.u.)    :    -0.04358607252482
-       Alpha-Beta Contribution (a.u.)     :    -0.22869655849159
-       Beta-Beta Contribution (a.u.)      :    -0.04274321396769
-       Scaled_SS Correlation Energy (a.u.):    -0.02877642883084
-       Scaled_OS Correlation Energy (a.u.):    -0.27443587018991
-       CD-SCS-MP2 Total Energy (a.u.)     :  -129.56248436903761
-       CD-SOS-MP2 Total Energy (a.u.)     :  -129.55657759605592
-       CD-SCSN-MP2 Total Energy (a.u.)    :  -129.41121161424368
-       CD-MP2 Correlation Energy (a.u.)   :    -0.31502584498410
-       CD-MP2 Total Energy (a.u.)         :  -129.57429791500095
+       Nuclear Repulsion Energy (a.u.)    :    25.59060766929188
+       CD-HF Energy (a.u.)                :  -129.25927206477397
+       REF Energy (a.u.)                  :  -129.25927206477397
+       Alpha-Alpha Contribution (a.u.)    :    -0.03921205696199
+       Alpha-Beta Contribution (a.u.)     :    -0.20476196654188
+       Beta-Beta Contribution (a.u.)      :    -0.03692010667753
+       Scaled_SS Correlation Energy (a.u.):    -0.02537738787984
+       Scaled_OS Correlation Energy (a.u.):    -0.24571435985026
+       CD-SCS-MP2 Total Energy (a.u.)     :  -129.53036381250408
+       CD-SOS-MP2 Total Energy (a.u.)     :  -129.52546262127842
+       CD-SCSN-MP2 Total Energy (a.u.)    :  -129.39326467277954
+       CD-MP2 Correlation Energy (a.u.)   :    -0.28089413018141
+       CD-MP2 Total Energy (a.u.)         :  -129.54016619495539
        ======================================================================= 

Should I post the whole diff against output.ref? I just saw output.ref seems to have been generated with 0.5, can somebody confirm this is not a general problem on 1.0rc? All other tests in quicktests have passed.

mbanck commented May 28, 2016

The reference is already somewhat off:

        Computing CD-MP2 energy using SCF MOs (Canonical CD-MP2)... 
        ======================================================================= 
-       Nuclear Repulsion Energy (a.u.)    :    25.59060766929189
-       CD-HF Energy (a.u.)                :  -129.25927207001686
-       REF Energy (a.u.)                  :  -129.25927207001686
-       Alpha-Alpha Contribution (a.u.)    :    -0.04358607252482
-       Alpha-Beta Contribution (a.u.)     :    -0.22869655849159
-       Beta-Beta Contribution (a.u.)      :    -0.04274321396769
-       Scaled_SS Correlation Energy (a.u.):    -0.02877642883084
-       Scaled_OS Correlation Energy (a.u.):    -0.27443587018991
-       CD-SCS-MP2 Total Energy (a.u.)     :  -129.56248436903761
-       CD-SOS-MP2 Total Energy (a.u.)     :  -129.55657759605592
-       CD-SCSN-MP2 Total Energy (a.u.)    :  -129.41121161424368
-       CD-MP2 Correlation Energy (a.u.)   :    -0.31502584498410
-       CD-MP2 Total Energy (a.u.)         :  -129.57429791500095
+       Nuclear Repulsion Energy (a.u.)    :    25.59060766929188
+       CD-HF Energy (a.u.)                :  -129.25927206477397
+       REF Energy (a.u.)                  :  -129.25927206477397
+       Alpha-Alpha Contribution (a.u.)    :    -0.03921205696199
+       Alpha-Beta Contribution (a.u.)     :    -0.20476196654188
+       Beta-Beta Contribution (a.u.)      :    -0.03692010667753
+       Scaled_SS Correlation Energy (a.u.):    -0.02537738787984
+       Scaled_OS Correlation Energy (a.u.):    -0.24571435985026
+       CD-SCS-MP2 Total Energy (a.u.)     :  -129.53036381250408
+       CD-SOS-MP2 Total Energy (a.u.)     :  -129.52546262127842
+       CD-SCSN-MP2 Total Energy (a.u.)    :  -129.39326467277954
+       CD-MP2 Correlation Energy (a.u.)   :    -0.28089413018141
+       CD-MP2 Total Energy (a.u.)         :  -129.54016619495539
        ======================================================================= 

Should I post the whole diff against output.ref? I just saw output.ref seems to have been generated with 0.5, can somebody confirm this is not a general problem on 1.0rc? All other tests in quicktests have passed.

@bozkaya

This comment has been minimized.

Show comment
Hide comment
@bozkaya

bozkaya May 28, 2016

Contributor

On my mac with Psi4 1.0rc3 (71ea2ea), the cd-omp2-2 test normally terminates. I can not reproduce the problem. What kind of blas you have on the Debian?

Contributor

bozkaya commented May 28, 2016

On my mac with Psi4 1.0rc3 (71ea2ea), the cd-omp2-2 test normally terminates. I can not reproduce the problem. What kind of blas you have on the Debian?

@mbanck

This comment has been minimized.

Show comment
Hide comment
@mbanck

mbanck May 28, 2016

Hrm, I would've sworn it's refblas, but after checking it seems the Debian chemps-1.7 build pulled in ATLAS (base version).

It's still curious that this would be the only failing testcase if there's an issue with the linear algebra packages...

mbanck commented May 28, 2016

Hrm, I would've sworn it's refblas, but after checking it seems the Debian chemps-1.7 build pulled in ATLAS (base version).

It's still curious that this would be the only failing testcase if there's an issue with the linear algebra packages...

@bozkaya

This comment has been minimized.

Show comment
Hide comment
@bozkaya

bozkaya May 28, 2016

Contributor

Most of tests include well-behaved molecules such as H2O. However, cd-omp2-2 includes the NO molecule, which has a problematic electronic structure. I tried it both on may mac and linux cluster (centos 6.3). In linux I used intel compiler and MKL. It is difficult to guess the source problem unless debugging the source code. I would debug the code but I can not reproduce the error on my machines. Therefore, I suspect that it might be related to the blas library.

Contributor

bozkaya commented May 28, 2016

Most of tests include well-behaved molecules such as H2O. However, cd-omp2-2 includes the NO molecule, which has a problematic electronic structure. I tried it both on may mac and linux cluster (centos 6.3). In linux I used intel compiler and MKL. It is difficult to guess the source problem unless debugging the source code. I would debug the code but I can not reproduce the error on my machines. Therefore, I suspect that it might be related to the blas library.

@mbanck

This comment has been minimized.

Show comment
Hide comment
@mbanck

mbanck May 28, 2016

OK, I can confirm it only fails with ATLAS - works fine with refblas.

If I run it with mpirun -n 1, it does not crash, but the result is bad:

        Nuclear Repulsion Energy (a.u.)...................................PASSED
        CD-HF Energy (a.u.)...............................................PASSED
        CD-OMP2 Total Energy (a.u.): computed value (-129.5432868) does not match (-129.5897884) to 6 decimal places.

If I switch to mpirun -n 2, it crashes as indicated before, so that might be a hint.

mbanck commented May 28, 2016

OK, I can confirm it only fails with ATLAS - works fine with refblas.

If I run it with mpirun -n 1, it does not crash, but the result is bad:

        Nuclear Repulsion Energy (a.u.)...................................PASSED
        CD-HF Energy (a.u.)...............................................PASSED
        CD-OMP2 Total Energy (a.u.): computed value (-129.5432868) does not match (-129.5897884) to 6 decimal places.

If I switch to mpirun -n 2, it crashes as indicated before, so that might be a hint.

@MrAbsence

This comment has been minimized.

Show comment
Hide comment
@MrAbsence

MrAbsence Jun 4, 2016

I had the same issue when I tested cdomp2-2. Here's the list of all my failed tests:
76:cdomp2-2
123:dfmp2-grad2
125:dfmp2-grad4
127:dfomp2-2
129:dfomp2-4
131:dfomp2-grad2
303:stability2

I am using Ubuntu 14.04 with ATLAS for BLAS and LAPACK.

Update:
After I changed to Intel MKL, the only failed case in tests is 303:stability2. Because it's just a value off, I just ignore it and continue installing.

Output:

Nuclear repulsion energy..........................................PASSED
Reference energy..................................................PASSED
Stability eigenvalues with symmetry: computed value (0.094068) does not match (0.0940977919192).
Check your output file for reporting of the matrices.
An error has occurred python-side. Traceback (most recent call last):

File "", line 58, in

File "/home/mrabsence/Downloads/soft/psi4/share/python/p4util/util.py", line 209, in compare_matrices
raise TestComparisonError("\n")

TestComparisonError:

Exit Status: infile ( 1 ); autotest ( None ); sowreap ( None ); overall ( 1 )

Test time = 1.80 sec

MrAbsence commented Jun 4, 2016

I had the same issue when I tested cdomp2-2. Here's the list of all my failed tests:
76:cdomp2-2
123:dfmp2-grad2
125:dfmp2-grad4
127:dfomp2-2
129:dfomp2-4
131:dfomp2-grad2
303:stability2

I am using Ubuntu 14.04 with ATLAS for BLAS and LAPACK.

Update:
After I changed to Intel MKL, the only failed case in tests is 303:stability2. Because it's just a value off, I just ignore it and continue installing.

Output:

Nuclear repulsion energy..........................................PASSED
Reference energy..................................................PASSED
Stability eigenvalues with symmetry: computed value (0.094068) does not match (0.0940977919192).
Check your output file for reporting of the matrices.
An error has occurred python-side. Traceback (most recent call last):

File "", line 58, in

File "/home/mrabsence/Downloads/soft/psi4/share/python/p4util/util.py", line 209, in compare_matrices
raise TestComparisonError("\n")

TestComparisonError:

Exit Status: infile ( 1 ); autotest ( None ); sowreap ( None ); overall ( 1 )

Test time = 1.80 sec

@jgonthier

This comment has been minimized.

Show comment
Hide comment
@jgonthier

jgonthier Feb 5, 2017

Member

I see the same issue with Psi4 installed on Ubuntu 16.04.1 LTS, compiled with gcc 5.4.0. It is also using ATLAS for BLAS/LAPACK, and my build is a debug build. The following test cases fail:
77 - cdomp2-2 (Failed)
125 - dfmp2-grad2 (Failed)
127 - dfmp2-grad4 (Failed)
129 - dfomp2-2 (Failed)
131 - dfomp2-4 (Failed)
133 - dfomp2-grad2 (Failed)

I'll update as I switch to another BLAS/LAPACK.
Other test cases fail but they might be related to different problems:
188 - mints9 (Failed)
268 - pywrap-checkrun-rhf (Failed)
269 - pywrap-checkrun-rohf (Failed)
270 - pywrap-checkrun-uhf (Failed)
314 - fsapt1 (Timeout)
321 - python-energy (Failed)
322 - python-curve (Failed)
323 - python-pubchem (Failed)
324 - json-energy (Failed)
325 - json-gradient (Failed)

Member

jgonthier commented Feb 5, 2017

I see the same issue with Psi4 installed on Ubuntu 16.04.1 LTS, compiled with gcc 5.4.0. It is also using ATLAS for BLAS/LAPACK, and my build is a debug build. The following test cases fail:
77 - cdomp2-2 (Failed)
125 - dfmp2-grad2 (Failed)
127 - dfmp2-grad4 (Failed)
129 - dfomp2-2 (Failed)
131 - dfomp2-4 (Failed)
133 - dfomp2-grad2 (Failed)

I'll update as I switch to another BLAS/LAPACK.
Other test cases fail but they might be related to different problems:
188 - mints9 (Failed)
268 - pywrap-checkrun-rhf (Failed)
269 - pywrap-checkrun-rohf (Failed)
270 - pywrap-checkrun-uhf (Failed)
314 - fsapt1 (Timeout)
321 - python-energy (Failed)
322 - python-curve (Failed)
323 - python-pubchem (Failed)
324 - json-energy (Failed)
325 - json-gradient (Failed)

@loriab

This comment has been minimized.

Show comment
Hide comment
@loriab

loriab Feb 5, 2017

Member

mints9 is known failure – it's the only test case failure remaining after KtB-INV, my fault. When the python & json tests alone fail, it's probably because the which python python is of a different version than that with which psi4 was compiled. All other test cases have the compilation python baked in to bin/psi4 shebang, but python & json tests are using the library directly and hence the which python python.

More to the point, good to know this problem further confirmed with ATLAS. Absent changes to dfocc, and since OpenBLAS seems free and sound, perhaps we should just discourage ATLAS and promote OpenBLAS.

Member

loriab commented Feb 5, 2017

mints9 is known failure – it's the only test case failure remaining after KtB-INV, my fault. When the python & json tests alone fail, it's probably because the which python python is of a different version than that with which psi4 was compiled. All other test cases have the compilation python baked in to bin/psi4 shebang, but python & json tests are using the library directly and hence the which python python.

More to the point, good to know this problem further confirmed with ATLAS. Absent changes to dfocc, and since OpenBLAS seems free and sound, perhaps we should just discourage ATLAS and promote OpenBLAS.

@jgonthier

This comment has been minimized.

Show comment
Hide comment
@jgonthier

jgonthier Feb 5, 2017

Member

@loriab Okay, I will test with Intel MKL and OpenBLAS and let you know how this goes. I also confirm that which python is Python 2.7 whereas CMake found Python 3.5 and built with it.

Member

jgonthier commented Feb 5, 2017

@loriab Okay, I will test with Intel MKL and OpenBLAS and let you know how this goes. I also confirm that which python is Python 2.7 whereas CMake found Python 3.5 and built with it.

@jgonthier

This comment has been minimized.

Show comment
Hide comment
@jgonthier

jgonthier Feb 8, 2017

Member

@loriab Ok, after some problems getting everything to work together, and setting which python to the correct Python, here is what I see (still Ubuntu 16.04.1 LTS, gcc 5.4.0, Python 3.5.2):

  • with ATLAS (v. 3.10.2), the following tests fail:
    77 - cdomp2-2 (Failed)
    125 - dfmp2-grad2 (Failed)
    127 - dfmp2-grad4 (Failed)
    129 - dfomp2-2 (Failed)
    131 - dfomp2-4 (Failed)
    188 - mints9 (Failed)
  • with Intel MKL (v. 2017.0.098), I have two Failed and one Timeout:
    188:mints9 (Failed)
    243:opt13 (Timeout)
    329:libefp-qmefp-moldomains (Failed)
  • with openBLAS (v. 0.2.18), I have one Failed and one Timeout:
    188:mints9 (Failed)
    243:opt13 (Timeout)

All were compiled with optimization (release version). Note that ATLAS does not time out on opt13, which seems to indicate it's a bit faster than the other two, but then it fails for the cdomp2/dfomp2 tests.

Member

jgonthier commented Feb 8, 2017

@loriab Ok, after some problems getting everything to work together, and setting which python to the correct Python, here is what I see (still Ubuntu 16.04.1 LTS, gcc 5.4.0, Python 3.5.2):

  • with ATLAS (v. 3.10.2), the following tests fail:
    77 - cdomp2-2 (Failed)
    125 - dfmp2-grad2 (Failed)
    127 - dfmp2-grad4 (Failed)
    129 - dfomp2-2 (Failed)
    131 - dfomp2-4 (Failed)
    188 - mints9 (Failed)
  • with Intel MKL (v. 2017.0.098), I have two Failed and one Timeout:
    188:mints9 (Failed)
    243:opt13 (Timeout)
    329:libefp-qmefp-moldomains (Failed)
  • with openBLAS (v. 0.2.18), I have one Failed and one Timeout:
    188:mints9 (Failed)
    243:opt13 (Timeout)

All were compiled with optimization (release version). Note that ATLAS does not time out on opt13, which seems to indicate it's a bit faster than the other two, but then it fails for the cdomp2/dfomp2 tests.

@dgasmith

This comment has been minimized.

Show comment
Hide comment
@dgasmith

dgasmith May 13, 2017

Member

There is now a note in the docs suggesting to use openBLAS or the like over ATLAS due to the above issues. Not sure there is anything else for us to do here.

Member

dgasmith commented May 13, 2017

There is now a note in the docs suggesting to use openBLAS or the like over ATLAS due to the above issues. Not sure there is anything else for us to do here.

@dgasmith dgasmith closed this May 13, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment