New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky functionals #1009

Closed
loriab opened this Issue May 6, 2018 · 50 comments

Comments

Projects
None yet
8 participants
@loriab
Member

loriab commented May 6, 2018

So Matt saw some functionals going wrong in the big dft-bench-ionization test (Intel compilers and presumably MKL).

Now I'm seeing it, too, when linking against OpenBLAS instead of MKL for a lot of functionals (see below).
PWB6K: Psi4 vs. Q-Chem: computed value (0.52376) does not match (0.45357) to 4 digits.

Anyone who builds Psi against non-MKL, please report the results of this test.

#BAD    'PWB6K':    0.45356644150000136,        #TEST
#BAD  'wB97X-D':     0.4575912357999954,        #TEST
#BAD 'LRC-wPBE':    0.45809929579999675,        #TEST
#BAD   'BHHLYP':     0.4474902386999986,        #TEST
#BAD    'MPW1K':     0.4527968481999949,        #TEST
#BAD'LRC-wPBEh':     0.4549011450999956,        #TEST
#BAD     'wB97':     0.4561211940999925,        #TEST
#BAD      'M11':    0.45965997109999535,        #TEST
#BAD   'M08-HX':     0.4616204211000081,        #TEST
#BAD    'wB97X':    0.45647112829999514,        #TEST
#BAD 'B5050LYP':    0.45086574249999956,        #TEST
#BAD   'M05-2X':     0.4583363492999979,        #TEST
#BAD     'BB1K':     0.4523318653999979,        #TEST
#BAD   'M06-2X':    0.45840746970000623,        #TEST
#BAD     'dlDF':    0.46205070889999433,        #TEST
#BAD   'M08-SO':     0.4656227382000111,        #TEST
#BAD   'M06-HF':     0.4582368217999999,        #TEST
#BAD   'MPWB1K':     0.4525753563999899,        #TEST
#BAD    'B97-K':     0.4498949295999921,        #TEST
#BAD'SOGGA11-X':     0.4601258852000001,        #TEST
#BAD    'PBE50':    0.45096528610000064,        #TEST
#BAD'CAM-B3LYP':     0.4568003603999955,        #TEST
#BAD  'LC-VV10':    0.45568725450000613,        #TEST
#BAD  'wB97M-V':     0.4544676075000069,        #TEST
#BAD  'wB97X-V':     0.4553026020000033,        #TEST

@loriab loriab added this to the Psi4 1.2 milestone May 6, 2018

@dgasmith

This comment has been minimized.

Member

dgasmith commented May 7, 2018

Disturbing, both compiled with ICC? If so that leaves the diagonalizer as the likely culprit. We have had issues with this before and have go so far as to say that we do not recommend OpenBLAS IIRC.

@loriab

This comment has been minimized.

Member

loriab commented May 7, 2018

Yes. Except for dropping CheMPS2 & libefp, same compile conditions (icpc), just using libopenblas.so in place of libmkl_rt.so. I need to look a little further, but I'm surprised that everything else in full tests (incl. dft-bench-interaction) passed, while dft-bench-ionization was off at first sig fig.

@susilehtola

This comment has been minimized.

Contributor

susilehtola commented May 7, 2018

I bet you're using a broken version of OpenBLAS. Funny things are prone to happen if you call a non-OpenMP version of the library within OpenMP parallel code.

This is why I don't like conda. There's no quality in the packages. Even though I filed a bug months ago, the OpenBLAS package still hasn't been compiled with runtime cpu detection.

@loriab

This comment has been minimized.

Member

loriab commented May 7, 2018

I usually find their packages to be of excellent quality. Which is your bug report? I remember a lightning talk, I think, at a recent SciPy about efforts to bring openblas up to snuff. Apparently even all their internal tests (or maybe it was numpy's tests) weren't passing for a long time.

I agree that runtime processor detection is a build dimension to which conda has not expanded. That's why psi uses Intel compilers to add some multiarch optimizations. But why should cpu detection affect the numerical results?

@loriab

This comment has been minimized.

Member

loriab commented May 7, 2018

Here's the compile conditions. It looks like OpenMP is enabled. Is something missing that you see?

Also, I was running the tests single-thread to avoid trouble.

@susilehtola

This comment has been minimized.

Contributor

susilehtola commented May 7, 2018

CFLAGS="${CF}" FFLAGS="-frecursive"
LOL. The Fortran code is compiled without any optimizations? 😆

And no, OpenMP is not enabled. USE_THREAD=1 is the pthreads version. The OpenMP version would have USE_THREAD=1 USE_OPENMP=1

@loriab

This comment has been minimized.

Member

loriab commented May 7, 2018

Optimization flags come from another source, but yes, I see what you mean that they look to be clobbered.

I can rebuild openblas locally to see if that helps. But I'd be surprised if there were fundamental accuracy errors in the openblas package that weren't noticed by the whole conda-forge-dependent community (where openblas is their default blas_impl) or conda defaults (which still has nomkl options)`.

@Diazonium

This comment has been minimized.

Diazonium commented May 7, 2018

OpenBLAS has seen quite a few bugfixes in the last couple months, especially thread safety related. I would recommend pulling the latest trunk version from github, and then compiling it with GCC, and USE_THREAD=1 USE_OPENMP=1 and DYNAMIC_ARCH = 1 in Makefile.rule. The DYNAMIC_ARCH is optional, it enables runtime CPU arch detection.
This could rule out both bugs in older versions of OpenBLAS and the chance that the Intel compilers are messing up OpenBLAS.
IIRC OpenBLAS is rarely used/tested with Intel compilers, since most people who have Intel compilers end up using MKL instead.

PS: The comment in Makefile.rule of OpenBLAS even includes a vague remark about Intel and other non-GNU compilers being non-recommended.

@loriab

This comment has been minimized.

Member

loriab commented May 7, 2018

Thanks, @Diazonium, I'll give it a try.

Just to note, the openblas I was using would have been built with GCC 7.2.0, then I was using Intel compilers (atop that same GCC) to build psi and link in openblas. Do your Intel concerns still hold in that scenario?

@Diazonium

This comment has been minimized.

Diazonium commented May 7, 2018

@loriab I have no idea if using Intel compilers for the rest of the build would cause any issues. I mean theoretically it should not, since all the code is already generated, but who knows? (i think Intel has their own OpenMP implementation, that might throw things off-kilter, also different compiler might handle the state of some floating point mode control registers in the CPU differently, things like selecting rounding modes and denormal handling, although those should not affect any algorithm unless it is inherently unstable numerically)
If problems persist I would also suggest trying a pure GNU build, especially since most of the people who would end up using openblas, would probably use GNU compilers for everything.

@hokru

This comment has been minimized.

Contributor

hokru commented May 17, 2018

PSI4+addons with gcc 6.3 linked to OpenBLAS r0.2.19.dev (gcc 6.3, both OpenMP and threaded version) fully passes dft-bench-ionization on my workstation.

@dgasmith

This comment has been minimized.

Member

dgasmith commented May 17, 2018

@hokru Thanks for checking. I guess we just need to mention that OpenBLAS >= 0.2 is highly suggested or trickier SCF cases (or any eigenvalue) may return questionable results.

@hokru

This comment has been minimized.

Contributor

hokru commented May 18, 2018

First off, I think the OpenBLAS project is amazing. However, I also had troubles with some olders versions that I had to stop using it intermittently. I would recommend 0.2.15 and higher.

@loriab

This comment has been minimized.

Member

loriab commented May 18, 2018

Conda pkgs are 0.2.20, so safe in that respect. Either I need to pay closer attn to omp flags in my p4 compilation against conda openblas or the missing threading flag in the conda openblas recipe is indeed fatal to psi.

Your dft-bench test was definitely run with -n, @hokru? Ctest runs individual tests single threaded.

@hokru

This comment has been minimized.

Contributor

hokru commented May 18, 2018

Your dft-bench test was definitely run with -n, @hokru? Ctest runs individual tests single threaded.

Yes, I used 4 threads.

@Diazonium

This comment has been minimized.

Diazonium commented May 18, 2018

@loriab If the threading flag you mentioned refers to USE_OPENMP=1, then I am reasonably sure that is essential. As far as I understand, OpenBLAS is not stateless/thread-safe when compiled to use its native threading. So if there is an OMP parallel section in Psi4, and multiple Psi4 threads call BLAS/LAPACK subrutines at the same time, AND OpenBLAS is also doing those operations in parallel, then you get undefined behaviour. This is probably not going to be reliably mitigated by setting OpenBLAS to only use 1 thread at runtime, it may still cause UB.
AFAIK, currently the only safe way to call OpenBLAS BLAS/LAPACK routines from OMP parallel regions, is to use an OpenBLAS build compiled with USE_OPENMP=1. This disables the native/pthreads parallelism, and uses OMP instead, this way you get thread safety and nested parallelism should also work.

@dgasmith

This comment has been minimized.

Member

dgasmith commented May 18, 2018

That makes a lot of sense and why we are getting issues. We call BLAS in both a OMP parallel region and outside of it relying on the library to thread non-parallel region calls. Kinda surprised a functional library has state, but I could have drunk too much server less Kool-Aid.

If I understand correctly, OpenBLAS on conda-forge uses native OpenBLAS threading:
https://github.com/conda-forge/openblas-feedstock/blob/master/recipe/build.sh#L19

@hokru

This comment has been minimized.

Contributor

hokru commented May 18, 2018

AFAIK, currently the only safe way to call OpenBLAS BLAS/LAPACK routines from OMP parallel regions, is to use an OpenBLAS build compiled with USE_OPENMP=1.

This is correct.

OpenBLAS used to spam stderr (i think) with a warning everytime you call a pthreaded BLAS inside an openmp region. I was expecting to see it for my test with the "wrong" openblas library, but maybe psi4 redirects it. Or i accidentally used USE_OPENMP=1 for both tests..

@dgasmith

This comment has been minimized.

Member

dgasmith commented May 18, 2018

Can USE_OPENMP be set dynamically? We could consider wrapping all C++ calls in a function that sets the environment. Could also be useful for fiddling with thread affinities as per Matt's suggestion. Sounds like a really horrible idea though...

@Diazonium

This comment has been minimized.

Diazonium commented May 18, 2018

@dgasmith No, it is a build time flag for OpenBLAS. So Psi4 must use an OpenBLAS library that has been built with that flag.

@dgasmith

This comment has been minimized.

Member

dgasmith commented May 18, 2018

@Diazonium Hmm, what I was getting. Can we query the library for this option?

@Diazonium

This comment has been minimized.

Diazonium commented May 18, 2018

@dgasmith I do not know. But probably not. If that is the case, patches are welcome, the project tends to be very receptive to PRs.

@hokru

This comment has been minimized.

Contributor

hokru commented May 18, 2018

@loriab

This comment has been minimized.

Member

loriab commented May 18, 2018

Does the omp flag to compile openblas squash pthread threading or can pthread and omp coexist in an openblas?

If the latter we can PR the openblas recipe. If the former, we can’t interrupt the whole eco of packages that rely on pthread openblas in conda. I thought about building our own openblas but then nomkl numpy would be using a diff lib, which would be fatal.

@hokru

This comment has been minimized.

Contributor

hokru commented May 18, 2018

xianyi/OpenBLAS#1529 @dgasmith

From my experience, when openblas is build with omp it needs an openmp flag from the compiler during linking of the original program.

@loriab

This comment has been minimized.

Member

loriab commented May 18, 2018

Sorry, didn’t mean to question the need for -fopenmp to compile the openblasusing program. Just concerned about this quote.

“AFAIK, currently the only safe way to call OpenBLAS BLAS/LAPACK routines from OMP parallel regions, is to use an OpenBLAS build compiled with USE_OPENMP=1. This disables the native/pthreads parallelism, and uses OMP instead, this way you get thread safety and nested parallelism should also work.“

If that’s true and if other programs rely on openblas native threading, then the conda openblas can never be used for psi even after PR and rebuild.

Whereas if USE_OPENMP=1 only adds capabilities, there’s a plan forward.

@hokru

This comment has been minimized.

Contributor

hokru commented May 18, 2018

I think it is an exclusive switch, either openmp or native threading. Parts of the source code suggest so, but I am not 100% sure.

@Diazonium

This comment has been minimized.

Diazonium commented May 18, 2018

@loriab I am pretty sure that switching OpenBLAS to OMP threading does squash the built in pthreads mechanism, BUT that may be acceptable. I think the OMP version can do everything the pthreads version can, but it is slower. So AFAIK, there is a performance impact, but not a functionality impact. So it probably would not break other projects, but some people might get pissed due to the performance hit.
My suggestion would be to create a separate OpenBLAS_OMP conda package, and just make a new numpy package as well, compiled with OpenBLAS_OMP. I am not familiar with how conda works, so that might not be feasible, but that would probably be the cleanest and most consistent solution.

@martin-frbg any comments regarding this situation?

@martin-frbg

This comment has been minimized.

martin-frbg commented May 19, 2018

Diazonium, I believe your assessment is correct and I have little to add here unfortunately. While thread safety in the pthread code has been improved recently, OpenMP is still considered the safer option. (Though you could try a pthread build - ideally of the current "develop" branch that should soon become the 0.3.0 release - with USE_SIMPLE_THREADED_LEVEL3=1 which should work around at least some of the remaining bugs.)
Please create issues for any OpenBLAS bugs you find, ideally with some standalone code that makes it easy to reproduce and track down the problem. I am anything but an expert on multithreading, but at least there are powerful debugging tools available now that simply did not exist when K.Goto wrote
the library that OpenBLAS builds upon.
(Incidentally my own involvement with OpenBLAS came about through dft as well, though in my case it is condensed matter codes like Elk and Wien2k)

@loriab

This comment has been minimized.

Member

loriab commented May 22, 2018

  • Psi4 v1.2rc3.dev1 compile pure GCC 7.2.0 w/libgomp for threading.
  • OpenBLAS v0.2.20-453-gf5959f2 compile pure GCC 5.2 w/libgomp for threading (except for line 5, which is conda 0.2.20, 9ac9557`).
  • No Intel compilers or libiomp5 available or in ldd -v. NumPy is still the conda nomkl NumPy, but its links to openblas are unresolved, so tests are either not hitting that submodule or it's using the below-described libopenblas loaded by psi.
openblas compilation psiapi speedup -n4 psithon speedup -n4 ion pass -n1 ion pass -n4
USE_THREAD=0 USE_OPENMP=0 1.00 1.54 yes yes
USE_THREAD=1 USE_OPENMP=0 0.90 warn + 0.85 no no
USE_THREAD=1 USE_OPENMP=0 USE_SIMPLE_THREADED_LEVEL3=1 1.20 warn + 1.40 no no
USE_THREAD=1 USE_OPENMP=1 1.04 2.27 yes no
CONDA USE_THREAD=1 USE_OPENMP=0 ? warn + ? no no
EDIT: USE_THREAD=1 USE_OPENMP=1 NUM_PARALLEL=4 1.00 1.77 yes no
  • "warn": oodles of "OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option."

  • full OpenBLAS compile command for row 4

make CC=gcc FC=gfortran DYNAMIC_ARCH=1 BINARY=64 NO_LAPACK=0 NO_AFFINITY=1 USE_THREAD=1 USE_OPENMP=1 NUM_THREADS=128 CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -O2 -pipe" FFLAGS="-fopenmp -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -O2 -pipe"

Unless someone sees a problem with my build configs, it's not looking too encouraging, except for wholly unthreaded. Troubling also that this contradicts @hokru's findings of good behavior for USE_OPENMP=1 from source build.

@hokru

This comment has been minimized.

Contributor

hokru commented May 22, 2018

My compilations pass the ion-test, but I too get no psiapi speedup.

But after using (another) of my openblas binaries for a bit I am not exactly happy. The FNO-DF-CCSD iterations do not seem to thread correctly, or very poorly. (visually checking with top). The conda 1.2rc2 binary I just tried is sooo much faster and puts good load on all 16 cores.

At this point I don't feel like pursuing an OpenBLAS solution anymore. Maybe if our lab buys AMD servers...

@Diazonium

This comment has been minimized.

Diazonium commented May 22, 2018

I will be investigating this closely in the foreseeable future. I think the threading issue is the same issue I have mentioned in the forum.

@Diazonium

This comment has been minimized.

Diazonium commented May 22, 2018

By the way, MKL/Intel compiler performance is actually reasonably good on recent AMD Ryzen/Epyc CPUs, the only thing that may need to be done, is to patch the binaries with this tool. What it does, is look for CPU detection checks, and disable the "cripple_AMD()" paths that MKL, and Intel compilers often tend to insert. When this is done, AMD CPUs will get to run the same, optimized/vectorized code paths that an Intel CPU would run, instead of the unoptimized/less optimized path that is intended to be run by non-Intel CPUs. This is just fine, since Ryzen CPUs are actually very happy to run code optimized for Haswell, in fact IIRC for a long time the best --march= flag to use with GCC on Ryzen has been ironically --march=haswell.
So using MKL and Intel compiled binaries are just fine on Ryzen systems, just make sure to patch the MKL and compiled binaries.

PS: we are in the process of acquiring some Ryzen systems, so in a couple months I will be able to provide actual test results

@loriab

This comment has been minimized.

Member

loriab commented May 22, 2018

Parts are arriving for an AMD server in our lab, so this may get more testing. But for now, OpenBLAS is going to go the way of Accelerate (on Mac) and just get a nice Use At Your Own Risk warning. I daresay Psi could use openblas directives better, but the stack of difficulties (wrong fctls, scaling, numpy compatibility, conda compatibility) is too high when there's a free, compatible, and working alternative in place. Glad to revisit periodically.

P. S. There might be a further technical choice why the conda openblas package isn't building the Fortran/LAPACK code with optimizations (#1009 (comment)), but an immediate technical reason is that conda gfortran 7.2.0 isn't distributing omp_lib.[h|mod], so threading isn't avail.

@Diazonium

This comment has been minimized.

Diazonium commented May 22, 2018

@loriab one more note about AMD Ryzen performance: AVX is fully supported (except AVX-512), but internally the floating point units are only 128 bit wide, so any 256 bit wide AVX/AVX2 instructions are split in half to be executed. This means that for current Ryzen/EPYC chips, using AVX is unlikely to offer the significant performance benefit seen for Intel chips, unless it can alleviate some specific bottleneck. But AFAIK there is no separate "AVX-mode" like Intel, and running AVX code (i.e. complied for Haswell) is often harmless and does not cause a performance degradation.
Really the only mayor weak points of Zen CPUs is memory latency and the communication latency between CCX-es (acts kinda like NUMA-on-a-chip). Especially EPYC CPUs are NUMA-like, you should even be able to toggle the NUMA mode in EFI/BIOS between full-on NUMA and try-to-act-like UMA. But really, core-to-core synchronization latency wise a single socket EPYC system acts more like a quad or octa socket, depending on how you look at it.
EPYC is great if you are looking into using multiple GPUs or NVMe drives, it has absolutely massive amounts of PCI-E lanes.
Hope this helps!

@martin-frbg

This comment has been minimized.

martin-frbg commented May 22, 2018

If psi4 calls OpenBLAS from several concurrent threads, even with OpenMP you may be running into a problem that was (hopefully) fixed only two weeks ago on the "develop" branch. (xianyi/OpenBLAS#1536).

@loriab

This comment has been minimized.

Member

loriab commented May 22, 2018

@martin-frbg, thanks, just rechecked that I was using head of the develop branch, so 1536 is in.

@martin-frbg

This comment has been minimized.

martin-frbg commented May 22, 2018

Thanks. Note though that you need to define NUM_PARALLEL to the number of expected callers (psi4 threads) when building, it defaults to the old behaviour.

@loriab

This comment has been minimized.

Member

loriab commented May 22, 2018

@martin-frbg, built again with NUM_PARALLEL=4 and added results to table. Regrettably, no change, still failing at -n2 and -n4. Thanks for the suggestion.

@Diazonium, thanks for the Ryzen info. I passed it along to @CDSherrill, and we'll keep it in mind when testing the new box.

@dzyong

This comment has been minimized.

dzyong commented May 26, 2018

@Ioriab xianyi/OpenBLAS#1536 will work accurately only if your compiler support C11, other wise it will fall back to former behavier.
Ubuntu 14.04 doesn't support C11, but Ubuntu 16.04 support this.
What's your compiler's version? If it is too low, please upgrade it.

@Diazonium

This comment has been minimized.

Diazonium commented May 26, 2018

@dzyong That should not be the issue, since Psi4 itself requires extensive C++11 support, so ancient compilers would not work at all.

@loriab

This comment has been minimized.

Member

loriab commented May 26, 2018

Yes, I was using Intel 2018 and GCC 5.2 and 7.2, so those are all fully C++11 compliant. If you'd like to prevent OpenBLAS from building with unsatisfactory compilers, you're welcome to adapt https://github.com/psi4/psi4/blob/master/cmake/custom_cxxstandard.cmake .

@dzyong

This comment has been minimized.

dzyong commented May 26, 2018

Does your OpenBLAS code include that pull request?

This issue is opened on May 6, and that pull request is merged into OpenBLAS on May 11.

@dzyong

This comment has been minimized.

dzyong commented May 26, 2018

OpenBLAS v0.2.20 doesn't include that pull request.
Please try v0.3.0.

@loriab

This comment has been minimized.

Member

loriab commented May 26, 2018

Yes, details in the 2nd bullet of #1009 (comment), but I was building the develop branch c. early this week.

@loriab loriab referenced this issue Jun 4, 2018

Merged

BLAS+OpenMP revamp & threading testing #1031

8 of 8 tasks complete
@PeterKraus

This comment has been minimized.

Contributor

PeterKraus commented Jun 6, 2018

To add to Lori's table from here, I've compiled the 1.2-rc2 tag against various blas/lapack combinations, and ran the dft-bench-ionisation test (wB97X-D, revTPSS, PW6B95, TPSS, PWB6K only):

  N=1 N=4 N=1 N=4
Psi4/1.2-rc2-acml-5.3.1 FAIL      
Psi4/1.2-rc2-lapack-3.8.0 395.944 299.438 1.00 1.32
Psi4/1.2-rc2 (1.2rc2.dev35) 268.523 223.574 1.47 1.77
Psi4/1.2-rc2-blis-0.3.2 356.723 262.964 1.11 1.51
Psi4/1.2-rc2-openblas-0.2.20 FAIL      
Psi4/1.2-rc2-openblas-0.3.0 FAIL      

All compiled with gcc-7.1.0, cmake-3.8.2, dftd3-3.2-r0:

cmake -H. -BBUILDS/$1 -DCMAKE_INSTALL_PREFIX=/opt/packages/psi4/$1 -DLAPACK_INCLUDE_DIRS=/path/to/lapack -DMAX_AM_ERI=5

ACML-5.3.1 is a binary.

Netlib's BLAS and lapack (3.8.0) compiled with -O3 -march=barcelona -fPIC and -O3 -frecursive -march=barcelona -fPIC respectively, including deprecated functions. I had to add -DENABLE_dkh=ON to the psi4 build options, otherwise the fortran compiler wouldn't get picked up.

The 1.2rc2.dev35 is a binary that was available in conda, compiled against mkl 2018.0.2, intel-openmp 2018.0.0.

Flame's blis-0.3.2 was compiled into netlib's lapack-3.8.0, with ./configure auto; lapack compiled as above.

OpenBLAS-0.2.20 was compiled with USE_THREAD=0 USE_OPENMP=0 TARGET=BARCELONA. OpenBLAS-0.3.0 was compiled with both:
make TARGET=BARCELONA BINARY=64 USE_THREAD=1 USE_OPENMP=1 NUM_THREADS=32 and
make TARGET=BARCELONA BINARY=64 USE_THREAD=1 USE_OPENMP=1 NUM_THREADS=32 CFLAGS="-fPIC -fopenmp -fstack-protector-strong" FFLAGS="-fPIC -fopenmp -fstack-protector-strong" and
make CC=gcc FC=gfortran DYNAMIC_ARCH=1 BINARY=64 NO_LAPACK=0 NO_AFFINITY=1 USE_THREAD=0 USE_OPENMP=0 NUM_THREADS=32 CFLAGS="-fopenmp -march=barcelona -ftree-vectorize -fPIC -fstack-protector-strong -O2 -pipe" FFLAGS="-fopenmp -march=barcelona -ftree-vectorize -fPIC -fstack-protector-strong -O2 -pipe"
linked using the .so and .a; it doesn't pass at all even with psi4 -n 1.

The node I used is a 2 x Quad-Core AMD Opteron(tm) Processor 2352, running up-to-date Debian 8.

@loriab

This comment has been minimized.

Member

loriab commented Jun 6, 2018

Thank you for the additional BLAS distro tests and timing, @PeterKraus. I'm glad to see there are a couple more options.

Continuing the Slack discussion, I agree Psi + threaded BLAS is something to keep an eye on. Points of concern:

  • Matt used to see certain functionals failing the ionization test, presumably with MKL (though I didn't explicitly check ldd). I couldn't reproduce this, so it might be a blip or it might be associated with various KMP_AFFINITY settings.
  • @hokru, @PeterKraus, and I each had different OpenBLAS conclusions after trying several variations.

In past Psi, there was a case of nested threading where MKL knew how to handle itself and others didn't. Perhaps there's a new case of that now (though why should single-thread fail?). I wouldn't be surprised if the final determination of guilt was Psi's, OpenBLAS's, or even both, but Psi is working with enough BLAS distros that I think it's reasonable to keep the situation under observation and collect any new sightings (esp. of flakiness in the presence of MKL).

@dgasmith

This comment has been minimized.

Member

dgasmith commented Jun 6, 2018

@loriab

This comment has been minimized.

Member

loriab commented Jun 6, 2018

@dgasmith, rolling back to 1.1 is fairly involved. Can commenting out the three pragma omps in superfunctional.cc test the same effect?

@dgasmith

This comment has been minimized.

Member

dgasmith commented Jun 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment