Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error with SVD #106

Open
2 of 6 tasks
mgates3 opened this issue Sep 1, 2023 · 0 comments
Open
2 of 6 tasks

error with SVD #106

mgates3 opened this issue Sep 1, 2023 · 0 comments

Comments

@mgates3
Copy link
Collaborator

mgates3 commented Sep 1, 2023

Description
On leconte, SVD has bad backward error with 8 ranks / 8 GPUs, for both target host and device.
Except backwards error ~ 1e-15. Using 1, 2, 4 ranks worked.

Steps To Reproduce

mpirun -np 8 ./tester --origin h --target h --jobu s --jobvt s --dim 1234 --dim 1k,2k,4k,8k,16k --ref n --nb 128,192,256,320 svd
% SLATE version 2023.08.25, id 57ea922b
% 2023-09-01 11:41:53, 8 MPI ranks, CPU-only MPI, 7 OpenMP threads per MPI rank
type  origin  target   A       jobu      jobvt       m       n    nb  ib    p    q  la  pt   S - Sref   Backward    U orth.    V orth.   time (s)  ref time (s)  status  
   d    host    host   1       some       some    1234    1234   128  32    2    4   1   3         NA   1.91e-03   1.80e-16   1.90e-16      1.659            NA  FAILED  
   d    host    host   1       some       some    1234    1234   192  32    2    4   1   3         NA   1.81e-03   1.85e-16   1.94e-16      1.974            NA  FAILED  
...

mpirun -np 8 ./bind_gpus.sh ./tester --origin d --target d --jobu s --jobvt s --dim 1234 --dim 1k,2k,4k,8k,16k --ref n --nb 128,192,256,320 svd
% SLATE version 2023.08.25, id 57ea922b
% 2023-09-01 07:30:57, 8 MPI ranks, CPU-only MPI, 7 OpenMP threads, 1 GPU devices per MPI rank
type  origin  target   A       jobu      jobvt       m       n    nb  ib    p    q  la  pt   S - Sref   Backward    U orth.    V orth.   time (s)  ref time (s)  status  
   d     dev     dev   1       some       some    1234    1234   128  32    2    4   1   3         NA   1.87e-03   1.89e-16   1.98e-16      1.854            NA  FAILED  
   d     dev     dev   1       some       some    1234    1234   192  32    2    4   1   3         NA   1.82e-03   1.79e-16   1.94e-16      1.992            NA  FAILED  
...

Environment
The more information that you can provide about your environment, the simpler it is for us to understand and reproduce the issue.

  • SLATE version / commit ID (e.g., git log --oneline -n 1):
  • 57ea922 (HEAD -> release, tag: v2023.08.25, github/master) Version 2023.08.25
  • How installed:
    • git clone
    • release tar file
    • Spack
    • module
  • How compiled:
    • makefile (include your make.inc)
sh leconte test> cat ../make.inc
CXXFLAGS = -Werror -Dslate_omp_default_none='default(none)'
CXX      = mpicxx
CC       = mpicc
FC       = mpif90
blas     = mkl
mkl_blacs = intelmpi
blas_threaded = 1
  • CMake (include your command line options)
  • Compiler & version (e.g., mpicxx --version):
  • BLAS library (e.g., MKL, ESSL, OpenBLAS) & version:
  • CUDA / ROCm / oneMKL version (e.g., nvcc --version):
  • MPI library & version (MPICH, Open MPI, Intel MPI, IBM Spectrum, Cray MPI, etc. Sometimes mpicxx -v gives info.):
  • OS:
  • Hardware (CPUs, GPUs, nodes): leconte, DGX, 8x V100
sh leconte test> module -t list
Currently Loaded Modulefiles:
gdbm/1.23/gcc-11.3.1-fhrtav
ncurses/6.4/gcc-11.3.1-vbhesx
sqlite/3.42.0/gcc-11.3.1-5ijskr
python/3.10.10/gcc-11.3.1-adtfss
perl/5.36.0/gcc-11.3.1-lhqic5
git/2.40.0/gcc-11.3.1-zem5da
cmake/3.26.3/gcc-11.3.1-6wjjvi
htop/3.2.2/gcc-11.3.1-rtm7gj
env-basic
gcc/11.4.0/gcc-11.3.1-rony5z
intel-oneapi-mkl/2023.1.0/gcc-11.3.1-qq2eto
intel-oneapi-mpi/2021.9.0/gcc-11.3.1-h77w4m
cuda/12.1.1/gcc-11.3.1-h7ttz4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant