Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--recon-precondition/sloppy 8 failing for twisted clover with multigrid #934

Open
cpviolator opened this issue Dec 8, 2019 · 23 comments
Open

Comments

@cpviolator
Copy link
Member

cpviolator commented Dec 8, 2019

While performing tests on SUMMIT with the large L=64,T=128 Twisted Clover lattice, I saw that the initial CG solve to construct null vectors was diverging when --recon-precondition 8 is passed. Here's a list of combinations I've tested, and their behaviour in the initial CG

recon recon-sloppy recon-precondition converged
18 18 18 YES
18 18 12 YES
18 12 12 YES
12 12 12 YES
18 18 8 NO
12 12 8 NO
18 8 18 NO
12 8 12 NO

Is this expected? Here is the rest of the command:

export EXE="./multigrid_invert_test"

export ARGS="--recon 18 --recon-sloppy 12 --prec double --nsrc 16 --dslash-type twisted-clover --compute-clover true                                                              
--niter 30000 --verify true --dim 64 16 16 16 --gridsize 1 4 4 8 --load-gauge ${CONFIG} --kappa 0.1394265 --mu 0.00072 --clover-coeff 0.235630785 \                               
--rank-order row --verbosity verbose --tol 1e-9"

export MG_ARGS_COMMON="--prec-sloppy single --prec-precondition half --prec-null half --recon-precondition 12 \                                                                   
--mg-levels 3 --mg-block-size 0 4 4 4 4 --mg-block-size 1 2 2 2 2 --mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg --mg-setup-inv 1 cg \                           
--mg-nvec 0 24 --mg-nvec 1 24 --mg-coarse-solver 1 gcr --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --pipeline 16 --reliable-delta 1e-5 \           
--ngcrkrylov 30"

export MG_ARGS="--mg-mu-factor 2 0.0 --mg-smoother 0 ca-gcr --mg-smoother 1 ca-gcr --mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 2 --mg-nu-post 1 2 \                            
--mg-coarse-solver 2 ca-gcr --mg-coarse-solver-ca-basis-size 2 10 --mg-coarse-solver-maxiter 1 8 --mg-coarse-solver-maxiter 2 8 \                                                 
--mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-tol 2 0.1 \                                                                                                                      
--mg-nvec 2 1024 --mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-nEv 2 1024 --mg-eig-nKr 2 1536 --mg-eig-tol 2 1e-4 --mg-eig-poly-deg 2 50 --mg-eig-amin 2 8e-1 \                   
--mg-eig-amax 2 8.0 --mg-eig-max-restarts 2 25  --mg-eig-use-dagger 2 false  --mg-eig-use-normop 2 true"

export ARGS="${ARGS} ${MG_ARGS_COMMON} ${MG_ARGS}"

command="jsrun -n32 -r1 -g4 -a4 -c4 -l gpu-gpu ${EXE} ${ARGS}"
@cpviolator cpviolator changed the title --recon-precondition 8 failing for twisted clover with multigrid --recon-precondition/sloppy 8 failing for twisted clover with multigrid Dec 8, 2019
@cpviolator
Copy link
Member Author

Same story with the current release 1.0 branch.

@maddyscientist
Copy link
Member

@cpviolator can you confirm if the arises on small Wilson lattices as well? e.g., the 16^3x64 anisotropic lattices or if it confined to twisted-clover only? You could also run twisted-clover on these smaller lattices to confirm if the issue only arise with multi-GPU or occurs with single GPU.

@cpviolator
Copy link
Member Author

@maddyscientist I used the wilson 16,64 and a small twisted clover 32,64, both with single node and with various MPI splitting, both with recon-sloppy/precondition = 8. None of the tests failed to converge the CG set up of multigrid.

One thing looked fishy though. The computed plaquette for the Wilson was:

Computed plaquette is 7.301271e-02 (spatial = 1.312530e-02, temporal = 1.329001e-01)

which doesn't look right. Is that how you remember it!?

@maddyscientist
Copy link
Member

The plaquette isn't computed correctly on anisotropic lattices. On my eternal to do list..... So yes, this is expected.

@maddyscientist
Copy link
Member

maddyscientist commented Dec 13, 2019

This is a weird bug then that is being encountered. One thing to try is to up the process count to 64, but run with these smaller volumes. This should be runable on nvsocal with something like

export QUDA_ENABLE_MPS=1 # allow multiple processes per GPU
export QUDA_ENABLE_MANAGED_MEMORY=1 # might not be needed, but will prevent out of memory error
mpirun -np 64 multigrid_invert_test --gridsize 1 4 4 8

which will allow you to run multiple processes per GPU. This might help isolate if the issue only occurs on higher process count.

@cpviolator
Copy link
Member Author

I'll give it a whirl with the 32,64 twisted clover.

@cpviolator
Copy link
Member Author

No joy with the above strategy either.

@maddyscientist
Copy link
Member

Ok, I guess we should test this on a different machine. Do you have access to Piz Daint, or should I do this?

@cpviolator
Copy link
Member Author

I don't have access to that anymore, sorry.

@cpviolator
Copy link
Member Author

CMakeCache.txt from the release build on summit
CMakeCache.txt

@kostrzewa
Copy link
Member

I'm running this test right now as it might be related to the issues that I see in testing for #941

@kostrzewa
Copy link
Member

kostrzewa commented Jan 22, 2020

@cpviolator note that in your example command above, you don't set a setup solver for the coarsest level. Doesn't that mean that BiCGstab would be used for that? (which won't converge, as we know) sorry, I'm being a doofus

@kostrzewa
Copy link
Member

sorry, being a doofus in #934 (comment) ...

@kostrzewa
Copy link
Member

kostrzewa commented Jan 22, 2020

I cannot confirm your observations on PizDaint. For all test cases below I get similar convergence (lvl0: <r,r>~5e-5, lvl1:<r,r> ~ 3e-4) over the 500 iterations that the CG runs for to get the null vectors. In particular, I don't observe divergence at all.

recon recon-sloppy recon-precondition converged
18 18 18 YES
18 18 12 YES
18 18 8 YES
18 8 18 YES
12 8 12 YES
12 12 8 YES

test command (32 PizDaint nodes)

[...]
export CRAY_CUDA_MPS=1

gdr=0
p2p=0
async=0
mempool=0

machine_id=PizDaint
quda_label=build_test-quda_develop-dynamic_clover-with_tests-with_qio
quda_commit=53e85c521f11d3a94166b951e14bf8640540ec24
gpu_arch=sm_60

export QUDA_RESOURCE_PATH=${HOME}/local/quda_resources/${machine_id}-${quda_label}-${quda_commit}-${gpu_arch}_gdr${gdr}_p2p${p2p}
if [ ! -d ${QUDA_RESOURCE_PATH} ]; then
  mkdir -p ${QUDA_RESOURCE_PATH}
fi

scratch_dir=$SCRATCH/multigrid_invert_test_runs/64c128_32n
mkdir -p ${scratch_dir}/logs
cd ${scratch_dir}

recon=12
recon_sloppy=8
recon_precondition=12

meta="recon${recon}_recon-sloppy${recon_sloppy}_recon-precondition${recon_precondition}"

export ARGS="--recon ${recon} --recon-sloppy ${recon_sloppy} \
--prec double --nsrc 16 --dslash-type twisted-clover --compute-clover true \
--niter 30000 --verify true --dim 64 32 32 16 --gridsize 1 2 2 8 \
--load-gauge ${CONFIG} --kappa 0.1394265 \
--mu 0.00072 --clover-coeff 0.235630785 \
--rank-order row --verbosity verbose --tol 1e-9"

export MG_ARGS_COMMON="--prec-sloppy single --prec-precondition half --prec-null half \
--recon-precondition ${recon_precondition} \
--mg-levels 3 --mg-block-size 0 4 4 4 4 --mg-block-size 1 2 2 2 2 \
--mg-setup-tol 0 5e-7 --mg-setup-tol 1 5e-7 --mg-setup-inv 0 cg --mg-setup-inv 1 cg \
--mg-nvec 0 24 --mg-nvec 1 24 --mg-coarse-solver 1 gcr \
--mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose \
--pipeline 16 --reliable-delta 1e-5 --ngcrkrylov 30"

export MG_ARGS="--mg-mu-factor 2 0.0 --mg-smoother 0 ca-gcr --mg-smoother 1 ca-gcr \
--mg-nu-pre 0 0 --mg-nu-post 0 4 --mg-nu-pre 1 2 --mg-nu-post 1 2 \
--mg-coarse-solver 2 ca-gcr --mg-coarse-solver-ca-basis-size 2 10 \
--mg-coarse-solver-maxiter 1 8 --mg-coarse-solver-maxiter 2 8 \
--mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-tol 2 0.1 \
--mg-nvec 2 1024 --mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-nEv 2 1024 --mg-eig-nKr 2 1536 \
--mg-eig-tol 2 1e-4 --mg-eig-poly-deg 2 50 --mg-eig-amin 2 8e-1 \
--mg-eig-amax 2 8.0 --mg-eig-max-restarts 2 25  --mg-eig-use-dagger 2 false  --mg-eig-use-normop 2 true"

ARGS="${ARGS} ${MG_ARGS_COMMON} ${MG_ARGS}"

logfile=${scratch_dir}/logs/muligrid_invert_test-64c128_32n-${quda_label}-${quda_commit}-${p2p}-gdr${gdr}-async${async}-mempool${mempool}-${SLURM_JOB_ID}_${meta}.out

GOMP_CPU_AFFINITY=0-23:2 \
QUDA_RESOURCE_PATH=${QUDA_RESOURCE_PATH} OMP_NUM_THREADS=12 \
QUDA_ENABLE_GDR=${gdr} QUDA_ENABLE_P2P=${p2p} QUDA_ENABLE_TUNING=1 \
QUDA_ENABLE_DEVICE_MEMORY_POOL=${mempool} MPICH_RDMA_ENABLED_CUDA=1 \
MPICH_NEMESIS_ASYNC_PROGRESS=${async} \
srun ${exe} ${ARGS} 2>&1 | tee ${logfile}

CMake call for test build

note the QUDA_DYNAMIC_CLOVER

module load daint-gpu
module swap PrgEnv-cray PrgEnv-gnu
module load CMake
module load Boost
module load cray-libsci
module load cudatoolkit
module load cray-hdf5
module load cray-mpich

CXX=CC \
CC=cc \
cmake \
-DCMAKE_INSTALL_PREFIX="${PROJECT}/libs/2020_01_16/$(basename $(pwd))" \
-DMPI_CXX_COMPILER=CC \
-DMPI_C_COMPILER=cc \
-DQUDA_BUILD_SHAREDLIB=OFF \
-DQUDA_MAX_MULTI_BLAS_N=9 \
-DQUDA_BUILD_ALL_TESTS=ON \
-DQUDA_GPU_ARCH=sm_60 \
-DQUDA_INTERFACE_QDP=ON \
-DQUDA_INTERFACE_MILC=OFF \
-DQUDA_MPI=OFF \
-DQUDA_QMP=ON \
-DQUDA_QIO=ON \
-DQUDA_DOWNLOAD_USQCD=ON \
-DQUDA_DIRAC_WILSON=ON \
-DQUDA_DIRAC_TWISTED_MASS=ON \
-DQUDA_DIRAC_TWISTED_CLOVER=ON \
-DQUDA_DIRAC_NDEG_TWISTED_MASS=ON \
-DQUDA_DIRAC_CLOVER=ON \
-DQUDA_DYNAMIC_CLOVER=ON \
-DQUDA_DIRAC_DOMAIN_WALL=OFF \
-DQUDA_MULTIGRID=ON \
-DQUDA_USE_EIGEN=ON \
-DQUDA_DOWNLOAD_EIGEN=ON \
-DQUDA_BLOCKSOLVER=ON \
-DQUDA_OPENMP=ON \
-DQUDA_TEX=OFF \
-DQUDA_GAUGE_ALG=ON \
-DQUDA_FORCE_GAUGE=ON \
-DQUDA_GAUGE_TOOLS=ON \
-DQUDA_DIRAC_STAGGERED=OFF ${HOME}/code/quda_develop

to compile the tests, I have to run cmake twice, as noted in #957

ps

This was a nice test because I was always worried that in the tmLQCD interface we were losing quite a bit by always working in recon=(18,18,18) mode. Yet, the differences that I observe between the best and worst case (in terms of TTS) are below the level of natural performance fluctuations on PizDaint, although on average it seems that (12,12,8) performs best (at the level of maybe 5%?).

@weinbe2
Copy link
Contributor

weinbe2 commented Jan 22, 2020

@cpviolator just catching up a little and making absolute sure, you're seeing divergence, not stalling, correct? I see stalling constantly for HISQ, and it's never an issue (in fact, in cases where I try to set the tolerance so it doesn't stall, the near-nulls end up being garbage).

Divergence is of course a different issue.

@cpviolator
Copy link
Member Author

cpviolator commented Jan 22, 2020 via email

@weinbe2
Copy link
Contributor

weinbe2 commented Jan 22, 2020

@cpviolator yep, I agree that's divergence, point noted.

@kostrzewa
Copy link
Member

Could this be related to using QUDA_TEX=ON?

@weinbe2
Copy link
Contributor

weinbe2 commented Jan 22, 2020

Could this be related to using QUDA_TEX=ON?

No with 99.99% confidence

@weinbe2
Copy link
Contributor

weinbe2 commented Jan 22, 2020

It'd be related to the way the reconstruct is done, not the read itself (which is abstracted at a lower level). The local volume isn't large enough for there to be some sort of weird indexing issue.

@maddyscientist
Copy link
Member

I realize that I never reported my progress on this issue. Confirming that I did reproduce the issue on Piz Daint, but only on 128 GPUs. Running on lower node counts did not give the issue. The horrible queue time on Piz Daint was a major pain to really debugging this further (yet).

So this appears to not be a machine issue.

@cpviolator
Copy link
Member Author

Could this be related to an integer overflow?

@maddyscientist
Copy link
Member

Unlikely, since larger node counts means smaller local volumes. I can do a run through ASAN on Piz Daint to see if anything is revealed….

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants