problem with staggered_dslash_test #200

stevengottlieb · 2014-12-16T17:51:47Z

Rich has requested a weak scaling study, so I am trying staggered_dslash_test. I am having trouble running multiple gpu jobs. I compiled QUDA for multi-gpu. Here is how I launch a job on a cray:
aprun -n 4 -N 1 /N/u/sg/BigRed2/quda-0.7.0/tests//staggered_dslash_test --prec double --xdi
m $nx --ydim $ny --zdim $nz --tdim $nt --xgridsize 1 --ygridsize 1 --zgridsize 2 --tgridsiz
e 2 >>out_4procs.$ns

Here is what shows up in output:
Tue Dec 16 09:11:07 EST 2014
-rwxr-xr-x 1 sg phys 192887697 Dec 4 02:25 /N/u/sg/BigRed2/quda-0.7.0/tests/staggered_dsla
sh_test
running the following test:
prec recon test_type dagger S_dim T_dimension
double 18 0 0 24/24/48 48
Grid partition info: X Y Z T
0 0 1 1
Found device 0: Tesla K20
Using device 0: Tesla K20
Setting NUMA affinity for device 0 to CPU core 0
WARNING: Environment variable QUDA_RESOURCE_PATH is not set.
WARNING: Caching of tuned parameters will be disabled.
Randomizing fields ...

mathiaswagner · 2014-12-16T18:08:22Z

Strange. It seems to work for me when using

mathwagn@aprun7:~/quda/tests> aprun -n4 -N1 ./staggered_dslash_test --prec double --xdim 24 --ydim 24 --zdim 48 --tdim 48 --xgridsize 1 --ygridsize 1 --zgridsize 2 --tgridsize 2

AlexVaq · 2014-12-16T18:22:31Z

You should set QUDA_RESOURCE_PATH to any path you wish (and that you have write permissions), otherwise QUDA will retune all the kernels every time you launch the job, and this can be extremely slow, so your results for the scaling test will be completely spoilt. Actually, if you don’t see anything for a while, it an mean that QUDA is tuning some kernels, but I would expect to see this a bit later, not right after “Randomizing fields…”.

El 16/12/2014, a las 18:51, stevengottlieb notifications@github.com escribió:

Rich has requested a weak scaling study, so I am trying staggered_dslash_test. I am having trouble running multiple gpu jobs. I compiled QUDA for multi-gpu. Here is how I launch a job on a cray:
aprun -n 4 -N 1 /N/u/sg/BigRed2/quda-0.7.0/tests//staggered_dslash_test --prec double --xdi
m $nx --ydim $ny --zdim $nz --tdim $nt --xgridsize 1 --ygridsize 1 --zgridsize 2 --tgridsiz
e 2 >>out_4procs.$ns

Here is what shows up in output:
Tue Dec 16 09:11:07 EST 2014
-rwxr-xr-x 1 sg phys 192887697 Dec 4 02:25 /N/u/sg/BigRed2/quda-0.7.0/tests/staggered_dsla
sh_test
running the following test:
prec recon test_type dagger S_dim T_dimension
double 18 0 0 24/24/48 48
Grid partition info: X Y Z T
0 0 1 1
Found device 0: Tesla K20
Using device 0: Tesla K20
Setting NUMA affinity for device 0 to CPU core 0
WARNING: Environment variable QUDA_RESOURCE_PATH is not set.
WARNING: Caching of tuned parameters will be disabled.
Randomizing fields ...

—
Reply to this email directly or view it on GitHub #200.

maddyscientist · 2014-12-16T18:56:24Z

So Steve, are you finding that it is hanging for you or is this the last thing it prints and then exits?

stevengottlieb · 2014-12-17T04:19:27Z

Hi Mike,

I have been traveling back from Mountain View. The job was not hanging,
it just did not seem to use more than one GPU. Some of the larger
volumes did fail when memory could not be allocated.

Here is a snippet from the compilation. If it looks like some flag is
missing or if there are incompatible ones, please let me know.

CC -Wall -O3 -D__COMPUTE_CAPABILITY__=350
-DMULTI_GPU -DGPU_STAGGERED
_DIRAC -DGPU_FATLINK -DGPU_UNITARIZE -DGPU_GAUGE_TOOLS -DGPU_GAUGE_FORCE
-DGPU_GAUGE_TOOLS
-DGPU_HISQ_FORCE -DGPU_STAGGERED_OPROD -DGPU_GAUGE_TOOLS -DGPU_DIRECT
-DBUILD_QDP_INTERFACE
-DBUILD_MILC_INTERFACE -DNUMA_AFFINITY
-I/opt/nvidia/cudatoolkit/default/include -DMPI_COMMS
-I/opt/cray/mpt/7.0.4/gni/mpich2-cray/83/include -I../include
-Idslash_core -I. gauge_field.cpp -c -o gauge_field.o

Thanks,
Steve

On Tue, 2014-12-16 at 10:56 -0800, mikeaclark wrote:

So Steve, are you finding that it is hanging for you or is this the
last thing it prints and then exits?

—
Reply to this email directly or view it on GitHub.

maddyscientist · 2014-12-17T15:25:20Z

Hi Steve,

I didn't see any reported dslash gflops which is why I asked if it was hanging. If you're telling me it is running to completion then I suspect all is running fine. One thing to note is that the test only prints out the performance for one gpu and not the aggregate performance. Could this be why you think it running on one gpu only? I guess we could update the tests to print aggregate performance as well, and also the number of GPUs it is running on.

This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

stevengottlieb · 2014-12-17T16:19:55Z

Hi Mike,

Thanks for getting back to me. I was only showing a snippet of the
output. I don't have a lot of experience running this test. What make
me suspicious is this part of the output:
running the following test:
prec recon test_type dagger S_dim T_dimension
double 18 0 0 24/24/48 48
Grid partition info: X Y Z T
0 0 1 1

I expected grid partition info to have 2 under Z and T, but now that I
look and see that X and Y are zero, I was probably misinterpreting the
meaning. Is the only possible value 0 or 1 depending on whether that
dimension is cut?

The other issue that is causing me to worry is that several of the jobs
are failing because they are running out of memory. On one node, I can
run 40^4 with this information at the end of the run:
Device memory used = 3155.7 MB
Page-locked host memory used = 1591.2 MB
Total host memory used >= 1986.8 MB

If I try to run 40 X 80^3 on 8 nodes, the job runs out of memory:
running the following test:
prec recon test_type dagger S_dim T_dimension
double 18 0 0 40/80/80 80
Grid partition info: X Y Z T
0 1 1 1
Found device 0: Tesla K20
Using device 0: Tesla K20
Setting NUMA affinity for device 0 to CPU core 0
WARNING: Environment variable QUDA_RESOURCE_PATH is not set.
WARNING: Caching of tuned parameters will be disabled.
Randomizing fields ...
Fat links sending...ERROR: Aborting (rank 6, host nid00193,
malloc.cpp:156 in device_malloc
_())
last kernel called was (name=,volume=,aux=)
ERROR: Failed to allocate device memory (cuda_gauge_field.cu:42 in
cudaGaugeField())

In fact, on 8 nodes, only 24 X 48^3 runs. The larger volumes fail.
Here is the report about memory usage for this run:
Device memory used = 3488.3 MB
Page-locked host memory used = 1974.7 MB
Total host memory used >= 2384.7 MB

When I look at the 24^4 run on a single GPU, memory report is:
Device memory used = 437.6 MB
Page-locked host memory used = 231.3 MB
Total host memory used >= 306.6 MB

This makes me wonder if the job is running only only 1 GPU or I don't
understand the input parameters, i.e., are --xdim, -ydim, etc. the per
GPU grid size or the total grid size?

I can make all the scripts and output available if we can find a
convenient space. That might result in faster time to solution. I can
put them on my webserver, dropbox (or IU's version of it) or Blue
Waters. Any preference?

Thanks,
Steve

On Wed, 2014-12-17 at 07:25 -0800, mikeaclark wrote:

Hi Steve,

I didn't see any reported dslash gflops which is why I asked if it was
hanging. If you're telling me it is running to completion then I
suspect all is running fine. One thing to note is that the test only
prints out the performance for one gpu and not the aggregate
performance. Could this be why you think it running on one gpu only? I
guess we could update the tests to print aggregate performance as

well, and also the number of GPUs it is running on.

This email message is for the sole use of the intended recipient(s)
and may contain
confidential information. Any unauthorized review, use, disclosure or
distribution
is prohibited. If you are not the intended recipient, please contact
the sender by

reply email and destroy all copies of the original message.

—
Reply to this email directly or view it on GitHub.

maddyscientist · 2014-12-17T17:56:36Z

Is the only possible value 0 or 1 depending on whether that dimension is cut?

Yes.

With respect to memory, when running 40^4 on a single GPU, are you partitioning the dimensions (loop back communication)? You need to do this to have the same effective per GPU memory usage as the multi-GPU runs. The force routines in particular use so-called extended fields where we allocate a local field size that includes the halo regions. I suspect you are running afoul of these issues.

Having said that, I believe there's optimization that can be done there with respect to memory usage in particular in multi-gpu mode, in particular I believe some extended gauge fields allocate a non-zero halo region in dimensions that are not partitioned, this obviously leave scope for improvement.

We can open additional bugs where necessary to reduce memory consumption (post 0.7.0) if this is a priority for you.

mathiaswagner · 2014-12-17T18:58:58Z

I never happened used loopback communication. Could you give me some instructions to use it with QUDA?

maddyscientist · 2014-12-17T20:41:01Z

With the unit tests it's easy, you just use the --partition flag to force communication on. This is a 4-bit integer where each bit signifies communication in a different dimension, with X the least significant bit and T the most significant bit. E.g. --partition 11 has X, Y, T partitioning switched on.

mathiaswagner · 2014-12-17T20:45:30Z

Will try that. Thanks.

maddyscientist · 2015-01-13T18:30:03Z

Closing.

mathiaswagner self-assigned this Dec 16, 2014

maddyscientist closed this as completed Jan 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problem with staggered_dslash_test #200

problem with staggered_dslash_test #200

stevengottlieb commented Dec 16, 2014

mathiaswagner commented Dec 16, 2014

AlexVaq commented Dec 16, 2014

maddyscientist commented Dec 16, 2014

stevengottlieb commented Dec 17, 2014

maddyscientist commented Dec 17, 2014

stevengottlieb commented Dec 17, 2014

well, and also the number of GPUs it is running on.

reply email and destroy all copies of the original message.

maddyscientist commented Dec 17, 2014

mathiaswagner commented Dec 17, 2014

maddyscientist commented Dec 17, 2014

mathiaswagner commented Dec 17, 2014

maddyscientist commented Jan 13, 2015

problem with staggered_dslash_test #200

problem with staggered_dslash_test #200

Comments

stevengottlieb commented Dec 16, 2014

mathiaswagner commented Dec 16, 2014

AlexVaq commented Dec 16, 2014

maddyscientist commented Dec 16, 2014

stevengottlieb commented Dec 17, 2014

maddyscientist commented Dec 17, 2014

reply email and destroy all copies of the original message.

stevengottlieb commented Dec 17, 2014

well, and also the number of GPUs it is running on.

reply email and destroy all copies of the original message.

maddyscientist commented Dec 17, 2014

mathiaswagner commented Dec 17, 2014

maddyscientist commented Dec 17, 2014

mathiaswagner commented Dec 17, 2014

maddyscientist commented Jan 13, 2015