-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
problem with staggered_dslash_test #200
Comments
Strange. It seems to work for me when using
|
You should set QUDA_RESOURCE_PATH to any path you wish (and that you have write permissions), otherwise QUDA will retune all the kernels every time you launch the job, and this can be extremely slow, so your results for the scaling test will be completely spoilt. Actually, if you don’t see anything for a while, it an mean that QUDA is tuning some kernels, but I would expect to see this a bit later, not right after “Randomizing fields…”.
|
So Steve, are you finding that it is hanging for you or is this the last thing it prints and then exits? |
Hi Mike, I have been traveling back from Mountain View. The job was not hanging, Here is a snippet from the compilation. If it looks like some flag is CC -Wall -O3 -D__COMPUTE_CAPABILITY__=350 Thanks, On Tue, 2014-12-16 at 10:56 -0800, mikeaclark wrote:
|
Hi Steve, I didn't see any reported dslash gflops which is why I asked if it was hanging. If you're telling me it is running to completion then I suspect all is running fine. One thing to note is that the test only prints out the performance for one gpu and not the aggregate performance. Could this be why you think it running on one gpu only? I guess we could update the tests to print aggregate performance as well, and also the number of GPUs it is running on.This email message is for the sole use of the intended recipient(s) and may contain reply email and destroy all copies of the original message. |
Hi Mike, Thanks for getting back to me. I was only showing a snippet of the I expected grid partition info to have 2 under Z and T, but now that I The other issue that is causing me to worry is that several of the jobs If I try to run 40 X 80^3 on 8 nodes, the job runs out of memory: In fact, on 8 nodes, only 24 X 48^3 runs. The larger volumes fail. When I look at the 24^4 run on a single GPU, memory report is: This makes me wonder if the job is running only only 1 GPU or I don't I can make all the scripts and output available if we can find a Thanks, On Wed, 2014-12-17 at 07:25 -0800, mikeaclark wrote:
|
Yes. With respect to memory, when running 40^4 on a single GPU, are you partitioning the dimensions (loop back communication)? You need to do this to have the same effective per GPU memory usage as the multi-GPU runs. The force routines in particular use so-called extended fields where we allocate a local field size that includes the halo regions. I suspect you are running afoul of these issues. Having said that, I believe there's optimization that can be done there with respect to memory usage in particular in multi-gpu mode, in particular I believe some extended gauge fields allocate a non-zero halo region in dimensions that are not partitioned, this obviously leave scope for improvement. We can open additional bugs where necessary to reduce memory consumption (post 0.7.0) if this is a priority for you. |
I never happened used loopback communication. Could you give me some instructions to use it with QUDA? |
With the unit tests it's easy, you just use the --partition flag to force communication on. This is a 4-bit integer where each bit signifies communication in a different dimension, with X the least significant bit and T the most significant bit. E.g. --partition 11 has X, Y, T partitioning switched on. |
Will try that. Thanks. |
Closing. |
Rich has requested a weak scaling study, so I am trying staggered_dslash_test. I am having trouble running multiple gpu jobs. I compiled QUDA for multi-gpu. Here is how I launch a job on a cray:
aprun -n 4 -N 1 /N/u/sg/BigRed2/quda-0.7.0/tests//staggered_dslash_test --prec double --xdi
m $nx --ydim $ny --zdim $nz --tdim $nt --xgridsize 1 --ygridsize 1 --zgridsize 2 --tgridsiz
e 2 >>out_4procs.$ns
Here is what shows up in output:
Tue Dec 16 09:11:07 EST 2014
-rwxr-xr-x 1 sg phys 192887697 Dec 4 02:25 /N/u/sg/BigRed2/quda-0.7.0/tests/staggered_dsla
sh_test
running the following test:
prec recon test_type dagger S_dim T_dimension
double 18 0 0 24/24/48 48
Grid partition info: X Y Z T
0 0 1 1
Found device 0: Tesla K20
Using device 0: Tesla K20
Setting NUMA affinity for device 0 to CPU core 0
WARNING: Environment variable QUDA_RESOURCE_PATH is not set.
WARNING: Caching of tuned parameters will be disabled.
Randomizing fields ...
The text was updated successfully, but these errors were encountered: