-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GPU direct as memory endpoint #284
Comments
I believe John Ravi/NC State has code to do this that he is cleaning up. @sbyna may know the status of it. |
John is doing some fine tuning to the code, which allows IOR to work both with and without GDS. He has been busy with another project, but should be done with ior shortly and do a PR. |
… Partially addressing #284. IOR support completed.
I've been implementing the CUDA malloc/free part allowing buffers to live in the GPU. ./src/ior -O allocateBufferOnGPU=1 -o /dev/shm/test I ran it several times, reproducible. |
Does What filesystem (and if Lustre, what version)? There are Lustre-specific GDS enhancements in some Lustre versions, but it is not yet landed into the master branch because of 2.14 feature freeze. |
The option is purely to allocate the buffer on GPU, IOR is filling the buffer as usual but thanks to unified memory the pages are migrated back to the GPU. |
I have added support for gpuDirect via the cuFile API. Unfortunately, I have no system where I can sensibly test it with a file system such as Lustre. That means it is likely that something may not be working completely as intended and it requires testing. |
I did test this patch on DGX-A100 with GDS(GPU DIRECT Storage) enabled. There are a couple of feedback. |
Thanks for testing. It seems it may not have detected cufile.h Once that this is there, it should support gpuDirect: |
I've added support for path, i.e., --with-cuda= and --with-gpuDirect= should work. |
@JulianKunkel where did you push codes I can test again? |
Hi, it is in the same PR #323.
You can chat with me on VI4IO if there is any issue.
|
* Basic support for memory allocation on GPU using CUDA unified memory. Partially addressing #284. IOR support completed. * Support for GPU alloc in MDTest and MD-Workbench * Option: support repeated parsing of same option (allows option sharing across modules). * Checks for gpuDirect * Integrate gpuDirect options and basic hooks, more testing to be done. * POSIX: basic gpuDirect implementation working with fake-gpudirect library. * CUDA allow setting of DeviceID for IOR (not yet MDTest). * CUDA/GPUDirect Support --with-X=<path> * Bugfix in option parser for flags that are part of an argument for an option, e.g., -O=1, if 1 is a flag it is wrongly assumed to be a flag.
Since the basic version has landed (but couldn't be too well tested), I'll close the issue for now. |
Hi, I am testing I/O systems with support for GPUDirect in an heterogeneous cluster. In this cluster I have 7 different nodes with Tesla T4 GPUs with GPUDirect support and NVMeoF. I would like to test the access of several nodes on the same NVMe disk by all the nodes in order to test the Bandwidth. I have tried to use the flags The command was If anyone can help me, I would appreciate it. |
Great, in order to use these flags, it must find CUDA (./configure options: --with-cuda --with-gpuDirect) |
I have configured with these flags, however, I am not able to see the features of |
Could it be that nvcc is not in the path? |
I have configured like you said: After making |
Okay, so I did git pull to the latest version dbb1f7d. It is challenging to find the issue on your end.
|
I have the same output as you and I have checked it with |
I can now reproduce the issue. |
The problem appears to be that nvcc cannot be found: If it works and you get sth. like this: Then during compilation, it will output sth like: Only then GPU Direct works. Give it a try pls. |
I have tried it with: And I got the output like you said: However, when I have used I cannot understand what is the problem on that |
Would be useful to support benchmarking deep learning workflows.
Could use a new flag such as "--memory-buffer-gpu".
The text was updated successfully, but these errors were encountered: