Support GPU direct as memory endpoint #284

JulianKunkel · 2020-11-30T14:10:29Z

Would be useful to support benchmarking deep learning workflows.
Could use a new flag such as "--memory-buffer-gpu".

glennklockwood · 2020-11-30T17:21:38Z

I believe John Ravi/NC State has code to do this that he is cleaning up. @sbyna may know the status of it.

sbyna · 2020-12-01T19:32:50Z

John is doing some fine tuning to the code, which allows IOR to work both with and without GDS. He has been busy with another project, but should be done with ior shortly and do a PR.

… Partially addressing #284. IOR support completed.

JulianKunkel · 2021-01-21T21:36:40Z

I've been implementing the CUDA malloc/free part allowing buffers to live in the GPU.
Fun result on a v100:
./src/ior -o /dev/shm/test
write 1479
read 3610

./src/ior -O allocateBufferOnGPU=1 -o /dev/shm/test
write 2122
read 5236

I ran it several times, reproducible.

adilger · 2021-01-22T01:16:06Z

Does -O allocateBufferOnGPU=1 also cause IOR to use O_DIRECT? There is also a separate question of whether IOR is filling in the data pattern in the buffers in this case, or is the CPU touching the pages?

What filesystem (and if Lustre, what version)? There are Lustre-specific GDS enhancements in some Lustre versions, but it is not yet landed into the master branch because of 2.14 feature freeze.

JulianKunkel · 2021-01-22T12:55:05Z

The option is purely to allocate the buffer on GPU, IOR is filling the buffer as usual but thanks to unified memory the pages are migrated back to the GPU.
My goal is to allow different combinations of options. Users can decide to additionally use O-Direct.
I'll be adding the additional feature to use GPU Direct but I have no test system that could benefit from it.

JulianKunkel · 2021-01-25T18:53:43Z

I have added support for gpuDirect via the cuFile API.
$ ./src/ior --gpuDirect --posix.odirect
It basically stores one block of the file in the read or write buffer on the GPU.

Unfortunately, I have no system where I can sensibly test it with a file system such as Lustre.
Therefore, I had created a library to fake the GPUDirect stuff and tested it with NVIDIA examples:
https://github.com/VI4IO/fake-gpudirect-cufile/

That means it is likely that something may not be working completely as intended and it requires testing.
I'm happy to finish development on a system that has support or have someone else come back with inquiries.
Once that works, will add the options to md* benchmarks.

sihara · 2021-02-01T06:47:45Z

I did test this patch on DGX-A100 with GDS(GPU DIRECT Storage) enabled. There are a couple of feedback.
it would allow "--with-cuda=PATH" instead of CPPFLAGS="-I/usr/local/cuda-11/include" LDFLAGS="-L/usr/local/cuda-11/lib64" for custom CUDA path for even more simpler. '-O allocateBufferOnGPU=1' works, but I did't find '--gpuDirect' option.
it would be nice to have GPU affinity setting? e.g. e.g. if node has multiple GPUs, IOR would select GPU id to use.

JulianKunkel · 2021-02-01T09:57:38Z

Thanks for testing.

It seems it may not have detected cufile.h
When you run configure, it needs to output:
checking for cufile.h... yes
checking for library containing cuFileDriverOpen... -lcufile

Once that this is there, it should support gpuDirect:
$ ./src/ior --help
Synopsis ./src/ior
Flags
-c, --collective Use collective I/O
-C reorderTasks -- changes task ordering for readback (useful to avoid client cache)
--gpuDirect allocate I/O buffers on the GPU and use gpuDirect to store data; this option is incompatible with any option requiring CPU access to data.
...
Module POSIX
Flags
--posix.odirect Direct I/O Mode
--gpuDirect allocate I/O buffers on the GPU

JulianKunkel · 2021-02-01T10:33:39Z

I've added support for path, i.e., --with-cuda= and --with-gpuDirect= should work.

sihara · 2021-02-10T08:24:47Z

@JulianKunkel where did you push codes I can test again?

JulianKunkel · 2021-02-10T08:29:19Z

Hi, it is in the same PR #323. You can chat with me on VI4IO if there is any issue.

* Basic support for memory allocation on GPU using CUDA unified memory. Partially addressing #284. IOR support completed. * Support for GPU alloc in MDTest and MD-Workbench * Option: support repeated parsing of same option (allows option sharing across modules). * Checks for gpuDirect * Integrate gpuDirect options and basic hooks, more testing to be done. * POSIX: basic gpuDirect implementation working with fake-gpudirect library. * CUDA allow setting of DeviceID for IOR (not yet MDTest). * CUDA/GPUDirect Support --with-X=<path> * Bugfix in option parser for flags that are part of an argument for an option, e.g., -O=1, if 1 is a flag it is wrongly assumed to be a flag.

JulianKunkel · 2021-02-25T18:16:29Z

Since the basic version has landed (but couldn't be too well tested), I'll close the issue for now.

ajtarraga · 2023-03-08T16:23:14Z

Hi, I am testing I/O systems with support for GPUDirect in an heterogeneous cluster.

In this cluster I have 7 different nodes with Tesla T4 GPUs with GPUDirect support and NVMeoF. I would like to test the access of several nodes on the same NVMe disk by all the nodes in order to test the Bandwidth.

I have tried to use the flags --gpuDirect, --with-cuda and --with-gpuDirect but I have the following error while executing the command:
Error invalid argument: --gpuDirect Error invalid argument: --with-cuda Error invalid argument: --with-gpuDirect Invalid options

The command was mpirun -n 2 -H nodo51:1,nodo52:1 ./src/ior -t 1m -b 16m -s 16 -o /nvme/testFile -O allocateBufferOnGPU=1 --gpuDirect --with-cuda=/usr/local/cuda-11.8/bin --with-gpuDirect=/usr/local/cuda-11.8/bin

If anyone can help me, I would appreciate it.

JulianKunkel · 2023-03-08T16:30:09Z

Great, in order to use these flags, it must find CUDA (./configure options: --with-cuda --with-gpuDirect)
Once compiled, you shall see the options listed when running ./ior --help

ajtarraga · 2023-03-08T16:37:40Z

I have configured with these flags, however, I am not able to see the features of --gpuDirect while executing the command. What command should I use in order to test it with GPU and GPUDirect? @JulianKunkel

JulianKunkel · 2023-03-09T07:34:04Z

Could it be that nvcc is not in the path?
My configure looks like this:
./configure --with-gpuDirect=/usr/local/cuda/targets/x86_64-linux --with-nvcc --with-cuda=/usr/local/cuda/targets/x86_64-linux
You need to check the output, if it has:
checking for cufile.h
checking for cuda_runtime.h

ajtarraga · 2023-03-09T17:42:11Z

I have configured like you said:
./configure --with-gpuDirect=/usr/local/cuda-11.8/targets/x86_64-linux --with-nvcc --with-cuda=/usr/local/cuda-11.8/targets/x86_64-linux
And I have the correct output that you indicated:
checking for cufile.h... yes
checking for cuda_runtime.h... yes

After making make clean and make commands, I tried to search for gpuDirect command and I can't see it, I can only see:
-O allocateBufferOnGPU=X -- allocate I/O buffers on the GPU: X=1 uses managed memory - verifications are run on CPU; X=2 managed memory - verifications on GPU; X=3 device memory with verifications on GPU

JulianKunkel · 2023-03-09T18:03:36Z

Okay, so I did git pull to the latest version dbb1f7d. It is challenging to find the issue on your end.
Can you share the config.h, I added the output of:
$ grep -v "/" src/config.h | grep -v "^$"

#define HAVE_CUDA_RUNTIME_H 1
#define HAVE_CUFILE_H 1
#define HAVE_FCNTL_H 1
#define HAVE_GETTIMEOFDAY 1
#define HAVE_INTTYPES_H 1
#define HAVE_LIBINTL_H 1
#define HAVE_MEMORY_H 1
#define HAVE_MEMSET 1
#define HAVE_MKDIR 1
#define HAVE_MPI 1
#define HAVE_PUTENV 1
#define HAVE_REALPATH 1
#define HAVE_REGCOMP 1
#define HAVE_STATFS 1
#define HAVE_STATVFS 1
#define HAVE_STDINT_H 1
#define HAVE_STDLIB_H 1
#define HAVE_STRCASECMP 1
#define HAVE_STRCHR 1
#define HAVE_STRERROR 1
#define HAVE_STRINGS_H 1
#define HAVE_STRING_H 1
#define HAVE_STRNCASECMP 1
#define HAVE_STRSTR 1
#define HAVE_SYSCONF 1
#define HAVE_SYS_IOCTL_H 1
#define HAVE_SYS_MOUNT_H 1
#define HAVE_SYS_PARAM_H 1
#define HAVE_SYS_STATFS_H 1
#define HAVE_SYS_STATVFS_H 1
#define HAVE_SYS_STAT_H 1
#define HAVE_SYS_TIME_H 1
#define HAVE_SYS_TYPES_H 1
#define HAVE_UNAME 1
#define HAVE_UNISTD_H 1
#define HAVE_WCHAR_H 1
#define META_ALIAS "ior-4.1.0+dev-0"
#define META_NAME "ior"
#define META_RELEASE "0"
#define META_VERSION "4.1.0+dev"
#define PACKAGE_BUGREPORT ""
#define PACKAGE_NAME "ior"
#define PACKAGE_STRING "ior 4.1.0+dev"
#define PACKAGE_TARNAME "ior"
#define PACKAGE_URL ""
#define PACKAGE_VERSION "4.1.0+dev"
#define STDC_HEADERS 1
#ifndef _DARWIN_USE_64_BIT_INODE
# define _DARWIN_USE_64_BIT_INODE 1
#endif
#define _XOPEN_SOURCE 700

ajtarraga · 2023-03-14T18:27:01Z

I have the same output as you and I have checked it with diff command and there is no difference between your config and mine

JulianKunkel · 2023-03-15T07:52:33Z

I can now reproduce the issue.

JulianKunkel · 2023-03-15T08:19:59Z

The problem appears to be that nvcc cannot be found:
checking for nvcc... no

If it works and you get sth. like this:
checking for nvcc... /sw/tools/cuda/11.2/bin/nvcc

Then during compilation, it will output sth like:
nvcc -g -O2 -c -o utilities-gpu.o utilities-gpu.cu

Only then GPU Direct works.
If have to see why the configure.ac macro doesn't work as intended in this case.

Give it a try pls.

ajtarraga · 2023-03-15T08:38:33Z

I have tried it with:
sudo ./configure --with-gpuDirect=/usr/local/cuda-11.8/targets/x86_64-linux --with-nvcc=/usr/local/cuda-11.8/bin/nvcc --with-cuda=/usr/local/cuda-11.8/targets/x86_64-linux

And I got the output like you said:
checking for nvcc... /usr/local/cuda-11.8/bin/nvcc

However, when I have used make command, I have no output like you said. I tried to search about key word nvcc and there isn't matching possibilities.

I cannot understand what is the problem on that

JulianKunkel added a commit that referenced this issue Jan 21, 2021

Basic support for memory allocation on GPU using CUDA unified memory.…

f9a5a52

… Partially addressing #284. IOR support completed.

JulianKunkel closed this as completed Jan 21, 2021

JulianKunkel reopened this Jan 21, 2021

JulianKunkel mentioned this issue Jan 26, 2021

Feature ior gpu #284 #323

Merged

JulianKunkel closed this as completed Feb 25, 2021

JulianKunkel reopened this Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GPU direct as memory endpoint #284

Support GPU direct as memory endpoint #284

JulianKunkel commented Nov 30, 2020

glennklockwood commented Nov 30, 2020

sbyna commented Dec 1, 2020

JulianKunkel commented Jan 21, 2021

adilger commented Jan 22, 2021

JulianKunkel commented Jan 22, 2021

JulianKunkel commented Jan 25, 2021

sihara commented Feb 1, 2021

JulianKunkel commented Feb 1, 2021

JulianKunkel commented Feb 1, 2021

sihara commented Feb 10, 2021

JulianKunkel commented Feb 10, 2021 via email

JulianKunkel commented Feb 25, 2021

ajtarraga commented Mar 8, 2023

JulianKunkel commented Mar 8, 2023

ajtarraga commented Mar 8, 2023 •

edited

Loading

JulianKunkel commented Mar 9, 2023

ajtarraga commented Mar 9, 2023

JulianKunkel commented Mar 9, 2023

ajtarraga commented Mar 14, 2023

JulianKunkel commented Mar 15, 2023

JulianKunkel commented Mar 15, 2023

ajtarraga commented Mar 15, 2023

Support GPU direct as memory endpoint #284

Support GPU direct as memory endpoint #284

Comments

JulianKunkel commented Nov 30, 2020

glennklockwood commented Nov 30, 2020

sbyna commented Dec 1, 2020

JulianKunkel commented Jan 21, 2021

adilger commented Jan 22, 2021

JulianKunkel commented Jan 22, 2021

JulianKunkel commented Jan 25, 2021

sihara commented Feb 1, 2021

JulianKunkel commented Feb 1, 2021

JulianKunkel commented Feb 1, 2021

sihara commented Feb 10, 2021

JulianKunkel commented Feb 10, 2021 via email

JulianKunkel commented Feb 25, 2021

ajtarraga commented Mar 8, 2023

JulianKunkel commented Mar 8, 2023

ajtarraga commented Mar 8, 2023 • edited Loading

JulianKunkel commented Mar 9, 2023

ajtarraga commented Mar 9, 2023

JulianKunkel commented Mar 9, 2023

ajtarraga commented Mar 14, 2023

JulianKunkel commented Mar 15, 2023

JulianKunkel commented Mar 15, 2023

ajtarraga commented Mar 15, 2023

ajtarraga commented Mar 8, 2023 •

edited

Loading