This section provides information and links that help with testing CoreNEURON
's GPU support. Other sections of the documentation that may be relevant are:
- The
getting-coreneuron
section, which documents both building from source with CoreNEURON support and installing Python wheels. - The
coreneuron-running-a-simulation
section, which explains the basics of porting a NEURON model to use CoreNEURON. - The
Running GPU benchmarks
section, which outlines how to use profiling tools such as Caliper, NVIDIA NSight Systems, and NVIDIA NSight Compute.
This section aims to add some basic information about how to test if GPU execution is working. This might be useful if, for example, you need to test a change to the GPU wheel building, or test GPU execution on a new system.
If your local system has an (NVIDIA) GPU installed then you can probably skip this section. The nvidia-smi
tool may be useful to check this; it will show the GPUs attached to a system:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P2200 Off | 00000000:01:00.0 Off | N/A |
| 45% 33C P8 4W / 75W | 71MiB / 5049MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
On a university cluster or supercomputer system then you will typically need to pass some kind of extra constraint to the job scheduler. For example on the BlueBrain5 system, which uses Slurm, you can allocate a GPU node using the volta
constraint:
[login node] $ salloc -A <account> -C volta
salloc: Granted job allocation 294001
...
[compute node] $ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off | Off |
...
If you have configured NEURON with CoreNEURON, CoreNEURON GPU support and tests (-DNRN_ENABLE_TESTS=ON <cmake-nrn-enable-tests-option>
) enabled then simply running
$ ctest --output-on-failure
in your CMake build directory will execute a large number of tests, many of them including GPU execution. You can filter which tests are run by name using the -R
option to CTest, for example:
$ ctest --output-on-failure -R gpu
Test project /path/to/your/build
Start 42: coreneuron_modtests::direct_py_gpu
1/53 Test #42: coreneuron_modtests::direct_py_gpu ............................. Passed 1.98 sec
Start 43: coreneuron_modtests::direct_hoc_gpu
2/53 Test #43: coreneuron_modtests::direct_hoc_gpu ............................ Passed 1.03 sec
Start 44: coreneuron_modtests::spikes_py_gpu
...
It is sometimes convenient to run basic tests outside the CTest infrastructure. A particularly useful test case is the ringtest
that is included in the CoreNEURON repository. This is very convenient because binary input data files for CoreNEURON are committed to the repository -- meaning that the test can be run without NEURON, Python, HOC, and friends -- and the required mechanisms are compiled as part of the standard NEURON build. To run this test on CPU you can, from your build directory, run:
$ ./bin/x86_64/special-core -d ../external/coreneuron/tests/integration/ring
...
where it is assumed that ..
is the source directory. To enable GPU execution, add the --gpu
option:
$ ./bin/x86_64/special-core -d ../external/coreneuron/tests/integration/ring --gpu
Info : 4 GPUs shared by 1 ranks per node
...
You should see that the statistics printed at the end of the simulation are the same. It can also be useful to enable some basic profiling, for example by using NVIDIA's NSight Systems utility nsys
:
$ nsys nvprof ./bin/x86_64/special-core -d ../external/coreneuron/tests/integration/ring --gpu
WARNING: special-core and any of its children processes will be profiled.
Collecting data...
Info : 4 GPUs shared by 1 ranks per node
...
Number of spikes: 37
Number of spikes with non negative gid-s: 37
Processing events...
...
CUDA API Statistics:
Time(%) Total Time (ns) Num Calls Average (ns) Minimum (ns) Maximum (ns) StdDev (ns) Name
------- --------------- --------- ------------- ------------ ------------ ----------- --------------------------
42.7 2,127,723,623 136,038 15,640.7 3,630 10,224,640 59,860.5 cuLaunchKernel
...
CUDA Kernel Statistics:
Time(%) Total Time (ns) Instances Average (ns) Minimum (ns) Maximum (ns) StdDev (ns) Name
------- --------------- --------- ------------ ------------ ------------ ----------- ----------------------------------------------------------------------------------------------------
32.3 346,133,763 8,000 43,266.7 42,175 50,080 1,435.3 nvkernel__ZN10coreneuron18solve_interleaved1Ei_F1L653_4
12.7 136,155,806 8,002 17,015.2 3,615 1,099,738 90,544.0 nvkernel__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_F1L375_7
10.4 111,258,439 8,002 13,903.8 3,199 1,314,489 73,556.3 nvkernel__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_F1L274_4
10.1 108,647,844 8,000 13,581.0 3,391 1,274,394 70,309.4 nvkernel__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_F1L418_10
...
This can be helpful to confirm that compute kernels are really being launched on the GPU. Substrings such as solve_interleaved1
, solve_interleaved2
, nrn_cur_
and nrn_state_
in these kernel names indicate that the computationally heavy parts of the simulation are indeed being executed on the GPU. This test dataset is extremely small, so you should not pay much attention to the simulation time in this case.
Note
The kernel names, which start with nvkernel__ZN10coreneuron
above, are implementation details of the OpenACC or OpenMP implementation being used. They can also depend on whether you use MOD2C or NMODL to translate MOD files. If you want to do any more sophisticated profiling then you should use a profiling tool such as Caliper that can access the well-defined human-readable names for these kernels that NEURON and CoreNEURON define.