/
gpus.rst
828 lines (609 loc) · 33.2 KB
/
gpus.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
======================
Using GPUs on Isambard
======================
This section presents information about GPU usage on Isambard. It is divided
in subsections describing general information about the avaible hardware and
common tasks associated with GPU-based workflows:
* General GPU usage. This subsection describes the SLURM directives that need
to be included in a job script to target partitions with GPUs.
* Available CUDA Versions & Hardware Differences
* Operating mode
* Setting up a work environment. Many users working on Machine and Deep
Learning tend to work with Python-based workflows. This subsection is focused
on helping them transition from a local system (e.g. laptop) to an HPC system.
General GPU Usage
=================
The PBS Pro scheduler controls the access to GPUs on a node such that access is
only granted when the resource is requested specifically. With PBS Pro GPUs can
be requested at job submission time via the following additional directive::
#PBS -l select=1:ncpus=64:ngpus=4
This directive requires PBS Pro to allocate four GPUs per allocated node, to
not use nodes without GPUs and to grant access.
Jobs must also be submitted to the desired GPU-enabled nodes queue, for example::
#PBS -q ampereq # to request A100 GPUs
Or::
#PBS -q votalq # to request V100 GPUs
It is then possible to use CUDA enabled applications or the CUDA toolkit
modules themselves, for example in the MAC cluster::
$ module load cuda11.2/toolkit/11.2.0
CUDA Versions & Hardware Differences
====================================
Isambard currently supports CUDA 11.2 (driver v460.32.03) on the `pascalq` and
`voltaq` partitions; and CUDA 11.4 (driver v515.65.01) on the `ampereq` partition.
AMD's Linux kernel module, AMDgpu, is available on the `instinctq` partition
(driver v5.11.32.21.40).
Isambard offer different GPU models with different numbers of devices per node.
Some important differences between the available GPU devices available on Isambard
are summarized in the table bellow:
.. csv-table:: GPUs info
:file: isambard_gpu_info.csv
:widths: 10, 10, 10, 10, 10, 10, 10, 10, 10, 10
:header-rows: 1
Tensor cores are a new type of programmable core exclusive to GPUs based on the
Volta architecture that run alongside standard CUDA cores. Tensor cores can
accelerate mixed-precision matrix multiply and accumulate calculations in a
single operation. This capability is specially significant for AI/DL/ML
applications that rely on large matrix operations.
Operating mode
==============
NVIDIA GPU cards can be operated in a number of Compute Modes. In short the
difference is whether multiple processes (and, theoretically, users) can access
(share) a GPU or if a GPU is exclusively bound to a single process. It is
typically application-specific whether one or the other mode is needed. On
Isambard NVIDIA cards are set to "Default" (multiple contexts are allowed per
device). Contact support if your application requires a different operating
mode.
To query the Compute Mode in all cards::
for host in `cat $PBS_NODEFILE`; do
ssh $host "nvidia-smi -q | grep 'Compute Mode'"
done
A sensible output should appear in the job output file.
Setting up a work environment
=============================
Running Python based workflows on Isambard
------------------------------------------
The traditional method to interact with remote HPC and cloud systems is through
the command line (via the ``ssh`` and ``scp`` commands), and although this might
require some adjustments for users more familar with GUIs like JupyterLab, it is
also the most efficient method and likely to be of benefit in the longer term.
This sections describes some of the most common steps when setting up a work
environment, Python in this example, as it is one of the most popular languages
used by researchers in the fields on Machine and Deep Learning.
Copying data
------------
To work on Isambard with Python scripts written locally in our Desktop or Laptop
computer we need to transfer them over. Depending on our platform we can do this
using the ``scp`` command provided by Linux and MacOS to copy individual files and
``scp -r`` to recursively copy your work directory over to your home directory
in Isambard::
$ scp -r python-code/ login-01.isambard.gw4.ac.uk:.
WARNING:
Unauthorised access may constitute a criminal offence.
All activity on the system is liable to monitoring.
Use of this system constitutes acceptance of our policies:
https://gw4-isambard.github.io/docs/policies/terms.html
User documentation: https://gw4-isambard.github.io/docs/
For support, please email isambard-support@gw4.ac.uk
python-novice-infla 100% 7216 91.9KB/s 00:00
inflammation.png 100% 19KB 209.3KB/s 00:00
argv_list.py 100% 42 0.5KB/s 00:00
readings_08.py 100% 1097 12.4KB/s 00:00
readings_09.py 100% 851 9.4KB/s 00:00
check.py 100% 1000 11.0KB/s 00:00
my_ls.py 100% 488 5.2KB/s 00:00
line_count.py 100% 920 10.0KB/s 00:00
readings_06.py 100% 718 7.6KB/s 00:00
...
Installing Python libraries
---------------------------
The **recommended approach** is to create a Python virtual environment with a
`requirements.txt` file which includes a list of all packages (and possibly
versions) needed for your work. This file can be created and used in your local
computer and then copied to Isambard to try to reproduce the same environment.
An example file is::
numpy==1.19
pandas
numba
Isambard provides a default Python installation (3.6.8) on both MACS and Phase 3
clusters that can be used to install your virtual environment::
$ python3 --version
Python 3.6.8
Additionally, on MACS, Python 3.7.5 is avaiable via module files::
[login-01 ~]$ module load python37
[login-01 ~]$ python3 --version
Python 3.7.5
On Phase 3, Python 3.9.4.2 can be accessed through module files::
[p3-login ~]$ module load cray-python/3.9.4.2
[p3-login ~]$ python3 --version
Python 3.9.4
After selecting the appropiate Python version, you can proceed to install the
virtual environment. In the example below we use the default Python3::
[login-01 ~]$ python3 -m venv --prompt myenv-macs ./myenv-macs
[login-01 ~]$ source ./myenv-macs/bin/activate
(myenv-macs) [login-01 ~]$
The above command will install a virtual environment named `myenv-macs` in a
directory with a matching name in the current directory.
Consider upgrading pip before installing main packages::
(myenv-macs) [login-01 ~]$ python3 -m pip install --upgrade pip
Collecting pip
Using cached https://files.pythonhosted.org/packages/a4/6d/6463d49a933f547439d6b5b98b46af8742cc03ae83543e4d7688c2420f8b/pip-21.3.1-py3-none-any.whl
Installing collected packages: pip
Found existing installation: pip 9.0.3
Uninstalling pip-9.0.3:
Successfully uninstalled pip-9.0.3
Successfully installed pip-21.3.1
Then continue with the libraries installation::
(myenv-macs) [login-01 ~]$ pip install -r requirements.txt
Collecting numpy==1.19
Downloading numpy-1.19.0-cp36-cp36m-manylinux2010_x86_64.whl (14.6 MB)
|████████████████████████████████| 14.6 MB 24 kB/s
Collecting pandas
Downloading pandas-1.1.5-cp36-cp36m-manylinux1_x86_64.whl (9.5 MB)
|████████████████████████████████| 9.5 MB 571 bytes/s
Collecting python-dateutil>=2.7.3
Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
|████████████████████████████████| 247 kB 63.4 MB/s
Collecting pytz>=2017.2
Downloading pytz-2022.2.1-py2.py3-none-any.whl (500 kB)
|████████████████████████████████| 500 kB 88.8 MB/s
Collecting six>=1.5
Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: six, pytz, python-dateutil, numpy, pandas
Successfully installed numpy-1.19.0 pandas-1.1.5 python-dateutil-2.8.2 pytz-2022.2.1 six-1.16.0
If all goes well your Python libraries should be avaiable within your virtual
environment. To deactivate the environment::
(myenv-macs) [login-01 ~]$ deactivate
[login-01 ~]$
Launching a GPU job
-------------------
Consider the following test Python code that uses Numba to compute the value of a
Gaussian probability density function at x with given mean and sigma::
#!/usr/bin/env python3
import math
import numpy as np
from numba import vectorize
SQRT_2PI = np.float32((2*math.pi)**0.5)
@vectorize(['float32(float32, float32, float32)'], target='cpu')
def gaussian_pdf(x, mean, sigma):
return math.exp(-0.5 * ((x - mean) / sigma)**2) / (sigma * SQRT_2PI)
# Evaluate the Gaussian a million times
x = np.random.uniform(-3, 3, size=1000000).astype(np.float32)
mean = np.float32(0.0)
sigma = np.float32(1.0)
gaussian_pdf(x, mean, sigma)
print("job done")
Now that we have our Python script, we need to create an additional file (job
script) to place it in the queue (submit the job). Make sure to remove any
commands from the Python script that might need additional confirmation or user
interaction as you won't be able to provide it with this method of execution.
The following is the content an example job script for MACS::
#!/bin/bash
#PBS -N pascalq
#PBS -q pascalq
#PBS -l select=1:ncpus=18:ngpus=1
#PBS -l walltime=00:15:00
# select -- allocate # separate nodes
# ncpus -- on each node allocate # cpus (cores)
# ngpus -- on each node allocate # gpus
set -eu
module purge
module load cuda11.2/toolkit/11.2.0
module list
# confirm which python version are we using
which python3
python3 --version
# Load local python environment
source ~/myenv-macs/bin/activate
cp $PBS_O_WORKDIR/gaussian.py .
time python3 gaussian.py
To submit (put it queue) the above script, on Isambard::
[login-01 ~]$ qsub run-macs-pascal.sh
59821.gw4head
You can query the current state of this job with::
[login-01 ~]$ qstat -u $USER
gw4head:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
59823.gw4head ca-munoz pascalq pascalq 157907 1 18 -- 00:15 R 00:00
This particular job might not spend a long time in queue and the above output
might not show it, but on completion there should be a `pascalq.oXXXXX` and
`pascalq.eXXXXX` (with the XXXXX matching your job id number) files created in
the current directory with the output and error messages (if any) produced by
our script.
Identifying memory access errors with NVIDIA Compute Sanitizer
==============================================================
Compute Sanitizer is a functional correctness checking suite included in the CUDA
toolkit. This suite contains multiple tools that can perform different type of checks.
The ``memcheck`` tool is capable of precisely detecting and attributing out of bounds
and misaligned memory access errors in CUDA applications. The tool can also report
hardware exceptions encountered by the GPU. The ``racecheck`` tool can report shared
memory data access hazards that can cause data races. The ``initcheck`` tool can report
cases where the GPU performs uninitialized accesses to global memory. The ``synccheck``
tool can report cases where the application is attempting invalid usages of
synchronization primitives. This section focuses on the use of ``memcheck`` and the
main steps to access the tool on Hawk, and demonstrates how apply it on a simple
example. You can find more details on the `user manual for Compute Sanitizer`_.
Accessing memcheck
------------------
.. tabs::
.. tab:: MACS
To access NVIDIA Compute Sanitizer on Isambard, load the latest CUDA module::
$ module load cuda11.2/toolkit/11.2.0
$ compute-sanitizer --version
NVIDIA (R) Compute Sanitizer
Copyright (c) 2020-2020 NVIDIA Corporation
Version 2020.3.0
.. tab:: Phase 3
On Phase 3 NVIDIA Compute Sanitizer on Isambard is only accessible from the
compute nodes, for example::
[p3-login ~]$ qsub -I -q ampereq -l select=1:ngpus=4 -l walltime=01:00:00
[p3-gpu01-compute ~]$ module load nvidia/21.11
[p3-gpu01-compute ~]$ compute-sanitizer --version
NVIDIA (R) Compute Sanitizer
Copyright (c) 2020-2021 NVIDIA Corporation
Version 2021.3.1
Consider the following code (from `CUDA-MEMCHECK user manual`_):
.. include:: memcheck_demo.cu
:code: c++
Compilation
-----------
Use the following command to compile the code above with details about line numbers
where errors occur::
$ nvcc -Xcompiler -rdynamic -lineinfo -o memcheck_demo memcheck_demo.cu
In the above command the ``-lineinfo`` option is used to generate line number information
for applications without affecting the optimization level of the output. The ``-rdynamic``
option is given to the host compiler to retain function symbols names which is useful
in stack backtrace. Note that when using ``nvcc``, flags to the host compiler can be
specified using the ``-Xcompiler`` option.
Running NVIDIA Compute Sanitizer
--------------------------------
Running CUDA programs on Isambard requires access to a GPU node. There is two options
available, an interactive session and a job script. If you attempt to run CUDA programs
on the login nodes or compute nodes without GPU access you may receive a error message
similar to the one below::
$ compute-sanitizer --tool memcheck memcheck_demo
========= COMPUTE-SANITIZER
Mallocing memory
Running unaligned_kernel
Ran unaligned_kernel: no CUDA-capable device is detected
Sync: no CUDA-capable device is detected
Running out_of_bounds_kernel
Ran out_of_bounds_kernel: no CUDA-capable device is detected
Sync: no CUDA-capable device is detected
========= Error: Target application terminated before first instrumented API call
.. tabs::
.. tab:: Interactive session
On Isambard, you can request an interactive session with the following command
(example for `pascalq` on MACS, please adapt for `voltaq` of `ampereq` if needed)::
$ qsub -I -q pascalq -l select=1:ngpus=1 -l walltime=01:00:00
qsub: waiting for job 59831.gw4head to start
qsub: job 59831.gw4head ready
remember to change the above options to fit your code requirements.
Load the latest CUDA module, change to your work directory and run compute
sanitizer in debug mode::
$ module load cuda11.2/toolkit/11.2.0
$ cd $PBS_O_WORKDIR
$ compute-sanitizer --log-file=memcheck_demo_debug.log --destroy-on-device-error kernel --leak-check full --tool memcheck memcheck_demo
Mallocing memory
Running unaligned_kernel
Ran unaligned_kernel: no error
Sync: no error
Running out_of_bounds_kernel
Ran out_of_bounds_kernel: no error
Sync: no error
In the above command ``--log-file`` allows you to save the filename of the file
where the output of `compute-sanitizer` should be saved;
``--destroy-on-device-error kernel``
controls how the application proceeds on hitting a memory access error, in this
case the ``kernel`` option specifies that the kernel must be terminated without
running any subsequent instructions and the application continues launching other
kernels in the CUDA context; ``--leak-check`` prints information about all
allocations that have not been freed via cudaFree at the point when the context was
destroyed.
Executing the command above will produce a file named
:download:`memcheck_demo_debug.log <./memcheck_demo_debug_isambard.log>`
that we can analyze later on.
.. tab:: Job script
A potentially more convenient way to run Compute Sanitizer is through the job
scheduler (especially for jobs with long runtimes). The method is very similar to
running the application through the command line but a job script with appropiate
directives to interact with the job scheduler is also needed. On Isambard we use
PBS Pro and the following job script could give you a good starting point:
.. include:: run.memcheck.isambard.sh
:code: bash
Compute Sanitizer memcheck output
---------------------------------
Exporing the output produced by memcheck we can find useful information like::
========= Invalid __global__ write of size 4 bytes
========= at 0x30 in /lustre/home/ca-munozcjj/gpudocs/files/memcheck_demo.cu:6:unaligned_kernel(void)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x15551a800001 is misaligned
========= Saved host backtrace up to driver entry point at kernel launch time
The above information can hopefully point you towards the region in the code likely to
benefit from further reviewing (line 6 in this example).
Profiling
=========
NVIDIA Nsight Systems
---------------------
Isambard offers access to NVIDIA Nsight Systems and NVIDIA compilers as part of its
standard module files.
Loading the profiler
^^^^^^^^^^^^^^^^^^^^
To access NVIDIA Nsight Systems on Isambard run the following command:
.. tabs::
.. tab:: MACS
On MACS users can access CUDA either on the login and compute nodes::
$ module load cuda11.1/nsight/11.1.1
$ nsys --version
NVIDIA Nsight Systems version 2020.3.4.32-52657a0
.. tab:: Phase 3
On Phase 3, `nsys` is only available through the compute nodes::
[username@p3-gpu01-compute ~]$ module load nvidia/21.11
[username@p3-gpu01-compute ~]$ nsys --version
NVIDIA Nsight Systems version 2021.4.1.73-08591f7
Example CUDA C++ program
^^^^^^^^^^^^^^^^^^^^^^^^
The usual workflow involves wriing code in CUDA C++, compiling,
profiling, analysing, optimising and repeating. Consider the
following code:
.. include:: vector-add.cu
:code: c++
Compilation
^^^^^^^^^^^
.. tabs::
.. tab:: MACS
The above code can be compiled and tested using an interactive session::
$ qsub -I -q voltaq -l select=1:ngpus=1 -l walltime=01:00:00
Once our session is granted and we are placed on the GPU node where
we can load and check our compilers::
$ module load cuda11.2/toolkit/11.2.0
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0
To compile our code::
$ nvcc -o vector-add -run vector-add.cu
All values were calculated correctly. Well done.
At this point your code is ready to be executed on the GPUs (and it already
was, since we used the `-run` option in nvcc). To profile the code we can use
the `nsys` command.
Notice that `nsys` is part of CUDA 11.1 instead of CUDA 11.2::
$ module purge
$ module load cuda11.1/nsight/11.1.1
$ nsys profile --stats=true ./vector-add
.. include:: nsys-output-isambard-macs.txt
:literal:
.. tab:: Phase 3
The above code can be compiled and tested using an interactive session::
$ qsub -I -q ampereq -l select=1:ngpus=1 -l walltime=01:00:00
Once our session is granted and we are placed on the GPU node where we can
load and check our compilers::
$ module load nvidia
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
To compile our code::
$ nvcc -o vector-add -run vector-add.cu
All values were calculated correctly. Well done.
At this point your code is ready to be executed on the GPUs (and it already
was, since we used the `-run` option in nvcc). To profile the code we can use
the `nsys` command::
$ nsys profile -b dwarf --sample=cpu --trace=nvtx,osrt,opengl --stats=true ./vector-add
.. include:: nsys-output-isambard-phase3.txt
:literal:
.. Attention::
In the above example we explicitly didn't include cuda tracing
(``--trace=nvtx,osrt,opengl``) as this currently produces a ``core dumped``
error message. At the moment this limits the information provided by the
profiler. The tech team is currently investigating.
The ``nsys`` command has several possible options available (see
``nsys profile --help``), among them ``--stats=true`` triggers the generation of
a statistics summary and the creation of two files::
report1.qdrep
report1.sqlite
The statistics summary output includes different CUDA operations performed by our
program including the time consumed by our main kernel `addVectorsInto`. Additionally,
it also provides us statistical information about how much time is spent transferring
data back and forth between the GPU and the CPU. At this point we could make changes
to our code (e.g. change the launch configuration, memory prefetching) and explore
its impact on these numbers to try to find optimal settings.
Visualize profiler results
^^^^^^^^^^^^^^^^^^^^^^^^^^
The files generated by ``nsys profile`` can in principle be used to explore our code
timeline using NVIDIA visual profiler Nsight Systems. However, Isambard doesn't currently
support X11 connections but the files may still be useful if the user has a local
installation of the visual profiler.
Arm Forge MAP
-------------
MAP - part of the Arm Forge toolsuite for high performance software - is an intuitive
graphical profiler that can show how much time was spent on each line of code.
Loading the profiler
^^^^^^^^^^^^^^^^^^^^
You can access Arm Forge toolset (MAP, DDT and Performance Reports) on both MACS and
Phase3 with the following command::
$ module use /software/x86/modulefiles
$ module load tools/arm-forge/20.1
You can use MAP's GPU profiling capabilities when working with CUDA programs
(CUDA kernels generated by OpenACC, CUDA Fortran, or off loaded OpenMP regions are
not yet supported by Arm MAP).
Compiler flags and compilation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Consider the :download:`previous example using CUDA C++ <./vector-add.cu>`.
In order to prepare our program to be profiled with MAP we need to compile it with
debugging symbols (``-g``) and in some cases it might need to be relinked. Typically
you should keep optimization flags enabled when profiling (rather than profiling a
debug build). This will give more representative results. The recommended set of
compilation flags for CUDA kernels are::
-g -lineinfo -O3
.. tabs::
.. tab:: MACS
Therefore we can compile our example program with::
$ qsub -I -q pascalq -l select=1:ngpus=1 -l walltime=01:00:00
Once our session is granted and we are placed on the GPU node where
we can load and compile the code::
$ module load cuda11.2/toolkit/11.2.0
$ nvcc -c -o vector-add.o -g -lineinfo -O3 vector-add.cu
$ nvcc vector-add.o -o vector-add
.. tab:: Phase 3
Therefore we can compile our example program with::
$ qsub -I -q ampereq -l select=1:ngpus=1 -l walltime=01:00:00
Once our session is granted and we are placed on the GPU node where
we can load and compile the code::
$ module load nvidia/21.11
$ nvcc -c -o vector-add.o -g -lineinfo -O3 vector-add.cu
$ nvcc vector-add.o -o vector-add
This generates a `vector-add` executable in our current directory. There are two
alternative methods to run MAP on our executable and generate a report (a map file)
with the profiling information for later analysis. The first method uses an interactive
session on a GPU node while the second runs MAP through a job script. Both methods
allow you to interact with the profiler through the command line but the former also
allows you to run the profiler's GUI and analize the results as soon as the program
completes execution so it is useful for relatively short runtime programs.
Running the profiler
^^^^^^^^^^^^^^^^^^^^
The following examples show how to run MAP on MACS, please adapt the instructions
(i.e. select corresponding partitions and module files) to run on Phase3.
.. tabs::
.. tab:: Interactive session
With an interactive session you can run the profiler through the command
line or using MAP's GUI. In both cases you first need to request the
interactive session.
.. tabs::
.. tab:: CLI
Request an interactive session (no need to do it again if already
in an interactive session)::
[login-01 ~]$ qsub -I -q pascalq -l select=1:ngpus=1 -l walltime=01:00:00
To run the profiler using the command line follow this steps::
[pascal-003 ~]$ module use /software/x86/modulefiles
[pascal-003 ~]$ module load tools/arm-forge/20.1
[pascal-003 ~]$ ALLINEA_SAMPLER_INTERVAL=1 map --profile vector-add
Arm Forge 21.0 - Arm MAP
Profiling : /lustre/home/ca-munozcjj/gpudocs/profilers/arm/macs/vector-add
Allinea sampler : not preloading
MPI implementation : Auto-Detect (None)
All values were calculated correctly. Well done.
MAP analysing program...
MAP gathering samples...
MAP generated /lustre/home/ca-munozcjj/gpudocs/profilers/arm/macs/vector-add_1p_1n_40t_2022-10-14_19-42.map
The environment variable `ALLINEA_SAMPLER_INTERVAL` controls how often
samples are taken in your code (in ms). By default MAP will sample the code
every 20 ms but if the code runs for a small amount of time (like in our
example) it might be good to increase the sampling rate (the maximum sampling
rate is 1000 Hz or 1 ms).
If all goes well MAP will generate a map file in the current directory which
can be analised later on with MAP GUI on the login nodes.
.. tab:: GUI
To run MAP's GUI we need to access the MACS cluster with X11 enabled::
$ ssh -Y login.isambard
We need to request an interactive session with X11 enabled::
[login-01 ~]$ qsub -I -X -q pascalq -l select=1:ngpus=1 -l walltime=01:00:00
qsub: waiting for job 62482.gw4head to start
qsub: job 62482.gw4head ready
cd /home/ca-munozcjj/pbs.62482.gw4head.x8z
xauth: file /home/ca-munozcjj/pbs.62482.gw4head.x8z/.Xauthority does not exist
When the request is granted you should be placed in the allocated node.
You can now load Arm Forge and execute MAP::
[pascal-003 ~]$ module use /software/x86/modulefiles
[pascal-003 ~]$ module load tools/arm-forge/20.1
[pascal-003 ~]$ export ALLINEA_SAMPLER_INTERVAL=1
[pascal-003 ~]$ map
The environment variable `ALLINEA_SAMPLER_INTERVAL` controls how often
samples are taken in your code (in ms). By default MAP will sample the code
every 20 ms but if the code runs for a small amount of time (like in our
example) it might be good to increase the sampling rate (the maximum sampling
rate is 1000 Hz or 1 ms).
If everything goes well you should see a new window similar to the left
figure below. Select **Profile a program**, this will open a new window
where you can search for the program to be profiled in the field
**Application** (right figure below).
.. list-table::
* - .. figure:: ./figures-gpudocs/isambard-macs-arm-forge-map-01.png
:alt: MAP GUI initial window
- .. figure:: ./figures-gpudocs/isambard-macs-arm-forge-map-02.png
:alt: MAP GUI search for application to profile
Select and open the application to profile (left figure below). Back on
the **Run** menu there are several options that allow you to control for
example the different **Metrics** to be monitored. Select the **CUDA
Kernel Analysis** field to extract additional information from your CUDA
kernels.
.. list-table::
* - .. figure:: ./figures-gpudocs/isambard-macs-arm-forge-map-03.png
:alt: MAP GUI select program
- .. figure:: ./figures-gpudocs/isambard-macs-arm-forge-map-04.png
:alt: MAP GUI GPU metrics
Select **Run** and the profiler will start analysing the code. On completion
a new window will open with the analysis results, a map file will also be
generated in our current directory, for example
`vector-add_1p_1n_2022-08-11_18-56.map`, you can use this file to explore
the profiler results again later on if needed. On the results window there
are several sections with useful information. Arm MAP makes it very easy to
observe how much time was consumed on the CPU and GPU. The **Main thread
activity** timeline (top section circled in red) shows in purple the time
spent by the CPU waiting for the GPU while green indicates the time spent
on the CPU.
The **Main Thread Stacks** section (bottom region circled in blue) shows a
more detailed description of the functions being accessed organised by spent
time. We can immediately see that our code is spending a lot of time (30.3%
of the whole program duration) on the CPU, on the `initWith` host function.
This suggests that our code could potentially benefit from accelerating this
function by making it a GPU kernel.
.. figure:: ./figures-gpudocs/isambard-macs-arm-forge-map-05.png
:scale: 50%
:alt: MAP results window
.. tab:: Job Script
A potentially more convenient way to run MAP is through the job scheduler
(especially for jobs with long runtimes). The method is very similar to running
the profiler through the command line but a job script with appropiate directives
to interact with the job scheduler is also needed. On Isambard we use PBS PRO and
the following job script could give you a good starting point:
.. include:: ./files-gpudocs/run.arm.map.q
:code: bash
The environment variable `ALLINEA_SAMPLER_INTERVAL` controls how often
samples are taken in your code (in ms). By default MAP will sample the code
every 20 ms but if the code runs for a small amount of time (like in our
example) it might be good to increase the sampling rate (the maximum sampling
rate is 1000 Hz or 1 ms).
As with the command line interactive method, the above script will produce a map
file that can then be used to start MAP on the login nodes to analyse the profiler
results. Don't forget to ammend the ``-l walltime`` and ``-l select`` options to
fit your program requirements.
Load a profile data file
^^^^^^^^^^^^^^^^^^^^^^^^
If later on you want to explore again the results produced by MAP, you don't need to
request an interactive session anymore, it is enough to connect to the system with
X11 enabled, load the arm-forge module and run map on the login nodes. Then on the
startup menu select **LOAD PROFILE DATA FILE** to load a profile data file from a
previous run (left image below). Then select the appropiate map file (right image
below). A new window with the profiler results will appear.
.. list-table::
* - .. figure:: ./figures-gpudocs/isambard-macs-arm-forge-map-06.png
:alt: load profile data file
- .. figure:: ./figures-gpudocs/isambard-macs-arm-forge-map-07.png
:alt: select map file
* - .. figure:: ./figures-gpudocs/isambard-macs-arm-forge-map-08.png
:alt: MAP previous results window
-
Benchmarking
============
We compare the performnace of some of the GPU devices available on Isambard (NVIDIA
P100, NVIDIA V100 and NVIDIA A100) using several popular CNN models for visual
classification implemented in Pytorch.
The first test uses a randomly generated dataset and is run on a single GPU.
The source code for the test can be found in https://github.com/ARCCA/pytorch-gpu-benchmark-synthetic
The reported time is the average time spent in 50 batches (12 images per
batch).
.. tabs::
.. tab:: Inference
.. figure:: ./files-gpudocs/synthetic_inference.png
:alt: Inference Results
.. tab:: Training
.. figure:: ./files-gpudocs/synthetic_training.png
:alt: Training Results
.. _user manual for Compute Sanitizer: https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html
.. _CUDA-MEMCHECK user manual: https://docs.nvidia.com/cuda/cuda-memcheck/index.html#cuda-memcheck-tool-examples