# Using FABRIC GPUs

Your compute nodes can include GPUs. These devices are made available as FABRIC components and can be added to your nodes like any other component.

This example notebook will demonstrate how to reserve and use Nvidia GPU devices on FABRIC.


## Setup the Experiment

#### Import FABRIC API

In [1]:
from fabrictestbed_extensions.fablib.fablib import FablibManager as fablib_manager

fablib = fablib_manager()
                     
fablib.show_config();

0,1
Credential Manager,cm.fabric-testbed.net
Orchestrator,orchestrator.fabric-testbed.net
Token File,/home/fabric/.tokens.json
Project ID,17f7e488-e1b7-4ea9-b657-e69cdbb27a38
Bastion Username,jbrassil_0034513446
Bastion Private Key File,/home/fabric/work/fabric_config/id_fabric_bastion_traffic
Bastion Host,bastion.fabric-testbed.net
Bastion Private Key Passphrase,
Slice Public Key File,/home/fabric/work/fabric_config/slice_key.pub
Slice Private Key File,/home/fabric/work/fabric_config/slice_key


## Create a Node

The cells below help you create a slice that contains a single node with an attached GPU. 

### Select GPU Type and select the FABRIC Site

First decide on which GPU type you want to try - this will determine the subset of sites where your VM can be placed.

In [2]:
# pick which GPU type we will use (execute this cell). 

# choices include
# GPU_RTX6000
# GPU_TeslaT4
# GPU_A30
# GPU_A40
GPU_CHOICE = 'GPU_A40' 

# don't edit - convert from GPU type to a resource column name
# to use in filter lambda function below
choice_to_column = {
    "GPU_RTX6000": "rtx6000_available",
    "GPU_TeslaT4": "tesla_t4_available",
    "GPU_A30": "a30_available",
    "GPU_A40": "a40_available"
}

column_name = choice_to_column.get(GPU_CHOICE, "Unknown")
print(f'{column_name=}')

column_name='a40_available'


Give the slice and the node in it meaningful names.

In [3]:
# name the slice and the node 
slice_name=f'A40-perf'
node_name='gpu-node'

print(f'Will create slice "{slice_name}" with node "{node_name}"')

Will create slice "A40-perf" with node "gpu-node"


Use a lambda filter to figure out which site the node will go to.

In [4]:
# find a site with at least one available GPU of the selected type
site_override = None

if site_override:
    site = site_override
else:
    site = fablib.get_random_site(filter_function=lambda x: x[column_name] > 0)
print(f'Preparing to create slice "{slice_name}" with node {node_name} in site {site}')

Preparing to create slice "A40-perf" with node gpu-node in site CERN


Create the desired slice with a GPU component. 

In [5]:
# Create Slice. Note that by default submit() call will poll for 360 seconds every 10-20 seconds
# waiting for slice to come up. Normal expected time is around 2 minutes. 
slice = fablib.new_slice(name=slice_name)

# Add node with a 100G drive and a couple of CPU cores (default)
node = slice.add_node(name=node_name, site=site, disk=100, image='default_ubuntu_22')
node.add_component(model=GPU_CHOICE, name='gpu1')

#Submit Slice Request
slice.submit();


Retry: 10, Time: 247 sec


0,1
ID,b24f2d18-9695-4762-a94a-0bbb31b2622d
Name,A40-perf
Lease Expiration (UTC),2024-04-11 09:47:46 +0000
Lease Start (UTC),2024-04-10 09:47:47 +0000
Project ID,17f7e488-e1b7-4ea9-b657-e69cdbb27a38
State,StableOK


ID,Name,Cores,RAM,Disk,Image,Image Type,Host,Site,Username,Management IP,State,Error,SSH Command,Public SSH Key File,Private SSH Key File
88446bbb-d881-4661-8d4c-ffc72b37bd4c,gpu-node,2,8,100,default_ubuntu_22,qcow2,cern-w1.fabric-testbed.net,CERN,ubuntu,2001:400:a100:3090:f816:3eff:fee7:ca93,Active,,ssh -i /home/fabric/work/fabric_config/slice_key -F /home/fabric/work/fabric_config/ssh_config ubuntu@2001:400:a100:3090:f816:3eff:fee7:ca93,/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key


## Get the Slice

Retrieve the node information and save the management IP addresses.

In [6]:
slice = fablib.get_slice(name=slice_name)
slice.show();

0,1
ID,b24f2d18-9695-4762-a94a-0bbb31b2622d
Name,A40-perf
Lease Expiration (UTC),2024-04-11 09:47:46 +0000
Lease Start (UTC),2024-04-10 09:47:47 +0000
Project ID,17f7e488-e1b7-4ea9-b657-e69cdbb27a38
State,StableOK


## Get the Node

Retrieve the node information and save the management IP address.


In [7]:
node = slice.get_node(node_name) 
node.show()

gpu = node.get_component('gpu1')
gpu.show();


0,1
ID,88446bbb-d881-4661-8d4c-ffc72b37bd4c
Name,gpu-node
Cores,2
RAM,8
Disk,100
Image,default_ubuntu_22
Image Type,qcow2
Host,cern-w1.fabric-testbed.net
Site,CERN
Username,ubuntu


0,1
Name,gpu-node-gpu1
Short Name,gpu1
Details,NVIDIA Corporation GA102GL [A40] (rev a1)
Disk,0
Units,1
PCI Address,['0000:07:00.0']
Model,
Type,GPU
Device,
Node,gpu-node


## GPU PCI Device

Run the command <code>lspci</code> to see your GPU PCI device(s). This is the raw GPU PCI device that is not yet configured for use.  You can use the GPUs as you would any GPUs.

View node's GPU

In [8]:
command = "sudo apt-get install -y pciutils && lspci | grep 'NVIDIA\|3D controller'"

stdout, stderr = node.execute(command)

Reading package lists...
Building dependency tree...
Reading state information...
pciutils is already the newest version (1:3.7.0-6).
pciutils set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
07:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)


## Install Nvidia Drivers

Now, let's run the following commands to install the latest NVidia driver and the CUDA libraries and compiler. This step can take up to 20 minutes.

NOTE: for instructional purposes the following cell sends all command output back to the notebook. You can also send it to log files to keep the notebook output clean.

In [9]:
distro='ubuntu2204'
version='12.2'
architecture='x86_64'

# install prerequisites
commands = [
    'sudo apt-get -q update',
    'sudo apt-get -q install -y linux-headers-$(uname -r) gcc',
]

print("Installing Prerequisites...")
for command in commands:
    print(f"++++ {command}")
    stdout, stderr = node.execute(command)

print(f"Installing CUDA {version}")
commands = [
    f'wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb',
    f'sudo dpkg -i cuda-keyring_1.1-1_all.deb',
    'sudo apt-get -q update',
    'sudo apt-get -q install -y cuda'
]
print("Installing CUDA...")
for command in commands:
    print(f"++++ {command}")
    stdout, stderr = node.execute(command)
    
print("Done installing CUDA")

Installing Prerequisites...
++++ sudo apt-get -q update
Hit:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:4 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Get:5 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [1342 kB]
Get:6 http://nova.clouds.archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [14.1 MB]
Get:7 http://security.ubuntu.com/ubuntu jammy-security/main Translation-en [237 kB]
Get:8 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [1662 kB]
Get:9 http://security.ubuntu.com/ubuntu jammy-security/restricted Translation-en [280 kB]
Get:10 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [852 kB]
Get:11 http://security.ubuntu.com/ubuntu jammy-security/universe Translation-en [163 kB]
Get:12 http://

And once CUDA is installed, reboot the machine.

In [10]:
reboot = 'sudo reboot'

print(reboot)
node.execute(reboot)

slice.wait_ssh(timeout=360,interval=10,progress=True)

print("Now testing SSH abilites to reconnect...",end="")
slice.update()
slice.test_ssh()
print("Reconnected!")


sudo reboot
Waiting for slice . Slice state: StableOK
Waiting for ssh in slice .. ssh successful
Now testing SSH abilites to reconnect...Reconnected!


## Testing the GPU and CUDA Installation

First, verify that the Nvidia drivers recognize the GPU by running `nvidia-smi`.

In [11]:
stdout, stderr = node.execute("nvidia-smi")

print(f"stdout: {stdout}")

Wed Apr 10 10:06:23 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A40                     Off |   00000000:07:00.0 Off |                    0 |
|  0%   32C    P8             12W /  300W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

Now, let's upload the following "Hello World" CUDA program file to the node.

`hello-world.cu`

*Source: https://computer-graphics.se/multicore/pdf/hello-world.cu*

*Author: Ingemar Ragnemalm*

>This file is from *"The real "Hello World!" for CUDA, OpenCL and GLSL!"* (https://computer-graphics.se/hello-world-for-cuda.html), written by Ingemar Ragnemalm, programmer and CUDA teacher. The only changes (if you download the original file from the website) are to additionally `#include <unistd.h>`, as `sleep()` is now a fuction defined in the `unistd.h` library.

In [12]:
node.upload_file('./hello-world.cu', 'hello-world.cu')

<SFTPAttributes: [ size=1110 uid=1000 gid=1000 mode=0o100664 atime=1712743585 mtime=1712743585 ]>

In [13]:
node.upload_file('./lightning_mnist_example.ipynb', 'lightning_mnist_example.ipynb')

<SFTPAttributes: [ size=53786 uid=1000 gid=1000 mode=0o100664 atime=1712743587 mtime=1712743588 ]>

In [20]:
node.upload_file('./lightning_mnist_example.py', 'lightning_mnist_example.py')
node.upload_file('./installer.sh', './installer.sh')
#node.upload_file('./lightning_mnist_example-install1.py', 'lightning_mnist_example-install1.py')
#node.upload_file('./lightning_mnist_example-install2.py', 'lightning_mnist_example-install2.py')

<SFTPAttributes: [ size=222 uid=1000 gid=1000 mode=0o100664 atime=1712743603 mtime=1712747215 ]>

In [23]:
#node.download_file('./lightning_mnist_example.py', 'lightning_mnist_example.py')
node.download_file('./cern-ray-microbm.txt', 'cern-ray-microbm.txt')

We now compile the `.cu` file using `nvcc`, the CUDA compiler tool installed with CUDA. In this example, we create an executable called `hello_world`.

In [16]:
stdout, stderr = node.execute(f"/usr/local/cuda-{12.4}/bin/nvcc -o hello_world hello-world.cu")

Finally, run the executable:

In [17]:
stdout, stderr = node.execute("./hello_world")

print(f"stdout: {stdout}")

Hello World!
stdout: Hello World!



In [22]:
stdout, stderr = node.execute("bash ./installer.sh")
stdout, stderr = node.execute("python3 ./lightning_mnist_example.py")

Reading package lists...[31m 

 [0m
Building dependency tree...
Reading state information...
python3-pip is already the newest version (22.0.2+dfsg-1ubuntu0.4).
0 upgraded, 0 newly installed, 0 to remove and 91 not upgraded.
Defaulting to user installation because normal site-packages is not writeable

View detailed results here: /tmp/ray_results/ptl-mnist-example
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2024-04-10_11-18-24_901390_3430/artifacts/2024-04-10_11-18-28/ptl-mnist-example/driver_artifacts`

Training started without custom configuration.
[36m(RayTrainWorker pid=3687)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
[36m(RayTrainWorker pid=3687)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /tmp/ray/session_2024-04-10_11-18-24_901390_3430/artifacts/2024-04-10_11-18-28/ptl-mnist-example/working_dirs/TorchTrainer_0e526_00000_0_2024-04-10_11-18-28/MNIST/raw/train-images-

If you see `Hello World!`, the CUDA program ran successfully. `World!` was computed on the GPU from an array of offsets being summed with the string `Hello `, and the resut was printed to stdout.

### Congratulations! You have now successfully run a program on a FABRIC GPU!

## Cleanup Your Experiment

In [24]:
#fablib.delete_slice(slice_name)