# Using FABRIC GPUs

Your compute nodes can include GPUs. These devices are made available as FABRIC components and can be added to your nodes like any other component.

This example notebook will demonstrate how to reserve and use Nvidia GPU devices on FABRIC.


## Setup the Experiment

#### Import FABRIC API

In [1]:
from ipaddress import ip_address, IPv4Address, IPv6Address, IPv4Network, IPv6Network
import ipaddress

from fabrictestbed_extensions.fablib.fablib import FablibManager as fablib_manager

fablib = fablib_manager()
                     
fablib.show_config();

0,1
Orchestrator,orchestrator.fabric-testbed.net
Credential Manager,cm.fabric-testbed.net
Core API,uis.fabric-testbed.net
Token File,/home/fabric/.tokens.json
Project ID,17f7e488-e1b7-4ea9-b657-e69cdbb27a38
Bastion Host,bastion.fabric-testbed.net
Bastion Username,jbrassil_0034513446
Bastion Private Key File,/home/fabric/work/fabric_config/id_fabric_bastion_traffic
Slice Public Key File,/home/fabric/work/fabric_config/slice_key.pub
Slice Private Key File,/home/fabric/work/fabric_config/slice_key


## Create a Node

The cells below help you create a slice that contains a single node with an attached GPU. 

### Select GPU Type and select the FABRIC Site

First decide on which GPU type you want to try - this will determine the subset of sites where your VM can be placed.

In [2]:
# pick which GPU type we will use (execute this cell). 

# choices include
# GPU_RTX6000
# GPU_TeslaT4
# GPU_A30
# GPU_A40
GPU_CHOICE = 'GPU_A40' 

# don't edit - convert from GPU type to a resource column name
# to use in filter lambda function below
choice_to_column = {
    "GPU_RTX6000": "rtx6000_available",
    "GPU_TeslaT4": "tesla_t4_available",
    "GPU_A30": "a30_available",
    "GPU_A40": "a40_available"
}

column_name = choice_to_column.get(GPU_CHOICE, "Unknown")
print(f'{column_name=}')

column_name='a40_available'


Give the slice and the node in it meaningful names.

In [3]:
# name the slice and the node 
slice_name=f'A40-CERN-perf1'
node1_name='gpu-node1'
node2_name='gpu-node2'

network_name='net1'

print(f'Will create slice "{slice_name}" with node "{node1_name}"')

Will create slice "A40-CERN-perf1" with node "gpu-node1"


Use a lambda filter to figure out which site the node will go to.

In [4]:
# find a site with at least one available GPU of the selected type
site_override = None

if site_override:
    site = site_override
else:
    site = fablib.get_random_site(filter_function=lambda x: x[column_name] > 0)
print(f'Preparing to create slice "{slice_name}" with node {node1_name} and node {node2_name}  in site {site}')

Preparing to create slice "A40-CERN-perf1" with node gpu-node1 and node gpu-node2  in site CERN


Create the desired slice with a GPU component. 

In [5]:
# Create Slice. Note that by default submit() call will poll for 360 seconds every 10-20 seconds
# waiting for slice to come up. Normal expected time is around 2 minutes. 
slice = fablib.new_slice(name=slice_name)

# Network
net1 = slice.add_l2network(name=network_name, subnet=IPv4Network("192.168.100.0/24"))

# Add node with a 100G drive and a couple of CPU cores (default)
node1 = slice.add_node(name=node1_name, site=site, disk=100, image='default_ubuntu_24')
node1.add_component(model=GPU_CHOICE, name='gpu1')
iface1 = node1.add_component(model='NIC_Basic', name='nic1').get_interfaces()[0]
iface1.set_mode('auto')
net1.add_interface(iface1)

node2 = slice.add_node(name=node2_name, site=site, disk=100, image='default_ubuntu_24')
node2.add_component(model=GPU_CHOICE, name='gpu2')

iface2 = node2.add_component(model='NIC_Basic', name='nic1').get_interfaces()[0]
iface2.set_mode('auto')
net1.add_interface(iface2)

#Submit Slice Request
slice.submit();


Retry: 11, Time: 270 sec


0,1
ID,9a936767-8b71-4b00-9290-f6df5a352953
Name,A40-CERN-perf1
Lease Expiration (UTC),2024-12-25 22:46:48 +0000
Lease Start (UTC),2024-12-24 22:46:48 +0000
Project ID,17f7e488-e1b7-4ea9-b657-e69cdbb27a38
State,StableOK


ID,Name,Cores,RAM,Disk,Image,Image Type,Host,Site,Username,Management IP,State,Error,SSH Command,Public SSH Key File,Private SSH Key File
355c8272-6094-4007-862d-886db051dc14,gpu-node1,2,8,100,default_ubuntu_24,qcow2,cern-w1.fabric-testbed.net,CERN,ubuntu,2001:400:a100:3090:f816:3eff:fe98:308c,Active,,ssh -i /home/fabric/work/fabric_config/slice_key -F /home/fabric/work/fabric_config/ssh_config ubuntu@2001:400:a100:3090:f816:3eff:fe98:308c,/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key
d3f3bde0-8953-4137-a64e-7a2637668eb5,gpu-node2,2,8,100,default_ubuntu_24,qcow2,cern-w1.fabric-testbed.net,CERN,ubuntu,2001:400:a100:3090:f816:3eff:fe10:651d,Active,,ssh -i /home/fabric/work/fabric_config/slice_key -F /home/fabric/work/fabric_config/ssh_config ubuntu@2001:400:a100:3090:f816:3eff:fe10:651d,/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key


ID,Name,Layer,Type,Site,Subnet,Gateway,State,Error
6b99c3db-f946-401a-9722-26292465437a,net1,L2,L2Bridge,CERN,192.168.100.0/24,,Active,


Name,Short Name,Node,Network,Bandwidth,Mode,VLAN,MAC,Physical Device,Device,IP Address,Numa Node,Switch Port
gpu-node1-nic1-p1,p1,gpu-node1,net1,100,auto,,0E:96:D5:17:8B:F8,enp7s0,enp7s0,192.168.100.1,6,HundredGigE0/0/0/5
gpu-node2-nic1-p1,p1,gpu-node2,net1,100,auto,,02:B5:53:89:2C:E6,enp8s0,enp8s0,192.168.100.2,6,HundredGigE0/0/0/5



Time to print interfaces 273 seconds


## Get the Slice

Retrieve the node information and save the management IP addresses.

In [6]:
slice = fablib.get_slice(name=slice_name)
slice.show();

0,1
ID,9a936767-8b71-4b00-9290-f6df5a352953
Name,A40-CERN-perf1
Lease Expiration (UTC),2024-12-25 22:46:48 +0000
Lease Start (UTC),2024-12-24 22:46:48 +0000
Project ID,17f7e488-e1b7-4ea9-b657-e69cdbb27a38
State,StableOK


## Get the Node

Retrieve the node information and save the management IP address.


In [7]:
node1 = slice.get_node(node1_name) 
node1.show()

node2 = slice.get_node(node2_name) 
node2.show()

gpu1 = node1.get_component('gpu1')
gpu1.show();

gpu2 = node2.get_component('gpu2')
gpu2.show();


0,1
ID,355c8272-6094-4007-862d-886db051dc14
Name,gpu-node1
Cores,2
RAM,8
Disk,100
Image,default_ubuntu_24
Image Type,qcow2
Host,cern-w1.fabric-testbed.net
Site,CERN
Username,ubuntu


0,1
ID,d3f3bde0-8953-4137-a64e-7a2637668eb5
Name,gpu-node2
Cores,2
RAM,8
Disk,100
Image,default_ubuntu_24
Image Type,qcow2
Host,cern-w1.fabric-testbed.net
Site,CERN
Username,ubuntu


0,1
Name,gpu-node1-gpu1
Short Name,gpu1
Details,NVIDIA Corporation GA102GL [A40] (rev a1)
Disk,0
Units,1
PCI Address,['0000:08:00.0']
Model,
Type,GPU
Device,
Node,gpu-node1


0,1
Name,gpu-node2-gpu2
Short Name,gpu2
Details,NVIDIA Corporation GA102GL [A40] (rev a1)
Disk,0
Units,1
PCI Address,['0000:07:00.0']
Model,
Type,GPU
Device,
Node,gpu-node2


## GPU PCI Device

Run the command <code>lspci</code> to see your GPU PCI device(s). This is the raw GPU PCI device that is not yet configured for use.  You can use the GPUs as you would any GPUs.

View node's GPU

In [8]:
command = "sudo apt-get install -y pciutils && lspci | grep 'NVIDIA'"

stdout, stderr = node1.execute(command)
stdout, stderr = node2.execute(command)

Reading package lists...
Building dependency tree...
Reading state information...
pciutils is already the newest version (1:3.10.0-2build1).
pciutils set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
08:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
Reading package lists...
Building dependency tree...
Reading state information...
pciutils is already the newest version (1:3.10.0-2build1).
pciutils set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.


## Install Nvidia Drivers

Now, let's run the following commands to install the latest NVidia driver and the CUDA libraries and compiler. This step can take up to 20 minutes.

NOTE: for instructional purposes the following cell sends all command output back to the notebook. You can also send it to log files to keep the notebook output clean.

In [9]:
distro='ubuntu2404'
version='12.6'
architecture='x86_64'

# install prerequisites
commands = [
    'sudo apt-get -q update',
    'sudo apt-get -q install -y linux-headers-$(uname -r) gcc',
]

print("Installing Prerequisites...")
for command in commands:
    print(f"++++ {command}")
    stdout, stderr = node1.execute(command)
    print(f"++++ {command}")
    stdout, stderr = node2.execute(command)

print(f"Installing CUDA {version}")
commands = [
    f'wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb',
    f'sudo dpkg -i cuda-keyring_1.1-1_all.deb',
    'sudo apt-get -q update',
    'sudo apt-get -q install -y cuda'
]
print("Installing CUDA...")
for command in commands:
    print(f"++++ {command}")
    stdout, stderr = node1.execute(command)
    print(f"++++ {command}")
    stdout, stderr = node2.execute(command)
    
print("Done installing CUDA")

Installing Prerequisites...
++++ sudo apt-get -q update
Get:1 http://nova.clouds.archive.ubuntu.com/ubuntu noble InRelease [256 kB]
Get:2 http://security.ubuntu.com/ubuntu noble-security InRelease [126 kB]
Get:3 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
Get:4 http://security.ubuntu.com/ubuntu noble-security/main amd64 Packages [572 kB]
Get:5 http://nova.clouds.archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB]
Get:6 http://nova.clouds.archive.ubuntu.com/ubuntu noble/universe amd64 Packages [15.0 MB]
Get:7 http://security.ubuntu.com/ubuntu noble-security/main Translation-en [111 kB]
Get:8 http://security.ubuntu.com/ubuntu noble-security/main amd64 Components [7224 B]
Get:9 http://security.ubuntu.com/ubuntu noble-security/main amd64 c-n-f Metadata [5892 B]
Get:10 http://security.ubuntu.com/ubuntu noble-security/universe amd64 Packages [795 kB]
Get:11 http://security.ubuntu.com/ubuntu noble-security/universe Translation-en [169 kB]
Get:12 http

In [None]:
print(f"Installing cuda-toolkit")
commands = [
    'sudo apt-get -q update',
    'sudo apt-get -q install -y cuda-toolkit'
]
print("Installing cuda-toolkit...")
for command in commands:
    print(f"++++ {command}")
    stdout, stderr = node1.execute(command)
    print(f"++++ {command}")
    stdout, stderr = node2.execute(command)
    
print("Done installing cuda-toolkit")

And once CUDA is installed, reboot the machine.

In [10]:
reboot = 'sudo reboot'

print(reboot)
node1.execute(reboot)

slice.wait_ssh(timeout=360,interval=10,progress=True)

print("Now testing SSH abilites to reconnect...",end="")
slice.update()
slice.test_ssh()
print("Reconnected!")


sudo reboot
Waiting for slice . Slice state: StableOK
Waiting for ssh in slice ...... ssh successful
Now testing SSH abilites to reconnect...Reconnected!


In [11]:
reboot = 'sudo reboot'

print(reboot)
node2.execute(reboot)

slice.wait_ssh(timeout=360,interval=10,progress=True)

print("Now testing SSH abilites to reconnect...",end="")
slice.update()
slice.test_ssh()
print("Reconnected!")


sudo reboot
Waiting for slice . Slice state: StableOK
Waiting for ssh in slice .. ssh successful
Now testing SSH abilites to reconnect...Reconnected!


## Testing the GPU and CUDA Installation

First, verify that the Nvidia drivers recognize the GPU by running `nvidia-smi`.

In [12]:
stdout, stderr = node1.execute("nvidia-smi")

print(f"stdout: {stdout}")

[31m bash: line 1: nvidia-smi: command not found
 [0mstdout: 


In [13]:
stdout, stderr = node2.execute("nvidia-smi")

print(f"stdout: {stdout}")

[31m bash: line 1: nvidia-smi: command not found
 [0mstdout: 


Now, let's upload the following "Hello World" CUDA program file to the node.

`hello-world.cu`

*Source: https://computer-graphics.se/multicore/pdf/hello-world.cu*

*Author: Ingemar Ragnemalm*

>This file is from *"The real "Hello World!" for CUDA, OpenCL and GLSL!"* (https://computer-graphics.se/hello-world-for-cuda.html), written by Ingemar Ragnemalm, programmer and CUDA teacher. The only changes (if you download the original file from the website) are to additionally `#include <unistd.h>`, as `sleep()` is now a fuction defined in the `unistd.h` library.

In [14]:
node1.upload_file('./hello-world.cu', 'hello-world.cu')
node2.upload_file('./hello-world.cu', 'hello-world.cu')

<SFTPAttributes: [ size=1110 uid=1000 gid=1000 mode=0o100664 atime=1735080893 mtime=1735080893 ]>

In [15]:
# never do this on FABRIC --see below for using slice_key -- but do here for cluster communications test
node1.upload_file('id_rsa', '.ssh/id_rsa')
node1.upload_file('id_rsa.pub', '.ssh/id_rsa.pub')
node2.upload_file('id_rsa', '.ssh/id_rsa')
node2.upload_file('id_rsa.pub', '.ssh/id_rsa.pub')
stdout, stderr = node1.execute(f"chmod 600 ~/.ssh/id_rsa")
stdout, stderr = node2.execute(f"chmod 600 ~/.ssh/id_rsa")
stdout, stderr = node1.execute(f"chmod 644 ~/.ssh/id_rsa.pub")
stdout, stderr = node2.execute(f"chmod 644 ~/.ssh/id_rsa.pub")


In [16]:
#grab latest fabric slice_key from fabric_config directory in case it changed
node1.upload_file('/home/fabric/work/fabric_config/slice_key', '.ssh/slice_key')
node2.upload_file('/home/fabric/work/fabric_config/slice_key', '.ssh/slice_key')
stdout, stderr = node1.execute(f"chmod 600 ~/.ssh/slice_key")
stdout, stderr = node2.execute(f"chmod 600 ~/.ssh/slice_key")
print(f'Use slice key explicitly to connect to nodes:  ssh -i ~/.ssh/slice_key ubuntu@192.168.100.2')

Use slice key explicitly to connect to nodes:  ssh -i ~/.ssh/slice_key ubuntu@192.168.100.2


In [17]:
#repair network interfaces after reboot
stdout, stderr = node1.execute(f"sudo apt install net-tools -y")
stdout, stderr = node1.execute(f"sudo ifconfig enp7s0 down")
stdout, stderr = node1.execute(f"sudo ip addr add 192.168.100.1/24 dev enp7s0")
stdout, stderr = node1.execute(f"sudo ifconfig enp7s0 up")

stdout, stderr = node2.execute(f"sudo apt install net-tools -y")
stdout, stderr = node2.execute(f"sudo ifconfig enp8s0 down")
stdout, stderr = node2.execute(f"sudo ip addr add 192.168.100.2/24 dev enp8s0")
stdout, stderr = node2.execute(f"sudo ifconfig enp8s0 up")

Reading package lists...[31m 

 [0m
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
  net-tools
0 upgraded, 1 newly installed, 0 to remove and 198 not upgraded.
Need to get 204 kB of archives.
After this operation, 811 kB of additional disk space will be used.
Get:1 http://nova.clouds.archive.ubuntu.com/ubuntu noble/main amd64 net-tools amd64 2.10-0.1ubuntu4 [204 kB]
[31m debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
 [0m[31m debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
 [0m[31m dpkg-preconfigure: unable to re-open stdin: 
 [0mFetched 204 kB in 1s (229 kB/s)
Selecting previously unselected package net-tools.
(Reading database ... 72490 files and directories curren

In [18]:
#node1.upload_file(' ./lightning_mnist_example.ipynb', 'lightning_mnist_example.ipynb')
#node2.upload_file('./lightning_mnist_example.ipynb', 'lightning_mnist_example.ipynb')

In [19]:
#get ssh commands
print(f"SSH Command: {node1.get_ssh_command()}")
print(f"SSH Command: {node2.get_ssh_command()}")

SSH Command: ssh -i /home/fabric/work/fabric_config/slice_key -F /home/fabric/work/fabric_config/ssh_config ubuntu@2001:400:a100:3090:f816:3eff:fe98:308c
SSH Command: ssh -i /home/fabric/work/fabric_config/slice_key -F /home/fabric/work/fabric_config/ssh_config ubuntu@2001:400:a100:3090:f816:3eff:fe10:651d


In [20]:
node1.upload_file('./lightning_mnist_example-2gpu.py', 'lightning_mnist_example.py')
node1.upload_file('./installer.sh', './installer.sh')
node1.upload_file('./invertArray.py', './invertArray.py')

node2.upload_file('./lightning_mnist_example-2gpu.py', 'lightning_mnist_example.py')
node2.upload_file('./installer.sh', './installer.sh')
node2.upload_file('./invertArray.py', './invertArray.py')


#node.upload_file('./lightning_mnist_example-install1.py', 'lightning_mnist_example-install1.py')
#node.upload_file('./lightning_mnist_example-install2.py', 'lightning_mnist_example-install2.py')

<SFTPAttributes: [ size=242 uid=1000 gid=1000 mode=0o100664 atime=1735080952 mtime=1735080953 ]>

In [21]:
# Let's run a broken command to stop execution here
duhprint(f'Broken print command to stop execution here')
#input("Execution paused. Hit Enter after running ray add and python ml example manually")

NameError: name 'duhprint' is not defined

In [None]:
node1.download_file('./cern-ray-microbm-node1.txt', 'cern-ray-microbm-node1.txt')
node2.download_file('./cern-ray-microbm-node2.txt', 'cern-ray-microbm-node2.txt')

In [None]:
#scp -F ~/.ssh/fabric_ssh_config -i <private *sliver* key file>  ubuntu@11.22.33.44:~/<remote file name> <local file name>
node.download_directory('/tmp/ray/session_latest', '.')

We now compile the `.cu` file using `nvcc`, the CUDA compiler tool installed with CUDA. In this example, we create an executable called `hello_world`.

In [None]:
stdout, stderr = node1.execute(f"/usr/local/cuda-{12.6}/bin/nvcc -o hello_world hello-world.cu")
stdout, stderr = node2.execute(f"/usr/local/cuda-{12.6}/bin/nvcc -o hello_world hello-world.cu")

Finally, run the executable:

In [None]:
stdout, stderr = node1.execute("./hello_world")
print(f"stdout: {stdout}")

stdout, stderr = node2.execute("./hello_world")
print(f"stdout: {stdout}")

In [None]:
stdout, stderr = node1.execute("bash ./installer.sh")
stdout, stderr = node1.execute("python3 ./lightning_mnist_example.py")

In [None]:
# now add a 2nd ray node to the cluster and run again on head node
# ray add???
stdout, stderr = node1.execute("bash ./installer.sh")
stdout, stderr = node1.execute("python3 ./lightning_mnist_example.py")

If you see `Hello World!`, the CUDA program ran successfully. `World!` was computed on the GPU from an array of offsets being summed with the string `Hello `, and the resut was printed to stdout.

### Congratulations! You have now successfully run a program on a FABRIC GPU!

## Cleanup Your Experiment

In [None]:
fablib.delete_slice(slice_name)

In [None]:
#extend slice
from datetime import datetime
from datetime import timezone
from datetime import timedelta

#Set end host to now plus 6 days
end_date = (datetime.now(timezone.utc) + timedelta(days=2)).strftime("%Y-%m-%d %H:%M:%S %z")

try:
    slice = fablib.get_slice(name=slice_name)

    slice.renew(end_date)
except Exception as e:
    print(f"Exception: {e}")