# Using FABRIC GPUs

Your compute nodes can include GPUs. These devices are made available as FABRIC components and can be added to your nodes like any other component.

This example notebook will demonstrate how to reserve and use Nvidia GPU devices on FABRIC.


## Setup the Experiment

#### Import FABRIC API

In [1]:
from ipaddress import ip_address, IPv4Address, IPv6Address, IPv4Network, IPv6Network
import ipaddress

from fabrictestbed_extensions.fablib.fablib import FablibManager as fablib_manager

fablib = fablib_manager()
                     
fablib.show_config();

0,1
Orchestrator,orchestrator.fabric-testbed.net
Credential Manager,cm.fabric-testbed.net
Core API,uis.fabric-testbed.net
Artifact Manager,artifacts.fabric-testbed.net
Token File,/home/fabric/.tokens.json
Project ID,1eb0c915-27b6-4421-aab1-27ae42ded922
Bastion Host,bastion.fabric-testbed.net
Bastion Username,jbrassil_0034513446
Bastion Private Key File,/home/fabric/work/fabric_config/id_fabric_bastion_traffic
Slice Public Key File,/home/fabric/work/fabric_config/slice_key.pub


## Create a Node

The cells below help you create a slice that contains a single node with an attached GPU. 

### Select GPU Type and select the FABRIC Site

First decide on which GPU type you want to try - this will determine the subset of sites where your VM can be placed.

In [2]:
# pick which GPU type we will use (execute this cell). 

# choices include
# GPU_RTX6000
# GPU_TeslaT4
# GPU_A30
# GPU_A40
GPU_CHOICE = 'GPU_A30' 

# don't edit - convert from GPU type to a resource column name
# to use in filter lambda function below
choice_to_column = {
    "GPU_RTX6000": "rtx6000_available",
    "GPU_TeslaT4": "tesla_t4_available",
    "GPU_A30": "a30_available",
    "GPU_A40": "a40_available"
}

column_name = choice_to_column.get(GPU_CHOICE, "Unknown")
print(f'{column_name=}')

column_name='a30_available'


Give the slice and the node in it meaningful names.

In [2]:
# name the slice and the node 
slice_name=f'ray_A30_2nodes'
node1_name='gpu-node1'
node2_name='gpu-node2'

network_name='net1'

print(f'Will create slice "{slice_name}" with node "{node1_name}"')

Will create slice "ray_A30_2nodes" with node "gpu-node1"


Use a lambda filter to figure out which site the node will go to.

In [4]:
# find a site with at least one available GPU of the selected type
site_override = None

if site_override:
    site = site_override
else:
    site = fablib.get_random_site(filter_function=lambda x: x[column_name] > 0)
print(f'Preparing to create slice "{slice_name}" with node {node1_name} and node {node2_name}  in site {site}')

Preparing to create slice "ray_A30_2nodes" with node gpu-node1 and node gpu-node2  in site TOKY


Create the desired slice with a GPU component. 

In [5]:
# Create Slice. Note that by default submit() call will poll for 360 seconds every 10-20 seconds
# waiting for slice to come up. Normal expected time is around 2 minutes. 
slice = fablib.new_slice(name=slice_name)

# Network
net1 = slice.add_l2network(name=network_name, subnet=IPv4Network("192.168.100.0/24"))

# Add node with a 100G drive and a couple of CPU cores (default)
node1 = slice.add_node(name=node1_name, site=site, disk=100, image='default_ubuntu_24')
node1.add_component(model=GPU_CHOICE, name='gpu1')
iface1 = node1.add_component(model='NIC_Basic', name='nic1').get_interfaces()[0]
iface1.set_mode('auto')
net1.add_interface(iface1)

node2 = slice.add_node(name=node2_name, site=site, disk=100, image='default_ubuntu_24')
node2.add_component(model=GPU_CHOICE, name='gpu2')

iface2 = node2.add_component(model='NIC_Basic', name='nic1').get_interfaces()[0]
iface2.set_mode('auto')
net1.add_interface(iface2)

#Submit Slice Request
slice.submit();


Retry: 9, Time: 242 sec


0,1
ID,c8f5d5f8-3bd9-4786-b01d-410ca2f63717
Name,ray_A30_2nodes
Lease Expiration (UTC),2024-12-26 15:01:31 +0000
Lease Start (UTC),2024-12-25 15:01:31 +0000
Project ID,1eb0c915-27b6-4421-aab1-27ae42ded922
State,StableOK


ID,Name,Cores,RAM,Disk,Image,Image Type,Host,Site,Username,Management IP,State,Error,SSH Command,Public SSH Key File,Private SSH Key File
b8c465f6-41f2-4d90-b3bc-1aadc3b3bdc3,gpu-node1,2,8,100,default_ubuntu_24,qcow2,toky-w3.fabric-testbed.net,TOKY,ubuntu,133.69.160.187,Active,,ssh -i /home/fabric/work/fabric_config/slice_key -F /home/fabric/work/fabric_config/ssh_config ubuntu@133.69.160.187,/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key
baacc4f1-bacb-4432-a928-05bc17245153,gpu-node2,2,8,100,default_ubuntu_24,qcow2,toky-w1.fabric-testbed.net,TOKY,ubuntu,133.69.160.67,Active,,ssh -i /home/fabric/work/fabric_config/slice_key -F /home/fabric/work/fabric_config/ssh_config ubuntu@133.69.160.67,/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key


ID,Name,Layer,Type,Site,Subnet,Gateway,State,Error
d8d8b333-d0cc-4142-a8b7-3c09b2f1c1e9,net1,L2,L2Bridge,TOKY,192.168.100.0/24,,Active,


Name,Short Name,Node,Network,Bandwidth,Mode,VLAN,MAC,Physical Device,Device,IP Address,Numa Node,Switch Port
gpu-node1-nic1-p1,p1,gpu-node1,net1,100,auto,,0A:A8:67:3B:04:90,enp8s0,enp8s0,192.168.100.2,4,HundredGigE0/0/0/9
gpu-node2-nic1-p1,p1,gpu-node2,net1,100,auto,,1E:50:53:F8:13:3C,enp8s0,enp8s0,192.168.100.1,6,HundredGigE0/0/0/5



Time to print interfaces 242 seconds


## Get the Slice

Retrieve the node information and save the management IP addresses.

In [3]:
slice = fablib.get_slice(name=slice_name)
slice.show();

0,1
ID,c8f5d5f8-3bd9-4786-b01d-410ca2f63717
Name,ray_A30_2nodes
Lease Expiration (UTC),2025-01-06 00:32:38 +0000
Lease Start (UTC),2024-12-25 15:01:31 +0000
Project ID,1eb0c915-27b6-4421-aab1-27ae42ded922
State,StableOK


## Get the Node

Retrieve the node information and save the management IP address.


In [4]:
node1 = slice.get_node(node1_name) 
node1.show()

node2 = slice.get_node(node2_name) 
node2.show()

gpu1 = node1.get_component('gpu1')
gpu1.show();

gpu2 = node2.get_component('gpu2')
gpu2.show();


0,1
ID,b8c465f6-41f2-4d90-b3bc-1aadc3b3bdc3
Name,gpu-node1
Cores,2
RAM,8
Disk,100
Image,default_ubuntu_24
Image Type,qcow2
Host,toky-w3.fabric-testbed.net
Site,TOKY
Username,ubuntu


0,1
ID,baacc4f1-bacb-4432-a928-05bc17245153
Name,gpu-node2
Cores,2
RAM,8
Disk,100
Image,default_ubuntu_24
Image Type,qcow2
Host,toky-w1.fabric-testbed.net
Site,TOKY
Username,ubuntu


0,1
Name,gpu-node1-gpu1
Short Name,gpu1
Details,NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Disk,0
Units,1
PCI Address,['0000:25:00.0']
Model,
Type,GPU
Device,
Node,gpu-node1


0,1
Name,gpu-node2-gpu2
Short Name,gpu2
Details,NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Disk,0
Units,1
PCI Address,['0000:81:00.0']
Model,
Type,GPU
Device,
Node,gpu-node2


## GPU PCI Device

Run the command <code>lspci</code> to see your GPU PCI device(s). This is the raw GPU PCI device that is not yet configured for use.  You can use the GPUs as you would any GPUs.

View node's GPU

In [8]:
command = "sudo apt-get install -y pciutils && lspci | grep 'NVIDIA'"

stdout, stderr = node1.execute(command)
stdout, stderr = node2.execute(command)

Reading package lists...
Building dependency tree...
Reading state information...
pciutils is already the newest version (1:3.10.0-2build1).
pciutils set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
07:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Reading package lists...
Building dependency tree...
Reading state information...
pciutils is already the newest version (1:3.10.0-2build1).
pciutils set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
07:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)


## Install Nvidia Drivers

Now, let's run the following commands to install the latest NVidia driver and the CUDA libraries and compiler. This step can take up to 20 minutes.

NOTE: for instructional purposes the following cell sends all command output back to the notebook. You can also send it to log files to keep the notebook output clean.

In [9]:
distro='ubuntu2404'
version='12.6'
architecture='x86_64'

# install prerequisites
commands = [
    'sudo apt-get -q update',
    'sudo apt-get -q install -y linux-headers-$(uname -r) gcc',
]

print("Installing Prerequisites...")
for command in commands:
    print(f"++++ {command}")
    stdout, stderr = node1.execute(command)
    print(f"++++ {command}")
    stdout, stderr = node2.execute(command)

print(f"Installing CUDA {version}")
commands = [
    f'wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb',
    f'sudo dpkg -i cuda-keyring_1.1-1_all.deb',
    'sudo apt-get -q update',
    'sudo apt-get -q install -y cuda'
]
print("Installing CUDA...")
for command in commands:
    print(f"++++ {command}")
    stdout, stderr = node1.execute(command)
    print(f"++++ {command}")
    stdout, stderr = node2.execute(command)
    
print("Done installing CUDA")

Installing Prerequisites...
++++ sudo apt-get -q update
Get:1 http://security.ubuntu.com/ubuntu noble-security InRelease [126 kB]
Get:2 http://nova.clouds.archive.ubuntu.com/ubuntu noble InRelease [256 kB]
Get:3 http://security.ubuntu.com/ubuntu noble-security/main amd64 Packages [572 kB]
Get:4 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
Get:5 http://nova.clouds.archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB]
Get:6 http://nova.clouds.archive.ubuntu.com/ubuntu noble/universe amd64 Packages [15.0 MB]
Get:7 http://security.ubuntu.com/ubuntu noble-security/main Translation-en [111 kB]
Get:8 http://security.ubuntu.com/ubuntu noble-security/main amd64 Components [7256 B]
Get:9 http://security.ubuntu.com/ubuntu noble-security/main amd64 c-n-f Metadata [5892 B]
Get:10 http://security.ubuntu.com/ubuntu noble-security/universe amd64 Packages [795 kB]
Get:11 http://security.ubuntu.com/ubuntu noble-security/universe Translation-en [169 kB]
Get:12 http

In [10]:
print(f"Installing cuda-toolkit")
commands = [
    'sudo apt-get -q update',
    'sudo apt-get -q install -y cuda-toolkit'
]
print("Installing cuda-toolkit...")
for command in commands:
    print(f"++++ {command}")
    stdout, stderr = node1.execute(command)
    print(f"++++ {command}")
    stdout, stderr = node2.execute(command)
    
print("Done installing cuda-toolkit")

Installing cuda-toolkit
Installing cuda-toolkit...
++++ sudo apt-get -q update
Hit:1 http://security.ubuntu.com/ubuntu noble-security InRelease
Get:2 http://nova.clouds.archive.ubuntu.com/ubuntu noble InRelease [256 kB]
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64  InRelease
Get:4 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
Get:5 http://nova.clouds.archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB]
Fetched 508 kB in 3s (198 kB/s)
Reading package lists...
++++ sudo apt-get -q update
Get:1 http://nova.clouds.archive.ubuntu.com/ubuntu noble InRelease [256 kB]
Hit:2 http://security.ubuntu.com/ubuntu noble-security InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64  InRelease
Get:4 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
Get:5 http://nova.clouds.archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB]
Fetched 508 kB in 4s (133 kB/s

In [5]:
print(f"Apt update and upgrade")
commands = [
    'sudo apt-get -q update',
    'sudo apt-get -q upgrade -y'
]
print("Updating...")
for command in commands:
    print(f"++++ {command}")
    stdout, stderr = node1.execute(command)
    print(f"++++ {command}")
    stdout, stderr = node2.execute(command)
    
print("Done with apt update")

Apt update and upgrade
Updating...
++++ sudo apt-get -q update
Get:1 http://nova.clouds.archive.ubuntu.com/ubuntu noble InRelease [256 kB]
Hit:2 http://security.ubuntu.com/ubuntu noble-security InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64  InRelease
Get:4 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
Get:5 http://nova.clouds.archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB]
Get:6 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages [761 kB]
Get:7 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates/main amd64 Components [151 kB]
Get:8 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates/universe amd64 Components [310 kB]
Get:9 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates/restricted amd64 Components [212 B]
Get:10 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates/multiverse amd64 Components [940 B]
Get:11 http://nova.clouds.archive.ubuntu.

In [28]:
node1.upload_file('./bashrc', './.bashrc')
node2.upload_file('./bashrc', './.bashrc')

<SFTPAttributes: [ size=4547 uid=1000 gid=1000 mode=0o100644 atime=1735154281 mtime=1735155990 ]>

And once CUDA is installed, reboot the machine.

In [11]:
reboot = 'sudo reboot'

print(reboot)
node1.execute(reboot)

slice.wait_ssh(timeout=360,interval=10,progress=True)

print("Now testing SSH abilites to reconnect...",end="")
slice.update()
slice.test_ssh()
print("Reconnected!")


sudo reboot
Waiting for slice . Slice state: StableOK
Waiting for ssh in slice ... ssh successful
Now testing SSH abilites to reconnect...Reconnected!


In [12]:
reboot = 'sudo reboot'

print(reboot)
node2.execute(reboot)

slice.wait_ssh(timeout=360,interval=10,progress=True)

print("Now testing SSH abilites to reconnect...",end="")
slice.update()
slice.test_ssh()
print("Reconnected!")


sudo reboot
Waiting for slice . Slice state: StableOK
Waiting for ssh in slice . ssh successful
Now testing SSH abilites to reconnect...Reconnected!


## Testing the GPU and CUDA Installation

First, verify that the Nvidia drivers recognize the GPU by running `nvidia-smi`.

In [13]:
stdout, stderr = node1.execute("nvidia-smi")

print(f"stdout: {stdout}")

[31m bash: line 1: nvidia-smi: command not found
 [0mstdout: 


In [14]:
stdout, stderr = node2.execute("nvidia-smi")

print(f"stdout: {stdout}")

[31m bash: line 1: nvidia-smi: command not found
 [0mstdout: 


Now, let's upload the following "Hello World" CUDA program file to the node.

`hello-world.cu`

*Source: https://computer-graphics.se/multicore/pdf/hello-world.cu*

*Author: Ingemar Ragnemalm*

>This file is from *"The real "Hello World!" for CUDA, OpenCL and GLSL!"* (https://computer-graphics.se/hello-world-for-cuda.html), written by Ingemar Ragnemalm, programmer and CUDA teacher. The only changes (if you download the original file from the website) are to additionally `#include <unistd.h>`, as `sleep()` is now a fuction defined in the `unistd.h` library.

In [15]:
node1.upload_file('./hello-world.cu', 'hello-world.cu')
node2.upload_file('./hello-world.cu', 'hello-world.cu')

<SFTPAttributes: [ size=1110 uid=1000 gid=1000 mode=0o100664 atime=1735140765 mtime=1735140765 ]>

In [16]:
# never do this on FABRIC --see below for using slice_key -- but do here for cluster communications test
node1.upload_file('id_rsa', '.ssh/id_rsa')
node1.upload_file('id_rsa.pub', '.ssh/id_rsa.pub')
node2.upload_file('id_rsa', '.ssh/id_rsa')
node2.upload_file('id_rsa.pub', '.ssh/id_rsa.pub')
stdout, stderr = node1.execute(f"chmod 600 ~/.ssh/id_rsa")
stdout, stderr = node2.execute(f"chmod 600 ~/.ssh/id_rsa")
stdout, stderr = node1.execute(f"chmod 644 ~/.ssh/id_rsa.pub")
stdout, stderr = node2.execute(f"chmod 644 ~/.ssh/id_rsa.pub")


In [17]:
#grab latest fabric slice_key from fabric_config directory in case it changed
node1.upload_file('/home/fabric/work/fabric_config/slice_key', '.ssh/slice_key')
node2.upload_file('/home/fabric/work/fabric_config/slice_key', '.ssh/slice_key')
stdout, stderr = node1.execute(f"chmod 600 ~/.ssh/slice_key")
stdout, stderr = node2.execute(f"chmod 600 ~/.ssh/slice_key")
print(f'Use slice key explicitly to connect to nodes:  ssh -i ~/.ssh/slice_key ubuntu@192.168.100.2')

Use slice key explicitly to connect to nodes:  ssh -i ~/.ssh/slice_key ubuntu@192.168.100.2


In [26]:
#repair network interfaces after reboot
stdout, stderr = node1.execute(f"sudo apt install net-tools -y")
stdout, stderr = node1.execute(f"sudo ifconfig enp8s0 down")
stdout, stderr = node1.execute(f"sudo ip addr add 192.168.100.1/24 dev enp8s0")
stdout, stderr = node1.execute(f"sudo ifconfig enp8s0 up")
stdout, stderr = node1.execute(f"sudo ifconfig enp8s0")

stdout, stderr = node2.execute(f"sudo apt install net-tools -y")
stdout, stderr = node2.execute(f"sudo ifconfig enp8s0 down")
stdout, stderr = node2.execute(f"sudo ip addr add 192.168.100.2/24 dev enp8s0")
stdout, stderr = node2.execute(f"sudo ifconfig enp8s0 up")
stdout, stderr = node2.execute(f"sudo ifconfig enp8s0")

Reading package lists...[31m 

 [0m
Building dependency tree...
Reading state information...
net-tools is already the newest version (2.10-0.1ubuntu4).
0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.
[31m Error: ipv4: Address already assigned.
 [0menp8s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.100.1  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::8a8:67ff:fe3b:490  prefixlen 64  scopeid 0x20<link>
        ether 0a:a8:67:3b:04:90  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 16  bytes 1312 (1.3 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Reading package lists...
Building dependency tree...[31m 

 [0m
Reading state information...
net-tools is already the newest version (2.10-0.1ubuntu4).
0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.
[31m Error: ipv4: Address already assigned.
 [0men

In [19]:
#node1.upload_file(' ./lightning_mnist_example.ipynb', 'lightning_mnist_example.ipynb')
#node2.upload_file('./lightning_mnist_example.ipynb', 'lightning_mnist_example.ipynb')

In [6]:
#get ssh commands
print(f"SSH Command: {node1.get_ssh_command()}")
print(f"SSH Command: {node2.get_ssh_command()}")

SSH Command: ssh -i /home/fabric/work/fabric_config/slice_key -F /home/fabric/work/fabric_config/ssh_config ubuntu@133.69.160.187
SSH Command: ssh -i /home/fabric/work/fabric_config/slice_key -F /home/fabric/work/fabric_config/ssh_config ubuntu@133.69.160.67


In [19]:
node1.upload_file('./lightning_mnist_example-2gpu.py', 'lightning_mnist_example.py')
node1.upload_file('./installer.sh', './installer.sh')
node1.upload_file('./invertArray.py', './invertArray.py')

node2.upload_file('./lightning_mnist_example-2gpu.py', 'lightning_mnist_example.py')
node2.upload_file('./installer.sh', './installer.sh')
node2.upload_file('./invertArray.py', './invertArray.py')


#node.upload_file('./lightning_mnist_example-install1.py', 'lightning_mnist_example-install1.py')
#node.upload_file('./lightning_mnist_example-install2.py', 'lightning_mnist_example-install2.py')

<SFTPAttributes: [ size=242 uid=1000 gid=1000 mode=0o100664 atime=1735140843 mtime=1735231821 ]>

In [22]:
# Let's run a broken command to stop execution here
duhprint(f'Broken print command to stop execution here')
#input("Execution paused. Hit Enter after running ray add and python ml example manually")

NameError: name 'duhprint' is not defined

In [None]:
node1.download_file('./cern-ray-microbm-node1.txt', 'cern-ray-microbm-node1.txt')
node2.download_file('./cern-ray-microbm-node2.txt', 'cern-ray-microbm-node2.txt')

In [None]:
#scp -F ~/.ssh/fabric_ssh_config -i <private *sliver* key file>  ubuntu@11.22.33.44:~/<remote file name> <local file name>
node.download_directory('/tmp/ray/session_latest', '.')

We now compile the `.cu` file using `nvcc`, the CUDA compiler tool installed with CUDA. In this example, we create an executable called `hello_world`.

In [7]:
stdout, stderr = node1.execute(f"/usr/local/cuda-{12.6}/bin/nvcc -o hello_world hello-world.cu")
stdout, stderr = node2.execute(f"/usr/local/cuda-{12.6}/bin/nvcc -o hello_world hello-world.cu")

Finally, run the executable:

In [8]:
stdout, stderr = node1.execute("./hello_world")
print(f"stdout: {stdout}")

stdout, stderr = node2.execute("./hello_world")
print(f"stdout: {stdout}")

Hello Hello 
stdout: Hello Hello 

Hello Hello 
stdout: Hello Hello 



In [17]:
stdout, stderr = node1.execute("bash ./installer.sh")
stdout, stderr = node1.execute("python3 ./lightning_mnist_example.py")

Reading package lists...
Building dependency tree...[31m sudo: python3-full: command not found


 [0m
Reading state information...
python3-pip is already the newest version (24.0+dfsg-1ubuntu1.1).
0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.
pip 24.0 from /usr/lib/python3/dist-packages/pip (python 3.12)
Collecting torchmetrics
  Downloading torchmetrics-1.6.1-py3-none-any.whl.metadata (21 kB)
Collecting pytorch_lightning
  Downloading pytorch_lightning-2.5.0.post0-py3-none-any.whl.metadata (21 kB)
Collecting torchvision
  Downloading torchvision-0.20.1-cp312-cp312-manylinux1_x86_64.whl.metadata (6.1 kB)
Collecting ray
  Downloading ray-2.40.0-cp312-cp312-manylinux2014_x86_64.whl.metadata (17 kB)
Collecting pandas
  Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89.9/89.9 kB 409.4 kB/s eta 0:00:00
Collecting numpy
  Downloading numpy-2.2.1-cp312-cp312-manylinux_2_17_x86

In [16]:
# now add a 2nd ray node to the cluster and run again on head node
# ray add???
stdout, stderr = node2.execute("bash ./installer.sh")
#stdout, stderr = node2.execute("python3 ./lightning_mnist_example.py")

The virtual environment was not created successfully because ensurepip is not
available.  On Debian/Ubuntu systems, you need to install the python3-venv
package using the following command.

    apt install python3.12-venv

You may need to use sudo with that command.  After installing the python3-venv
package, recreate your virtual environment.

Failing command: /home/ubuntu/ray-venv/bin/python3

[31m sudo: python3-full: command not found
 [0m[31m 

 [0mReading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  javascript-common libexpat1-dev libjs-jquery libjs-sphinxdoc
  libjs-underscore libpython3-dev libpython3.12-dev python3-dev python3-wheel
  python3.12-dev zlib1g-dev
Suggested packages:
  apache2 | lighttpd | httpd
The following NEW packages will be installed:
  javascript-common libexpat1-dev libjs-jquery libjs-sphinxdoc
  libjs-underscore libpython3-dev libpython3.12-dev python3-dev python3-pip


If you see `Hello World!`, the CUDA program ran successfully. `World!` was computed on the GPU from an array of offsets being summed with the string `Hello `, and the resut was printed to stdout.

### Congratulations! You have now successfully run a program on a FABRIC GPU!

## Cleanup Your Experiment

In [None]:
fablib.delete_slice(slice_name)

In [29]:
#extend slice
from datetime import datetime
from datetime import timezone
from datetime import timedelta

#Set end host to now plus 6 days
end_date = (datetime.now(timezone.utc) + timedelta(days=12)).strftime("%Y-%m-%d %H:%M:%S %z")

try:
    slice = fablib.get_slice(name=slice_name)

    slice.renew(end_date)
except Exception as e:
    print(f"Exception: {e}")


Retry: 1, Time: 43 sec


0,1
ID,c8f5d5f8-3bd9-4786-b01d-410ca2f63717
Name,ray_A30_2nodes
Lease Expiration (UTC),2025-01-06 00:32:38 +0000
Lease Start (UTC),2024-12-25 15:01:31 +0000
Project ID,1eb0c915-27b6-4421-aab1-27ae42ded922
State,StableOK


ID,Name,Cores,RAM,Disk,Image,Image Type,Host,Site,Username,Management IP,State,Error,SSH Command,Public SSH Key File,Private SSH Key File
b8c465f6-41f2-4d90-b3bc-1aadc3b3bdc3,gpu-node1,2,8,100,default_ubuntu_24,qcow2,toky-w3.fabric-testbed.net,TOKY,ubuntu,133.69.160.187,Active,,ssh -i /home/fabric/work/fabric_config/slice_key -F /home/fabric/work/fabric_config/ssh_config ubuntu@133.69.160.187,/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key
baacc4f1-bacb-4432-a928-05bc17245153,gpu-node2,2,8,100,default_ubuntu_24,qcow2,toky-w1.fabric-testbed.net,TOKY,ubuntu,133.69.160.67,Active,,ssh -i /home/fabric/work/fabric_config/slice_key -F /home/fabric/work/fabric_config/ssh_config ubuntu@133.69.160.67,/home/fabric/work/fabric_config/slice_key.pub,/home/fabric/work/fabric_config/slice_key


ID,Name,Layer,Type,Site,Subnet,Gateway,State,Error
d8d8b333-d0cc-4142-a8b7-3c09b2f1c1e9,net1,L2,L2Bridge,TOKY,192.168.100.0/24,,Active,


Name,Short Name,Node,Network,Bandwidth,Mode,VLAN,MAC,Physical Device,Device,IP Address,Numa Node,Switch Port
gpu-node1-nic1-p1,p1,gpu-node1,net1,100,auto,,0A:A8:67:3B:04:90,enp8s0,enp8s0,192.168.100.2,4,HundredGigE0/0/0/9
gpu-node2-nic1-p1,p1,gpu-node2,net1,100,auto,,1E:50:53:F8:13:3C,enp8s0,enp8s0,192.168.100.1,6,HundredGigE0/0/0/5



Time to print interfaces 43 seconds
