## Launch and set up NVIDIA A100 40GB server - with python-chi

At the beginning of the lease time, we will bring up our GPU server. We will use the `python-chi` Python API to Chameleon to provision our server.

> **Note**: if you don’t have access to the Chameleon Jupyter environment, or if you prefer to set up your AMD MI100 server by hand, the next section provides alternative instructions! If you want to set up your server “by hand”, skip to the next section.

We will execute the cells in this notebook inside the Chameleon Jupyter environment.

Run the following cell, and make sure the correct project is selected:

In [1]:
from chi import server, context, lease
import os
import chi, os, time, datetime

context.version = "1.0" 
context.choose_project()
context.choose_site(default="CHI@TACC")

VBox(children=(Dropdown(description='Select Project', options=('CHI-251409',), value='CHI-251409'), Output()))

VBox(children=(Dropdown(description='Select Site', options=('CHI@TACC', 'CHI@UC', 'CHI@EVL', 'CHI@NCAR', 'CHI@…

Change the string in the following cell to reflect the name of *your* lease (**with your own net ID**), then run it to get your lease:

In [2]:
l = lease.get_lease(f"project20") 
l.show()

HTML(value='\n        <h2>Lease Details</h2>\n        <table>\n            <tr><th>Name</th><td>project20</td>…

Lease Details:
Name: project20
ID: 7a061785-0668-4a96-9294-de48072878c8
Status: ACTIVE
Start Date: 2025-05-08 14:10:00
End Date: 2025-05-08 15:50:00
User ID: 7dba80de0714e3446b69ea0af5fddfa8a8c3dbf80afd33092c28e9a090c236df
Project ID: d3c6e101843a4ba79e665ebf59b521a2

Node Reservations:
ID: ea77ba06-a030-48fc-b366-00dbc8eef31c, Status: active, Min: 1, Max: 1

Floating IP Reservations:

Network Reservations:

Events:


The status should show as “ACTIVE” now that we are past the lease start time.

The rest of this notebook can be executed without any interactions from you, so at this point, you can save time by clicking on this cell, then selecting “Run” \> “Run Selected Cell and All Below” from the Jupyter menu.

As the notebook executes, monitor its progress to make sure it does not get stuck on any execution error, and also to see what it is doing!

We will use the lease to bring up a server with the `CC-Ubuntu24.04-CUDA` disk image.

> **Note**: the following cell brings up a server only if you don’t already have one with the same name! (Regardless of its error state.) If you have a server in ERROR state already, delete it first in the Horizon GUI before you run this cell.

In [3]:
username = os.getenv('USER') # all exp resources will have this prefix
s = server.Server(
    f"mlops-project20", 
    reservation_id=l.node_reservations[0]["id"],
    image_name="CC-Ubuntu24.04-CUDA"
)
s.submit(idempotent=True)

Waiting for server mlops-project20's status to become ACTIVE. This typically takes 10 minutes, but can take up to 20 minutes.


HBox(children=(Label(value=''), IntProgress(value=0, bar_style='success')))

Server has moved to status ACTIVE


Attribute,mlops-project20
Id,7076f705-46a9-403e-a670-b52bb2b082af
Status,ACTIVE
Image Name,CC-Ubuntu24.04-CUDA
Flavor Name,baremetal
Addresses,sharednet1:  IP: 10.52.0.96 (v4)  Type: fixed  MAC: 34:80:0d:ed:52:26
Network Name,sharednet1
Created At,2025-05-08T14:33:17Z
Keypair,bm3788_nyu_edu-jupyter
Reservation Id,ea77ba06-a030-48fc-b366-00dbc8eef31c
Host Id,9acf860df16fe3cd915f9522cd52cf171577a815ef5c486f67a143e3


Note: security groups are not used at Chameleon bare metal sites, so we do not have to configure any security groups on this instance.

Then, we’ll associate a floating IP with the instance, so that we can access it over SSH.

In [19]:
s.associate_floating_ip()

In the output below, make a note of the floating IP that has been assigned to your instance (in the “Addresses” row).

In [20]:
s.refresh()
s.show(type="widget")

Attribute,mlops-project20
Id,7076f705-46a9-403e-a670-b52bb2b082af
Status,ACTIVE
Image Name,CC-Ubuntu24.04-CUDA
Flavor Name,baremetal
Addresses,sharednet1:  IP: 10.52.0.96 (v4)  Type: fixed  MAC: 34:80:0d:ed:52:26  IP: 129.114.108.188 (v4)  Type: floating  MAC: 34:80:0d:ed:52:26
Network Name,sharednet1
Created At,2025-05-08T14:33:17Z
Keypair,bm3788_nyu_edu-jupyter
Reservation Id,ea77ba06-a030-48fc-b366-00dbc8eef31c
Host Id,9acf860df16fe3cd915f9522cd52cf171577a815ef5c486f67a143e3


In [13]:
security_groups = [
  {'name': "allow-ssh", 'port': 22, 'description': "Enable SSH traffic on TCP port 22"},
  {'name': "allow-8888", 'port': 8888, 'description': "Enable TCP port 8888 (used by Jupyter)"},
  {'name': "allow-8000", 'port': 8000, 'description': "Enable TCP port 8000 (used by MLFlow)"},
  {'name': "allow-9000", 'port': 9000, 'description': "Enable TCP port 9000 (used by MinIO API)"},
  {'name': "allow-9001", 'port': 9001, 'description': "Enable TCP port 9001 (used by MinIO Web UI)"}
]

In [14]:
os_conn = chi.clients.connection()
nova_server = chi.nova().servers.get(s.id)

for sg in security_groups:

  if not os_conn.get_security_group(sg['name']):
      os_conn.create_security_group(sg['name'], sg['description'])
      os_conn.create_security_group_rule(sg['name'], port_range_min=sg['port'], port_range_max=sg['port'], protocol='tcp', remote_ip_prefix='0.0.0.0/0')

  nova_server.add_security_group(sg['name'])

print(f"updated security groups: {[group.name for group in nova_server.list_security_group()]}")

BadRequest: Invalid input for security_groups. Reason: Duplicate items in the list: '981163bd-1d7d-4687-86c6-160783061624'.
Neutron server returns request_ids: ['req-91361ba9-db86-4794-aad4-c3c168e1165f'] (HTTP 400) (Request-ID: req-08b1853b-208c-4c15-8c63-3e69e4c85b36)

## Retrieve code and notebooks on the instance

Now, we can use `python-chi` to execute commands on the instance, to set it up. We’ll start by retrieving the code and other materials on the instance.

In [12]:
s.execute("git clone --recurse-submodules https://github.com/kathangabani-nyu/MLOps-Project-Group-20")

Cloning into 'MLOps-Project-Group-20'...


<Result cmd='git clone --recurse-submodules https://github.com/kathangabani-nyu/MLOps-Project-Group-20' exited=0>

## Set up Docker

To use common deep learning frameworks like Tensorflow or PyTorch, and ML training platforms like MLFlow and Ray, we can run containers that have all the prerequisite libraries necessary for these frameworks. Here, we will set up the container framework.

In [16]:
s.execute("curl -sSL https://get.docker.com/ | sudo sh")
s.execute("sudo groupadd -f docker; sudo usermod -aG docker $USER")

# Executing docker install script, commit: 53a22f61c0628e58e1d6680b49e82993d304b449


+ sh -c apt-get -qq update >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get -y -qq install ca-certificates curl >/dev/null
+ sh -c install -m 0755 -d /etc/apt/keyrings
+ sh -c curl -fsSL "https://download.docker.com/linux/ubuntu/gpg" -o /etc/apt/keyrings/docker.asc
+ sh -c chmod a+r /etc/apt/keyrings/docker.asc
+ sh -c echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu noble stable" > /etc/apt/sources.list.d/docker.list
+ sh -c apt-get -qq update >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get -y -qq install docker-ce docker-ce-cli containerd.io docker-compose-plugin docker-ce-rootless-extras docker-buildx-plugin >/dev/null

Running kernel seems to be up-to-date.

The processor microcode seems to be up-to-date.

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.
+ sh -c doc

Client: Docker Engine - Community
 Version:           28.1.1
 API version:       1.49
 Go version:        go1.23.8
 Git commit:        4eba377
 Built:             Fri Apr 18 09:52:14 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.1.1
  API version:      1.49 (minimum version 1.24)
  Go version:       go1.23.8
  Git commit:       01f442b
  Built:            Fri Apr 18 09:52:14 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.27
  GitCommit:        05044ec0a9a75232cad458027ca83437aae3f4da
 runc:
  Version:          1.2.5
  GitCommit:        v1.2.5-0-g59923ef
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0


To run Docker as a non-privileged user, consider setting up the
Docker daemon in rootless mode for your user:

    dockerd-rootless-setuptool.sh install

Visit https://docs.docker.com/go/rootless/ to learn about rootless mode.


T

<Result cmd='sudo groupadd -f docker; sudo usermod -aG docker $USER' exited=0>

## Set up the NVIDIA container toolkit

We will also install the NVIDIA container toolkit, with which we can access GPUs from inside our containers.

In [21]:
s.execute("curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list")
s.execute("sudo apt update")
s.execute("sudo apt-get install -y nvidia-container-toolkit")
s.execute("sudo nvidia-ctk runtime configure --runtime=docker")
# for https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
s.execute("sudo jq 'if has(\"exec-opts\") then . else . + {\"exec-opts\": [\"native.cgroupdriver=cgroupfs\"]} end' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.tmp > /dev/null && sudo mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json")
s.execute("sudo systemctl restart docker")

gpg: cannot open '/dev/tty': No such device or address


UnexpectedExit: Encountered a bad command exit code!

Command: "curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg   && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |     sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |     sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list"

Exit code: 2

Stdout: already printed

Stderr: already printed



and we can install `nvtop` to monitor GPU usage:

In [22]:
s.execute("sudo apt update")
s.execute("sudo apt -y install nvtop")





Hit:1 https://download.docker.com/linux/ubuntu noble InRelease
Hit:2 https://nvidia.github.io/libnvidia-container/stable/deb/amd64  InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64  InRelease
Hit:4 http://security.ubuntu.com/ubuntu noble-security InRelease
Get:5 http://nova.clouds.archive.ubuntu.com/ubuntu noble InRelease [256 kB]
Get:6 http://nova.clouds.archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
Get:7 http://nova.clouds.archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB]
Fetched 508 kB in 1s (428 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
4 packages can be upgraded. Run 'apt list --upgradable' to see them.






Reading package lists...
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
  nvtop
0 upgraded, 1 newly installed, 0 to remove and 4 not upgraded.
Need to get 62.8 kB of archives.
After this operation, 180 kB of additional disk space will be used.
Get:1 http://nova.clouds.archive.ubuntu.com/ubuntu noble/multiverse amd64 nvtop amd64 3.0.2-1 [62.8 kB]


debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 


Fetched 62.8 kB in 0s (129 kB/s)
Selecting previously unselected package nvtop.
(Reading database ... 113837 files and directories currently installed.)
Preparing to unpack .../nvtop_3.0.2-1_amd64.deb ...
Unpacking nvtop (3.0.2-1) ...
Setting up nvtop (3.0.2-1) ...
Processing triggers for man-db (2.12.0-4build2) ...


debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype

Running kernel seems to be up-to-date.

The processor microcode seems to be up-to-date.

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.


<Result cmd='sudo apt -y install nvtop' exited=0>

Socket exception: No route to host (113)


In [17]:
s.refresh()
s.check_connectivity()

Checking connectivity to 129.114.108.211 port 22.


HBox(children=(Label(value=''), IntProgress(value=0, bar_style='success')))

Connection successful


In [23]:

#curl https://rclone.org/install.sh | sudo bash
#sudo sed -i '/^#user_allow_other/s/^#//' /etc/fuse.conf
#mkdir -p ~/.config/rclone
#nano  ~/.config/rclone/rclone.conf
# sudo mkdir -p /mnt/object
# sudo chown -R cc /mnt/object
# sudo chgrp -R cc /mnt/object
# rclone mount chi_tacc:<object store name> /mnt/object  --allow-other --daemon
# docker build -t mlops20/ray-worker:nvidia -f Dockerfile.jupyter-ray-nvidia .
# docker build -t mlops20/ray-worker:nvidia -f Dockerfile.ray-nvidia .
# docker compose -f docker-compose-mlflow.yaml up -d

#export HOST_IP=$(curl -s http://169.254.169.254/latest/meta-data/public-ipv4)
#docker compose -f docker-compose-mlflow.yaml up -d

# to bring up Jupyter notebook 
# docker run -d --rm \
#   -p 8888:8888 \
#   --shm-size 20G \
#   -v ~/MLOps-Project-Group-20/workspace:/home/project20/work/ \
#   --mount type=bind,source=/mnt/object,target=/mnt/model1-artifacts \
#   --name jupyter \
#   quay.io/jupyter/pytorch-notebook:latest





### Build a container image - for MLFlow section

Finally, we will build a container image in which to work in the MLFlow section, that has:

-   a Jupyter notebook server
-   Pytorch and Pytorch Lightning
-   CUDA, which allows deep learning frameworks like Pytorch to use the NVIDIA GPU accelerator
-   and MLFlow

You can see our Dockerfile for this image at: [Dockerfile.jupyter-torch-mlflow-cuda](https://github.com/teaching-on-testbeds/mltrain-chi/tree/main/docker/Dockerfile.jupyter-torch-mlflow-cuda)

Building this container may take a bit of time, but that’s OK: we can get it started and then continue to the next section while it builds in the background, since we don’t need this container immediately.

Leave that cell running, and in the meantime, open an SSH sesson on your server. From your local terminal, run

    ssh -i ~/.ssh/id_rsa_chameleon cc@A.B.C.D

where

-   in place of `~/.ssh/id_rsa_chameleon`, substitute the path to your own key that you had uploaded to CHI@TACC
-   in place of `A.B.C.D`, use the floating IP address you just associated to your instance.