Before you begin, open this experiment on Trovi:

-   Use this link: [Large-scale model training on Chameleon](https://chameleoncloud.org/experiment/share/39a536c6-6070-4ccf-9e91-bc47be9a94af) on Trovi
-   Then, click “Launch on Chameleon”. This will start a new Jupyter server for you, with the experiment materials already in it.

You will see several notebooks inside the `llm-chi` directory - look for the one titled `1_create_server.ipynb`. Open this notebook and continue there.

## Bring up a GPU server

At the beginning of the lease time, we will bring up our GPU server. We will use the `python-chi` Python API to Chameleon to provision our server.

We will execute the cells in this notebook inside the Chameleon Jupyter environment.

Run the following cell, and make sure the correct project is selected:

In [1]:
from chi import server, context, lease
import os

context.version = "1.0" 
context.choose_project()
context.choose_site(default="CHI@TACC")

VBox(children=(Dropdown(description='Select Project', options=('CHI-251409',), value='CHI-251409'), Output()))

VBox(children=(Dropdown(description='Select Site', options=('CHI@TACC', 'CHI@UC', 'CHI@EVL', 'CHI@NCAR', 'CHI@…

Change the string in the following cell to reflect the name of *your* lease (**with your own net ID**), then run it to get your lease:

In [2]:
l = lease.get_lease(f"mlops-project20_bm3788") # or llm_single_netID, or llm_multi_netID
l.show()

HTML(value='\n        <h2>Lease Details</h2>\n        <table>\n            <tr><th>Name</th><td>mlops-project2…

Lease Details:
Name: mlops-project20_bm3788
ID: 0f9a7684-dc12-43df-8614-9ba6baf68c77
Status: ACTIVE
Start Date: 2025-04-30 16:10:00
End Date: 2025-05-01 16:10:00
User ID: 7dba80de0714e3446b69ea0af5fddfa8a8c3dbf80afd33092c28e9a090c236df
Project ID: d3c6e101843a4ba79e665ebf59b521a2

Node Reservations:
ID: e20ab9a5-d722-4da3-b728-1f65a511d099, Status: active, Min: 1, Max: 1

Floating IP Reservations:

Network Reservations:

Events:


The status should show as “ACTIVE” now that we are past the lease start time.

The rest of this notebook can be executed without any interactions from you, so at this point, you can save time by clicking on this cell, then selecting Run \> Run Selected Cell and All Below from the Jupyter menu.

As the notebook executes, monitor its progress to make sure it does not get stuck on any execution error, and also to see what it is doing!

We will use the lease to bring up a server with the `CC-Ubuntu24.04-CUDA` disk image. (Note that the reservation information is passed when we create the instance!) This will take up to 10 minutes.

In [3]:
username = os.getenv('USER') # all exp resources will have this prefix
s = server.Server(
    f"mlops-project20{username}", 
    reservation_id=l.node_reservations[0]["id"],
    image_name="CC-Ubuntu24.04-CUDA"
)
s.submit(idempotent=True)

HTTPBadRequest: HTTP 400 Bad Request: Unable to filter by unknown operator &#x27;CC-Ubuntu24.04-CUDA_2024-11-01 23&#x27;.

Note: security groups are not used at Chameleon bare metal sites, so we do not have to configure any security groups on this instance.

Then, we’ll associate a floating IP with the instance, so that we can access it over SSH.

In [4]:
s.associate_floating_ip()

In [5]:
s.refresh()
s.check_connectivity()

Checking connectivity to 129.114.108.94 port 22.


HBox(children=(Label(value=''), IntProgress(value=0, bar_style='success')))

Connection successful


In [4]:
s.refresh()
s.show(type="widget")

Attribute,mlops-project20bm3788_nyu_edu
Id,48e1d909-8395-4d22-a07c-c2768508d293
Status,ACTIVE
Image Name,CC-Ubuntu24.04-CUDA
Flavor Name,baremetal
Addresses,sharednet1:  IP: 10.52.1.64 (v4)  Type: fixed  MAC: 34:80:0d:de:55:44  IP: 129.114.108.94 (v4)  Type: floating  MAC: 34:80:0d:de:55:44
Network Name,sharednet1
Created At,2025-04-30T16:18:56Z
Keypair,trovi-160e9e8
Reservation Id,e20ab9a5-d722-4da3-b728-1f65a511d099
Host Id,9acf860df16fe3cd915f9522cd52cf171577a815ef5c486f67a143e3


In [7]:
security_groups = [
  {'name': "allow-ssh", 'port': 22, 'description': "Enable SSH traffic on TCP port 22"},
  {'name': "allow-8001", 'port': 8001, 'description': "Enable TCP port 8001 (used by QA API)"},
  {'name': "allow-8000", 'port': 8000, 'description': "Enable TCP port 8000 (used by Summarizer API)"},
  {'name': "allow-3000", 'port': 3000, 'description': "Enable TCP port 3000 (used by Graphana)"},
  {'name': "allow-5000", 'port': 5000, 'description': "Enable TCP port 5000 (used by MLFLOW)"},
  {'name': "allow-9090", 'port': 9090, 'description': "Enable TCP port 9090 (used by Prometheus)"}
]

In [8]:
from chi import server, context
from openstack import exceptions
import chi, os, time, datetime

os_conn = chi.clients.connection()
nova_server = chi.nova().servers.get(s.id)

for sg in security_groups:
    # try to find an existing SG
    existing_sg = os_conn.get_security_group(sg['name'])

    if not existing_sg:
        # create SG + rule, but guard against quota errors
        try:
            existing_sg = os_conn.create_security_group(
                sg['name'],
                sg['description']
            )
            os_conn.create_security_group_rule(
                existing_sg['id'],
                port_range_min=sg['port'],
                port_range_max=sg['port'],
                protocol='tcp',
                remote_ip_prefix='0.0.0.0/0'
            )
            print(f"Created security group '{sg['name']}' with port {sg['port']}")
        except exceptions.ConflictException as e:
            print(f"Skipping creation of '{sg['name']}' (quota or conflict): {e}")
            # re-attempt to fetch it in case it actually got created
            existing_sg = os_conn.get_security_group(sg['name'])

    if existing_sg:
        # attach to the server if not already attached
        attached_names = [g.name for g in nova_server.list_security_group()]
        if sg['name'] not in attached_names:
            nova_server.add_security_group(existing_sg['name'])
            print(f"Added '{sg['name']}' to server")

print(
    "Updated security groups on server:",
    [g.name for g in nova_server.list_security_group()]
)

Added 'allow-ssh' to server
Added 'allow-8001' to server
Added 'allow-8000' to server
Skipping creation of 'allow-3000' (quota or conflict): ConflictException: 409: Client Error for url: https://chi.tacc.chameleoncloud.org:9696/v2.0/security-groups, Quota exceeded for resources: ['security_group'].
Added 'allow-5000' to server
Skipping creation of 'allow-9090' (quota or conflict): ConflictException: 409: Client Error for url: https://chi.tacc.chameleoncloud.org:9696/v2.0/security-groups, Quota exceeded for resources: ['security_group'].
Updated security groups on server: ['allow-5000', 'allow-8000', 'allow-8001', 'allow-ssh', 'default']


In [5]:
s.execute("git clone https://github.com/BMG2001nyu/MLOPS_Project")

fatal: destination path 'MLOPS_Project' already exists and is not an empty directory.


UnexpectedExit: Encountered a bad command exit code!

Command: 'git clone https://github.com/BMG2001nyu/MLOPS_Project'

Exit code: 128

Stdout: already printed

Stderr: already printed



In [10]:
s.execute("curl -sSL https://get.docker.com/ | sudo sh")
s.execute("sudo groupadd -f docker; sudo usermod -aG docker $USER")

# Executing docker install script, commit: 53a22f61c0628e58e1d6680b49e82993d304b449


+ sh -c apt-get -qq update >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get -y -qq install ca-certificates curl >/dev/null

Running kernel seems to be up-to-date.

The processor microcode seems to be up-to-date.

Restarting services...
 systemctl restart packagekit.service

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.
+ sh -c install -m 0755 -d /etc/apt/keyrings
+ sh -c curl -fsSL "https://download.docker.com/linux/ubuntu/gpg" -o /etc/apt/keyrings/docker.asc
+ sh -c chmod a+r /etc/apt/keyrings/docker.asc
+ sh -c echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu noble stable" > /etc/apt/sources.list.d/docker.list
+ sh -c apt-get -qq update >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get -y -qq install docker-ce docker-ce-cli containerd.io docker-compose-plugin docker-ce-rootless-extras docker-buildx-pl

Client: Docker Engine - Community
 Version:           28.1.1
 API version:       1.49
 Go version:        go1.23.8
 Git commit:        4eba377
 Built:             Fri Apr 18 09:52:14 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.1.1
  API version:      1.49 (minimum version 1.24)
  Go version:       go1.23.8
  Git commit:       01f442b
  Built:            Fri Apr 18 09:52:14 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.27
  GitCommit:        05044ec0a9a75232cad458027ca83437aae3f4da
 runc:
  Version:          1.2.5
  GitCommit:        v1.2.5-0-g59923ef
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0


To run Docker as a non-privileged user, consider setting up the
Docker daemon in rootless mode for your user:

    dockerd-rootless-setuptool.sh install

Visit https://docs.docker.com/go/rootless/ to learn about rootless mode.


T

<Result cmd='sudo groupadd -f docker; sudo usermod -aG docker $USER' exited=0>