# Build an MLOps Pipeline

In [Cloud Computing on Chameleon](https://teaching-on-testbeds.github.io/cloud-chi/), following the premise:

> You are working at a machine learning engineer at a small startup company called GourmetGram. They are developing an online photo sharing community focused on food. You are testing a new model you have developed that automatically classifies photos of food into one of a set of categories: Bread, Dairy product, Dessert, Egg, Fried food, Meat, Noodles/Pasta, Rice, Seafood, Soup, and Vegetable/Fruit. You have built a simple web application with which to test your model and get feedback from others.

we deployed a basic machine learning service to an OpenStack cloud. However, that deployment involved a lot of manual steps (“ClickOps”), and any updates to it would similarly involve lots of manual effort, be difficult to track, etc.

In this tutorial, we will learn how to automate both the initial deployment, and updates during the lifecycle of the application. We will:

-   practice deploying systems following infrastructure-as-code and configuration-as-code principles using automated deployment tools
-   and create an automated pipeline to manage a machine learning model through its lifecycle

Our experiment will use the following automated deployment and lifecycle management tools:

-   Terraform: A declarative Infrastructure as Code (IaC) tool used to provision and manage cloud infrastructure (servers, networks, etc.) by defining the desired end state in configuration files. Here, we use it to provision our infrastructure.
-   Ansible: An imperative Configuration as Code (CaC) tool that automates system configuration, software installation, and application deployment through task-based YAML playbooks describing the steps to achieve a desired setup. Here, we use it to install Kubernetes and the Argo tools on our infrastructure after it is provisioned
-   Argo CD: A declarative GitOps continuous delivery tool for Kubernetes that automatically syncs and deploys applications based on the desired state stored in Git repositories.
-   Argo Workflows: A Kubernetes-native workflow engine where you define workflows, which execute tasks inside containers to run pipelines, jobs, or automation processes.

**Note**: that we use Argo CD and Argo Workflows, which are tightly integrated with Kubernetes, because we are working in the context of a Kubernetes deployment. If our service was not deployed in Kubernetes (for example: it was deployed using “plain” Docker containers without a container orchestration framework), we would use other tools for managing the application and model lifecycle.

To run this experiment, you should have already created an account on Chameleon, and become part of a project. You should also have added your SSH key to the KVM@TACC site.

In [1]:
# runs in Chameleon Jupyter environment
from chi import context
import os

context.version = "1.0" 
context.choose_project()
context.choose_site(default="CHI@UC")

VBox(children=(Dropdown(description='Select Project', options=('CHI-251409',), value='CHI-251409'), Output()))

VBox(children=(Dropdown(description='Select Site', index=1, options=('CHI@TACC', 'CHI@UC', 'CHI@EVL', 'CHI@NCA…

In [2]:
from chi import lease

lease_names = {
    "node1": "compute_gigaio_project46_1",
    "node2": "compute_gigaio_project46_2",
    "node3": "compute_gigaio_project46_3"
}

reservation_map = {
    name: lease.get_lease(lease_name).node_reservations[0]["id"]
    for name, lease_name in lease_names.items()
}
# node_hostnames = {
#     name: lease.get_lease(lease_name).nodes[0]["name"] + ".chameleoncloud.org"
#     for name, lease_name in lease_names.items()
# }
for name, lease_name in lease_names.items():
    print(lease.get_lease(lease_name).node_reservations[0]["id"])

a09e00c8-af62-4b92-929b-1e39fb4020cd
ad5247ff-9e60-49ff-a862-3d52a6bf09e6
eca3a12c-8edb-455e-b873-b18fa1ee39dd


In [None]:
cd ~/Fine-Tuning-Taiwanese-Hokkien-LLM-for-Medical-Advising/iac/tf/chi 

In [4]:
import json

with open("outputs.json") as f:
    out = json.load(f)
    sharednet1_ports = {
        "node1": out["sharednet1_ports"]["value"]["node1"],
        "node2": out["sharednet1_ports"]["value"]["node2"],
        "node3": out["sharednet1_ports"]["value"]["node3"],
    }
    private_net_ports = {
        "node1": out["private_net_ports"]["value"]["node1"],
        "node2": out["private_net_ports"]["value"]["node2"],
        "node3": out["private_net_ports"]["value"]["node3"],
    }

In [5]:
# path = os.path.expanduser("~/Fine-Tuning-Taiwanese-Hokkien-LLM-for-Medical-Advising/iac/tf/chi/terraform.tfvars")
# with open(path, "w") as f:
#     f.write('suffix = "yc7690"\n')
#     f.write('key    = "id_rsa_chameleon"\n')
#     f.write('node_reservations = {\n')
#     for node, res_id in reservation_map.items():
#         f.write(f'  {node} = "{res_id}"\n')
#     f.write('}\n')
# with open(path, "w") as f:
#     f.write("node_hostnames = {\n")
#     for name, host in node_hostnames.items():
#         f.write(f'  {name} = "{host}"\n')
#     f.write("}\n")

In [6]:
# from openstack import connection
# import subprocess, json

# terraform_dir = os.path.expanduser("~/Fine-Tuning-Taiwanese-Hokkien-LLM-for-Medical-Advising/iac/tf/chi")
# outputs = subprocess.check_output(["terraform", "output", "-json"], cwd=terraform_dir)

# ports = json.loads(outputs)
# port_map = {
#     "node1": ports["sharednet1_port_id_node1"]["value"],
#     "node2": ports["private_net_port_id_node2"]["value"],
#     "node3": ports["private_net_port_id_node3"]["value"]
# }
# print(ports.keys())

In [7]:
# Create servers
# from openstack import connection
# conn = connection.from_config(cloud="openstack")
# for node in ["node1", "node2", "node3"]:
#     print(f"Creating {node}...")
#     server = conn.create_server(
#         name=f"{node}-mlops",
#         image='CC-Ubuntu24.04-CUDA',
#         flavor='baremetal',
#         key_name='id_rsa_chameleon',
#         nics=[
#             {"port-id": sharednet1_ports[node]},
#             {"port-id": private_net_ports[node]}
#         ],
#         scheduler_hints={"reservation": reservation_map[node]},
#         wait=False,
#         auto_ip=False,
#         user_data = f"""#!/bin/bash
# echo '127.0.1.1 {node}-mlops' >> /etc/hosts
# su cc -c /usr/local/bin/cc-load-public-keys
# """
#     )
#     print(f"{node} server created: {server.id}")


In [8]:
import base64

user_data_script = """#!/bin/bash
echo '127.0.1.1 {node}-mlops' >> /etc/hosts
su cc -c /usr/local/bin/cc-load-public-keys
"""
from openstack import connection
conn = connection.from_config(cloud="openstack")
for node in ["node1", "node2", "node3"]:
    print(f"Creating {node}...")

    encoded_user_data = base64.b64encode(
        user_data_script.format(node=node).encode("utf-8")
    ).decode("utf-8")

    server = conn.compute.create_server(
        name=f"{node}-mlops",
        image_id=conn.compute.find_image("CC-Ubuntu24.04-CUDA").id,
        flavor_id=conn.compute.find_flavor("baremetal").id,
        networks=[
            {"port": sharednet1_ports[node]},
            {"port": private_net_ports[node]}
        ],
        key_name="id_rsa_chameleon",
        scheduler_hints={"reservation": reservation_map[node]},
        user_data=encoded_user_data
    )
    print(f"{node} server created: {server.id}")


Creating node1...
node1 server created: 0e34eabd-fffd-450b-af60-a244b6a13614
Creating node2...
node2 server created: 5f228da3-7d3c-4f27-af04-510cae12d3c7
Creating node3...
node3 server created: 90278510-f28e-418b-8b74-98b7824777b9


In [13]:
server = conn.get_server("90278510-f28e-418b-8b74-98b7824777b9")
print(server.status)

ACTIVE


## Experiment topology

In this experiment, we will deploy a 3-node Kubernetes cluster on Chameleon instances. The Kubernetes cluster will be self-managed, which means that the infrastructure provider is not responsbile for setting up and maintaining our cluster; *we* are.

However, the cloud infrastructure provider will provide the compute resources and network resources that we need. We will provision the following resources for this experiment:

<figure>
<img src="images/lab-topology.svg" alt="Experiment topology." />
<figcaption aria-hidden="true">Experiment topology.</figcaption>
</figure>

## Provision a key

Before you begin, open this experiment on Trovi:

-   Use this link: [MLOps Pipeline](https://chameleoncloud.org/experiment/share/1eb302de-4707-4ae9-ae2d-391b9b8e5261) on Trovi
-   Then, click “Launch on Chameleon”. This will start a new Jupyter server for you, with the experiment materials already in it.

You will see several notebooks inside the `mlops-chi` directory - look for the one titled `0_intro.ipynb`. Open this notebook and execute the following cell (and make sure the correct project is selected):

Then, you may continue following along at [Build an MLOps Pipeline](https://teaching-on-testbeds.github.io/mlops-chi/).