# Level 1: GPU concurrency

This section introduces GPU scaling and workload management in an OpenShift cluster. Many workloads benefit from sharing GPU resources to improve utilisation:

- Low-batch inference serving that processes one input at a time  
- High-performance computing (HPC) applications that split work between the CPU (for input handling) and the GPU (for computation)  
- Interactive development in Jupyter Notebooks (the focus of this lab)  
- Spark- or Ray-based data analytics, where small tasks run concurrently and gain from higher GPU utilisation  
- Visualisation or offline rendering workloads that generate burst traffic  
- Continuous integration and continuous delivery (CI/CD) pipelines that use any available GPU for testing  

## Cluster overview

The lab uses a Single Node OpenShift (SNO) deployment. In the web console, navigate to **Compute → Nodes** to view node details.

<img src="images/ocp-node-sno.png"
     alt="OpenShift node details for the SNO deployment"
     style="width:100%;">

| Instance type | vCPUs | Memory (GiB) | NVIDIA L4 GPUs | GPU memory (GiB) | Network bandwidth (Gbps) | EBS bandwidth (Gbps) |
| ------------- | ----- | ------------ | -------------- | ---------------- | ------------------------ | -------------------- |
| g6.8xlarge    | 32    | 128          | 1              | 24               | 25                       | 16                   |

The **g6.8xlarge** instance includes a single **NVIDIA L4** GPU with 24 GiB of VRAM.

> **Rule of thumb**: A large language model (LLM) with 1 billion parameters stored in FP32 format requires roughly 4 GiB of GPU memory. Each parameter occupies 4 bytes.

For example, the **Llama 405B** model needs eleven 80 GiB GPUs when loaded in FP16. (Source: Nir Shavit, MIT and Neural Magic.)

<img src="images/llama-405b.png"
     alt="Llama 405B GPU requirements"
     style="width:75%;">

The size of the model loaded into vRAM can be estimated using the following formula:

<img src="images/model-sizing-formula.png"
     alt="Model Sizing Formula"
     style="width:50%;">


| Symbol | Description                                     |
|--------|-------------------------------------------------|
| M      | GPU memory                                      |
| P      | The number of parameters in the model           |
| 4b     | 4 bytes, the bytes used for each parameter      |
| 32     | There are 32 bits in 4 bytes                    |
| Q      | Bits for loading the model 16, 8 or 4 bits      |
| 1.2    | Represents a 20% overhead for additional stuff  |

For example, the [Granite 3.3 8B Instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct/tree/main) is an 8 billion parameter model and is fp16, so the calculation would be:

```
(((8*10^9*4)/(32/16))*1.2) / 1024^3 = 17.9 Gb
```

Large LLMs can span multiple GPUs on a single instance or multiple instances across the cluster. For details on AWS accelerated instances, see the [EC2 instance types](https://aws.amazon.com/ec2/instance-types/).

The [NVIDIA L4](https://www.nvidia.com/en-us/data-center/l4/) is an entry-level, cost-effective GPU based on the [Ada Lovelace architecture](https://www.nvidia.com/en-us/technologies/ada-architecture/).

## Scaling options

You can add GPU capacity in several ways:

1. **Vertical scaling**: Reinstall OpenShift on a larger instance that contains multiple GPUs—for example, a g6.12xlarge (4× L4) or g6.48xlarge (8× L4).  
2. **Larger accelerators**: Choose instances that provide higher-end GPUs such as P4, P5, or P6 families (Blackwell, Hopper, or Ampere architectures).  
3. **Horizontal scaling** (lab focus): Add worker nodes that each include one or more GPUs.  
4. **Alternative vendors**: Use non-NVIDIA GPUs. OpenShift supports multiple accelerator types, but this lab uses NVIDIA hardware exclusively.

Before you decide how to scale, review workload requirements - especially GPU memory needs - to ensure that you select the most appropriate instance type and accelerator.

For this configuration to take affect, we need to label our node with the device plugin config for the A10.

## GPU Sharing

GPU sharing is a technique that allows multiple workloads to share a GPU. This is useful for small workloads that don't require full GPU power simultaneously. This can be broken down into three main components: 

- Time-slicing
- Multi-Instance GPUs (MIG)
- CUDA Multi-Process Service (MPS)

Different applications have different computational requirements when it comes to GPU's. Training giant AI models where the GPUs batch process hundreds of data samples in parallel, keeps the GPUs fully utilized during the training process. However, many other application types may only require a fraction of the GPU compute, thereby resulting in underutilization of the massive computational power.

<img src="images/gpu-concurrency.png"
     alt="GPU concurrency options"
     style="width:50%;">

[Read this blog post](https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/) to get a good understanding of these different types of sharing in depth.

If our OpenShift cluster was deployed on VM's we could consider using vGPU (e.g. OpenShift Virt, VMWare). Since we are in AWS we can consider using **time-slicing** or **MIG** depending on the type of accelerator cards we have available.

Here are some Pros and Cons of GPU Sharing Methods.

<img src="images/gpu-accel-capability.png"
     alt="GPU accelerator capability"
     style="width:50%;">

**MIG** gives the highest level of physical workload partitioning but is only avaialable on the (more expensize) Blackwell, Hopper, Ampere architecture. See [here for a full list](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html) of cards supporting MIG.

**Time-Slicing** is available with all GPU cards, but does not have the same level of isolation. Think of it like cpu based process sharing.

**The Multi-Process Service (MPS)** is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). MPS enables multiple CUDA applications to run concurrently on the same GPU, improving overall GPU utilization and performance. MPS is not currently supported on OpenShift.

We are going to use **time-slicing** in this exercise. 

This approach is particularly valuable for:

* **AI inference workloads** that don't fully utilize GPU resources
* **Development and testing environments** requiring flexible GPU access
* **Multi-tenant scenarios** where isolation requirements are less stringent
* **Edge deployments** with limited GPU resources
* **Cost optimization** by maximizing GPU utilization

Timeslicing enables organizations to dramatically improve GPU utilization while reducing costs, making AI workloads more accessible and economical.

vLLM (our inference serving engine in RHOAI, RHAIIS) provides some useful benefit when using timeslicing:

* **Multiple Model Serving**: Different vLLM instances can serve different models on the same GPU
* **Improved Throughput**: Better utilization through temporal multiplexing
* **Cost Reduction**: Serve more models with fewer GPUs
* **Flexibility**: Dynamic allocation of GPU resources based on demand

We are going to prove out the single-gpu-per-node, multi-node example in this lab. There are lots of more advanced scenarios to keep in mind that we are setting the ground work for here.

* Deploy multiple vLLM instances serving different models
* Implement cost optimization strategies using timeslicing
* Configure mixed workloads (training and inference)
* Set up auto-scaling based on GPU utilization metrics

OK - so let's choose a different AWS [instance type](https://aws.amazon.com/ec2/instance-types/) to scale our cluster out with. The **g5.xlarge** looks OK.

|Instance Name | vCPUs | Memory (GiB) | NVIDIA A10 GPU | GPU Memory (GiB) | Network Bandwidth (Gbps) | EBS Bandwidth (Gbps) |
|--------------|-------|--------------|----------------|------------------|--------------------------|----------------------|
|  g5.xlarge   |   4   |      16      |	      1        |	      24	     |           10	            |          3.5         |

We see we have a **g5.xlarge** instance that comes with a single **NVIDIA A10G** [GPU](https://www.nvidia.com/en-au/data-center/products/a10-gpu/) that has 24 GiB of NVRAM.

## Scaling OpenShift - Adding a GPU worker node to OpenShift in AWS

We could redeploy our cluster with an instance that supports more GPUs - this would take a bit of time. A quicker option is to make use of OpenShift's in-built 
cluster scaling ability. If you browse to the OpenShift Console > Compute > MachineSet you will see there are three machine set's already defined.

<img src="images/machine-sets-ootb.png"
     alt="Available MachineSets"
     style="width:75%;">

These were setup when we deployed our Single Node OpenShift instace. They use non-GPU enabled instance types.

Let's install a couple of python dependecies we are going to use in this notebook.

In [None]:
!pip install uv

In [None]:
!uv pip install rich jq

Collecting rich
  Downloading rich-14.0.0-py3-none-any.whl.metadata (18 kB)
Collecting jq
  Downloading jq-1.9.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting markdown-it-py>=2.2.0 (from rich)
  Downloading markdown_it_py-3.0.0-py3-none-any.whl.metadata (6.9 kB)
Collecting mdurl~=0.1 (from markdown-it-py>=2.2.0->rich)
  Downloading mdurl-0.1.2-py3-none-any.whl.metadata (1.6 kB)
Downloading rich-14.0.0-py3-none-any.whl (243 kB)
Downloading jq-1.9.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (754 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m754.3/754.3 kB[0m [31m58.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Installing collected packages: mdurl, jq, markdown-it-py, rich
Successfully installed jq-1.9.1 markdown-it-py-3.0.0 mdurl-0.1.2 rich-14.0.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new rele

## Log on to the OpenShift cluster so we can add a new node

Now login to OpenShift using the command line. Make sure you have these variable set in your environment.

In [None]:
!oc login -u admin -p ${ADMIN_PASSWORD} --server=https://api.${BASE_DOMAIN}:6443 --insecure-skip-tls-verify


Login successful.

You have access to 106 projects, the list has been suppressed. You can list all projects with 'oc projects'

Using project "ai-roadshow".
Welcome! See 'oc help' to get started.


## Review the existing Machine in the cluster
Now we can take a look at the machine that forms our Single Node OpenShift cluster.

Observe the master node's: TYPE, REGION and ZONE that it is running in.

In [3]:
!oc get machines.machine.openshift.io -A

NAMESPACE               NAME                 PHASE     TYPE         REGION      ZONE         AGE
openshift-machine-api   sno-5dqmr-master-0   Running   g6.8xlarge   us-east-2   us-east-2a   4d19h


### Review the results
We can see our master node which is a g6.8xlarge instance, as well as the AWS Region and Zone it is running in.

The Region **us-east-2** has 3 available availability zones (AZ's) - us-east-2a, us-east-2b, us-east-2c

The Region **ap-southeast-2** has 3 available availability zones (AZ's) - ap-southeast-2a, ap-southeast-2b, ap-southeast-2c

# Review the MachineSets

MachineSets define the types of Machines that can be added to the OpenShift cluster - think of these as a template for creating new Machines. 

These are part of the Machine API, which automates the management of machines within an OpenShift cluster. Essentially, it acts as a controller, automatically creating, updating, or deleting machines to match the defined configuration.

In [4]:
!oc get machineset.machine.openshift.io -A

NAMESPACE               NAME                          DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   sno-5dqmr-worker-us-east-2a   0         0                             4d19h
openshift-machine-api   sno-5dqmr-worker-us-east-2b   0         0                             4d19h
openshift-machine-api   sno-5dqmr-worker-us-east-2c   0         0                             4d19h


## Create a new MachineSet specification from the existing MachineSets
Each ZONE within a REGION has its own MachineSet.

We are going to create a new GPU enabled machine set. We can choose any of the available ZONEs in our REGION.

Lets use the same ZONE out current note is running in to dump out the configuration to a file. Choose the MachineSet `NAME` from above that matches your ZONE and replace it in the following code. E.g.:

`%env SOURCE_MACHINESET=<paste a machine set here from the NAME column in the previous step>`

In [5]:
%env SOURCE_MACHINESET=sno-5dqmr-worker-us-east-2a
!oc -n openshift-machine-api get -o json machineset.machine.openshift.io $SOURCE_MACHINESET > source-machineset.json

env: SOURCE_MACHINESET=sno-5dqmr-worker-us-east-2a


## Edit the MachineSet to create the new node with an NVIDIA A10 GPU
The YAML for the MachineSet is very long and complex. To save you the difficulty of editing this we have created a small script to automate taking an existing MachineSet and repurposing it to creaye our new node with an A10 GPU.

We are going to change a few settings using rhe **jq** library so they match your environment !
- the new GPU instance type **g5.xlarge** which has an NVIDIA A10 GPU
- and a new MachineSet **name** = sno-5dqmr-worker-us-east-2a-gpu **<== EDIT** the cell below to match the machine set name you want
- set the machine **replicas** to 1

In [6]:
import jq
import json
from rich import print

with open('source-machineset.json') as f:
  machineset_data = json.load(f)

transform = """.spec.template.spec.providerSpec.value.instanceType = \"g5.xlarge\"
  | .metadata.name = \"sno-5dqmr-worker-us-east-2a-gpu\"
  | .spec.replicas = 1
  | del(.metadata.selfLink)
  | del(.metadata.uid)
  | del(.metadata.creationTimestamp)
  | del(.metadata.resourceVersion)
  | del(.status)"""

transformed_result = jq.all(transform, machineset_data)
print(transformed_result)

We can see all of the MachineSet details including the AMI, root disk size as well as Security Groups and Networking that we will keep. These were all configured at OpenShift install time.

Let's write out our new gpu enabled machineset to a file.

In [7]:
with open("gpu-machineset.json", "w") as file:
    file.write(json.dumps(transformed_result[0]))

# Create the new machine
Using the new Machine specification, we will add the new to the cluster.

In [8]:
!oc create -f gpu-machineset.json

machineset.machine.openshift.io/sno-5dqmr-worker-us-east-2a-gpu created


### Observe the results

Provisioning a new machine takes about **5–10 minutes**. Track the progress in the OpenShift console:

1. In the navigation pane, choose **Compute > MachineSets**.  
   Confirm that the new MachineSet appears.
2. Select **Compute > Machines**.  
   Watch the machine progress through its lifecycle. When provisioning completes, the status changes to **Provisioned as Node**.
   You can also see the Node appeat in the Compute > Node window.
4. Click the new Machine.
   Explore the details of the new Machine.

<img src="images/gpu-machine-set.png"
     alt="Machine creation progress in the Machines list"
     style="width:50%;">

## AWS Quota issues

<div class="alert alert-block alert-warning">
<b>WARNING:</b> Only do the following - if you see a warning about a lack of sufficient quota.
</div>

There are occasions where the AWS account does not have sufficient quota to provision the node, or that there are not enough of the EC2-instance types available in the selected Availability Zone. The following steps will help address these problems.

To review the reason a node fails to provision, in the *Compute* menu, 
1. Click *Machine*  
   OpenShift displays the Machines in the cluster
2. Click the machine that has failed to provision.  
   Observe failure reason described in the *Conditions* section of the Machine's details

<img src="images/worker-insufficient-resources.png"
     alt="Insufficient AWS quota available"
     style="width:50%;">

To delete the failed Machine and create a machine in a different Availability Zone, open the notebook <a href="https://github.com/odh-labs/rhoai-roadshow/blob/main/site/docs/6-gpuaas/notebooks/Level1_quota_issue.ipynb" target="_blank">Level1_quota_issue.ipynb</a>

<div class="alert alert-block alert-warning">
<b>WARNING:</b> End of warning.
</div>

After a bit of time .. let's check our new node has been added to the cluster correctly. You should see it as the node with the `worker` ROLE.

In [9]:
!oc get nodes

NAME                                        STATUS     ROLES                         AGE     VERSION
ip-10-0-15-75.us-east-2.compute.internal    NotReady   worker                        35s     v1.32.5
ip-10-0-29-181.us-east-2.compute.internal   Ready      control-plane,master,worker   4d19h   v1.32.5


We can also check the machine types again.

In [10]:
!oc get machines.machine.openshift.io -A

NAMESPACE               NAME                                    PHASE     TYPE         REGION      ZONE         AGE
openshift-machine-api   sno-5dqmr-master-0                      Running   g6.8xlarge   us-east-2   us-east-2a   4d19h
openshift-machine-api   sno-5dqmr-worker-us-east-2a-gpu-kcx4p   Running   g5.xlarge    us-east-2   us-east-2a   4m16s


<div class="alert alert-block alert-success">
<b>Success:</b> We have successfully added a new GPU worker node to our cluster.
</div>

Continue to the [next notebook](./Level2_gpu_operator.ipynb) to learn how to configure the gpu operator.