# Lab 1: Serving a Model with vLLM

This section is to walk through how to serve a open source model using vLLM (the upstream project of Red Hat AI Inference Server.

### What is an Inference Server?
An inference server is the piece of software that allows artificial intelligence (AI) applications to communicate with large language models (LLMs) and generate a response based on data. This process is called inference. It’s where the business value happens and the end result is delivered.

### How does Red Hat AI Inference Server work?
Red Hat AI Inference Server provides fast and cost-effective inference at scale. Its open source nature allows it to support any generative AI (gen AI) model, on any AI accelerator, in any cloud environment. 

Powered by [vLLM](https://docs.vllm.ai/en/latest/), the inference server maximizes GPU utilization, and enables faster response times. Combined with LLM Compressor capabilities, inference efficiency increases without sacrificing performance. With cross-platform adaptability and a growing community of contributors, vLLM is emerging as the Linux® of gen AI inference. 

### Red Hat AI Repository
The Red Hat AI repository on Hugging Face is an open-source initiative backed by deep collaboration between IBM and Red Hat’s research, engineering, and business units. We’re committed to making AI more accessible, efficient, and community-driven from research to production.

We believe the future of AI is open. That’s why we’re sharing our latest models and research on Hugging Face, which are freely available to help researchers, developers, and organizations deploy high-performance AI at scale.

Here's the link to Red Hat AI Repository - https://huggingface.co/RedHatAI

### Now, let's start with serving an open source model using the upstream vLLM library.

---
# Start the vLLM server

## Quick commands:
* To Serve a model, you can run the command `vllm serve` from a terminal.  
* To view all of the vLLM the options type the following into a terminal: `vllm serve -h`.

For this lab we will serve the validated model `Llama-3.2-1B-Instruct-FP8` from Red Hat AI Repository on Hugging Face. To view the model details open a new browser tab and paste the following url into the address: `https://huggingface.co/RedHatAI/Llama-3.2-1B-Instruct-FP8.

## Serve the model
This part of the lab requires you to work in a terminal. To open a new terminal window:  
1. Click **File > New > Terminal** in the JupyterLab toolbar.
   JupyterLab opens a new tab with a bash terminal session.

2. Click the terminal's tab and drag it to the right side of the screen.
   JupyterLab docks the terminal to the right of this notebook.

3. Highlight the following text: `vllm serve RedHatAI/Llama-3.2-1B-Instruct-FP8 --port 8000 --tensor-parallel-size 1`  
5. With the text highlightd, type **Ctrl+C** to copy the command to the clipboard.
6. In the Terminal window, right-click.
   JupyterLab displays the copy/paste popup menu.

7. Click **Paste**.
   The command is pasted into the terminal window. Validate it is correct and make ay necessary corrections.

![image.png](attachment:4ff271c3-2798-4a5c-bdb9-2ae538fdc657.png)
   
9. Type **Enter**.  
   Wait for the model server to start.  


Once the model is serving successfully as `Application startup complete`, we can start to infer with it.  

![image.png](attachment:919109e9-1825-43a9-b2a7-1f144347e7d7.png)

---
We can use Curl to test out the endpoint http://localhost:8000.
##### vLLM provides an HTTP server that implements ***OpenAI's Completions API, Chat API, and more!*** This functionality lets you serve models and interact with them using an HTTP client.

In [2]:
!curl -X POST -H "Content-Type: application/json" -d '{ \
    "prompt": "What is the capital of France?", \
    "max_tokens": 50 \
}' http://localhost:8000/v1/completions

{"id":"cmpl-de0f500d61144871b699c307f6db7da4","object":"text_completion","created":1752538248,"model":"RedHatAI/Llama-3.2-1B-Instruct-FP8","choices":[{"index":0,"text":" Paris.\nThe capital of France is indeed Paris. It is the most famous city in France and is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum.\n\nParis is a city that has","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":8,"total_tokens":58,"completion_tokens":50,"prompt_tokens_details":null},"kv_transfer_params":null}

As you can see above, you've got the response from the model serving!

#### Model Serving Key Metrics

Use the following metrics to evaluate the performance of the LLM model being served with AI Inference Server:

- ***Time to first token (TTFT)***: How long does it take for the model to provide the first token of its response?
- ***Time per output token (TPOT)***: How long does it take for the model to provide an output token to each user, who has sent a request?
- ***Latency***: How long does it take for the model to generate a complete response?
- ***Throughput***: How many output tokens can a model produce simultaneously, across all users and requests?

Now, let's use benchmark test agaist vLLM.  

Then we need clone the vllm git repo, and use the `benchmark_serving.py`.

In [4]:
!git clone https://github.com/vllm-project/vllm.git

Cloning into 'vllm'...
remote: Enumerating objects: 90360, done.[K
remote: Counting objects: 100% (336/336), done.[K
remote: Compressing objects: 100% (229/229), done.[K
remote: Total 90360 (delta 229), reused 108 (delta 107), pack-reused 90024 (from 3)[K
Receiving objects: 100% (90360/90360), 63.22 MiB | 38.42 MiB/s, done.
Resolving deltas: 100% (71124/71124), done.


Now we run the following command to do the benchmark performance testing.

In [5]:
!python vllm/benchmarks/benchmark_serving.py \
--backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--num-prompts 100 --dataset-name random  --random-input 200 --random-output 200 --port 8000

INFO 07-15 00:11:44 [__init__.py:244] Automatically detected platform cuda.
Namespace(backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, no_stream=False, max_concurrency=None, model='RedHatAI/Llama-3.2-1B-Instruct-FP8', tokenizer=None, use_beam_search=False, num_prompts=100, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, custom_output_len=256, custom_skip_chat_template=False, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=200, random_output_len=200, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p

You may need use Huggingface Token to pull the model for the test above. \
`!export HF_TOKEN=hf_xxxx` \
Or, use `huggingface-cli login`

---
It will be very interesting to commpare the benchmark results with published results, such as NVIDIA NIM serving benchmark at https://docs.nvidia.com/nim/benchmarking/llm/latest/performance.html#llama-3-1-8b-instruct-results.
![image.png](attachment:8436a1da-6029-4620-b84c-ec67642c1a9a.png)

In [6]:
from IPython.display import IFrame

IFrame("https://docs.nvidia.com/nim/benchmarking/llm/latest/performance.html#llama-3-1-8b-instruct-results", width=600, height=600)

---
#### Explanations of performance benchmark

First, let's check what GPU we're using in the lab. We will need to log into the pod to check the GPU info.

Copy the login command from the web console.![image.png](attachment:91851d10-48ba-48cd-b239-d7d898c637a7.png) 
Run the command as below

In [7]:
!oc login -u admin -p ${ADMIN_PASSWORD} --server=https://api.sno.${BASE_DOMAIN}:6443

Login successful.

You have access to 108 projects, the list has been suppressed. You can list all projects with 'oc projects'

Using project "llama-serving".


List all the pods and find the name of the pod that our workbench sits in.

In [13]:
!oc project vllm-demo
!oc get pods

Already on project "vllm-demo" on server "https://api.sno.sandbox1745.opentlc.com:6443".
NAME     READY   STATUS    RESTARTS   AGE
vllm-0   2/2     Running   0          12m


Exec into the pod `vllm-0` and run the command `nvidia-smi`

In [14]:
!oc exec vllm-0 -- nvidia-smi

Defaulted container "vllm" out of: vllm, oauth-proxy
Tue Jul 15 00:14:09 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      On  |   00000000:36:00.0 Off |                    0 |
| N/A   50C    P0             33W /   72W |   20888MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+-------------------

We can see we are using a **shared L4 GPU** with only **24GB** memory.

Now, let's compare the results with Nvidia NIM benchamark test.
![image.png](attachment:200e2ccb-2bfe-4257-8786-a7912d75f54c.png)

By comparing with H100 served by Nvidia NIM performance benchmark, here's some highlights of vllm's performance data.

|   Metrics             |   vLLM with L4 GPU |   NIM with H100 GPU  |
|-----------------------|--------------------|----------------------|
| GPU Memory            |        24 GiB      |        80 GiB        |
| FP16                  |      ~120 TFLops   |      ~500-700 TFlops |
| INT8                  |      ~240 Tops     |       ~2000 Tops     |
| Mem. Bandwidth        |       300 GB/s     |     Up to 3.35 TB/s  |
|-----------------------|--------------------|----------------------|
| Mean TTFT (S)         |        0.29        |         0.12         |
| Mean ITL (S)          |        0.22        |         0.07         |
| Throughput (tokens/s) |       8,032.07      |      12,214.97        |
|-----------------------|--------------------|----------------------|
| Approx. Unit Price($) |        2,000       |       25,000         |

We are using ***~10x*** cheapter GPU with vLLM serving a similar size of LLama3.2 model, and we have delivered nearly **50%** performance.

---
#### 📌  Once you have done this lab, please remember stop the vLLM serving from the terminal.📌 

![image.png](attachment:21669c5d-0c72-42b3-8185-dc37ccf1d8e7.png)

---
This is the end of Lab 1 - Serving a model with vLLM.