## Load a model with VLLM and serve an OpenAI-compatible API

This notebook demonstrates how to load a model (we are using the bitsandbytes quantized merged model from the previous notebook) and serve it for inference using VLLM.

In [1]:
%%writefile run_vllm.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=1
#SBATCH --gpus-per-task=1  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=120GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=8  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:10:00

# Construct command to run container:
export CONTAINER="singularity run --nv --home=$HOME /leonardo/pub/userexternal/mpfister/vllm01.sif"

# Run AI scripts:
export OMP_NUM_THREADS=32
export PORT=$((RANDOM % (60000 - 8000) + 8000))  # Choose a random port
echo "VLLM will serve an OpenAI compatible AI at http://$HOSTNAME:$PORT/v1"
$CONTAINER python3 -m vllm.entrypoints.openai.api_server \
    --model output/llama-3.2-1b-instruct-guanaco-fsdp_merged_bnb \
    --served-model-name finetuned-phi-3.5-mini \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --api-key our-secret-api-key \
    --port $PORT

Writing run_vllm.slurm


Now submit the SLURM job:

In [2]:
!sbatch --job-name=$TRAINEE_USERNAME run_vllm.slurm

Submitted batch job 19816220


Execute `squeue` to see, if your job is already running:

In [3]:
!squeue --name=$TRAINEE_USERNAME

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          19816220 boost_usr   martin mpfister  R       0:04      1 lrdn3394


Once your job is running, look at the output of the job using the following command (replace the number with the JOBID from above):

In [5]:
!cat slurm-19816220.out

VLLM will serve an OpenAI compatible AI at http://lrdn3394:30278/v1

== CUDA ==

CUDA Version 12.6.3

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

INFO 09-10 19:06:27 [__init__.py:241] Automatically detected platform cuda.
[1;36m(APIServer pid=3008577)[0;0m INFO 09-10 19:06:29 [api_server.py:1805] vLLM API server version 0.10.1.1
[1;36m(APIServer pid=3008577)[0;0m INFO 09-10 19:06:29 [utils.py:326] non-default args: {'port': 30278, 'api_key': ['our-secret-api-key'], 'model': 'output/llama-3.2-1b-instruct-guanaco-fsdp_merged_bnb', 'dtype': 'bfloat16', 'max_model_len': 

Now we can access the OpenAI-compatible API using any other software. For example, we can use the `openai` python library:

In [19]:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "our-secret-api-key"
# TODO: Edit the following line to use the URL from the output of your own VLLM job
openai_api_base = "http://lrdn3394:30278/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="finetuned-phi-3.5-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)
print('Chat response:\n', chat_response)
print('')
print('Answer from the chatbot:', chat_response.choices[0].message.content)

Chat response:
 ChatCompletion(id='chatcmpl-2bac2702524744de86c8e1612c8c66ff', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Why don't scientists trust atoms? Because they make up everything!", refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], reasoning_content=None), stop_reason=None)], created=1757523921, model='finetuned-phi-3.5-mini', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=14, prompt_tokens=46, total_tokens=60, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None, kv_transfer_params=None)

Answer from the chatbot: Why don't scientists trust atoms? Because they make up everything!


Finally, cancel your VLLM job to free up the resources:

In [20]:
!scancel 19816220

If you want to, you can also delete the files that we create above:

In [9]:
!rm run_vllm.slurm slurm-*.out