## Load a model with VLLM and serve an OpenAI-compatible API

This notebook demonstrates how to load a model (we are using the bitsandbytes quantized merged model from the previous notebook) and serve it for inference using VLLM.

In [1]:
%%writefile run_vllm.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=1
#SBATCH --gpus-per-task=1  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=120GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=8  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:10:00

# Load conda:
module purge
module load anaconda3
eval "$(conda shell.bash hook)"
conda activate /leonardo/pub/userexternal/mpfister/conda_env_martin29

# Run AI scripts:
export OMP_NUM_THREADS=32
export PORT=$((RANDOM % (60000 - 8000) + 8000))  # Choose a random port
echo "VLLM will serve an OpenAI compatible AI at http://$HOSTNAME:$PORT/v1"
python3 -m vllm.entrypoints.openai.api_server \
    --model output/phi-3.5-mini-instruct-guanaco-fsdp_merged_bnb \
    --served-model-name finetuned-phi-3.5-mini \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --api-key our-secret-api-key \
    --port $PORT

Overwriting run_vllm.slurm


Now submit the SLURM job:

In [2]:
!sbatch run_vllm.slurm

Submitted batch job 16603457


Execute `squeue` to see, if your job is already running:

In [4]:
!squeue --me

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          16603457 boost_usr run_vllm mpfister  R       0:13      1 lrdn1935
          16599181 boost_usr jupyterl mpfister  R    2:24:23      1 lrdn2012


Once your job is running, look at the output of the job using the following command (replace the number with the JOBID from above):

In [6]:
!cat slurm-16603457.out

VLLM will serve an OpenAI compatible AI at http://lrdn1935:12520/v1
INFO 06-10 19:31:13 [__init__.py:243] Automatically detected platform cuda.
INFO 06-10 19:31:16 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 06-10 19:31:16 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 06-10 19:31:16 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 06-10 19:31:17 [api_server.py:1289] vLLM API server version 0.9.0.1
INFO 06-10 19:31:18 [cli_args.py:300] non-default args: {'port': 12520, 'api_key': 'our-secret-api-key', 'model': 'output/phi-3.5-mini-instruct-guanaco-fsdp_merged_bnb', 'dtype': 'bfloat16', 'max_model_len': 4096, 'served_model_name': ['finetuned-phi-3.5-mini']}
INFO 06-10 19:31:18 [config.py:213] Replacing legacy 'type' key with 'rope_type'
INFO 06-10 19:31:31 [config.py:793] This model supports multiple tasks: {'sc

Now we can access the OpenAI-compatible API using any other software. For example, we can use the `openai` python library:

In [7]:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "our-secret-api-key"
# TODO: Edit the following line to use the URL from the output of your own VLLM job
openai_api_base = "http://lrdn1935:12520/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="finetuned-phi-3.5-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)
print('Chat response:\n', chat_response)
print('')
print('Answer from the chatbot:', chat_response.choices[0].message.content)

Chat response:
 ChatCompletion(id='chatcmpl-e91290ff6aee49a988f0b4a2c51585c4', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=" Sure! Here's a joke for you:\n\nWhy don't skeletons fight each other?\n\nBecause they don't have the guts to.", refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], reasoning_content=None), stop_reason=32007)], created=1749576851, model='finetuned-phi-3.5-mini', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=39, prompt_tokens=17, total_tokens=56, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None, kv_transfer_params=None)

Answer from the chatbot:  Sure! Here's a joke for you:

Why don't skeletons fight each other?

Because they don't have the guts to.


Finally, cancel your VLLM job to free up the resources:

In [8]:
!scancel 16603457

If you want to, you can also delete the files that we create above:

In [9]:
!rm run_vllm.slurm slurm-*.out