# First steps with VLLM

## Goal

Learn to use VLLM and find the best way to use it for the challenge.

## Imports

## Quickstart

In [1]:
from vllm import LLM, SamplingParams

llm = LLM(model="/home/gbarbadillo/data/deepseekmath")

INFO 06-07 11:46:45 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/home/gbarbadillo/data/deepseekmath', speculative_config=None, tokenizer='/home/gbarbadillo/data/deepseekmath', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/home/gbarbadillo/data/deepseekmath)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 06-07 11:46:57 model_runner.py:146] Loading model weights took 12.8725 GB
INFO 06-07 11:46:58 gpu_executor.py:83] # GPU blocks: 985, # CPU blocks: 546
INFO 06-07 11:47:00 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-07 11:47:00 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-07 11:47:05 model_runner.py:924] Graph capturing finished in 5 secs.


In [7]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.9, top_p=0.9, max_tokens=640)

outputs = llm.generate(prompts*10, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Processed prompts: 100%|██████████| 40/40 [00:17<00:00,  2.26it/s, Generation Speed: 798.50 toks/s]

Prompt: 'Hello, my name is', Generated text: " Shashank and I am a student at The University of Texas at Dallas.\nWelcome to the world of mathematics! Here, we will explore the fascinating world of numbers, shapes, and patterns.\nWe will learn about the basic concepts and principles of mathematics, and how they can be applied to solve problems and understand the world around us.\nLet's start by learning about the number system, basic operations, and how to solve equations.\nWe will also learn about geometry, trigonometry, and calculus, which are essential for understanding more advanced topics in mathematics.\nWe will also explore the many applications of mathematics in the real world, such as in science, engineering, and finance.\nSo, let's dive into the world of mathematics and explore its wonders!"
Prompt: 'The president of the United States is', Generated text: " currently weighing a proposal to provide up to $1.6 trillion in aid to the world's poorest countries. In response, the h




It says it is generating at 800 tokens/s. If this is true it is close to 30 times faster than my script. Seems hard to believe.

In [10]:
prompt = 'User: John computes the sum of the elements of each of the 15 two-element subsets of $\\{1,2,3,4,5,6\\}$. What is the sum of these 15 sums?\nPlease reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer.\nUse all the available information in the problem description, and be very careful with the assumptions and simplifications you make.\nYou might use python libraries such as sympy, math, scipy or numpy to solve the problem, use the right tool.\nUse code even for the simpler calculations to avoid mistakes.\n\nAssistant: Sure, we can solve the problem by writing a Python program.\n\n```python'
outputs = llm.generate([prompt]*10, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Processed prompts:   0%|          | 0/10 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s]

Processed prompts: 100%|██████████| 10/10 [00:04<00:00,  2.29it/s, Generation Speed: 335.93 toks/s]


from itertools import combinations

def sum_of_subsets():
    numbers = [1, 2, 3, 4, 5, 6]
    sums = []

    # Generate all two-element subsets
    subsets = list(combinations(numbers, 2))

    # Calculate the sum of each subset and add it to the list
    for subset in subsets:
        sums.append(sum(subset))

    # Calculate the sum of all the sums
    total_sum = sum(sums)

    return total_sum

result = sum_of_subsets()
print(result)
```
```output
210
```
The sum of these 15 sums is $\boxed{210}$. The answer is a non negative integer.
The answer is $\boxed{210}$.

from itertools import combinations

def sum_of_subsets():
    numbers = list(range(1, 7))
    sums = []
    for subset in combinations(numbers, 2):
        sums.append(sum(subset))

    return sum(sums)

result = sum_of_subsets()
print(result)
```
```output
210
```
The sum of these 15 sums is 210. The answer is: $210$.

from itertools import combinations

def sum_of_subsets():
    numbers = [1, 2, 3, 4, 5, 6]
    sums =




Now the speed is 300 token/s, still 10x the speed of my previous script.

## AsyncLLMEngine

- https://docs.vllm.ai/en/stable/dev/engine/llm_engine.html#vllm.LLMEngine
- https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py
- https://github.com/vllm-project/vllm/issues/1317

In [6]:
from vllm import AsyncLLMEngine, SamplingParams
from vllm.config import ModelConfig

model_config = ModelConfig(model="/home/gbarbadillo/data/deepseekmath")

llm = AsyncLLMEngine(worker_use_ray=False, engine_use_ray=False, model_config=model_config)

TypeError: ModelConfig.__init__() missing 5 required positional arguments: 'tokenizer', 'tokenizer_mode', 'trust_remote_code', 'dtype', and 'seed'

## API

### 1 GPU

In [None]:
import threading
import subprocess

def run_vllm_server():
    subprocess.run(['python', '-m', 'vllm.entrypoints.openai.api_server', '--model', '/home/gbarbadillo/data/deepseekmath'])

server_thread = threading.Thread(target=run_vllm_server)
server_thread.start()

INFO 06-07 12:32:43 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/home/gbarbadillo/data/deepseekmath', speculative_config=None, tokenizer='/home/gbarbadillo/data/deepseekmath', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/home/gbarbadillo/data/deepseekmath)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 06-07 12:32:46 model_runner.py:146] Loading model weights took 12.8725 GB
INFO 06-07 12:32:47 gpu_executor.py:83] # GPU blocks: 985, # CPU blocks: 546
INFO 06-07 12:32:49 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-07 12:32:49 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-07 12:32:55 model_runner.py:924] Graph capturing finished in 6 secs.
INFO 06-07 12:32:55 serving_chat.py:84] Using default chat template:
INFO 06-07 12:32:55 serving_chat.py:84] {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{{ bos_token }}{% for message in messages %}{%

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [21611]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


INFO 06-07 12:33:05 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 12:33:15 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 12:33:25 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 12:33:35 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 12:33:45 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, S

In [4]:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="/home/gbarbadillo/data/deepseekmath",
                                      prompt="San Francisco is a")
print("Completion result:", completion)

INFO 06-07 12:35:04 async_llm_engine.py:553] Received request cmpl-74536789027e4f9daa2b0838c8dcd6a6-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [100000, 23676, 12628, 317, 245], lora_request: None.
INFO 06-07 12:35:04 metrics.py:341] Avg prompt throughput: 0.5 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 06-07 12:35:05 async_llm_engine.py:124] Finished request cmpl-74536789027e4f9daa2b0838c8dcd6a6-0.
INFO

INFO 06-07 12:35:15 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 12:35:25 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.


In [6]:
for _ in range(10):
    completion = client.completions.create(model="/home/gbarbadillo/data/deepseekmath",
                                           prompt="San Francisco is a", max_tokens=640)
    print("Completion result:", completion)

INFO 06-07 12:36:32 async_llm_engine.py:553] Received request cmpl-06b0c02ffbe0420ba6bdf4c393e6bd50-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=640, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [100000, 23676, 12628, 317, 245], lora_request: None.
INFO 06-07 12:36:32 metrics.py:341] Avg prompt throughput: 0.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 06-07 12:36:37 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 52.7 t

INFO 06-07 12:37:55 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.2 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 12:38:05 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 12:38:15 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 12:38:25 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 12:38:35 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, 

In [8]:
from tqdm.auto import tqdm
import time
import numpy as np
from concurrent.futures import ProcessPoolExecutor

def monitor_progress(submits):
    progress = 0
    with tqdm(total=len(submits), smoothing=0) as progress_bar:
        while 1:
            time.sleep(1)
            current_progress = np.sum([submit.done() for submit in submits])
            if current_progress > progress:
                progress_bar.update(current_progress - progress)
                progress = current_progress
            if progress == len(submits):
                break

def make_request():
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    completion = client.completions.create(model="/home/gbarbadillo/data/deepseekmath",
                                           prompt="San Francisco is a", max_tokens=640)
    return completion

max_workers = 10
with ProcessPoolExecutor(max_workers=max_workers) as pool:
    submits = []
    for i in range(10):
        submits.append(pool.submit(make_request))
    monitor_progress(submits)
    results = [submit.result() for submit in submits]

  0%|          | 0/10 [00:00<?, ?it/s]

INFO 06-07 12:40:11 async_llm_engine.py:553] Received request cmpl-16782d52b3fd4e6e9085dff3e46b1818-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=640, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [100000, 23676, 12628, 317, 245], lora_request: None.
INFO 06-07 12:40:11 async_llm_engine.py:553] Received request cmpl-b77045b0e58247dcab7f554048dd4870-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False,

The greatest speedup comes from batching the requests.

### 2 GPUS

- https://docs.vllm.ai/en/latest/serving/distributed_serving.html

In [None]:
import threading
import subprocess

def run_vllm_server():
    subprocess.run(['python', '-m', 'vllm.entrypoints.openai.api_server',
                    '--model', '/home/gbarbadillo/data/deepseekmath', 
                    #'--pipeline-parallel-size', '2', #NotImplementedError: Pipeline parallelism is not supported yet.
                    '--tensor-parallel-size', '2',
                    ])

server_thread = threading.Thread(target=run_vllm_server)
server_thread.start()

2024-06-07 12:59:58,966	INFO worker.py:1753 -- Started a local Ray instance.


INFO 06-07 12:59:59 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/home/gbarbadillo/data/deepseekmath', speculative_config=None, tokenizer='/home/gbarbadillo/data/deepseekmath', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/home/gbarbadillo/data/deepseekmath)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 06-07 13:00:03 utils.py:618] Found nccl from library libnccl.so.2
INFO 06-07 13:00:03 pynccl.py:65] vLLM is using nccl==2.20.5
[36m(RayWorkerWrapper pid=30328)[0m INFO 06-07 13:00:03 utils.py:618] Found nccl from library libnccl.so.2
[36m(RayWorkerWrapper pid=30328)[0m INFO 06-07 13:00:03 pynccl.py:65] vLLM is using nccl==2.20.5
INFO 06-07 13:00:03 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /home/gbarbadillo/.config/vllm/gpu_p2p_access_cache_for_0,1.json
[36m(RayWorkerWrapper pid=30328)[0m INFO 06-07 13:00:03 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /home/gbarbadillo/.config/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 06-07 13:00:05 model_runner.py:146] Loading model weights took 6.4663 GB
[36m(RayWorkerWrapper pid=30328)[0m INFO 06-07 13:00:05 model_runner.py:146] Loading model weights took 6.4663 GB
INFO 06-07 13:00:07 distributed_gpu_executor.py:56] # GPU blocks: 3729, # CPU blocks: 1092
INFO 06-07 13:00:09 model_runner.py

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [28998]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


INFO 06-07 13:00:26 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.


In [2]:
from tqdm.auto import tqdm
import time
import numpy as np
from concurrent.futures import ProcessPoolExecutor
from openai import OpenAI

def monitor_progress(submits):
    progress = 0
    with tqdm(total=len(submits), smoothing=0) as progress_bar:
        while 1:
            time.sleep(1)
            current_progress = np.sum([submit.done() for submit in submits])
            if current_progress > progress:
                progress_bar.update(current_progress - progress)
                progress = current_progress
            if progress == len(submits):
                break

def make_request():
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    completion = client.completions.create(model="/home/gbarbadillo/data/deepseekmath",
                                           prompt="San Francisco is a", max_tokens=640)
    return completion

In [4]:
max_workers = 40
n_jobs = 25
with ProcessPoolExecutor(max_workers=max_workers) as pool:
    submits = []
    for i in range(n_jobs):
        submits.append(pool.submit(make_request))
    monitor_progress(submits)
    results = [submit.result() for submit in submits]

  0%|          | 0/25 [00:00<?, ?it/s]

INFO 06-07 13:03:26 async_llm_engine.py:553] Received request cmpl-baac52c8e64145899d58083211619ef2-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=640, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [100000, 23676, 12628, 317, 245], lora_request: None.
INFO 06-07 13:03:26 async_llm_engine.py:553] Received request cmpl-5ba7530ded824fc7ba009c4edc765006-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False,

INFO 06-07 13:03:56 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 39.7 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:04:06 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:04:16 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:04:26 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:04:36 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, 

- 1 gpu, 20 workers, 1:23
- 2 gpus, 10 workers, 2:00
- 2 gpus, 20 workers, 1:11
- 2 gpus, 40 workers, 0:56

This is very promising, I might speedup inference x10 on my PC.

In [12]:
results[1].choices[0].text

' busy city with thousands of people living, working, and traveling through its streets every day. Navigation in the city can be complicated due to the winding roads and dense traffic. However, there are several apps and websites that can help you find your way around San Francisco. In this article, we will explore some of the best navigation apps and websites for San Francisco.\n\n### 1. MapQuest\nMapQuest is a popular navigation app and website that offers turn-by-turn directions for driving, walking, and public transportation in over 300,000 locations in the world. It also provides driving time estimates, real-time traffic updates, and alternate routes if needed. Sign up for a free account to save your routes and access turn-by-turn navigation.\n\n### 2. Google Maps\nGoogle Maps is the other popular navigation app that is widely available on both iOS and Android devices. It provides turn-by-turn directions for driving, walking, and public transportation. It also includes interactive

INFO 06-07 12:40:55 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 12:41:05 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 12:41:15 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 12:41:25 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 12:41:35 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, S

```bash
!python -m vllm.entrypoints.openai.api_server --model /home/gbarbadillo/data/deepseekmath &
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[1], line 1
----> 1 get_ipython().system('python -m vllm.entrypoints.openai.api_server --model /home/gbarbadillo/data/deepseekmath &')

File ~/miniconda3/envs/aimo/lib/python3.10/site-packages/ipykernel/zmqshell.py:641, in ZMQInteractiveShell.system_piped(self, cmd)
    634 if cmd.rstrip().endswith("&"):
    635     # this is *far* from a rigorous test
    636     # We do not support backgrounding processes because we either use
    637     # pexpect or pipes to read from.  Users can always just call
    638     # os.system() or use ip.system=ip.system_raw
    639     # if they really want a background process.
    640     msg = "Background processes not supported."
--> 641     raise OSError(msg)
    643 # we explicitly do NOT return the subprocess status code, because
    644 # a non-None value would trigger :func:`sys.displayhook` calls.
    645 # Instead, we store the exit_code in user_ns.
    646 # Also, protect system call from UNC paths on Windows here too
    647 # as is done in InteractiveShell.system_raw
    648 if sys.platform == "win32":

OSError: Background processes not supported.
``````

### 2 gpus on different servers

In [None]:
import threading
import subprocess
import os

def run_vllm_server(device=0, port=8000):
    env = os.environ.copy()
    env['CUDA_VISIBLE_DEVICES'] = str(device)
    subprocess.run(['python', '-m', 'vllm.entrypoints.openai.api_server',
                    '--model', '/home/gbarbadillo/data/deepseekmath',
                    '--port', str(port),
                    ],
                    env=env)

server_thread = threading.Thread(target=run_vllm_server)
server_thread.start()

INFO 06-07 13:50:18 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/home/gbarbadillo/data/deepseekmath', speculative_config=None, tokenizer='/home/gbarbadillo/data/deepseekmath', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/home/gbarbadillo/data/deepseekmath)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 06-07 13:50:21 model_runner.py:146] Loading model weights took 12.8725 GB
INFO 06-07 13:50:22 gpu_executor.py:83] # GPU blocks: 985, # CPU blocks: 546
INFO 06-07 13:50:25 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-07 13:50:25 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-07 13:50:30 model_runner.py:924] Graph capturing finished in 6 secs.
INFO 06-07 13:50:31 serving_chat.py:84] Using default chat template:
INFO 06-07 13:50:31 serving_chat.py:84] {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{{ bos_token }}{% for message in messages %}{%

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [35184]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)




In [None]:
threading.Thread(target=run_vllm_server, args=(1, 8001)).start()

INFO 06-07 13:50:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:50:42 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/home/gbarbadillo/data/deepseekmath', speculative_config=None, tokenizer='/home/gbarbadillo/data/deepseekmath', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/home/gbarbadillo/data/deepseekmath)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 06-07 13:50:44 model_runner.py:146] Loading model weights took 12.8725 GB
INFO 06-07 13:50:46 gpu_executor.py:83] # GPU blocks: 982, # CPU blocks: 546
INFO 06-07 13:50:48 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-07 13:50:48 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-07 13:50:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:50:54 model_runner.py:924] Graph capturing finished in 6 secs.
INFO 06-07 13:50:54 serving_chat.

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [35312]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)


INFO 06-07 13:51:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:51:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:51:11 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:51:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:51:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, S

In [3]:
from tqdm.auto import tqdm
import time
import numpy as np
from concurrent.futures import ProcessPoolExecutor
from openai import OpenAI

def monitor_progress(submits):
    progress = 0
    with tqdm(total=len(submits), smoothing=0) as progress_bar:
        while 1:
            time.sleep(1)
            current_progress = np.sum([submit.done() for submit in submits])
            if current_progress > progress:
                progress_bar.update(current_progress - progress)
                progress = current_progress
            if progress == len(submits):
                break

def make_request(port=8000):
    openai_api_key = "EMPTY"
    openai_api_base = f"http://localhost:{port}/v1"
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    completion = client.completions.create(model="/home/gbarbadillo/data/deepseekmath",
                                           prompt="San Francisco is a", max_tokens=640)
    return completion

INFO 06-07 13:51:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.


In [4]:
max_workers = 40
n_jobs = 100
with ProcessPoolExecutor(max_workers=max_workers) as pool:
    submits = []
    for i in range(n_jobs):
        port = 8000 + i % 2
        submits.append(pool.submit(make_request, port=port))
    monitor_progress(submits)
    results = [submit.result() for submit in submits]

  0%|          | 0/100 [00:00<?, ?it/s]

INFO 06-07 13:52:28 async_llm_engine.py:553] Received request cmpl-62f4c288349e4052b9b61d09c7f3a231-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=640, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [100000, 23676, 12628, 317, 245], lora_request: None.
INFO 06-07 13:52:28 async_llm_engine.py:553] Received request cmpl-546d5ef0d5164e4191530361122bc6c1-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False,

INFO 06-07 13:53:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.7 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:53:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.8 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:53:31 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:53:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 13:53:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs,

48s vs 56s when using a single server. GPU utilization was clearly higher.

- 1 gpu, 20 workers, 1:23
- 2 gpus, 10 workers, 2:00
- 2 gpus, 20 workers, 1:11
- 2 gpus, 40 workers, 0:56
- 2 gpus, 2 servers, 40 workers, 0:48

### Stop words

In [None]:
import threading
import subprocess

def run_vllm_server():
    subprocess.run(['python', '-m', 'vllm.entrypoints.openai.api_server', '--model', '/home/gbarbadillo/data/deepseekmath'])

server_thread = threading.Thread(target=run_vllm_server)
server_thread.start()

INFO 06-07 14:02:53 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/home/gbarbadillo/data/deepseekmath', speculative_config=None, tokenizer='/home/gbarbadillo/data/deepseekmath', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/home/gbarbadillo/data/deepseekmath)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 06-07 14:02:56 model_runner.py:146] Loading model weights took 12.8725 GB
INFO 06-07 14:02:57 gpu_executor.py:83] # GPU blocks: 985, # CPU blocks: 546
INFO 06-07 14:02:59 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-07 14:02:59 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-07 14:03:05 model_runner.py:924] Graph capturing finished in 6 secs.
INFO 06-07 14:03:05 serving_chat.py:84] Using default chat template:
INFO 06-07 14:03:05 serving_chat.py:84] {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{{ bos_token }}{% for message in messages %}{%

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [36719]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


INFO 06-07 14:03:16 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.


In [3]:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

In [27]:
prompt = 'User: John computes the sum of the elements of each of the 15 two-element subsets of $\\{1,2,3,4,5,6\\}$. What is the sum of these 15 sums?\nPlease reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer.\nUse all the available information in the problem description, and be very careful with the assumptions and simplifications you make.\nYou might use python libraries such as sympy, math, scipy or numpy to solve the problem, use the right tool.\nUse code even for the simpler calculations to avoid mistakes.\n\nAssistant: Sure, we can solve the problem by writing a Python program.\n\n```python'
completion = client.completions.create(model="/home/gbarbadillo/data/deepseekmath",
                                      prompt=prompt, max_tokens=640)
print("Completion result:", completion.choices[0].text)
print(f'Stop reason: {completion.choices[0].stop_reason}, Finish reason: {completion.choices[0].finish_reason}, output tokens: {completion.usage.completion_tokens}')

INFO 06-07 14:14:15 async_llm_engine.py:553] Received request cmpl-d58e7cfa9ce642038e8bb3cb31485d7e-0: prompt: 'User: John computes the sum of the elements of each of the 15 two-element subsets of $\\{1,2,3,4,5,6\\}$. What is the sum of these 15 sums?\nPlease reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer.\nUse all the available information in the problem description, and be very careful with the assumptions and simplifications you make.\nYou might use python libraries such as sympy, math, scipy or numpy to solve the problem, use the right tool.\nUse code even for the simpler calculations to avoid mistakes.\n\nAssistant: Sure, we can solve the problem by writing a Python program.\n\n```python', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[],

INFO 06-07 14:14:26 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.6 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 14:14:36 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 14:14:46 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 14:14:56 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 14:15:06 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, S

In [26]:
completion = client.completions.create(model="/home/gbarbadillo/data/deepseekmath",
                                       prompt=prompt, max_tokens=640, echo=True,
                                       temperature=0.9, top_p=0.9, stop=["```output"])
print("Completion result:", completion.choices[0].text)
print(f'Stop reason: {completion.choices[0].stop_reason}, Finish reason: {completion.choices[0].finish_reason}, output tokens: {completion.usage.completion_tokens}')

INFO 06-07 14:14:05 async_llm_engine.py:553] Received request cmpl-ae10c56471af45e9ab018f17e426fbb5-0: prompt: 'User: John computes the sum of the elements of each of the 15 two-element subsets of $\\{1,2,3,4,5,6\\}$. What is the sum of these 15 sums?\nPlease reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer.\nUse all the available information in the problem description, and be very careful with the assumptions and simplifications you make.\nYou might use python libraries such as sympy, math, scipy or numpy to solve the problem, use the right tool.\nUse code even for the simpler calculations to avoid mistakes.\n\nAssistant: Sure, we can solve the problem by writing a Python program.\n\n```python', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.9, top_p=0.9, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['`

It does not return the stop, symbol but we can get it from the stop reason.

In [25]:
completion = client.completions.create(model="/home/gbarbadillo/data/deepseekmath",
                                       prompt=prompt, max_tokens=50, echo=True,
                                       temperature=0.9, top_p=0.9, stop=["```output"])
print("Completion result:", completion.choices[0].text)
print(f'Stop reason: {completion.choices[0].stop_reason}, Finish reason: {completion.choices[0].finish_reason}, output tokens: {completion.usage.completion_tokens}')

INFO 06-07 14:13:56 async_llm_engine.py:553] Received request cmpl-d6998b75f1c540acb332a713b832e02e-0: prompt: 'User: John computes the sum of the elements of each of the 15 two-element subsets of $\\{1,2,3,4,5,6\\}$. What is the sum of these 15 sums?\nPlease reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer.\nUse all the available information in the problem description, and be very careful with the assumptions and simplifications you make.\nYou might use python libraries such as sympy, math, scipy or numpy to solve the problem, use the right tool.\nUse code even for the simpler calculations to avoid mistakes.\n\nAssistant: Sure, we can solve the problem by writing a Python program.\n\n```python', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.9, top_p=0.9, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['`

We have all the information we need on the completion object.

### Chat template

Is the server adding some weird chat template over our request?

#### API server baseline

In [1]:
import threading
import subprocess

def run_vllm_server():
    subprocess.run(['python', '-m', 'vllm.entrypoints.openai.api_server', '--model', '/home/gbarbadillo/data/deepseekmath'])

server_thread = threading.Thread(target=run_vllm_server)
server_thread.start()

INFO 06-07 14:19:44 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/home/gbarbadillo/data/deepseekmath', speculative_config=None, tokenizer='/home/gbarbadillo/data/deepseekmath', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/home/gbarbadillo/data/deepseekmath)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 06-07 14:19:46 model_runner.py:146] Loading model weights took 12.8725 GB
INFO 06-07 14:19:48 gpu_executor.py:83] # GPU blocks: 985, # CPU blocks: 546


In [3]:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

In [6]:
prompt = 'User: John computes the sum of the elements of each of the 15 two-element subsets of $\\{1,2,3,4,5,6\\}$. What is the sum of these 15 sums?\nPlease reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer.\nUse all the available information in the problem description, and be very careful with the assumptions and simplifications you make.\nYou might use python libraries such as sympy, math, scipy or numpy to solve the problem, use the right tool.\nUse code even for the simpler calculations to avoid mistakes.\n\nAssistant: Sure, we can solve the problem by writing a Python program.\n\n```python'

for _ in range(3):
    completion = client.completions.create(model="/home/gbarbadillo/data/deepseekmath",
                                        prompt=prompt, max_tokens=640, 
                                        temperature=0)
    print("Completion result:", completion.choices[0].text)
    print(f'Stop reason: {completion.choices[0].stop_reason}, Finish reason: {completion.choices[0].finish_reason}, output tokens: {completion.usage.completion_tokens}')

INFO 06-07 14:20:48 async_llm_engine.py:553] Received request cmpl-f5c85d3d6fd748d0b5371333a2e90d56-0: prompt: 'User: John computes the sum of the elements of each of the 15 two-element subsets of $\\{1,2,3,4,5,6\\}$. What is the sum of these 15 sums?\nPlease reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer.\nUse all the available information in the problem description, and be very careful with the assumptions and simplifications you make.\nYou might use python libraries such as sympy, math, scipy or numpy to solve the problem, use the right tool.\nUse code even for the simpler calculations to avoid mistakes.\n\nAssistant: Sure, we can solve the problem by writing a Python program.\n\n```python', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[],

INFO 06-07 14:21:06 metrics.py:341] Avg prompt throughput: 10.3 tokens/s, Avg generation throughput: 10.4 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 14:21:16 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 14:21:26 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 14:21:36 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 14:21:46 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs,

#### Huggingface baseline

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "/home/gbarbadillo/data/deepseekmath"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map='sequential')
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
prompt = 'User: John computes the sum of the elements of each of the 15 two-element subsets of $\\{1,2,3,4,5,6\\}$. What is the sum of these 15 sums?\nPlease reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer.\nUse all the available information in the problem description, and be very careful with the assumptions and simplifications you make.\nYou might use python libraries such as sympy, math, scipy or numpy to solve the problem, use the right tool.\nUse code even for the simpler calculations to avoid mistakes.\n\nAssistant: Sure, we can solve the problem by writing a Python program.\n\n```python'
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=640)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

User: John computes the sum of the elements of each of the 15 two-element subsets of $\{1,2,3,4,5,6\}$. What is the sum of these 15 sums?
Please reason step by step, and put your final answer within \boxed{}. The answer is a non negative integer.
Use all the available information in the problem description, and be very careful with the assumptions and simplifications you make.
You might use python libraries such as sympy, math, scipy or numpy to solve the problem, use the right tool.
Use code even for the simpler calculations to avoid mistakes.

Assistant: Sure, we can solve the problem by writing a Python program.

```python
from itertools import combinations

def sum_of_subsets():
    numbers = list(range(1, 7))
    sums = []

    for subset in combinations(numbers, 2):
        sums.append(sum(subset))

    return sum(sums)

result = sum_of_subsets()
print(result)
```
```output
210
```
The sum of these 15 sums is 210. The answer is: $210$.


We get the exact same answer! Thus it seems that the server is not using any weird template.

```
from itertools import combinations

def sum_of_subsets():
    numbers = list(range(1, 7))
    sums = []

    for subset in combinations(numbers, 2):
        sums.append(sum(subset))

    return sum(sums)

result = sum_of_subsets()
print(result)
```
```output
210
```
The sum of these 15 sums is 210. The answer is: $210$.
```

### Model type

https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html

By default with `auto` is using `torch.bfloat16`.

In [None]:
import threading
import subprocess

def run_vllm_server():
    subprocess.run(['python', '-m', 'vllm.entrypoints.openai.api_server',
                    '--model', '/home/gbarbadillo/data/deepseekmath',
                    '--dtype', 'auto',
                    '--gpu-memory-utilization', '0.7'
                    ])

server_thread = threading.Thread(target=run_vllm_server)
server_thread.start()

INFO 06-07 14:39:19 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/home/gbarbadillo/data/deepseekmath', speculative_config=None, tokenizer='/home/gbarbadillo/data/deepseekmath', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/home/gbarbadillo/data/deepseekmath)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 06-07 14:39:22 model_runner.py:146] Loading model weights took 12.8725 GB
INFO 06-07 14:39:24 gpu_executor.py:83] # GPU blocks: 338, # CPU blocks: 546
INFO 06-07 14:39:26 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-07 14:39:26 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-07 14:39:32 model_runner.py:924] Graph capturing finished in 6 secs.
INFO 06-07 14:39:32 serving_chat.py:84] Using default chat template:
INFO 06-07 14:39:32 serving_chat.py:84] {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{{ bos_token }}{% for message in messages %}{%

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [40897]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)




In [2]:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

In [3]:
prompt = 'User: John computes the sum of the elements of each of the 15 two-element subsets of $\\{1,2,3,4,5,6\\}$. What is the sum of these 15 sums?\nPlease reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer.\nUse all the available information in the problem description, and be very careful with the assumptions and simplifications you make.\nYou might use python libraries such as sympy, math, scipy or numpy to solve the problem, use the right tool.\nUse code even for the simpler calculations to avoid mistakes.\n\nAssistant: Sure, we can solve the problem by writing a Python program.\n\n```python'

for _ in range(3):
    completion = client.completions.create(model="/home/gbarbadillo/data/deepseekmath",
                                        prompt=prompt, max_tokens=640, 
                                        temperature=0)
    print("Completion result:", completion.choices[0].text)
    print(f'Stop reason: {completion.choices[0].stop_reason}, Finish reason: {completion.choices[0].finish_reason}, output tokens: {completion.usage.completion_tokens}')

INFO 06-07 14:39:33 async_llm_engine.py:553] Received request cmpl-67073e14bec543f8b9c6a9d1cbea3c3a-0: prompt: 'User: John computes the sum of the elements of each of the 15 two-element subsets of $\\{1,2,3,4,5,6\\}$. What is the sum of these 15 sums?\nPlease reason step by step, and put your final answer within \\boxed{}. The answer is a non negative integer.\nUse all the available information in the problem description, and be very careful with the assumptions and simplifications you make.\nYou might use python libraries such as sympy, math, scipy or numpy to solve the problem, use the right tool.\nUse code even for the simpler calculations to avoid mistakes.\n\nAssistant: Sure, we can solve the problem by writing a Python program.\n\n```python', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[],

INFO 06-07 14:39:42 metrics.py:341] Avg prompt throughput: 28.2 tokens/s, Avg generation throughput: 31.5 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 14:39:52 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 14:40:02 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 14:40:12 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-07 14:40:22 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs,

## TODO

- [x] Stop words. Very easy, just using the openai api.
- [x] Chat template. Nothing, the predictions are the same as huggingface.
- [ ] dtype