# g5/g6 serving performance comparison

## [key metrics for LLM serving:](https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices)

1. **Time To First Token (TTFT)**: How quickly users start seeing the model's output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
2. **Time Per Output Token (TPOT)**: Time to generate an output token for *each* user that is querying our system. This metric corresponds with how each user will perceive the "speed" of the model. For example, a TPOT of 100 milliseconds/tok would be 10 tokens per second per user, or ~450 words per minute, which is faster than a typical person can read.
3. **Latency**: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = *(TTFT)* + *(TPOT)* * (the number of tokens to be generated).
4. **Throughput**: The number of output tokens per second an inference server can generate across all users and requests.

For ease of experimentation, we only measure 1,2,4.

## Set up env and start vllm server 


* ami - [ami-020f2b388c86c9684](https://us-west-2.console.aws.amazon.com/ec2/home?region=us-west-2#Images:visibility=public-images;imageId=ami-020f2b388c86c9684)
* watch -n 0.5 -d nvidia-smi
* install vllm - https://docs.vllm.ai/en/latest/getting_started/installation.html 
* open api server - https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html

```
# (Recommended) Create a new conda environment.
$ conda create -n myenv python=3.9 -y
# new terminal
$ conda activate myenv
$ # Install vLLM with CUDA 12.1.
$ pip install vllm===0.3.3
```

vllm repo git clone

```
https://github.com/vllm-project/vllm.git
```

install dependencies

```
pip install -r requirements-cuda.txt
# Install vLLM with CUDA 12.1.
pip install vllm==0.3.3
#to avoid ModuleNotFoundError: No module named 'vllm._C'
cd benchmark 
```
```
huggingface-cli login
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf --dtype float16 

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "prompt": "where is San Francisco?",
        "max_tokens": 10,
        "temperature": 0
    }'
```

```
# backend 
  #  "tgi": async_request_tgi,
  #  "vllm": async_request_openai_completions,
  #  "lmdeploy": async_request_openai_completions,
  #  "deepspeed-mii": async_request_deepspeed_mii,
  #  "openai": async_request_openai_completions,
  #  "openai-chat": async_request_openai_chat_completions,
  #  "tensorrt-llm": async_request_trt_llm,
    
# By default <request_rate> is inf. 
  # Number of requests per second. If this is inf, 
  # then all the requests are sent at time 0. 
  # Otherwise, we use Poisson process to synthesize 
  # the request arrival times.
# By default <num_prompts> is 1000
  # Number of prompts to process.
# save-result specify to save benchmark results to a json file, action="store_true"

In [3]:
# download sharedgpt for benchmark
!wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
!pip install aiohttp

--2024-04-11 04:37:28--  https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
Resolving huggingface.co (huggingface.co)... 99.84.66.112, 99.84.66.70, 99.84.66.72, ...
Connecting to huggingface.co (huggingface.co)|99.84.66.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/58/74/5874e8234cbcd37dd31ca486e8492d9f1370bdd04829001f53991a866851e83f/35f0e213ce091ed9b9af2a1f0755e9d39f9ccec34ab281cd4ca60d70f6479ba4?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27ShareGPT_V3_unfiltered_cleaned_split.json%3B+filename%3D%22ShareGPT_V3_unfiltered_cleaned_split.json%22%3B&response-content-type=application%2Fjson&Expires=1713069448&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMzA2OTQ0OH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy81OC83NC81ODc0ZTgyMzRjYmNkMzdkZDMxY2E0ODZlOD

18.161.6.100, 18.161.6.107, 18.161.6.126, ...
Connecting to cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)|18.161.6.100|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 672837942 (642M) [application/json]
Saving to: ‘ShareGPT_V3_unfiltered_cleaned_split.json’


2024-04-11 04:37:29 (383 MB/s) - ‘ShareGPT_V3_unfiltered_cleaned_split.json’ saved [672837942/672837942]

Collecting aiohttp
  Downloading aiohttp-3.9.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting multidict<7.0,>=4.5 (from aiohttp)
  Downloading multidict-6.0.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting yarl<2.0,>=1.0 (from aiohttp)
  Downloading yarl-1.9.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (31 kB)
Collecting async-timeout<5.0,>=4.0 (from aiohttp)
  Downloading async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Downloading aiohttp-3.9.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x8

## g5.2xlarge serving performance (meta-llama/Llama-2-7b-chat-hf, float16)

In [3]:
# run benchmark_serving request-rate inf, num-prompts 3
!python benchmark_serving.py \
        --backend vllm \
        --model "meta-llama/Llama-2-7b-chat-hf" \
        --dataset-name sharegpt \
        --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
        --request-rate inf \
        --num-prompts 3 \
        --save-result 

Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=3, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=True, metadata=None, result_dir=None)
Traffic request rate: inf
100%|█████████████████████████████████████████████| 3/3 [00:28<00:00,  9.43s/it]
Successful requests:                     3         
Benchmark duration (s):                  28.28     
Total input tokens:                      584       
Total generated tokens:                  1432      
Request throughput (req/s):              0.11      
Input token throughput (tok/s):          20.65     
Output token throughput (tok/s):         50.64     
---------------Time to First Token-------

In [4]:
# run benchmark_serving request-rate inf, num-prompts 100
!python benchmark_serving.py \
        --backend vllm \
        --model "meta-llama/Llama-2-7b-chat-hf" \
        --dataset-name sharegpt \
        --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
        --request-rate inf \
        --num-prompts 100

Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=100, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False, metadata=None, result_dir=None)
Traffic request rate: inf
100%|█████████████████████████████████████████| 100/100 [00:48<00:00,  2.08it/s]
Successful requests:                     100       
Benchmark duration (s):                  48.04     
Total input tokens:                      25900     
Total generated tokens:                  18393     
Request throughput (req/s):              2.08      
Input token throughput (tok/s):          539.11    
Output token throughput (tok/s):         382.85    
---------------Time to First Token----

In [5]:
# run benchmark_serving request-rate 10, num-prompts 100
!python benchmark_serving.py \
        --backend vllm \
        --model "meta-llama/Llama-2-7b-chat-hf" \
        --dataset-name sharegpt \
        --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
        --request-rate 10 \
        --num-prompts 100

Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=100, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, request_rate=10.0, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False, metadata=None, result_dir=None)
Traffic request rate: 10.0
100%|█████████████████████████████████████████| 100/100 [00:48<00:00,  2.06it/s]
Successful requests:                     100       
Benchmark duration (s):                  48.43     
Total input tokens:                      25900     
Total generated tokens:                  18432     
Request throughput (req/s):              2.06      
Input token throughput (tok/s):          534.74    
Output token throughput (tok/s):         380.55    
---------------Time to First Token--

In [6]:
# run benchmark_serving request-rate 100, num-prompts 100
!python benchmark_serving.py \
        --backend vllm \
        --model "meta-llama/Llama-2-7b-chat-hf" \
        --dataset-name sharegpt \
        --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
        --request-rate 100 \
        --num-prompts 100

Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=100, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, request_rate=100.0, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False, metadata=None, result_dir=None)
Traffic request rate: 100.0
100%|█████████████████████████████████████████| 100/100 [00:47<00:00,  2.12it/s]
Successful requests:                     100       
Benchmark duration (s):                  47.13     
Total input tokens:                      25900     
Total generated tokens:                  18184     
Request throughput (req/s):              2.12      
Input token throughput (tok/s):          549.60    
Output token throughput (tok/s):         385.87    
---------------Time to First Token

In [7]:
# run benchmark_serving request-rate 100, num-prompts 1000
!python benchmark_serving.py \
        --backend vllm \
        --model "meta-llama/Llama-2-7b-chat-hf" \
        --dataset-name sharegpt \
        --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
        --request-rate 100 \
        --num-prompts 1000

Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, request_rate=100.0, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False, metadata=None, result_dir=None)
Traffic request rate: 100.0
100%|███████████████████████████████████████| 1000/1000 [06:40<00:00,  2.49it/s]
Successful requests:                     1000      
Benchmark duration (s):                  400.92    
Total input tokens:                      248339    
Total generated tokens:                  195161    
Request throughput (req/s):              2.49      
Input token throughput (tok/s):          619.42    
Output token throughput (tok/s):         486.78    
---------------Time to First Toke

## g6.2xlarge serving performance (meta-llama/Llama-2-7b-chat-hf, float16)

In [1]:
# run benchmark_serving request-rate inf, num-prompts 3
!python benchmark_serving.py \
        --backend vllm \
        --model "meta-llama/Llama-2-7b-chat-hf" \
        --dataset-name sharegpt \
        --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
        --request-rate inf \
        --num-prompts 3 \
        --save-result 

Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=3, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=True, metadata=None, result_dir=None)
Traffic request rate: inf
100%|█████████████████████████████████████████████| 3/3 [00:50<00:00, 16.73s/it]
Successful requests:                     3         
Benchmark duration (s):                  50.18     
Total input tokens:                      584       
Total generated tokens:                  1432      
Request throughput (req/s):              0.06      
Input token throughput (tok/s):          11.64     
Output token throughput (tok/s):         28.54     
---------------Time to First Token-------

In [2]:
# run benchmark_serving request-rate inf, num-prompts 100
!python benchmark_serving.py \
        --backend vllm \
        --model "meta-llama/Llama-2-7b-chat-hf" \
        --dataset-name sharegpt \
        --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
        --request-rate inf \
        --num-prompts 100

Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=100, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False, metadata=None, result_dir=None)
Traffic request rate: inf
100%|█████████████████████████████████████████| 100/100 [01:18<00:00,  1.28it/s]
Successful requests:                     100       
Benchmark duration (s):                  78.32     
Total input tokens:                      25900     
Total generated tokens:                  18268     
Request throughput (req/s):              1.28      
Input token throughput (tok/s):          330.69    
Output token throughput (tok/s):         233.25    
---------------Time to First Token----

In [3]:
# run benchmark_serving request-rate 10, num-prompts 100
!python benchmark_serving.py \
        --backend vllm \
        --model "meta-llama/Llama-2-7b-chat-hf" \
        --dataset-name sharegpt \
        --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
        --request-rate 10 \
        --num-prompts 100

Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=100, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, request_rate=10.0, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False, metadata=None, result_dir=None)
Traffic request rate: 10.0
100%|█████████████████████████████████████████| 100/100 [01:18<00:00,  1.27it/s]
Successful requests:                     100       
Benchmark duration (s):                  78.86     
Total input tokens:                      25900     
Total generated tokens:                  18510     
Request throughput (req/s):              1.27      
Input token throughput (tok/s):          328.43    
Output token throughput (tok/s):         234.72    
---------------Time to First Token--

In [4]:
# run benchmark_serving request-rate 100, num-prompts 100
!python benchmark_serving.py \
        --backend vllm \
        --model "meta-llama/Llama-2-7b-chat-hf" \
        --dataset-name sharegpt \
        --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
        --request-rate 100 \
        --num-prompts 100

Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=100, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, request_rate=100.0, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False, metadata=None, result_dir=None)
Traffic request rate: 100.0
100%|█████████████████████████████████████████| 100/100 [01:17<00:00,  1.29it/s]
Successful requests:                     100       
Benchmark duration (s):                  77.24     
Total input tokens:                      25900     
Total generated tokens:                  18365     
Request throughput (req/s):              1.29      
Input token throughput (tok/s):          335.33    
Output token throughput (tok/s):         237.77    
---------------Time to First Token

In [5]:
# run benchmark_serving request-rate 100, num-prompts 1000
!python benchmark_serving.py \
        --backend vllm \
        --model "meta-llama/Llama-2-7b-chat-hf" \
        --dataset-name sharegpt \
        --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
        --request-rate 100 \
        --num-prompts 1000

Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, request_rate=100.0, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False, metadata=None, result_dir=None)
Traffic request rate: 100.0
100%|███████████████████████████████████████| 1000/1000 [10:33<00:00,  1.58it/s]
Successful requests:                     1000      
Benchmark duration (s):                  633.90    
Total input tokens:                      248339    
Total generated tokens:                  195463    
Request throughput (req/s):              1.58      
Input token throughput (tok/s):          391.76    
Output token throughput (tok/s):         308.35    
---------------Time to First Toke