LLMPerf

Fork of LLMPerf optimized for open LLM usage.

Installation

git clone https://github.com/philschmid/llmperf.git 
pip install -e llmperf/

Benchmarks

This fork of LLMPerf was used to generated the following benchmarks:

Llama 3 8B Instruct on NVIDIA A10G: Hugging Face TGI, vLLM, NVIDIA NIM

Basic Usage

We implement 2 tests for evaluating LLMs: a load test to check for performance and a correctness test to check for correctness.

OpenAI Compatible APIs

Note: This includes vllm, Tgi or NVIDIA NIM Containers.

export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE="https://api.endpoints.anyscale.com/v1" # or "http://localhost:8000/v1"

python token_benchmark_ray.py \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'

Hugging Face (TGI)

export HUGGINGFACE_API_KEY="YOUR_HUGGINGFACE_API_KEY" # only for IE and API
# local testing "http://localhost:8000"
# serverless hosted models "https://api-inference.huggingface.co"
# Inference endpoints, e.g. "https://ptrlmejh4tjmcb4t.us-east-1.aws.endpoints.huggingface.cloud"
export HUGGINGFACE_API_BASE="YOUR_HUGGINGFACE_URL"
export MODEL_ID="meta-llama/Llama-2-7b-chat-hf"

python token_benchmark_ray.py \
--model $MODEL_ID \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api huggingface \
--additional-sampling-params '{}'

SageMaker (TGI)

SageMaker doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.

MESSAGES_API=true python llmperf/token_benchmark_ray.py \
--model {endpoint_name} \
--llm-api "sagemaker" \
--max-num-completed-requests 500 \
--timeout 600 \
--num-concurrent-requests 25 \
--results-dir "results"

Vertex AI

NOTE: WIP, not yet tested.

Here, --model is used for logging, not for selecting the model. The model is specified in the Vertex AI Endpoint ID.

The GCLOUD_ACCESS_TOKEN needs to be somewhat regularly set, as the token generated by gcloud auth print-access-token expires after 15 minutes or so.

Vertex AI doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

export GCLOUD_ACCESS_TOKEN=$(gcloud auth print-access-token)
export GCLOUD_PROJECT_ID=YOUR_PROJECT_ID
export GCLOUD_REGION=YOUR_REGION
export VERTEXAI_ENDPOINT_ID=YOUR_ENDPOINT_ID

python token_benchmark_ray.py \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api "vertexai" \
--additional-sampling-params '{}'

see python token_benchmark_ray.py --help for more details on the arguments.

Examples and other use cases

End to End Test for llama 3 8b instruct

First we need to start TGI:

model=meta-llama/Meta-Llama-3-8B-Instruct
token=$(cat ~/.cache/huggingface/token)
num_shard=1
max_input_length=5000
max_total_tokens=6000
max_batch_prefill_tokens=6144
docker run --gpus $num_shard -ti -p 8080:80 \
  -e MODEL_ID=$model \
  -e HF_TOKEN=$token \
  -e NUM_SHARD=$num_shard \
  -e MAX_INPUT_LENGTH=$max_input_length \
  -e MAX_TOTAL_TOKENS=$max_total_tokens \
  -e MAX_BATCH_PREFILL_TOKENS=$max_batch_prefill_tokens \
  ghcr.io/huggingface/text-generation-inference:2.0.3

Test the TGI:

curl http://localhost:8080 \
    -X POST \
    -d '{"inputs":"What is 10+10?","parameters":{"temperature":0.2, "top_p": 0.95, "max_new_tokens": 256}}' \
    -H 'Content-Type: application/json'

Then we can run the benchmark:

HUGGINGFACE_API_BASE="http://localhost:8080"
MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
python token_benchmark_ray.py \
--model $MODEL_ID \
--max-num-completed-requests 100 \
--num-concurrent-requests 10 \
--results-dir "result_outputs" \
--llm-api huggingface

Parse results

python parse_results.py --results-dir "result_outputs"

Results on a 1x A10G GPU:

Avg. Input token length: 550
Avg. Output token length: 150
Avg. First-Time-To-Token: 375.99ms
Avg. Thorughput: 163.23 tokens/sec
Avg. Latency: 38.22ms/token

Results on a 1x H100 GPU with (max_batch_prefill_tokens=16182)

Speculative Decoding

Note: WIP

Use Hugging Face Dataset

In this fork we added support to used datasets from Hugging Face to generate the input for the LLM. Dataset should either have a prompt column or use the messages format from openai, where then the first user message will be used as input.

Note: WIP.

curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer sk"
-d '{ "model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Hello!" } ], "stream": true }'

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
benchmarks		benchmarks
scripts		scripts
src/llmperf		src/llmperf
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
analyze-token-benchmark-results.ipynb		analyze-token-benchmark-results.ipynb
llm_correctness.py		llm_correctness.py
parse_results.py		parse_results.py
pre-commit.sh		pre-commit.sh
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
token_benchmark_ray.py		token_benchmark_ray.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMPerf

Installation

Benchmarks

Basic Usage

OpenAI Compatible APIs

Hugging Face (TGI)

SageMaker (TGI)

Vertex AI

Examples and other use cases

End to End Test for llama 3 8b instruct

Speculative Decoding

Use Hugging Face Dataset

About

Releases

Packages

Languages

License

philschmid/llmperf

Folders and files

Latest commit

History

Repository files navigation

LLMPerf

Installation

Benchmarks

Basic Usage

OpenAI Compatible APIs

Hugging Face (TGI)

SageMaker (TGI)

Vertex AI

Examples and other use cases

End to End Test for llama 3 8b instruct

Speculative Decoding

Use Hugging Face Dataset

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages