Skip to content

LLMPerf is a library for validating and benchmarking LLMs

License

Notifications You must be signed in to change notification settings

philschmid/llmperf

 
 

Repository files navigation

LLMPerf

Fork of LLMPerf optimized for open LLM usage.

Installation

git clone https://github.com/philschmid/llmperf.git 
pip install -e llmperf/

Benchmarks

This fork of LLMPerf was used to generated the following benchmarks:

Basic Usage

We implement 2 tests for evaluating LLMs: a load test to check for performance and a correctness test to check for correctness.

OpenAI Compatible APIs

Note: This includes vllm, Tgi or NVIDIA NIM Containers.

export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE="https://api.endpoints.anyscale.com/v1" # or "http://localhost:8000/v1"

python token_benchmark_ray.py \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'

Hugging Face (TGI)

export HUGGINGFACE_API_KEY="YOUR_HUGGINGFACE_API_KEY" # only for IE and API
# local testing "http://localhost:8000"
# serverless hosted models "https://api-inference.huggingface.co"
# Inference endpoints, e.g. "https://ptrlmejh4tjmcb4t.us-east-1.aws.endpoints.huggingface.cloud"
export HUGGINGFACE_API_BASE="YOUR_HUGGINGFACE_URL"
export MODEL_ID="meta-llama/Llama-2-7b-chat-hf"

python token_benchmark_ray.py \
--model $MODEL_ID \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api huggingface \
--additional-sampling-params '{}'

SageMaker (TGI)

SageMaker doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.

MESSAGES_API=true python llmperf/token_benchmark_ray.py \
--model {endpoint_name} \
--llm-api "sagemaker" \
--max-num-completed-requests 500 \
--timeout 600 \
--num-concurrent-requests 25 \
--results-dir "results"

Vertex AI

NOTE: WIP, not yet tested.

Here, --model is used for logging, not for selecting the model. The model is specified in the Vertex AI Endpoint ID.

The GCLOUD_ACCESS_TOKEN needs to be somewhat regularly set, as the token generated by gcloud auth print-access-token expires after 15 minutes or so.

Vertex AI doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

export GCLOUD_ACCESS_TOKEN=$(gcloud auth print-access-token)
export GCLOUD_PROJECT_ID=YOUR_PROJECT_ID
export GCLOUD_REGION=YOUR_REGION
export VERTEXAI_ENDPOINT_ID=YOUR_ENDPOINT_ID

python token_benchmark_ray.py \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api "vertexai" \
--additional-sampling-params '{}'

see python token_benchmark_ray.py --help for more details on the arguments.

Examples and other use cases

End to End Test for llama 3 8b instruct

First we need to start TGI:

model=meta-llama/Meta-Llama-3-8B-Instruct
token=$(cat ~/.cache/huggingface/token)
num_shard=1
max_input_length=5000
max_total_tokens=6000
max_batch_prefill_tokens=6144
docker run --gpus $num_shard -ti -p 8080:80 \
  -e MODEL_ID=$model \
  -e HF_TOKEN=$token \
  -e NUM_SHARD=$num_shard \
  -e MAX_INPUT_LENGTH=$max_input_length \
  -e MAX_TOTAL_TOKENS=$max_total_tokens \
  -e MAX_BATCH_PREFILL_TOKENS=$max_batch_prefill_tokens \
  ghcr.io/huggingface/text-generation-inference:2.0.3

Test the TGI:

curl http://localhost:8080 \
    -X POST \
    -d '{"inputs":"What is 10+10?","parameters":{"temperature":0.2, "top_p": 0.95, "max_new_tokens": 256}}' \
    -H 'Content-Type: application/json'

Then we can run the benchmark:

HUGGINGFACE_API_BASE="http://localhost:8080"
MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
python token_benchmark_ray.py \
--model $MODEL_ID \
--max-num-completed-requests 100 \
--num-concurrent-requests 10 \
--results-dir "result_outputs" \
--llm-api huggingface 

Parse results

python parse_results.py --results-dir "result_outputs"

Results on a 1x A10G GPU:

Avg. Input token length: 550
Avg. Output token length: 150
Avg. First-Time-To-Token: 375.99ms
Avg. Thorughput: 163.23 tokens/sec
Avg. Latency: 38.22ms/token

Results on a 1x H100 GPU with (max_batch_prefill_tokens=16182)

Speculative Decoding

Note: WIP

Use Hugging Face Dataset

In this fork we added support to used datasets from Hugging Face to generate the input for the LLM. Dataset should either have a prompt column or use the messages format from openai, where then the first user message will be used as input.

Note: WIP.

curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer sk"
-d '{ "model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Hello!" } ], "stream": true }'

About

LLMPerf is a library for validating and benchmarking LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 59.5%
  • Jupyter Notebook 40.4%
  • Shell 0.1%