⚠️ Code in this repo is written for testing purposes and should not be used in production
The Azure OpenAI Benchmarking tool is designed to aid customers in benchmarking their provisioned-throughput deployments. Provisioned throughput deployments provide a set amount of model compute. But determining the exact performance for you application is dependent on several variables such as: prompt size, generation size and call rate. This tool supports both Azure OpenAI and OpenAI.com model endpoints.
The benchmarking tool provides a simple way to run test traffic on your deploymnet and validate the throughput for your traffic workloads. The script will output key performance statistics including the average and 95th percentile latencies and utilization of the deployment.
You can use this tool to experiment with total throughput at 100% utilization across different traffic patterns for a Provisioned-Managed
deployment type. These tests allow you to better optimize your solution design by adjusting the prompt size, generation size and PTUs deployed
- An Azure OpenAI Service resource with a model model deployed with a provisioned deployment (either
Provisioned
orProvisioned-Managed
) deplyment type. For more information, see the resource deployment guide. - Your resource endpoint and access key. The script assumes the key is stored in the following environment variable:
OPENAI_API_KEY
. For more information on finding your endpoint and key, see the Azure OpenAI Quickstart.
In an existing python environment:
$ pip install -r requirements.txt
$ python -m benchmark.bench load --help
Build a docker container:
$ docker build -t azure-openai-benchmarking .
$ docker run azure-openai-benchmarking load --help
Consider the following guidelines when creating your benchmark tests
- Read the CLI argument descriptions by running
benchmark.bench load -h
. Start by reading about each of the arguments and how they work. This will help you design your test with the right parameters. - Ensure call characteristics match your production expectations. The number of calls per minute and total tokens you are able to process varies depending on the prompt size, generation size and call rate.
- Run your test long enough to reach a stable state. Throttling is based on the total compute you have deployed and are utilizing. The utilization includes active calls. As a result you will see a higher call rate when ramping up on an unloaded deployment because there are no existing active calls being processed. Once your deplyoment is fully loaded with a utilzation near 100%, throttling will increase as calls can only be processed as earlier ones are completed. To ensure an accurate measure, set the duration long enough for the throughput to stabilize, especialy when running at or close to 100% utilization. Also note that once the test ends (either by termination, or reaching the maximum duration or number of requests), any pending requests will continue to drain, which can result in lower throughput values as the load on the endpoint gradually decreases to 0.
- Consider whether to use a retry strategy, and the effect of throttling on the resulting stats. There are careful considerations when selecting a retry strategy, as the resulting latency statistics will be effected if the resource is pushed beyond it's capacity and to the point of throttling.
- When running a test with
retry=none
, any throttled request will be treated as throttled and a new request will be made to replace it, with the start time of the replacement request being reset to a newer time. If the resource being tested starts returning 429s, then any latency metrics from this tool will only represent the values of the final successful request, without also including the time that was spent retrying to resource until a successful response was received (which may not be representative of the real-world user experience). This setting should be used when the workload being tested results is within the resource's capacity and no throttling occurs, or where you are looking to understand what percentage of requests to a PTU instance might need to be diverted to a backup resource, such as during periods of peak load which require more throughput than the PTU resource can handle. - When running a test with
retry=exponential
, any failed or throttled request will be retried with exponential backoff, up to a max of 60 seconds. While it is always recommended to deploy backup AOAI resources for use-cases that will experience periods of high load, this setting may be useful for trying to simulate a scenario where no backup resource is available, and where throttled or failed requests must still be fulfilled by the resource. In this case, the TTFT and e2e latency metrics will represent the time from the first throttled request to the time that the final request was successful, and may be more reflective of the total time that an end user could spend waiting for a response, e.g. in a chat application. Use this option in situations where you want to understand the latency of requests which are throttled and need to be retried on the same resource, and the how the total latency of a request is impacted by multiple request retries. - As a practical example, if a PTU resource is tested beyond 100% capacity and starts returning 429s:
- With
retry=none
the TTFT and e2e latency statistics will remain stable (and very low), since only the successful requests will be included in the metrics. Number of throttled requests will be relatively high. - With
retry=exponential
, the TTFT/e2e latency metrics will increase (potentially up to the max of 60 seconds), while the number of throttled requests will remain lower (since a request is only treated as throttled after 60 seconds, regardless of how many attempts were made within the retry period). - Total throughput values (RPM, TPM) may be lower when
retry=none
if rate limiting is applied.
- With
- As a best practice, any PTU resource should be deployed with a backup PayGO resource for times of peak load. As a result, any testing should be conducted with the values suggested in the AOAI capacity calculator (within the AI Azure Portal) to ensure that throttling does not occur during testing.
The table below provides an example prompt & generation size we have seen with some customers. Actual sizes will vary significantly based on your overall architecture For example,the amount of data grounding you pull into the prompt as part of a chat session can increase the prompt size significantly.
Scenario | Prompt Size | Completion Size | Calls per minute | Provisioned throughput units (PTU) required |
---|---|---|---|---|
Chat | 1000 | 200 | 45 | 200 |
Summarization | 7000 | 150 | 7 | 100 |
Classification | 7000 | 1 | 24 | 300 |
Or see the pre-configured shape-profiles below.
During a run, statistics are output every second to stdout
while logs are output to stderr
. Some metrics may not show up immediately due to lack of data.
Run load test at 60 RPM with exponential retry back-off
$ python -m benchmark.bench load \
--deployment gpt-4 \
--rate 60 \
--retry exponential \
https://myaccount.openai.azure.com
2023-10-19 18:21:06 INFO using shape profile balanced: context tokens: 500, max tokens: 500
2023-10-19 18:21:06 INFO warming up prompt cache
2023-10-19 18:21:06 INFO starting load...
2023-10-19 18:21:06 rpm: 1.0 requests: 1 failures: 0 throttled: 0 ctx tpm: 501.0 gen tpm: 103.0 ttft avg: 0.736 ttft 95th: n/a tbt avg: 0.088 tbt 95th: n/a e2e avg: 1.845 e2e 95th: n/a util avg: 0.0% util 95th: n/a
2023-10-19 18:21:07 rpm: 5.0 requests: 5 failures: 0 throttled: 0 ctx tpm: 2505.0 gen tpm: 515.0 ttft avg: 0.937 ttft 95th: 1.321 tbt avg: 0.042 tbt 95th: 0.043 e2e avg: 1.223 e2e 95th: 1.658 util avg: 0.8% util 95th: 1.6%
2023-10-19 18:21:08 rpm: 8.0 requests: 8 failures: 0 throttled: 0 ctx tpm: 4008.0 gen tpm: 824.0 ttft avg: 0.913 ttft 95th: 1.304 tbt avg: 0.042 tbt 95th: 0.043 e2e avg: 1.241 e2e 95th: 1.663 util avg: 1.3% util 95th: 2.6%
Load test with custom messages being loaded from file and used in all requests
$ python -m benchmark.bench load \
--deployment gpt-4 \
--rate 1 \
--context-generation-method replay
--replay-path replay_messages.json
--max-tokens 500 \
https://myaccount.openai.azure.com
Load test with custom request shape, and automatically save output to file
$ python -m benchmark.bench load \
--deployment gpt-4 \
--rate 1 \
--shape custom \
--context-tokens 1000 \
--max-tokens 500 \
--log-save-dir logs/ \
https://myaccount.openai.azure.com
As above, but also record the timestamps, call status and input & output content of every individual request
$ python -m benchmark.bench load \
--deployment gpt-4 \
--rate 1 \
--shape custom \
--context-tokens 1000 \
--max-tokens 500 \
--log-save-dir logs/ \
--log-request-content true \
https://myaccount.openai.azure.com
Obtain number of tokens for input context
tokenize
subcommand can be used to count number of tokens for a given input.
It supports both text and json chat messages input.
$ python -m benchmark.bench tokenize \
--model gpt-4 \
"this is my context"
tokens: 4
Alternatively you can send your text via stdin:
$ cat mychatcontext.json | python -m benchmark.bench tokenize \
--model gpt-4
tokens: 65
Extract and Combine Statistics from JSON logs to CSV
The combine_logs
CLI can be used to load and combine the logs from multiple runs into a single CSV, ready for comparison and analysis. This tool extracts:
- The arguments that were used to initiate the benchmarking run
- The aggregate statistics of all requests in the run
- With
--include-raw-request-info true
, the timestamps, call status and all input/output content of every individual request will be extracted and saved into the combined CSV. This can be used to plot distributions of values, and start/finish of each individual request.
Additionally, the --load-recursive
arg will search not only in the provided directory, but all subdirectories as well.
Note: The core benchmarking tool waits for any incomplete requests to 'drain' when the end of the run is reached, without replacing these requests with new ones. This can mean that overall TPM and RPM can begin to drop after the draining point as all remaining requests slowly finish, dragging the average TPM and RPM statistics down. For this reason, it is recommended to use --stat-extraction-point draining
to extract the aggregate statistcs that were logged when draining began (and prior to any reduction in throughput). If however you are more interested in latency values and do not care about the RPM and TPM values, use --stat-extraction-point final
, which will extract the very last line of logged statistics (which should include all completed requests that are still within the aggregation window).
# Extract stats that were logged when the duration/requests limit was reached
$ python -m benchmark.contrib.combine_logs logs/ combined_logs.csv --load-recursive \
--stat-extraction-point draining
# Extract aggregate AND individual call stats that were logged when the duration/requests limit was reached
$ python -m benchmark.contrib.combine_logs logs/ combined_logs.csv --load-recursive \
--stat-extraction-point draining --include-raw-request-info
# Extract the very last line of logs, after the very last request has finished
$ python -m benchmark.contrib.combine_logs logs/ combined_logs.csv --load-recursive \
--stat-extraction-point final
Extract Raw Call Data from a Combined Logs CSV
Once the combine_logs
CLI has been run, the extract_raw_samples
CLI can be used to extract all individual call data from each separate run. This is useful for digging deeper into the data for each invidual benchmark run, enabling you to include or exclude individual calls prior to analysis, create custom aggregations, or for inspecting the call history or request & response content of individual requests.
Additionally, the --exclude-failed-requests
arg will drop any call records that were unsucessful (where request code != 200, or where no tokens were generated).
# Extract individual call samples from a combined logs CSV
$ python -m benchmark.contrib.extract_raw_samples logs/combined_logs.csv \
logs/raw_request_samples.csv
# Extract individual call samples, excluding unsuccessful requests from the result
$ python -m benchmark.contrib.extract_raw_samples logs/combined_logs.csv \
logs/raw_request_samples.csv --exclude-failed-requests
Run Batches of Multiple Configurations
The batch_runner
CLI can be used to run batches of benchmark runs back-to-back. Currently, this CLI only works for runs where context-generation-method = generation
. The CLI also includes a --start-ptum-runs-at-full-utilization
argument (default=true
), which will warm up any PTU-M model endpoints to 100% utilization prior to testing, which is critical for ensuring that test results reflect accurate real-world performance and is enabled by default. To see the full list of args which can be used for all runs in each batch, run python -m benchmark.contrib.batch_runner -h
.
To use the CLI, create a list of token profile and rate combinations to be used, and then select the number of batches and interval to be used between each batch. When using the batch runner with the commands below, make sure to execute the command from the root directory of the repo.
Example - Run a single batch with context-generation-method=generate
with the following two configurations for 120 seconds each, making sure to automatically warm up the endpoint prior to each run (if it is a PTU-M endpoint), and also saving all request input and output content from each run:
- context_tokens=500, max_tokens=100, rate=20
- context_tokens=3500, max_tokens=300, rate=7.5
$ python -m benchmark.contrib.batch_runner https://myaccount.openai.azure.com/ \
--deployment gpt-4-1106-ptu --context-generation-method generate \
--token-rate-workload-list 500-100-20,3500-300-7.5 --duration 130 \
--aggregation-window 120 --log-save-dir logs/ \
--start-ptum-runs-at-full-utilization true --log-request-content true
Example - Run the same batch as above, but 5x times and with a 1 hour delay between the start of each batch:
$ python -m benchmark.contrib.batch_runner https://myaccount.openai.azure.com/ \
--deployment gpt-4-1106-ptu --context-generation-method generate \
--token-rate-workload-list 500-100-20,3500-300-7.5 --duration 130 \
--aggregation-window 120 --log-save-dir logs/ \
--start-ptum-runs-at-full-utilization true --log-request-content true \
--num-batches 5 --batch-start-interval 3600
Example 3 - Run a batch using context-generation-method=replay
. In this example, the first item in the token-rate-workload-list is the path to the replay messages dataset (see the next section for more info on how this works). Make sure that the replay messages filename does not contain dashes, and that the path is relative to the directory from which you are running the command:
$ python -m benchmark.contrib.batch_runner https://myaccount.openai.azure.com/ \
--deployment gpt-4-1106-ptu --context-generation-method replay \
--token-rate-workload-list tests/test_replay_messages.json-100-20,tests/test_replay_messages.json-300-7.5 \
--duration 130 --aggregation-window 120 --log-save-dir logs/ \
--start-ptum-runs-at-full-utilization true --log-request-content true \
Using the --context-generation-method
argument, this tool gives two options for how the source content of each request is generated:
1: generate
[default]: Context information is generated automatically from a list of all english words, and the endpoint is instructed to generate a long story of max_tokens
words. This is useful where existing data is not yet available, and should reslt in similar performance as real-world workoads with the same number of context & completion tokens.
In this mode, there are four different shape profiles via command line option --shape-profile
:
profile | description | context tokens | max tokens |
---|---|---|---|
balanced |
[default] Balanced count of context and generation tokens. Should be representative of typical workloads. | 500 | 500 |
context |
Represents workloads with larger context sizes compared to generation. For example, chat assistants. | 2000 | 200 |
generation |
Represents workloads with larger generation and smaller contexts. For example, question answering. | 500 | 1000 |
custom |
Allows specifying custom values for context size (--context-tokens ) and max generation tokens (--max-tokens ). |
Note: With the default prompting strategy, OpenAI models will typically return completions of a max of 700-1200 tokens. If setting max_tokens
above 750, be aware that the results for rpm
may be higher, and e2e
latency lower, than if the model was returning completions of size max_tokens
in every response. Refer to the gen_tpr
stats at the end of each run to see how many tokens were generated across responses.
2: replay
: Messages are loaded from a JSON file and replayed back to the endpoint. This is useful for scenarios where testing with real-world data is important, and that data has already been generated or collected from an existing LLM application.
In this mode, all messages in the file are sampled randomly when making requests to the endpoint. This means the same message may be used multiple times in a benchmarking run, plus any anti-caching prefix if prevent-server-caching=true
. The format of the JSON file should be a single array containing separate lists of messages which conform to the OpenAI chat completions API schema. Two examples are available in the tests/
folder, with the text-only example as follows:
[
[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Can you explain how photosynthesis works?"}
],
[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."},
{"role": "user", "content": "Please tell me about the history of Paris."}
]
]
When --prevent-server-caching=true
, every message in each request payload is prefixed with a random string to force the inference endpoint to process each request without any optimization/caching that might occur if workloads are the same. This ensures that the results observed while running the tool are the worst case scenario for given traffic shape. For example:
initial request | request with random prefixes |
---|---|
{"role": "user", "content": "Can you explain how photosynthesis works?"} | {"role": "user", "content": "1704441942.868042 Can you explain how photosynthesis works?"} |
{"role": "user", "content": "1704441963.715898 Can you explain how photosynthesis works?"} |
Setting --prevent-server-caching=false
is only recommended when a sufficiently large replay dataset is available (e.g. at least double the number of messages than the total number of requests to be made across all test runs in a session). If the cache needs to be cleared/reset for additional runs, it is recommended that the PTU model deployment should be deleted and recreated in order to reload the model with an empty cache.
The --adjust-for-network-latency
argument will adjust all aggregate statistics based on the network delay (using a ping test) between the testing machine and the model endpoint. This makes it easy to test models across different regions from a single machine without having the results influenced by the time it takes for requests to traverse the globe. Note that this will only adjust the results of aggregate statistics (e.g. those listed in the Output Fields section down below); all individual call results will maintain their original timestamps and will need to be adjusted separtely.
At the end of each bencmark run, the raw call statistics (such as request start time, time of first token, request end time, and num context and generation tokens) will be logged for every request that occurred within the test (both the successes and failures). If the --log-request-content
argument is set to true
, this dump will also include the raw input messages and output completion for each request. This is useful in cases where you want to compare the generated content between different endpoints.
field | description | sliding window | example |
---|---|---|---|
time |
Time offset in seconds since the start of the test. | no | 120 |
rpm |
Successful Requests Per Minute. Note that it may be less than --rate as it counts completed requests. |
yes | 12 |
processing |
Total number of requests currently being processed by the endpoint. | no | 100 |
completed |
Total number of completed requests. | no | 100 |
failures |
Total number of failed requests out of requests . |
no | 100 |
throttled |
Total number of throttled requests out of requests . |
no | 100 |
requests |
Deprecated in favor of completed field (output values of both fields are the same) |
no | 1233 |
ctx_tpm |
Number of context Tokens Per Minute. | yes | 1200 |
gen_tpm |
Number of generated Tokens Per Minute. | yes | 156 |
ttft_avg |
Average time in seconds from the beginning of the request until the first token was received. | yes | 0.122 |
ttft_95th |
95th percentile of time in seconds from the beginning of the request until the first token was received. | yes | 0.130 |
tbt_avg |
Average time in seconds between two consequitive generated tokens. | yes | 0.018 |
tbt_95th |
95th percentail of time in seconds between two consequitive generated tokens. | yes | 0.021 |
gen_tpr_10th |
10th percentile of number of generated tokens per model response. | yes | 389 |
gen_tpr_avg |
Average number of generated tokens per model response. | yes | 509 |
gen_tpr_90th |
90th percentile of number of generated tokens per model response. | yes | 626 |
e2e_avg |
Average end to end request time. | yes | 1.2 |
e2e_95th |
95th percentile of end to end request time. | yes | 1.5 |
util_avg |
Average deployment utilization percentage as reported by the service. | yes | 89.3% |
util_95th |
95th percentile of deployment utilization percentage as reported by the service. | yes | 91.2% |
Note: Prior to the benchmarking run reaching aggregation-window
in elapsed time, all sliding window stats will be calculated over a dynamic window, equal to the time elapsed since starting the test. This ensures RPM/TPM stats are relatively accurate prior to the test reaching completion, including when a test ends early due to reaching the request limit.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.