# Benchmarking notebook
This notebook contains the code to set up and run the benchmarking experiments for a variety of LLM endpoints.
This notebook depends on the code within the module, and serves only to:
* Queue and run experiments with consistent parameters across models
* Process the outputs of the models into a consistent format

# Experiment Setup
In order to get a representative sample, we will run tests with a variety of setups:
* Different context and completion workloads, in order to simulate a variety of customer use cases (e.g. summarisation, RAG, chat, generation)
    * Chat: 1000 prompt tokens, 200 completion tokens
    * Summarization: 7000 prompt tokens, 150 completion tokens
    * Time-to-first-token simulation: 1000 prompt tokens, 1 completion token (See notes below for the reason why)

Requirements/Notes:
* An Azure OpenAI resource, with endpoints configured to use all of the resource's available TPM with no dynamic quota. In general, all other deployments should be deleted so that only the model being tested is deployed in the resource.
* The deployment name will be used for identifying each model in the logs, so when deploying any model to an endpoint, name the deployment based on the model (e.g. "gpt-4-8k", "llama-2-7b-chat").
* For AML Managed Online Endpoints, we cannot stream responses, so measuring the time-to-first token for a regular request will not work. To simulate this, we will instead run an additional test with the same parameters as a chat workload, however with a completion size of only 1 token. Though not perfect, this gives us some idea of the time to receive the request and pre-fill the context into memory.

# Config setup
We'll set up a series of tests for each model type, and run them one-by-one in order to collect the data.

We expect to run a test for every combination of concurrency x workload profile x model deployment. Where we are testing 5x concurrency values across 3x workload profiles, each full test should take approximately 45 minutes.

We'll use some helper dataclasses to hold values for each run.

In [44]:
from dataclasses import dataclass
from typing import Optional, Type, Union
from dotenv import load_dotenv
import shlex
import subprocess
import itertools
from typing import Iterable


# Create deployment config dataclasses that align to the different python entrypoints
@dataclass
class OpenAIDeploymentConfig:
    # Dynamic config
    api_key_env: str
    deployment: str
    api_base_endpoint: str
    # Static config
    python_entrypoint: str = "benchmark.bench load"
    api_version: str = "2023-05-15"
    completions: int = 1

DeploymentConfigType = Type[OpenAIDeploymentConfig]

# @dataclass
# class AMLDeploymentConfig:
#     # Dynamic config
#     api_key_env: str
#     deployment: str
#     api_base_endpoint: str
#     model: str
#     # Static config
#     python_entrypoint: str = "benchmark.bench_aml load"

# DeploymentConfigType = Union[Type[AMLDeploymentConfig], Type[OpenAIDeploymentConfig]]

@dataclass
class BenchmarkConfig:
    # Dynamic config
    context_tokens: int
    max_tokens: int
    duration: int
    # Static config
    shape_profile: str = "custom"
    aggregation_window: int = 60
    

### Generate config for each deployment and workload profile
We'll create a set of Benchmark params for each workload. The workloads are based on [this document](https://microsoftapc-my.sharepoint.com/:w:/g/personal/mtremeer_microsoft_com/EfxJ8FsJcm5NmHRZuhlcRc0BC03ScNXZqKydqj3AjbIcBg?e=nmBgcF).

In [45]:
CONCURRENCY_VALUES = [1, 2, 4, 8, 12, 16, 20]

# Run duration & rate
WARMUP_TIME_PER_RUN_SECS = 30
AGGREGATION_WINDOW_SECS = 60

# 1. Text Generation (Output-token heavy)
TEXT_GEN_BENCH_CONFIG = BenchmarkConfig(
    context_tokens=200,
    max_tokens=2000,
    duration=WARMUP_TIME_PER_RUN_SECS+AGGREGATION_WINDOW_SECS
)

# 2. Chat, with RAG (Balanced)
CHAT_RAG_BENCH_CONFIG = BenchmarkConfig(
    context_tokens=3100,
    max_tokens=300,
    duration=WARMUP_TIME_PER_RUN_SECS+AGGREGATION_WINDOW_SECS
)

# 3. Classification (Context-heavy)
CLASSIFICATION_BENCH_CONFIG = BenchmarkConfig(
    context_tokens=5000,
    max_tokens=1,
    duration=WARMUP_TIME_PER_RUN_SECS+AGGREGATION_WINDOW_SECS
)

ALL_BENCH_CONFIGS = [TEXT_GEN_BENCH_CONFIG, CHAT_RAG_BENCH_CONFIG, CLASSIFICATION_BENCH_CONFIG]

In [46]:
def generate_config_set(
        deployment_config: DeploymentConfigType, 
        benchmark_configs: Iterable[BenchmarkConfig],
        initial_warmup_secs: int,
        concurrency_values: Iterable[int] = CONCURRENCY_VALUES,
    ) -> Iterable[tuple[DeploymentConfigType, BenchmarkConfig]]:
    """
    Combines the deployment and benchmark configs into all possible permutations, 
    plus an initial warmup config to be run prior to all benchmarks.

    Returns:
        List of execution strings to be run (in order) by the benchmarking script.
    """
    # Create warmup run with 10 clients and 60 RPM @ 200 total tokens - equates to max of 1.2M TPM
    warmup_bench_config = BenchmarkConfig(
        context_tokens=1000,
        max_tokens=1000,
        duration=initial_warmup_secs
    )
    warmup_config = [deployment_config, warmup_bench_config, 10]

    # Combine all permutations of deployment, benchmark and concurrency values
    permutations = itertools.product(*[benchmark_configs, concurrency_values])
    permutations = [[deployment_config, *config] for config in list(permutations)]

    return [warmup_config, *permutations]

def config_to_execution_str(
    deployment_config: DeploymentConfigType, 
    benchmark_config: BenchmarkConfig,
    clients: int,
) -> str:
    """
    Combines configs into a single execution string, ready for execution by the
    benchmarking CLI.
    """
    cmd = f"python3 -m {deployment_config.python_entrypoint}"
    # Deployment config
    for param, value in deployment_config.__dict__.items():
        if param == "python_entrypoint":
            continue
        elif param == "api_base_endpoint":
            cmd += f" {value}"
        else:
            cmd += f" --{param.replace('_', '-')} {value}"
    # Benchmark config
    for param, value in benchmark_config.__dict__.items():
        cmd += f" --{param.replace('_', '-')} {value}"
    # Logs save dir
    subdir = "openai" if deployment_config.__class__.__name__ == "OpenAIDeploymentConfig" else "aml"
    cmd += f" --output-format jsonl --log-save-dir logs/{subdir}/{deployment_config.deployment}"
    # Add clients 
    assert clients > 0
    cmd += f" --clients {clients}"
    
    return cmd

# OpenAI GPT-4 PayGO - Default Content Filter

In [47]:
# Ensure API key is loaded from .env file
load_dotenv()

demo_oai_deployment_config = OpenAIDeploymentConfig(
    api_key_env="OPENAI_API_KEY",
    deployment="gpt-4-1106-defaultct",
    api_base_endpoint="https://aoai-sweden-mt.openai.azure.com/",
)
initial_warmup_secs = 300 # Allow 5 mins for endpoint warmup and scaling of load


In [27]:
# Generate config sets
config_set = generate_config_set(
    deployment_config=demo_oai_deployment_config,
    benchmark_configs=ALL_BENCH_CONFIGS,
    initial_warmup_secs=initial_warmup_secs,
)
print(f"Generated {len(config_set)} configs.")

# Convert to execution strings
exec_strings = [config_to_execution_str(*config) for config in config_set]

Generated 22 configs.


### Run benchmark

In [22]:
for trial, exec_str in enumerate(exec_strings):
    print(f"Starting trial {trial+1} of {len(exec_strings)}")
    process = subprocess.Popen(shlex.split(exec_str), shell=False)
    return_code = process.wait()

Starting trial 1 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 2 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 3 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 4 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 5 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 6 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 7 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 8 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 9 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 10 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 11 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 12 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 13 of 22
invalid argument(s): api-

In [13]:
process.kill()

# OpenAI GPT-3.5 PayGO

In [None]:
# Ensure API key is loaded from .env file
load_dotenv()

demo_oai_deployment_config = OpenAIDeploymentConfig(
    api_key_env="OPENAI_API_KEY",
    deployment="gpt-4-1106-defaultct",
    api_base_endpoint="https://aoai-sweden-mt.openai.azure.com/",
)
initial_warmup_secs = 300 # Allow 5 mins for endpoint warmup and scaling of load


In [None]:
# Generate config sets
config_set = generate_config_set(
    deployment_config=demo_oai_deployment_config,
    benchmark_configs=ALL_BENCH_CONFIGS,
    initial_warmup_secs=initial_warmup_secs,
)
print(f"Generated {len(config_set)} configs.")

# Convert to execution strings
exec_strings = [config_to_execution_str(*config) for config in config_set]

Generated 22 configs.


In [None]:
for trial, exec_str in enumerate(exec_strings):
    print(f"Starting trial {trial+1} of {len(exec_strings)}")
    process = subprocess.Popen(shlex.split(exec_str), shell=False)
    return_code = process.wait()

Starting trial 1 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 2 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 3 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 4 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 5 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 6 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 7 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 8 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 9 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 10 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 11 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 12 of 22
invalid argument(s): api-key-env OPENAI_API_KEY not set
Starting trial 13 of 22
invalid argument(s): api-

In [None]:
process.kill()