Skip to content

Disable Early Stopping for LLM benchmarks #2188

@attafosu

Description

@attafosu

Early stopping (ES) criteria as implemented for the inference benchmarks provides robust determination for test systems to meet stipulated latency constraints, particularly for the Server scenario. While this purpose is very well served in general benchmarks, the same cannot be said about the LLM benchmarks. For instance, submitters often may have a difficult time during performance, especially even when the SUT meets the stipulated token latency constraints.
For the LLM benchmarks, there’re few things that make ES difficult to use:

  1. Equal-issue mode is enabled by default: Unlike other benchmarks where the performance samples (subset of the dataset that is used for performance runs) is a smaller subset of the validation, LLM benchmarks use the full validation set for the run. This ensures that all samples in the dataset are seen at least once during the benchmark run, and that latency results are not skewed.
    The dataset also tends to be a carefully curated representation of the samples typically found in the task for the benchmark, as well as the SLAs. As such, at the end of a benchmark run, the latency profile of the SUT is a reliable performance indicator on the given task/dataset.

  2. As stated previously, tuning the server QPS has become challenging: Given the two latency constraints for LLM benchmarks, TTFT and TPOT, users have to spend time tuning their QPS, which may meet latency constraints, but barely passes ES. Combined with point 1 above, if a system passes the two latency constraints, it should be reasonable to conclude of its performance thereof.

  3. Moreover, benchmark duration for LLMs typically goes well beyond the minimum duration requirement of 10min. ES was initially proposed to replace the minimum query count, so that systems that comfortably meet the latency constraints can terminate early.
    We propose replacing the minimum query count with a function that dynamically determines when a run has reached statistical significance. Systems that consistently deliver low latency queries will have shorter run times; systems that struggle to hit the latency bound may have longer run times.
    Specifically, implementations with underlatency percentiles greater than the target tail latency will have much lower required query counts and therefore runtime. Conversely, implementations with underlatency percentiles that are very close to the target tail latency percentile will have somewhat higher minimum query counts. The net effect will be shorter runtimes for small systems in the SingleStream, MultiStream, and Server scenarios

Given that LLM benchmarks can’t “terminate” early even if the SUT comfortably meets the latency constraints during the run, enforcing ES for these benchmarks makes no sense.
For these reasons, we propose to make the ES criteria optional for LLM benchmarks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions