Skip to content

Perf: Analayze the roofline of the inference endpoints #9

@nvzhihanj

Description

@nvzhihanj

We need to understand the roofline of:

  • In offline, the maximum number of queries/reponses we can handle each second
  • In online (concurrency), the maximum concurrency that we can measure for the endpoints
  • In online, the maximum SSE chunks we can stream each second (which will impact our TPS roofline)

We can use SemiAnalsysis data as a reference: https://inferencemax.ai/

This will prepare us for the future when we need to horizontally scale to measure endpoints served on a larger cluster

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: core-engineLoad generator, scheduler, async utilspriority: P1High — must address this cycletype: performancePerformance regression or improvement

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions