-
Notifications
You must be signed in to change notification settings - Fork 3
Perf: Analayze the roofline of the inference endpoints #9
Copy link
Copy link
Open
Labels
area: core-engineLoad generator, scheduler, async utilsLoad generator, scheduler, async utilspriority: P1High — must address this cycleHigh — must address this cycletype: performancePerformance regression or improvementPerformance regression or improvement
Description
We need to understand the roofline of:
- In offline, the maximum number of queries/reponses we can handle each second
- In online (concurrency), the maximum concurrency that we can measure for the endpoints
- In online, the maximum SSE chunks we can stream each second (which will impact our TPS roofline)
We can use SemiAnalsysis data as a reference: https://inferencemax.ai/
This will prepare us for the future when we need to horizontally scale to measure endpoints served on a larger cluster
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area: core-engineLoad generator, scheduler, async utilsLoad generator, scheduler, async utilspriority: P1High — must address this cycleHigh — must address this cycletype: performancePerformance regression or improvementPerformance regression or improvement