You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the solution you'd like
To autoscale LLM inference services Knative's request level metrics may not be the best scaling metrics as LLM inference is performed at the token level and we need to understand the number of input/output tokens to scale. To effectively autoscale LLM inference services, we'd like to scale based on metrics like token throughput, power consumption.
In order to achieve this Autoscaler needs to be able to query metrics from prometheus. KEDA supports scaling based on prometheus metrics and we would like to implement a native integration with KEDA for both serverless and raw deployment mode to support autoscaling LLM inference out of the box.
A few important metrics:
Time To First Token (TTFT): Low waiting times for a response are essential in real-time interactions, but less important in offline workloads.
Time Per Output Token (TPOT): Time to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the "speed" of the model.
Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).
Throughput: The number of output tokens per second an inference server can generate across all users and requests.
Power consumption: Metrics produced by Kepler, e.g kepler_container_joules_total.
Anything else you would like to add:
Knative has a PoC to integrate with KEDA and @skonto is creating a new repo for serving-keda under knative-extensions org.
For Raw Deployment mode we can provision the ScalerObject in kserve controller and allow user to specify the metrics query on InferenceService spec.
Links to the design documents:
[Optional, start with the short-form RFC template to outline your ideas and get early feedback.]
[Required, use the longer-form design doc template to specify and discuss your design in more detail]
The text was updated successfully, but these errors were encountered:
@yuzisun Thank you for starting this thread. Serverless LLM inference has many use cases. When it comes to power efficiency and power capping, autoscaling capabilities from KEDA and power consumption metrics from Kepler will pave a more sustainable LLM inference service.
+1 This would be great to have. We should discuss how the KEDA extension can be incorporated in the two deployments modes, especially given that raw deployment does not include KNative.
since vllm support continue batching for handing multiple request, but big batch will results long TOF. Could we also conside the increasing or decreasing the batch?
/kind feature
Describe the solution you'd like
To autoscale LLM inference services Knative's request level metrics may not be the best scaling metrics as LLM inference is performed at the token level and we need to understand the number of input/output tokens to scale. To effectively autoscale LLM inference services, we'd like to scale based on metrics like token throughput, power consumption.
In order to achieve this Autoscaler needs to be able to query metrics from prometheus. KEDA supports scaling based on prometheus metrics and we would like to implement a native integration with KEDA for both serverless and raw deployment mode to support autoscaling LLM inference out of the box.
A few important metrics:
Anything else you would like to add:
serving-keda
under knative-extensions org.ScalerObject
in kserve controller and allow user to specify the metrics query onInferenceService
spec.Links to the design documents:
[Optional, start with the short-form RFC template to outline your ideas and get early feedback.]
[Required, use the longer-form design doc template to specify and discuss your design in more detail]
The text was updated successfully, but these errors were encountered: