Native integration with KEDA for LLM inference autoscaling #3561

yuzisun · 2024-03-31T13:45:10Z

/kind feature

Describe the solution you'd like
To autoscale LLM inference services Knative's request level metrics may not be the best scaling metrics as LLM inference is performed at the token level and we need to understand the number of input/output tokens to scale. To effectively autoscale LLM inference services, we'd like to scale based on metrics like token throughput, power consumption.

In order to achieve this Autoscaler needs to be able to query metrics from prometheus. KEDA supports scaling based on prometheus metrics and we would like to implement a native integration with KEDA for both serverless and raw deployment mode to support autoscaling LLM inference out of the box.

A few important metrics:

Time To First Token (TTFT): Low waiting times for a response are essential in real-time interactions, but less important in offline workloads.
Time Per Output Token (TPOT): Time to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the "speed" of the model.
Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).
Throughput: The number of output tokens per second an inference server can generate across all users and requests.
Power consumption: Metrics produced by Kepler, e.g kepler_container_joules_total.

Anything else you would like to add:

Knative has a PoC to integrate with KEDA and @skonto is creating a new repo for serving-keda under knative-extensions org.
For Raw Deployment mode we can provision the ScalerObject in kserve controller and allow user to specify the metrics query on InferenceService spec.

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "llama2-vllm"
spec:
  predictor:
    scaleQuery: "average_token_throughput_per_second[1m]"
    scaleMetric: custom
    maxReplicas: 10
    minReplicas: 1
    model:
      modelFormat:
        name: huggingface
      storageUri: "gs://kfserving-examples/models/huggingface/llama2-70b"

apiVersion: keda.sh/v1alpha1
kind: ScaleObject
metadata:
  name: llama2-vllm-scaleobject
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama2-vllm-deployment
  pollingInterval: 15
  cooldownPeriod: 30
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-server
        metricName: average_token_per_second
        query: average_token_per_second[1m]
        threshold: "500"

Links to the design documents:
[Optional, start with the short-form RFC template to outline your ideas and get early feedback.]
[Required, use the longer-form design doc template to specify and discuss your design in more detail]

The text was updated successfully, but these errors were encountered:

rootfs · 2024-03-31T14:09:01Z

@yuzisun Thank you for starting this thread. Serverless LLM inference has many use cases. When it comes to power efficiency and power capping, autoscaling capabilities from KEDA and power consumption metrics from Kepler will pave a more sustainable LLM inference service.

terrytangyuan · 2024-04-12T14:29:56Z

+1 This would be great to have. We should discuss how the KEDA extension can be incorporated in the two deployments modes, especially given that raw deployment does not include KNative.

kenplusplus · 2024-04-17T12:48:45Z

since vllm support continue batching for handing multiple request, but big batch will results long TOF. Could we also conside the increasing or decreasing the batch?

Jeffwan · 2024-04-22T20:44:56Z

@kenplusplus dynamic batch size won't be controlled by external autoscaling logic. That's two different levels.

yuzisun added the kserve/llm label Mar 31, 2024

oss-prow-bot bot added the kind/feature label Mar 31, 2024

yuzisun mentioned this issue Mar 31, 2024

New Repo: autoscaler-keda knative/community#1527

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native integration with KEDA for LLM inference autoscaling #3561

Native integration with KEDA for LLM inference autoscaling #3561

yuzisun commented Mar 31, 2024 •

edited

Loading

rootfs commented Mar 31, 2024

terrytangyuan commented Apr 12, 2024 •

edited

Loading

kenplusplus commented Apr 17, 2024

Jeffwan commented Apr 22, 2024

Native integration with KEDA for LLM inference autoscaling #3561

Native integration with KEDA for LLM inference autoscaling #3561

Comments

yuzisun commented Mar 31, 2024 • edited Loading

rootfs commented Mar 31, 2024

terrytangyuan commented Apr 12, 2024 • edited Loading

kenplusplus commented Apr 17, 2024

Jeffwan commented Apr 22, 2024

yuzisun commented Mar 31, 2024 •

edited

Loading

terrytangyuan commented Apr 12, 2024 •

edited

Loading