Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native integration with KEDA for LLM inference autoscaling #3561

Open
yuzisun opened this issue Mar 31, 2024 · 4 comments
Open

Native integration with KEDA for LLM inference autoscaling #3561

yuzisun opened this issue Mar 31, 2024 · 4 comments

Comments

@yuzisun
Copy link
Member

yuzisun commented Mar 31, 2024

/kind feature

Describe the solution you'd like
To autoscale LLM inference services Knative's request level metrics may not be the best scaling metrics as LLM inference is performed at the token level and we need to understand the number of input/output tokens to scale. To effectively autoscale LLM inference services, we'd like to scale based on metrics like token throughput, power consumption.

In order to achieve this Autoscaler needs to be able to query metrics from prometheus. KEDA supports scaling based on prometheus metrics and we would like to implement a native integration with KEDA for both serverless and raw deployment mode to support autoscaling LLM inference out of the box.

A few important metrics:

  • Time To First Token (TTFT): Low waiting times for a response are essential in real-time interactions, but less important in offline workloads.
  • Time Per Output Token (TPOT): Time to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the "speed" of the model.
  • Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).
  • Throughput: The number of output tokens per second an inference server can generate across all users and requests.
  • Power consumption: Metrics produced by Kepler, e.g kepler_container_joules_total.

Anything else you would like to add:

  • Knative has a PoC to integrate with KEDA and @skonto is creating a new repo for serving-keda under knative-extensions org.
  • For Raw Deployment mode we can provision the ScalerObject in kserve controller and allow user to specify the metrics query on InferenceService spec.
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "llama2-vllm"
spec:
  predictor:
    scaleQuery: "average_token_throughput_per_second[1m]"
    scaleMetric: custom
    maxReplicas: 10
    minReplicas: 1
    model:
      modelFormat:
        name: huggingface
      storageUri: "gs://kfserving-examples/models/huggingface/llama2-70b"
apiVersion: keda.sh/v1alpha1
kind: ScaleObject
metadata:
  name: llama2-vllm-scaleobject
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama2-vllm-deployment
  pollingInterval: 15
  cooldownPeriod: 30
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-server
        metricName: average_token_per_second
        query: average_token_per_second[1m]
        threshold: "500"

Links to the design documents:
[Optional, start with the short-form RFC template to outline your ideas and get early feedback.]
[Required, use the longer-form design doc template to specify and discuss your design in more detail]

@rootfs
Copy link

rootfs commented Mar 31, 2024

@yuzisun Thank you for starting this thread. Serverless LLM inference has many use cases. When it comes to power efficiency and power capping, autoscaling capabilities from KEDA and power consumption metrics from Kepler will pave a more sustainable LLM inference service.

@terrytangyuan
Copy link
Member

terrytangyuan commented Apr 12, 2024

+1 This would be great to have. We should discuss how the KEDA extension can be incorporated in the two deployments modes, especially given that raw deployment does not include KNative.

@kenplusplus
Copy link

since vllm support continue batching for handing multiple request, but big batch will results long TOF. Could we also conside the increasing or decreasing the batch?

@Jeffwan
Copy link

Jeffwan commented Apr 22, 2024

@kenplusplus dynamic batch size won't be controlled by external autoscaling logic. That's two different levels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants