Skip to content

Conversation

@liu-cong
Copy link
Contributor

@liu-cong liu-cong commented Nov 20, 2025

Previously the prefix cache scorer implicitly overrides the BlockSize and LRUCapacityPerServer via model server metrics. This is the default behavior we want. However it can be confusing if users manually configure these parameters.

This PR adds an autoTune flag to the prefix cache scorer config to allow the user to explicitly set whether they want autoTune or not, and is defaulted to true.

This also allows use to manually set the CPU cache capacity because the corresponding metric isn't ready in vllm yet.

Test

Tried the following config:

 pluginsConfigFile: "default-plugins.yaml"
  # This is the plugins configuration file.
  pluginsCustomConfig:
    default-plugins.yaml: |
      apiVersion: inference.networking.x-k8s.io/v1alpha1
      kind: EndpointPickerConfig
      plugins:
      - type: queue-scorer
      - type: kv-cache-utilization-scorer
      - type: prefix-cache-scorer
        name: gpu-prefix-cache-scorer
        parameters:
          autoTune: true
      - type: prefix-cache-scorer
        name: cpu-prefix-cache-scorer
        parameters:
          autoTune: false
          lruCapacityPerServer: 41000
      schedulingProfiles:
      - name: default
        plugins:
        - pluginRef: queue-scorer
          weight: 2
        - pluginRef: kv-cache-utilization-scorer
          weight: 2
        - pluginRef: gpu-prefix-cache-scorer
          weight: 2
        - pluginRef: cpu-prefix-cache-scorer
          weight: 1

And got the following log:

{"level":"Level(-2)","ts":"2025-11-20T22:58:56Z","caller":"prefix/plugin.go:190","msg":"PrefixCachePlugin initialized","config":{"autoTune":true,"blockSize":64,"maxPrefixBlocksToMatch":256,"lruCapacityPerServer":31250}}                       │
│ {"level":"Level(-2)","ts":"2025-11-20T22:58:56Z","caller":"prefix/plugin.go:190","msg":"PrefixCachePlugin initialized","config":{"autoTune":false,"blockSize":64,"maxPrefixBlocksToMatch":256,"lruCapacityPerServer":41000}}

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2025
@netlify
Copy link

netlify bot commented Nov 20, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 23d4d96
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/691f99ad09f8320008d73c90
😎 Deploy Preview https://deploy-preview-1888--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 20, 2025
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 20, 2025
@zetxqx
Copy link
Contributor

zetxqx commented Nov 20, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 20, 2025
@liu-cong liu-cong force-pushed the prefix-plugin-manual-config branch from 68f5cf0 to 7144e95 Compare November 20, 2025 22:13
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 20, 2025
@liu-cong
Copy link
Contributor Author

/hold

Running some test

@liu-cong liu-cong marked this pull request as ready for review November 20, 2025 22:16
@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Nov 20, 2025
@liu-cong liu-cong force-pushed the prefix-plugin-manual-config branch from 7144e95 to 23d4d96 Compare November 20, 2025 22:43
@liu-cong liu-cong mentioned this pull request Nov 20, 2025
@zetxqx
Copy link
Contributor

zetxqx commented Nov 20, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 20, 2025
@liu-cong
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 20, 2025
@k8s-ci-robot k8s-ci-robot merged commit f357ece into kubernetes-sigs:main Nov 20, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants