-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Initial Gateway API Inference Extension Blog Post #49898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Pull request preview available for checkingBuilt without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify site configuration. |
content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md
Outdated
Show resolved
Hide resolved
content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md
Outdated
Show resolved
Hide resolved
|
@robscott PTAL and let me know if you would like any modifications. |
|
TODO [danehans]: Add benchmarks and ref to: kubernetes-sigs/gateway-api-inference-extension#480 (when merged). |
|
|
||
| ## Enter Gateway API Inference Extension | ||
|
|
||
| [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/) was created to address this gap by building on the existing [Gateway API](https://gateway-api.sigs.k8s.io/), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On line 19 above, "... focused on HTTP path routing or ..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton I resolved all your feedback in the latest commit other than this comment. Feel free to rereview and/or elaborate. Thanks again for your review.
| standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware | ||
| routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing | ||
| based on real-time model metrics. By achieving these, the project aims to reduce latency and improve accelerator | ||
| (GPU) utilization for AI workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love if you could work in
"Adding the inference extension to your existing gateway makes it an Inference Gateway - enabling you to self-host large language models with a model as a service mindset"
or similar. Roughly hitting the two points "inference extends gateway = inference gateway", and "inference gateway = self-host genai/large models as model as a service"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I really like this framing, and we should use it as much as we can throughout this post and our docs.
I started to go a bit farther with this theme and realized that we could write a very compelling blog post with this theme after KubeCon when we have more Gateway implementations ready. That post could be titled "Introducing Kubernetes Inference Gateways", and have a section describing that an Inference Gateway is an "existing gateway + inference extension". To really sell that though, I think we need to have a variety of "Inference Gateways" ready to play with.
So if we think we'll end up with two separate blog posts here, maybe this initial one is focused on the project goal of extending any existing Gateway with specialized Inference routing capabilities, and then in a follow up blog post we can focus more on the "Inference Gateway" term when we have more examples to work with.
Or maybe we should just hold off on this post until we have more Inference Gateway examples. I'm not sure, open to ideas here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like planting the "gateway + inference extension = inference gateway" seed here and using a follow-up post to drive the messaging.
|
|
||
| 2. **Endpoint Selection** | ||
| Instead of simply forwarding to any pod, the Gateway consults an inference-specific routing extension. This | ||
| Instead of simply forwarding to any pod, the Gateway consults an inference-specific routing extension, e.g. endpoint selection extension. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe instead of 'e.g.'
Instead of simply forwarding to any available pod, the Gateway consults an inference-specific routing extension - an endpoint selection extension - to pick the best of the available pods.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the ^ change with one minor difference s/an endpoint selection extension/the Endpoint Selection Extension/
| more, it helps ops teams deliver the right LLM services to the right users—smoothly and efficiently. | ||
|
|
||
| **Ready to learn more?** Visit the [project docs](https://gateway-api-inference-extension.sigs.k8s.io/) to dive deeper, | ||
| give Inference Extension a try with a few [simple steps](https://gateway-api-inference-extension.sigs.k8s.io/guides/), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest saying "... give the Inference Gateway extension a try with a few ...".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to hold off on publishing this until we've updated our guides to use proper "Inference Gateways" instead of Envoy patches. Maybe that's actually an argument for saving this until after KubeCon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial inference extension supported landed in kgateway and I plan on adding an inference extension docs PR in the next few days.
942afc7 to
6bdf890
Compare
robscott
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this @danehans!
| --- | ||
| layout: blog | ||
| title: "Introducing Gateway API Inference Extension" | ||
| date: 2025-02-21 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danehans can we aim for a day that hasn't been claimed yet next week?
| standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware | ||
| routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing | ||
| based on real-time model metrics. By achieving these, the project aims to reduce latency and improve accelerator | ||
| (GPU) utilization for AI workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I really like this framing, and we should use it as much as we can throughout this post and our docs.
I started to go a bit farther with this theme and realized that we could write a very compelling blog post with this theme after KubeCon when we have more Gateway implementations ready. That post could be titled "Introducing Kubernetes Inference Gateways", and have a section describing that an Inference Gateway is an "existing gateway + inference extension". To really sell that though, I think we need to have a variety of "Inference Gateways" ready to play with.
So if we think we'll end up with two separate blog posts here, maybe this initial one is focused on the project goal of extending any existing Gateway with specialized Inference routing capabilities, and then in a follow up blog post we can focus more on the "Inference Gateway" term when we have more examples to work with.
Or maybe we should just hold off on this post until we have more Inference Gateway examples. I'm not sure, open to ideas here.
content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md
Outdated
Show resolved
Hide resolved
content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md
Outdated
Show resolved
Hide resolved
| more, it helps ops teams deliver the right LLM services to the right users—smoothly and efficiently. | ||
|
|
||
| **Ready to learn more?** Visit the [project docs](https://gateway-api-inference-extension.sigs.k8s.io/) to dive deeper, | ||
| give Inference Extension a try with a few [simple steps](https://gateway-api-inference-extension.sigs.k8s.io/guides/), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to hold off on publishing this until we've updated our guides to use proper "Inference Gateways" instead of Envoy patches. Maybe that's actually an argument for saving this until after KubeCon?
|
|
||
| This extra step provides a smarter, model-aware routing mechanism that still feels like a normal single request to | ||
| the client. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere in this section I think it would be useful to mention the extensible nature of this model, and that new extensions can be developed that will be compatible with any Inference Gateway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this section based on ^ feedback, PTAL.
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
robscott
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @danehans!
/lgtm
| --- | ||
| layout: blog | ||
| title: "Introducing Gateway API Inference Extension" | ||
| date: 2025-03-31 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sftim I think this blog post is pretty close to ready, what date can we target for publishing this?
| date: 2025-03-31 | ||
| slug: introducing-gateway-api-inference-extension | ||
| draft: true | ||
| author: > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shaneutt Unfortunately our previous attempts at greater inclusion actually backfired. I know we'd tried to link to a list of authors, but the end result is just an unlinked "Gateway API Contributors" (https://kubernetes.io/blog/2024/11/21/gateway-api-v1-2/).
Given that, I'd recommend keeping the authors as listed, though it may make sense to order alphabetically by first name instead of last name.
|
LGTM label has been added. DetailsGit tree hash: 6e4b047f7508a0eb1e19ef5ec62fca40e7775c8f |
|
@sftim PTAL when you have a moment. |
|
@askorupka you're the author of #50673 I'd like you to be a buddy for @danehans on this PR. The idea of article writing buddies (and the new guidelines) merged after this PR was opened; please bear that in mind. Please:
|
|
@askorupka please let me know if anything is needed to merge this PR. |
|
hey @danehans thanks for patience - just saw it, looks like I missed the previous notification. apologies! |
askorupka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
That was an interesting read. Left a few (very) minor comments. Thanks @danehans for patience!
| long-running, resource-intensive, and partially stateful. For example, a single GPU-backed model server | ||
| may keep multiple inference sessions active and maintain in-memory token caches. | ||
|
|
||
| Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed | |
| Traditional load balancers focused on HTTP path or round-robin, lack the specialized capabilities needed |
| In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform | ||
| operators manage where and how it’s served. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform | |
| operators manage where and how it’s served. | |
| In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets the platform | |
| operators manage where and how it’s served. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original is more idiomatic.
| baseline, particularly as traffic increased beyond 400–500 QPS. | ||
|
|
||
| These results suggest that this extension's model‐aware routing significantly reduced latency for GPU‐backed LLM | ||
| workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can | |
| workloads. By dynamically selecting the least‐loaded or best‐performing model server it avoids hotspots that can |
|
|
||
| As the Gateway API Inference Extension heads toward GA, planned features include: | ||
|
|
||
| 1. **Prefix-cache aware load balancing** for remote caches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When formatting numbered list it's a good idea to use this format
1. text
1. text2
1. text3
it adds automatic order and avoids potential renumbering in the future (might not be the case here)
|
@askorupka: changing LGTM is restricted to collaborators DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
hey @lmktfy according to buddy review guidelines I've tried to add lgtm label but looks like it's restricted for collaborators only |
lmktfy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's merge this as draft. Once release comms are done, it can go live.
@graz-dev would you be willing to propose a publication date for this?
/LGTM
/approve
| The design introduces two new Custom Resources (CRDs) with distinct responsibilities, each aligning with a | ||
| specific user persona in the AI/ML serving workflow: | ||
|
|
||
| {{< figure src="inference-extension-resource-model.png" alt="Resource Model" class="diagram-large" clicktozoom="true" >}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have the source images, please share them (say Hi in #sig-docs-blog in Slack, we'll see what we can do).
Switching to SVG can happen post publication.
| In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform | ||
| operators manage where and how it’s served. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original is more idiomatic.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lmktfy The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hi @danehans sorry I missed the previous mentions I'm going to open the publication PR proposing a date later this evening! Thanks for the ping! |
Description
Adds a blog post introducing the Gateway API inference extension project.
cc: @robscott @kfswain