Initial Gateway API Inference Extension Blog Post #49898

danehans · 2025-02-25T15:52:55Z

Description

Adds a blog post introducing the Gateway API inference extension project.

netlify · 2025-02-25T16:01:53Z

✅ Pull request preview available for checking

Built without sensitive environment variables

Name	Link
🔨 Latest commit	`d444df9`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/67f5436addfaff0008a0ba17
😎 Deploy Preview	https://deploy-preview-49898--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md

danehans · 2025-03-07T18:56:04Z

@robscott PTAL and let me know if you would like any modifications.

danehans · 2025-03-13T17:51:14Z

TODO [danehans]: Add benchmarks and ref to: kubernetes-sigs/gateway-api-inference-extension#480 (when merged).

smarterclayton · 2025-03-21T17:48:29Z

content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md


 ## Enter Gateway API Inference Extension

-[Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/) was created to address this gap by building on the existing [Gateway API](https://gateway-api.sigs.k8s.io/),


On line 19 above, "... focused on HTTP path routing or ..."?

@smarterclayton I resolved all your feedback in the latest commit other than this comment. Feel free to rereview and/or elaborate. Thanks again for your review.

smarterclayton · 2025-03-21T17:50:36Z

content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md

+standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware
+routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing
+based on real-time model metrics. By achieving these, the project aims to reduce latency and improve accelerator
+(GPU) utilization for AI workloads.


I'd love if you could work in

"Adding the inference extension to your existing gateway makes it an Inference Gateway - enabling you to self-host large language models with a model as a service mindset"

or similar. Roughly hitting the two points "inference extends gateway = inference gateway", and "inference gateway = self-host genai/large models as model as a service"

+1, I really like this framing, and we should use it as much as we can throughout this post and our docs.

I started to go a bit farther with this theme and realized that we could write a very compelling blog post with this theme after KubeCon when we have more Gateway implementations ready. That post could be titled "Introducing Kubernetes Inference Gateways", and have a section describing that an Inference Gateway is an "existing gateway + inference extension". To really sell that though, I think we need to have a variety of "Inference Gateways" ready to play with.

So if we think we'll end up with two separate blog posts here, maybe this initial one is focused on the project goal of extending any existing Gateway with specialized Inference routing capabilities, and then in a follow up blog post we can focus more on the "Inference Gateway" term when we have more examples to work with.

Or maybe we should just hold off on this post until we have more Inference Gateway examples. I'm not sure, open to ideas here.

I like planting the "gateway + inference extension = inference gateway" seed here and using a follow-up post to drive the messaging.

smarterclayton · 2025-03-21T17:51:46Z

content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md


 2. **Endpoint Selection**
-   Instead of simply forwarding to any pod, the Gateway consults an inference-specific routing extension. This
+   Instead of simply forwarding to any pod, the Gateway consults an inference-specific routing extension, e.g. endpoint selection extension. This


maybe instead of 'e.g.'

Instead of simply forwarding to any available pod, the Gateway consults an inference-specific routing extension - an endpoint selection extension - to pick the best of the available pods.

?

Making the ^ change with one minor difference s/an endpoint selection extension/the Endpoint Selection Extension/

smarterclayton · 2025-03-21T17:52:44Z

content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md

+more, it helps ops teams deliver the right LLM services to the right users—smoothly and efficiently.
+
+**Ready to learn more?** Visit the [project docs](https://gateway-api-inference-extension.sigs.k8s.io/) to dive deeper,
+give Inference Extension a try with a few [simple steps](https://gateway-api-inference-extension.sigs.k8s.io/guides/),


I would suggest saying "... give the Inference Gateway extension a try with a few ...".

We probably want to hold off on publishing this until we've updated our guides to use proper "Inference Gateways" instead of Envoy patches. Maybe that's actually an argument for saving this until after KubeCon?

The initial inference extension supported landed in kgateway and I plan on adding an inference extension docs PR in the next few days.

robscott

Thanks for the work on this @danehans!

robscott · 2025-03-21T22:28:06Z

content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md

+---
+layout: blog
+title: "Introducing Gateway API Inference Extension"
+date: 2025-02-21


@danehans can we aim for a day that hasn't been claimed yet next week?

robscott · 2025-03-21T22:38:56Z

content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md

+standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware
+routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing
+based on real-time model metrics. By achieving these, the project aims to reduce latency and improve accelerator
+(GPU) utilization for AI workloads.


+1, I really like this framing, and we should use it as much as we can throughout this post and our docs.

I started to go a bit farther with this theme and realized that we could write a very compelling blog post with this theme after KubeCon when we have more Gateway implementations ready. That post could be titled "Introducing Kubernetes Inference Gateways", and have a section describing that an Inference Gateway is an "existing gateway + inference extension". To really sell that though, I think we need to have a variety of "Inference Gateways" ready to play with.

So if we think we'll end up with two separate blog posts here, maybe this initial one is focused on the project goal of extending any existing Gateway with specialized Inference routing capabilities, and then in a follow up blog post we can focus more on the "Inference Gateway" term when we have more examples to work with.

Or maybe we should just hold off on this post until we have more Inference Gateway examples. I'm not sure, open to ideas here.

content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md

robscott · 2025-03-21T22:43:46Z

content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md

+more, it helps ops teams deliver the right LLM services to the right users—smoothly and efficiently.
+
+**Ready to learn more?** Visit the [project docs](https://gateway-api-inference-extension.sigs.k8s.io/) to dive deeper,
+give Inference Extension a try with a few [simple steps](https://gateway-api-inference-extension.sigs.k8s.io/guides/),


We probably want to hold off on publishing this until we've updated our guides to use proper "Inference Gateways" instead of Envoy patches. Maybe that's actually an argument for saving this until after KubeCon?

robscott · 2025-03-21T22:45:11Z

content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md

+
+This extra step provides a smarter, model-aware routing mechanism that still feels like a normal single request to
+the client.
+


Somewhere in this section I think it would be useful to mention the extensible nature of this model, and that new extensions can be developed that will be compatible with any Inference Gateway.

I updated this section based on ^ feedback, PTAL.

Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>

robscott

Thanks @danehans!

/lgtm

robscott · 2025-04-11T00:59:24Z

content/en/blog/_posts/2025-03-31-introducing-gateway-api-inference-extension/index.md

+---
+layout: blog
+title: "Introducing Gateway API Inference Extension"
+date: 2025-03-31


@sftim I think this blog post is pretty close to ready, what date can we target for publishing this?

robscott · 2025-04-11T01:02:15Z

content/en/blog/_posts/2025-03-31-introducing-gateway-api-inference-extension/index.md

+date: 2025-03-31
+slug: introducing-gateway-api-inference-extension
+draft: true
+author: >


@shaneutt Unfortunately our previous attempts at greater inclusion actually backfired. I know we'd tried to link to a list of authors, but the end result is just an unlinked "Gateway API Contributors" (https://kubernetes.io/blog/2024/11/21/gateway-api-v1-2/).

Given that, I'd recommend keeping the authors as listed, though it may make sense to order alphabetically by first name instead of last name.

k8s-ci-robot · 2025-04-11T01:03:59Z

LGTM label has been added.

Details

Git tree hash: 6e4b047f7508a0eb1e19ef5ec62fca40e7775c8f

danehans · 2025-04-13T19:24:42Z

@sftim PTAL when you have a moment.

lmktfy · 2025-04-27T21:35:13Z

@askorupka you're the author of #50673

I'd like you to be a buddy for @danehans on this PR. The idea of article writing buddies (and the new guidelines) merged after this PR was opened; please bear that in mind.

Please:

review this PR, paying attention to the guidelines and review hints
update your own PR based on any good practice that you realize you ought to be following
be compassionate to your fellow article author

danehans · 2025-05-05T15:17:53Z

@askorupka please let me know if anything is needed to merge this PR.

askorupka · 2025-05-05T15:21:00Z

hey @danehans thanks for patience - just saw it, looks like I missed the previous notification. apologies!
I'll provide you with a review tomorrow if that's ok!

askorupka

/lgtm

That was an interesting read. Left a few (very) minor comments. Thanks @danehans for patience!

askorupka · 2025-05-06T20:29:47Z

content/en/blog/_posts/2025-03-31-introducing-gateway-api-inference-extension/index.md

+long-running, resource-intensive, and partially stateful. For example, a single GPU-backed model server
+may keep multiple inference sessions active and maintain in-memory token caches.
+
+Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed


Suggested change

Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed

Traditional load balancers focused on HTTP path or round-robin, lack the specialized capabilities needed

askorupka · 2025-05-06T20:30:25Z

content/en/blog/_posts/2025-03-31-introducing-gateway-api-inference-extension/index.md

+In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform
+operators manage where and how it’s served.


Suggested change

In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform

operators manage where and how it’s served.

In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets the platform

operators manage where and how it’s served.

I think the original is more idiomatic.

askorupka · 2025-05-06T20:33:15Z

content/en/blog/_posts/2025-03-31-introducing-gateway-api-inference-extension/index.md

+  baseline, particularly as traffic increased beyond 400–500 QPS.
+
+These results suggest that this extension's model‐aware routing significantly reduced latency for GPU‐backed LLM
+workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can


Suggested change

workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can

workloads. By dynamically selecting the least‐loaded or best‐performing model server it avoids hotspots that can

askorupka · 2025-05-06T20:37:47Z

content/en/blog/_posts/2025-03-31-introducing-gateway-api-inference-extension/index.md

+
+As the Gateway API Inference Extension heads toward GA, planned features include:
+
+1. **Prefix-cache aware load balancing** for remote caches


When formatting numbered list it's a good idea to use this format

1. text 1. text2 1. text3

it adds automatic order and avoids potential renumbering in the future (might not be the case here)

k8s-ci-robot · 2025-05-06T20:38:50Z

@askorupka: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm

That was an interesting read. Left a few (very) minor comments. Thanks @danehans for patience!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

askorupka · 2025-05-06T20:40:21Z

hey @lmktfy according to buddy review guidelines I've tried to add lgtm label but looks like it's restricted for collaborators only

lmktfy

Ok, let's merge this as draft. Once release comms are done, it can go live.

@graz-dev would you be willing to propose a publication date for this?
/LGTM
/approve

lmktfy · 2025-05-07T19:22:58Z

content/en/blog/_posts/2025-03-31-introducing-gateway-api-inference-extension/index.md

+The design introduces two new Custom Resources (CRDs) with distinct responsibilities, each aligning with a
+specific user persona in the AI/ML serving workflow:
+
+{{< figure src="inference-extension-resource-model.png" alt="Resource Model" class="diagram-large" clicktozoom="true" >}}


If you have the source images, please share them (say Hi in #sig-docs-blog in Slack, we'll see what we can do).

Switching to SVG can happen post publication.

lmktfy · 2025-05-07T19:23:16Z

content/en/blog/_posts/2025-03-31-introducing-gateway-api-inference-extension/index.md

+In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform
+operators manage where and how it’s served.


I think the original is more idiomatic.

k8s-ci-robot · 2025-05-07T19:25:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lmktfy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~content/en/blog/OWNERS~~ [lmktfy]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

danehans · 2025-05-15T17:01:03Z

@lmktfy @graz-dev checking to see if anything is needed to publish this blog post.

cc: @robscott

graz-dev · 2025-05-15T17:10:16Z

Hi @danehans sorry I missed the previous mentions I'm going to open the publication PR proposing a date later this evening!
I'll share here the link to the new PR :)

Thanks for the ping!

k8s-ci-robot added the area/blog Issues or PRs related to the Kubernetes Blog subproject label Feb 25, 2025

k8s-ci-robot requested review from natalisucks and sftim February 25, 2025 15:53

k8s-ci-robot added language/en Issues or PRs related to English language cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 25, 2025

kfswain reviewed Feb 25, 2025

View reviewed changes

content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md Outdated Show resolved Hide resolved

danehans changed the title ~~Initial Gateway API Inference Extension Blog Post~~ [WIP] Initial Gateway API Inference Extension Blog Post Feb 25, 2025

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 25, 2025

sftim reviewed Feb 26, 2025

View reviewed changes

content/en/blog/_posts/2025-02-21-introducing-gateway-api-inference-extension/index.md Outdated Show resolved Hide resolved

danehans force-pushed the gie_kcon_blog branch from 81e60ca to 3b5f77a Compare March 7, 2025 18:54

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 20, 2025

smarterclayton reviewed Mar 21, 2025

View reviewed changes

danehans force-pushed the gie_kcon_blog branch 2 times, most recently from 942afc7 to 6bdf890 Compare March 21, 2025 19:31

danehans changed the title ~~[WIP] Initial Gateway API Inference Extension Blog Post~~ Initial Gateway API Inference Extension Blog Post Mar 21, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2025

robscott reviewed Mar 21, 2025

View reviewed changes

danehans force-pushed the gie_kcon_blog branch from 6bdf890 to 7ee2d03 Compare March 22, 2025 19:11

danehans requested review from kfswain, robscott, sftim and smarterclayton March 22, 2025 19:11

danehans added 8 commits April 8, 2025 08:30

Font matter for authors, roadmap, and streamlines content

638ed14

Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>

Adds call to action and benchmarking todo

6b9f460

Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>

Adds benchmark section

5fb76ca

Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>

Changes the date and resolves initial review feedback

121bbbd

Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>

Updates release date and minor content update

89bc1a6

Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>

Resolves kfswain review feedback

83ac832

Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>

Updates request flow diagram

75653dc

Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>

Resolve feedback from sftim

d444df9

Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>

danehans force-pushed the gie_kcon_blog branch from fe08575 to d444df9 Compare April 8, 2025 15:40

danehans requested a review from sftim April 8, 2025 15:42

robscott reviewed Apr 11, 2025

View reviewed changes

k8s-ci-robot assigned robscott Apr 11, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 11, 2025

lmktfy mentioned this pull request Apr 27, 2025

Add blog: Start sidecar first #50673

Merged

askorupka reviewed May 6, 2025

View reviewed changes

lmktfy reviewed May 7, 2025

View reviewed changes

k8s-ci-robot assigned lmktfy May 7, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2025

k8s-ci-robot merged commit 452d17a into kubernetes:main May 7, 2025
6 checks passed

graz-dev mentioned this pull request May 15, 2025

Publish Initial Gateway API Inference Extension Blog Post #50923

Merged


		## Enter Gateway API Inference Extension

		[Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/) was created to address this gap by building on the existing [Gateway API](https://gateway-api.sigs.k8s.io/),


		This extra step provides a smarter, model-aware routing mechanism that still feels like a normal single request to
		the client.

	Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed
	Traditional load balancers focused on HTTP path or round-robin, lack the specialized capabilities needed

		In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform
		operators manage where and how it’s served.

	workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can
	workloads. By dynamically selecting the least‐loaded or best‐performing model server it avoids hotspots that can


		As the Gateway API Inference Extension heads toward GA, planned features include:

		1. Prefix-cache aware load balancing for remote caches

Initial Gateway API Inference Extension Blog Post #49898

Initial Gateway API Inference Extension Blog Post #49898

Uh oh!

Conversation

danehans commented Feb 25, 2025

Description

Uh oh!

netlify bot commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Pull request preview available for checking

Uh oh!

Uh oh!

Uh oh!

danehans commented Mar 7, 2025

Uh oh!

danehans commented Mar 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robscott left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robscott left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Apr 11, 2025

Uh oh!

danehans commented Apr 13, 2025

Uh oh!

lmktfy commented Apr 27, 2025

Uh oh!

danehans commented May 5, 2025

Uh oh!

askorupka commented May 5, 2025

Uh oh!

askorupka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

netlify bot commented Feb 25, 2025 •

edited

Loading