Skip to content

Conversation

@danehans
Copy link
Contributor

Description

Adds a blog post introducing the Gateway API inference extension project.

cc: @robscott @kfswain

@k8s-ci-robot k8s-ci-robot added the area/blog Issues or PRs related to the Kubernetes Blog subproject label Feb 25, 2025
@k8s-ci-robot k8s-ci-robot added language/en Issues or PRs related to English language cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 25, 2025
@netlify
Copy link

netlify bot commented Feb 25, 2025

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit d444df9
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/67f5436addfaff0008a0ba17
😎 Deploy Preview https://deploy-preview-49898--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@danehans danehans changed the title Initial Gateway API Inference Extension Blog Post [WIP] Initial Gateway API Inference Extension Blog Post Feb 25, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 25, 2025
@danehans
Copy link
Contributor Author

danehans commented Mar 7, 2025

@robscott PTAL and let me know if you would like any modifications.

@danehans
Copy link
Contributor Author

TODO [danehans]: Add benchmarks and ref to: kubernetes-sigs/gateway-api-inference-extension#480 (when merged).

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 20, 2025

## Enter Gateway API Inference Extension

[Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/) was created to address this gap by building on the existing [Gateway API](https://gateway-api.sigs.k8s.io/),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 19 above, "... focused on HTTP path routing or ..."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton I resolved all your feedback in the latest commit other than this comment. Feel free to rereview and/or elaborate. Thanks again for your review.

standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware
routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing
based on real-time model metrics​. By achieving these, the project aims to reduce latency and improve accelerator
(GPU) utilization for AI workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love if you could work in

"Adding the inference extension to your existing gateway makes it an Inference Gateway - enabling you to self-host large language models with a model as a service mindset"

or similar. Roughly hitting the two points "inference extends gateway = inference gateway", and "inference gateway = self-host genai/large models as model as a service"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I really like this framing, and we should use it as much as we can throughout this post and our docs.

I started to go a bit farther with this theme and realized that we could write a very compelling blog post with this theme after KubeCon when we have more Gateway implementations ready. That post could be titled "Introducing Kubernetes Inference Gateways", and have a section describing that an Inference Gateway is an "existing gateway + inference extension". To really sell that though, I think we need to have a variety of "Inference Gateways" ready to play with.

So if we think we'll end up with two separate blog posts here, maybe this initial one is focused on the project goal of extending any existing Gateway with specialized Inference routing capabilities, and then in a follow up blog post we can focus more on the "Inference Gateway" term when we have more examples to work with.

Or maybe we should just hold off on this post until we have more Inference Gateway examples. I'm not sure, open to ideas here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like planting the "gateway + inference extension = inference gateway" seed here and using a follow-up post to drive the messaging.


2. **Endpoint Selection**
Instead of simply forwarding to any pod, the Gateway consults an inference-specific routing extension. This
Instead of simply forwarding to any pod, the Gateway consults an inference-specific routing extension, e.g. ​endpoint selection extension. This
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of 'e.g.'

Instead of simply forwarding to any available pod, the Gateway consults an inference-specific routing extension - an ​endpoint selection extension - to pick the best of the available pods.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the ^ change with one minor difference s/an ​endpoint selection extension/the Endpoint Selection Extension/

more, it helps ops teams deliver the right LLM services to the right users—smoothly and efficiently.

**Ready to learn more?** Visit the [project docs](https://gateway-api-inference-extension.sigs.k8s.io/) to dive deeper,
give Inference Extension a try with a few [simple steps](https://gateway-api-inference-extension.sigs.k8s.io/guides/),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest saying "... give the Inference Gateway extension a try with a few ...".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to hold off on publishing this until we've updated our guides to use proper "Inference Gateways" instead of Envoy patches. Maybe that's actually an argument for saving this until after KubeCon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial inference extension supported landed in kgateway and I plan on adding an inference extension docs PR in the next few days.

@danehans danehans force-pushed the gie_kcon_blog branch 2 times, most recently from 942afc7 to 6bdf890 Compare March 21, 2025 19:31
@danehans danehans changed the title [WIP] Initial Gateway API Inference Extension Blog Post Initial Gateway API Inference Extension Blog Post Mar 21, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2025
Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this @danehans!

---
layout: blog
title: "Introducing Gateway API Inference Extension"
date: 2025-02-21
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danehans can we aim for a day that hasn't been claimed yet next week?

standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware
routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing
based on real-time model metrics​. By achieving these, the project aims to reduce latency and improve accelerator
(GPU) utilization for AI workloads.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I really like this framing, and we should use it as much as we can throughout this post and our docs.

I started to go a bit farther with this theme and realized that we could write a very compelling blog post with this theme after KubeCon when we have more Gateway implementations ready. That post could be titled "Introducing Kubernetes Inference Gateways", and have a section describing that an Inference Gateway is an "existing gateway + inference extension". To really sell that though, I think we need to have a variety of "Inference Gateways" ready to play with.

So if we think we'll end up with two separate blog posts here, maybe this initial one is focused on the project goal of extending any existing Gateway with specialized Inference routing capabilities, and then in a follow up blog post we can focus more on the "Inference Gateway" term when we have more examples to work with.

Or maybe we should just hold off on this post until we have more Inference Gateway examples. I'm not sure, open to ideas here.

more, it helps ops teams deliver the right LLM services to the right users—smoothly and efficiently.

**Ready to learn more?** Visit the [project docs](https://gateway-api-inference-extension.sigs.k8s.io/) to dive deeper,
give Inference Extension a try with a few [simple steps](https://gateway-api-inference-extension.sigs.k8s.io/guides/),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to hold off on publishing this until we've updated our guides to use proper "Inference Gateways" instead of Envoy patches. Maybe that's actually an argument for saving this until after KubeCon?


This extra step provides a smarter, model-aware routing mechanism that still feels like a normal single request to
the client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere in this section I think it would be useful to mention the extensible nature of this model, and that new extensions can be developed that will be compatible with any Inference Gateway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this section based on ^ feedback, PTAL.

danehans added 8 commits April 8, 2025 08:30
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
@danehans danehans requested a review from sftim April 8, 2025 15:42
Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @danehans!

/lgtm

---
layout: blog
title: "Introducing Gateway API Inference Extension"
date: 2025-03-31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sftim I think this blog post is pretty close to ready, what date can we target for publishing this?

date: 2025-03-31
slug: introducing-gateway-api-inference-extension
draft: true
author: >
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaneutt Unfortunately our previous attempts at greater inclusion actually backfired. I know we'd tried to link to a list of authors, but the end result is just an unlinked "Gateway API Contributors" (https://kubernetes.io/blog/2024/11/21/gateway-api-v1-2/).

Given that, I'd recommend keeping the authors as listed, though it may make sense to order alphabetically by first name instead of last name.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 11, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: 6e4b047f7508a0eb1e19ef5ec62fca40e7775c8f

@danehans
Copy link
Contributor Author

@sftim PTAL when you have a moment.

@lmktfy
Copy link
Member

lmktfy commented Apr 27, 2025

@askorupka you're the author of #50673

I'd like you to be a buddy for @danehans on this PR. The idea of article writing buddies (and the new guidelines) merged after this PR was opened; please bear that in mind.

Please:

  • review this PR, paying attention to the guidelines and review hints
  • update your own PR based on any good practice that you realize you ought to be following
  • be compassionate to your fellow article author

@danehans
Copy link
Contributor Author

danehans commented May 5, 2025

@askorupka please let me know if anything is needed to merge this PR.

@askorupka
Copy link
Contributor

hey @danehans thanks for patience - just saw it, looks like I missed the previous notification. apologies!
I'll provide you with a review tomorrow if that's ok!

Copy link
Contributor

@askorupka askorupka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

That was an interesting read. Left a few (very) minor comments. Thanks @danehans for patience!

long-running, resource-intensive, and partially stateful. For example, a single GPU-backed model server
may keep multiple inference sessions active and maintain in-memory token caches.

Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed
Traditional load balancers focused on HTTP path or round-robin, lack the specialized capabilities needed

Comment on lines +55 to +56
In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform
operators manage where and how it’s served.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform
operators manage where and how it’s served.
In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets the platform
operators manage where and how it’s served.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original is more idiomatic.

baseline, particularly as traffic increased beyond 400–500 QPS.

These results suggest that this extension's model‐aware routing significantly reduced latency for GPU‐backed LLM
workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can
workloads. By dynamically selecting the least‐loaded or best‐performing model server it avoids hotspots that can


As the Gateway API Inference Extension heads toward GA, planned features include:

1. **Prefix-cache aware load balancing** for remote caches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When formatting numbered list it's a good idea to use this format

1. text
1. text2
1. text3

it adds automatic order and avoids potential renumbering in the future (might not be the case here)

@k8s-ci-robot
Copy link
Contributor

@askorupka: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm

That was an interesting read. Left a few (very) minor comments. Thanks @danehans for patience!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@askorupka
Copy link
Contributor

hey @lmktfy according to buddy review guidelines I've tried to add lgtm label but looks like it's restricted for collaborators only

Copy link
Member

@lmktfy lmktfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's merge this as draft. Once release comms are done, it can go live.

@graz-dev would you be willing to propose a publication date for this?
/LGTM
/approve

The design introduces two new Custom Resources (CRDs) with distinct responsibilities, each aligning with a
specific user persona in the AI/ML serving workflow​:

{{< figure src="inference-extension-resource-model.png" alt="Resource Model" class="diagram-large" clicktozoom="true" >}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have the source images, please share them (say Hi in #sig-docs-blog in Slack, we'll see what we can do).

Switching to SVG can happen post publication.

Comment on lines +55 to +56
In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform
operators manage where and how it’s served.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original is more idiomatic.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lmktfy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2025
@k8s-ci-robot k8s-ci-robot merged commit 452d17a into kubernetes:main May 7, 2025
6 checks passed
@danehans
Copy link
Contributor Author

@lmktfy @graz-dev checking to see if anything is needed to publish this blog post.

cc: @robscott

@graz-dev
Copy link
Contributor

Hi @danehans sorry I missed the previous mentions I'm going to open the publication PR proposing a date later this evening!
I'll share here the link to the new PR :)

Thanks for the ping!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/blog Issues or PRs related to the Kubernetes Blog subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.