AI inference: demonstrate in-cluster storage of models #575

justinsb · 2025-09-09T13:29:00Z

This example demonstrates how we can serve models from inside the cluster,
without needing to bake them into the container images,
or rely on pulling them from services like huggingface.

We may also in future want to support storing models in GCS or S3,
but this example focuses on storing models without cloud dependencies.

We may also want to investigate serving models from container images,
particularly given the upcoming support for mounting container images
as volumes, but this approach works today and allows for more
dynamic model loading (e.g. loading new models without restarting pods).
Moreover, a container image server is backed by a blob server,
as introduced here.

k8s-ci-robot · 2025-09-09T13:29:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: justinsb
Once this PR has been reviewed and has the lgtm label, please assign idvoretskyi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This example demonstrates how we can serve models from inside the cluster, without needing to bake them into the container images. We may also in future want to support storing models in GCS or S3, but this example focuses on storing models without cloud dependencies. We may also want to investigate serving models from container images, particularly given the upcoming support for mounting container images as volumes, but this approach works today and allows for more dynamic model loading (e.g. loading new models without restarting pods). Moreover, a container image server is backed by a blob server, as introduced here.

justinsb · 2025-09-09T15:26:43Z

Heavily inspired by @seans3 's work in vllm-deployment example! And now with a readme (with similar inspiration).

Looks like we aren't checking copyright headers so I will look into adding that in a separate PR

janetkuo · 2025-09-10T17:24:11Z

/assign

janetkuo · 2025-09-24T16:28:32Z

AI/modelcloud/README.md

+on AI-conformant kubernetes clusters.
+
+We (aspirationally) aim to demonstrate the capabilities of the AI-conformance
+profile.  Where we cannot achieve production-grade inference, we hope to


nit: remove "profile" everywhere. We were suggested not to use the term "profile" for Kubernetes AI conformance, given that there was historically an effort to define subsets (not supersets) of Kubernetes Conformance with this term.

janetkuo · 2025-09-24T16:36:59Z

AI/modelcloud/dev/tools/shared/utils.py

+def get_image_prefix():
+    """Constructs the image prefix for a container image."""
+    project_id = get_gcp_project()
+    return f"gcr.io/{project_id}/"


nit: adopt the same change in kubernetes-sigs/agent-sandbox#13 with changes like supporting IMAGE_PREFIX env etc.

janetkuo · 2025-09-24T17:09:14Z

AI/modelcloud/images/vllm-frontend/Dockerfile

+# gemma3-6cf4765df9-c4nmt gemma3 DEBUG 09-08 14:57:56 [__init__.py:99] CUDA platform is not available because: NVML Shared Library Not Found
+
+
+# FROM vllm/vllm-openai:v0.10.0


nit: remove commented out part of it's not needed

janetkuo · 2025-09-24T17:11:50Z

AI/modelcloud/README.md

+
+1. `blob-server`, a statefulset with a persistent volume to hold the model blobs (files)
+
+1. `gemma3`, a deployment running vLLM, with a frontend go process that will download the model from `blob-server`.


Do we need to merge this with https://github.com/kubernetes/examples/tree/master/AI/vllm-deployment? The other one doesn't provide persistent model storage.

janetkuo · 2025-09-24T17:20:25Z

AI/modelcloud/README.md

+
+```bash
+kubectl delete deployment gemma3
+kubectl delete statefulset blob-server


Need to delete the PVC as well for full cleanup

janetkuo · 2025-09-24T17:23:43Z

AI/modelcloud/k8s/blob-server/manifest.yaml

+  selector:
+    matchLabels:
+      app: blob-server
+  #serviceName: blob-server


nit: remove given that it's not needed

k8s-ci-robot requested a review from kow3ns September 9, 2025 13:29

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 9, 2025

k8s-ci-robot requested a review from soltysh September 9, 2025 13:29

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 9, 2025

justinsb force-pushed the modelcloud_example branch 2 times, most recently from 287ae85 to 97a98d7 Compare September 9, 2025 15:21

justinsb force-pushed the modelcloud_example branch from 97a98d7 to e7a7cac Compare September 9, 2025 15:25

k8s-ci-robot assigned janetkuo Sep 10, 2025

janetkuo reviewed Sep 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AI inference: demonstrate in-cluster storage of models #575

AI inference: demonstrate in-cluster storage of models #575

justinsb commented Sep 9, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Sep 9, 2025

Uh oh!

justinsb commented Sep 9, 2025

Uh oh!

janetkuo commented Sep 10, 2025

Uh oh!

janetkuo Sep 24, 2025

Uh oh!

janetkuo Sep 24, 2025

Uh oh!

janetkuo Sep 24, 2025

Uh oh!

janetkuo Sep 24, 2025

Uh oh!

janetkuo Sep 24, 2025

Uh oh!

janetkuo Sep 24, 2025

Uh oh!

Uh oh!

		# gemma3-6cf4765df9-c4nmt gemma3 DEBUG 09-08 14:57:56 [__init__.py:99] CUDA platform is not available because: NVML Shared Library Not Found


		# FROM vllm/vllm-openai:v0.10.0


		1. `blob-server`, a statefulset with a persistent volume to hold the model blobs (files)

		1. `gemma3`, a deployment running vLLM, with a frontend go process that will download the model from `blob-server`.

AI inference: demonstrate in-cluster storage of models #575

Are you sure you want to change the base?

AI inference: demonstrate in-cluster storage of models #575

Conversation

justinsb commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Sep 9, 2025

Uh oh!

justinsb commented Sep 9, 2025

Uh oh!

janetkuo commented Sep 10, 2025

Uh oh!

janetkuo Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

janetkuo Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

janetkuo Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

janetkuo Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

janetkuo Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

janetkuo Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

justinsb commented Sep 9, 2025 •

edited

Loading