-
Notifications
You must be signed in to change notification settings - Fork 4.7k
AI inference: demonstrate in-cluster storage of models #575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: justinsb The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
287ae85
to
97a98d7
Compare
This example demonstrates how we can serve models from inside the cluster, without needing to bake them into the container images. We may also in future want to support storing models in GCS or S3, but this example focuses on storing models without cloud dependencies. We may also want to investigate serving models from container images, particularly given the upcoming support for mounting container images as volumes, but this approach works today and allows for more dynamic model loading (e.g. loading new models without restarting pods). Moreover, a container image server is backed by a blob server, as introduced here.
97a98d7
to
e7a7cac
Compare
Heavily inspired by @seans3 's work in vllm-deployment example! And now with a readme (with similar inspiration). Looks like we aren't checking copyright headers so I will look into adding that in a separate PR |
/assign |
on AI-conformant kubernetes clusters. | ||
|
||
We (aspirationally) aim to demonstrate the capabilities of the AI-conformance | ||
profile. Where we cannot achieve production-grade inference, we hope to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove "profile" everywhere. We were suggested not to use the term "profile" for Kubernetes AI conformance, given that there was historically an effort to define subsets (not supersets) of Kubernetes Conformance with this term.
def get_image_prefix(): | ||
"""Constructs the image prefix for a container image.""" | ||
project_id = get_gcp_project() | ||
return f"gcr.io/{project_id}/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: adopt the same change in kubernetes-sigs/agent-sandbox#13 with changes like supporting IMAGE_PREFIX
env etc.
# gemma3-6cf4765df9-c4nmt gemma3 DEBUG 09-08 14:57:56 [__init__.py:99] CUDA platform is not available because: NVML Shared Library Not Found | ||
|
||
|
||
# FROM vllm/vllm-openai:v0.10.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove commented out part of it's not needed
|
||
1. `blob-server`, a statefulset with a persistent volume to hold the model blobs (files) | ||
|
||
1. `gemma3`, a deployment running vLLM, with a frontend go process that will download the model from `blob-server`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to merge this with https://github.com/kubernetes/examples/tree/master/AI/vllm-deployment? The other one doesn't provide persistent model storage.
|
||
```bash | ||
kubectl delete deployment gemma3 | ||
kubectl delete statefulset blob-server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to delete the PVC as well for full cleanup
selector: | ||
matchLabels: | ||
app: blob-server | ||
#serviceName: blob-server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove given that it's not needed
This example demonstrates how we can serve models from inside the cluster,
without needing to bake them into the container images,
or rely on pulling them from services like huggingface.
We may also in future want to support storing models in GCS or S3,
but this example focuses on storing models without cloud dependencies.
We may also want to investigate serving models from container images,
particularly given the upcoming support for mounting container images
as volumes, but this approach works today and allows for more
dynamic model loading (e.g. loading new models without restarting pods).
Moreover, a container image server is backed by a blob server,
as introduced here.