In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemma deployment to GKE using vLLM on GPU

<a target="_blank" href="https://colab.research.google.com/github/moficodes/multi-llm/blob/main/notebooks/serve_gemma_on_gke_using_vllm.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Overview

This notebook demonstrates downloading and deploying Gemma, open models from Google DeepMind using vLLM, an efficient serving option to improve serving throughput. In this notebook we will deploy and serve TGI on GPUs. In this guide we specifically use L4 GPUs but this guide should also work for A100(40 GB), A100(80 GB), H100(80 GB) GPUs.


### Objective

Deploy and run inference for serving Gemma with vLLM on GPUs.

### GPUs

GPUs let you accelerate specific workloads running on your nodes such as machine learning and data processing. GKE provides a range of machine type options for node configuration, including machine types with NVIDIA H100, L4, and A100 GPUs.

Before you use GPUs in GKE, we recommend that you complete the following learning path:

Learn about [current GPU version availability](https://cloud.google.com/compute/docs/gpus)

Learn about [GPUs in GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/gpus)


### vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

## Before you begin

### Configure Environment

Set the following variables for the experiment environment.

In [None]:
# The HuggingFace token used to download models.
HF_TOKEN = ""  # @param {type:"string"}

# The size of the model to launch
MODEL_SIZE = "1b"  # @param ["1b", "4b", "12b"]
# Cloud project id.
PROJECT_ID = "mofilabs"  # @param {type:"string"}
# Region for launching clusters.
REGION = "us-central1"  # @param {type:"string"}
# The cluster name to create
CLUSTER_NAME = "gke-llm-inference"  # @param {type:"string"}

# The number of GPUs to run: 1 for 1b or 4b, 2 for 12b
GPU_COUNT = 1
if MODEL_SIZE == "12b":
    GPU_COUNT = 2
# Ephemeral storage
EPHEMERAL_STORAGE_SIZE = "10Gi"
if MODEL_SIZE == "4b":
    EPHEMERAL_STORAGE_SIZE = "20Gi"
if MODEL_SIZE == "12b":
    EPHEMERAL_STORAGE_SIZE = "40Gi"

# Memory size
MEMORY_SIZE = "10Gi"
if MODEL_SIZE == "4b":
    MEMORY_SIZE = "20Gi"
if MODEL_SIZE == "12b":
    MEMORY_SIZE = "40Gi"
GPU_SHARD = 1
if MODEL_SIZE == "12b":
    GPU_SHARD = 2
CPU_LIMITS = 4
if MODEL_SIZE == "12b":
    CPU_LIMITS = 10

In [None]:
! gcloud config set project "$PROJECT_ID"
! gcloud services enable container.googleapis.com

# Add kubectl to the set of available tools.
! mkdir -p /tools/google-cloud-sdk/.install
! gcloud components install kubectl --quiet

### Create a GKE cluster and a node pool

GKE creates the following resources for the model based on the MODEL_SIZE environment variable set above.

- Autopilot cluster

If you already have a cluster, you can skip to `Use an existing GKE cluster` instead.

In [None]:
! gcloud container clusters create-auto {CLUSTER_NAME} \
  --project={PROJECT_ID} \
  --region={REGION} \
  --release-channel=rapid \

### Use an existing GKE cluster

In [None]:
! gcloud container clusters get-credentials {CLUSTER_NAME} --location {REGION}

Fetching cluster endpoint and auth data.
kubeconfig entry generated for gke-llm-inference.


### Create Kubernetes secret for Hugging Face credentials

Create a Kubernetes Secret that contains the Hugging Face token.

In [None]:
! kubectl create secret generic hf-secret \
--from-literal=hf_api_token={HF_TOKEN} \
--dry-run=client -o yaml | kubectl apply -f -

secret/hf-secret created


### Deploy vLLM

Use the YAML to deploy Gemma on vLLM

In [None]:
K8S_YAML = f"""
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-gemma-server
  template:
    metadata:
      labels:
        app: vllm-gemma-server
        ai.gke.io/model: gemma-3-{MODEL_SIZE}-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250312_0916_RC01
        resources:
          requests:
            cpu: {CPU_LIMITS}
            memory: {MEMORY_SIZE}
            ephemeral-storage: {EPHEMERAL_STORAGE_SIZE}
            nvidia.com/gpu: {GPU_COUNT}
          limits:
            cpu: {CPU_LIMITS}
            memory: {MEMORY_SIZE}
            ephemeral-storage: {EPHEMERAL_STORAGE_SIZE}
            nvidia.com/gpu: {GPU_COUNT}
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size={GPU_SHARD}
        - --host=0.0.0.0
        - --port=8000
        env:
        - name: MODEL_ID
          value: google/gemma-3-{MODEL_SIZE}-it
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
        cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llm-service
spec:
  selector:
    app: vllm-gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
"""

with open("vllm.yaml", "w") as f:
    f.write(K8S_YAML)

! cat vllm.yaml



apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-gemma-server
  template:
    metadata:
      labels:
        app: vllm-gemma-server
        ai.gke.io/model: gemma-3-1b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250312_0916_RC01
        resources:
          requests:
            cpu: 4
            memory: 10Gi
            ephemeral-storage: 10Gi
            nvidia.com/gpu: 1
          limits:
            cpu: 4
            memory: 10Gi
            ephemeral-storage: 10Gi
            nvidia.com/gpu: 1
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size=1
        - --host=0.0.0.0
        - --port=8

In [None]:
! kubectl apply -f vllm.yaml

deployment.apps/vllm-gemma-deployment created
service/vllm-llm-service created


#### Waiting for the container to create

Use the command below to check on the status of the container.

In [None]:
! kubectl get pod -w

#### View the logs from the running deployment

##### This will download the needed artifacts, this process will take close to 5 minutes depending on what runtime you are using to run your colab environment. The server is up and running and ready to take inference request once you see log messages like these :

```
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
```

In [None]:
! kubectl logs -f -l app=vllm-gemma-server

#### Set up port forwarding

In [None]:
! kubectl exec -t $( kubectl get pod -l app=vllm-gemma-server -o jsonpath="{.items[0].metadata.name}" ) \
   -c inference-server -- curl -X POST http://localhost:8000/v1/chat/completions \
    -X POST \
    -H "Content-Type: application/json" \
    -d '{"model": "google/gemma-3-4b-it","messages": [{"role": "user", "content": "Why is the sky blue?"}]}' \
    2> /dev/null

{"id":"chatcmpl-2aa7c4dc097442c391b0cd2dc5e66d6c","object":"chat.completion","created":1742696545,"model":"google/gemma-3-4b-it","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Okay, let's break down why the sky is blue! It's a surprisingly complex phenomenon, but here's the simplified explanation:\n\n**1. Sunlight and Colors:**\n\n* Sunlight is actually made up of all the colors of the rainbow – red, orange, yellow, green, blue, indigo, and violet.  Think of a prism splitting sunlight into those colors.\n\n**2. The Role of the Atmosphere (specifically, Rayleigh Scattering):**\n\n* **Air Molecules:** Earth’s atmosphere isn't empty space. It's filled with tiny air molecules – mostly nitrogen and oxygen.\n* **Scattering:** When sunlight enters the atmosphere, it collides with these air molecules. This causes the light to *scatter* in different directions.\n* **Rayleigh Scattering:**  Here's the key: **Rayleigh scattering** is the phenomenon where s

## Clean up resources

In [None]:
! kubectl delete deployments vllm-gemma-deployment
! kubectl delete services vllm-llm-service
! kubectl delete secrets hf-secret

In [None]:
! gcloud container clusters delete {CLUSTER_NAME} \
  --region={REGION} \
  --quiet