Warning
This is a platform specific (Apple Silicon) and local first implementation of the proposed approach. Test it locally, validate the solution and then if you would like to bring it on managed Kubernetes services or on different GPUs architectures (NVIDIA) read the migrating to the cloud chapter of this guide.
This tutorial demonstrates a golden path for deploying GenAI Models (LLMs / SLMs) locally on Apple Silicon Macs, leveraging GPU acceleration and following MLOps best practices. The goal is to create a reliable, repeatable, and automated workflow. How it works:
- Minikube (
krunkitdriver): Provides a lightweight Kubernetes cluster running in a minimal VM directly leveraging macOS's Virtualization framework. Thekrunkitdriver is key as it enables GPU passthrough. - GPU Passthrough (
krunkit+ Generic Device Plugin):krunkitexposes the Mac's GPU (via Vulkan) to the VM. Thesquat/generic-device-pluginthen advertises this GPU device (/dev/dri) to the Kubernetes scheduler using the resource namesquat.ai/dri, making it requestable by pods. - Model Serving (
ramalama/llama.cpp): We use theramalamacontainer image, which bundles the llama.cpp inference server. This server is highly optimized for running GGUF-formatted models efficiently on various hardware, including CPUs and GPUs via Vulkan. - Model Format (GGUF): GGUF is a versatile format specifically designed for
llama.cpp, allowing models to be loaded efficiently. We pre-download the model and mount it into the cluster. - Packaging (Helm): We package the ramalama server and the Open WebUI interface as Helm charts. This standardizes the application definition and configuration.
- GitOps (ArgoCD): Argo CD monitors a Git repository containing the Helm charts and application definitions. It automatically synchronizes the cluster state to match the desired state defined in Git, enabling automated, auditable deployments.
- Service Mesh (Istio): Istio manages network traffic. We use an Istio Gateway and VirtualService to securely expose the model API (
/v1) and the Web UI (/) through a single entry point.
While vLLM is a popular high-performance inference server, the standard vllm/vllm-openai image is built for NVIDIA GPUs and relies heavily on CUDA libraries. These are not available in the krunkit VM or on Apple Silicon, causing the container to crash even in CPU mode, as it still expects CUDA libraries to be present.
Therefore, the ramalama (llama.cpp) + GGUF approach is the most reliable and performant method to achieve GPU-accelerated LLM inference within the krunkit environment on macOS. It directly leverages the Vulkan passthrough provided by krunkit.
This phase installs all the necessary command-line tools for managing Kubernetes, packaging applications, interacting with Git and running the specific Minikube driver.
# Install core Kubernetes and development tools via Homebrew
brew install minikube kubectl helm git
# Install the krunkit VMM driver and its tap
brew tap slp/krunkit && brew install krunkit
# Install the vmnet-helper for krunkit networking
# This requires root permissions to manage network interfaces
curl -fsSL https://github.com/minikube-machine/vmnet-helper/releases/latest/download/install.sh | bash
/opt/vmnet-helper/bin/vmnet-helper --version # Verify installation
# Install istioctl
curl -L https://istio.io/downloadIstioctl | sh -
export PATH=$HOME/.istioctl/bin:$PATH # Add istioctl to PATH for this session
istioctl version # Verify installation# Cloning and working on this repository you already have a `models` folder, so move into it.
cd models
# Download the model from HuggingFace, in this case TinyLlama 1.1B
curl -LO 'https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q8_0.gguf?download=true'This phase creates the Kubernetes cluster using Minikube with the krunkit driver, configures GPU access within Kubernetes, and installs the core MLOps platform components (Istio and Argo CD).
# Delete previous cluster if needed
minikube delete --all
# Start Minikube using the krunkit driver
# Allocate sufficient resources (adjust if needed)
# Mount the host's ~/models directory into the VM at /mnt/models
minikube start --driver krunkit \
--memory=16g --cpus=4 \
--mount \
--mount-string="<absolute-path-to-the-models-folder>/:/mnt/models"
# --- Install GPU Device Plugin ---
# This DaemonSet runs on the node and detects the /dev/dri device (GPU)
# It advertises the GPU to Kubernetes as the resource "squat.ai/dri"
cat <<'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: generic-device-plugin
namespace: kube-system
spec:
selector:
matchLabels:
app.kubernetes.io: generic-device-plugin
template:
metadata:
labels:
app.kubernetes.io: generic-device-plugin
spec:
priorityClassName: system-node-critical
tolerations:
- operator: "Exists"
containers:
- image: squat/generic-device-plugin
args:
- --device
- |
name: dri
groups:
- count: 4
paths:
- path: /dev/dri
name: generic-device-plugin
resources:
limits:
cpu: 50m
memory: 20Mi
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: dev
mountPath: /dev
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: dev
hostPath:
path: /dev
EOF
# Verify that Kubernetes sees the GPU resource
kubectl get nodes -o jsonpath='{.items[0].status.allocatable}' | grep 'squat.ai/dri'
# You should get something like:
# {"cpu":"2","ephemeral-storage":"17734596Ki","hugepages-1Gi":"0","hugepages-2Mi":"0","hugepages-32Mi":"0","hugepages-64Ki":"0","memory":"3920780Ki","pods":"110","squat.ai/dri":"4"}
# Here it's important to have "squat.ai/dri":"4"`
# --- Install Istio Service Mesh ---
# Use istioctl for reliable installation
istioctl install --set profile=default -y
# Enable automatic Istio sidecar injection for the 'default' namespace
kubectl label namespace default istio-injection=enabled --overwrite
# --- Install Argo CD GitOps Controller ---
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yamlNote
In this repo you can find both tinyllama-gpu-chart to deploy the model and webui-chart to interact with the model. Keep them in your repository or move them to another repository to decouple charts from the applications repo an reuse them for multiple applications.
You should configure:
app-llama.yaml: Defines the Argo CD application for the TinyLlama server, pointing to thetinyllama-gpu-chart. Edit thespec.project.source.repoUrlwith your repository URL where the chart is stored.app-webui.yaml: Defines the Argo CD application for the Open WebUI, pointing to thewebui-chart. Edit thespec.project.source.repoUrlwith your repository URL where the chart is stored.app-gateway.yaml: Defines the IstioGateway(entry point) andVirtualService(routing rules:/v1to TinyLlama,/to WebUI).
Then commit and push everything on GitHub and apply the ArgoCD Application and Istio Gateway manifests:
kubectl apply -f app-llama.yaml
kubectl apply -f app-webui.yaml
kubectl apply -f app-gateway.yamlThis final phase verifies that the applications are running and accessible.
# --- Start Minikube Tunnel ---
# Run this in a NEW, separate terminal window and keep it running.
# It exposes the Istio Ingress Gateway service IP (usually 127.0.0.1) on your host.
minikube tunnel
# --- Verify Pods and Access ---
# In your original terminal:
# Wait for pods to be ready
kubectl wait --for=condition=Ready pods --all --timeout=5m
# Get the Ingress IP (should be 127.0.0.1 due to the tunnel)
export INGRESS_IP=$(kubectl get svc istio-ingressgateway -n istio-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "✅ Access the Web UI at: http://${INGRESS_IP}"
echo " (API endpoint is available at http://${INGRESS_IP}/v1)"
# --- Interact ---
# Open the Ingress IP in your web browser.
# You should see the Open WebUI interface.
# 1. Create a local account when prompted.
# 2. The TinyLlama model should be automatically detected (via the OPENAI_API_BASE_URLS env var).
# 3. Select "TinyLlama" and start chatting!
# (Optional) Test the API directly with curl:
# curl http://${INGRESS_IP}/v1/chat/completions \
# -H "Content-Type: application/json" \
# -d '{ "model": "tinyllama", "messages": [{"role": "user", "content": "Explain Kubernetes simply"}], "max_tokens": 50 }'This setup, using ramalama (llama.cpp), can run any model available in GGUF format. You can find many on Hugging Face. Examples include:
- Mistral 7B GGUF: Larger and more capable than TinyLlama.
- Llama 3 GGUF: Various sizes (8B, 70B - check resource requirements!).
- Phi-3 GGUF: Small, powerful models from Microsoft.
- Gemma GGUF: Google's open models.
To use a different model:
- Download the desired
.gguffile to your/modelsdirectory. - Update the
--modelargument intinyllama-gpu-chart/values.yamlto point to the new filename. - Commit and push the change to Git. Argo CD will automatically update the deployment. (You might need to adjust memory/CPU requests in
values.yamlfor larger models).
What if we want to standardize this approach making it available to our developers in a self-service way using a standardized helm chart? This setup enables developers to easily deploy new model instances using a pre-defined, standardized Helm chart.
Note
This approach use a standardized helm chart available on a separate repo / helm registry. For each model we want to deploy we should create a different project repo (GitOps repo) to store the charts values (specific for the model we wants to deploy) and the ArgoCD Application manifest for the deployment.
- Prepare the Model:
- Local
ramalama: Download the required.ggufmodel file to the shared location accessible by Minikube (e.g.,~/models). - Cloud
vLLM: Ensure the model exists on Hugging Face or another accessible model registry.
- Local
- Define Project Configuration (
<project>-values.yaml):- In the GitOps repository (where Argo CD
Applicationmanifests reside), create a new YAML file specific to this deployment (e.g.,my-app/mistral-7b-values.yaml). - Inside this file, specify the values that differ from the generic chart's defaults, like:
image.repository/image.tag(if using a different server version)modelArgs(e.g.,--modelpath/name,--alias)resources(CPU, memory, GPU typesquat.ai/driornvidia.com/gpu, and quantity)persistence.size(if needed)- Any other custom parameters exposed by the generic chart.
- In the GitOps repository (where Argo CD
# Example: my-app/mistral-7b-gguf-values.yaml (for local ramalama)
modelArgs:
- "--model"
- "/mnt/models/mistral-7b-instruct-v0.2.Q5_K_M.gguf" # Path to the specific model
- "--alias"
- "mistral-7b-chat"
- "-ngl"
- "999"
resources:
limits:
squat.ai/dri: 1
memory: "8Gi" # More RAM for Mistral
requests:
squat.ai/dri: 1
memory: "8Gi"- Create Argo CD Application Manifest (
<project>-app.yaml):- Create a new
Applicationmanifest (e.g.,my-app/mistral-app.yaml) in the GitOps repository. - Point
source.repoURL,source.chart,source.targetRevisionto the central Helm Chart repository and the generic LLM chart. - Use
source.helm.valueFilesto point to the project-specific values file created in step 2 (relative path within the GitOps repo).
- Create a new
# Example: my-app/mistral-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-mistral-deployment
namespace: argocd
spec:
project: default
source:
# --- Source Chart from Helm Repo ---
repoURL: 'http://my-central-helm-repo.example.com' # Central Helm repo URL
chart: generic-ramalama-chart # Name of the generic chart
targetRevision: 1.0.0 # Version of the chart
helm:
# --- Override with project values from GitOps Repo ---
valueFiles:
- my-app/mistral-7b-gguf-values.yaml # Path to values in THIS repo
destination:
server: 'https://kubernetes.default.svc'
namespace: my-app-namespace # Target namespace for deployment
syncPolicy:
automated: { prune: true, selfHeal: true }- Update Istio Gateway (Optional):
- If a unique URL path is needed, edit the central
app-gateway.yaml(or a project-specific one) in the GitOps repository. - Add a new
matchblock to theVirtualServiceto route a specific prefix (e.g.,/my-mistral/v1) to the new Kubernetes Service created by Helm (e.g.,my-mistral-deployment-generic-ramalama-chart.my-app-namespace.svc.cluster.local).
- If a unique URL path is needed, edit the central
- Commit & Push: Commit the new
values.yaml,Applicationmanifest, and any Gateway changes to the GitOps repository. - Bootstrap: Apply the new
Applicationmanifest:kubectl apply -f my-app/mistral-app.yaml. - Result: Argo CD will automatically pull the generic chart, apply the project-specific values, and deploy the new LLM instance according to the configuration in Git.
Migrating the MLOps workflow to EKS with NVIDIA GPUs using vLLM follows the same GitOps principles but requires changing the infrastructure targets and application configuration via a different set of values.
- Cluster: Provision an Amazon EKS cluster
- Nodes: Create EKS Managed Node Groups using EC2 instances with NVIDIA GPUs (e.g.,
p5,g5). - GPU Drivers/Plugin: Ensure NVIDIA drivers and the NVIDIA Kubernetes Device Plugin are installed on the GPU nodes (often handled by EKS AMIs or addons). This advertises
nvidia.com/gpu. - Storage: Define a Kubernetes
StorageClassbacked by AWS EBS - Ingress: Configure the
istio-ingressgatewayservice (typeLoadBalancer) to integrate with an AWS Load Balancer (ALB/NLB) and configure DNS.
- Assume a Generic vLLM Chart: You would likely have a separate generic Helm chart optimized for vLLM deployments (
generic-vllm-chart) stored in your central Helm repository. This chart would be designed to accept parameters for model name, resources (includingnvidia.com/gpu), PVCs, etc. - Create Cloud-Specific Values (
<project>-eks-values.yaml):- In your GitOps repository, create a
values.yamlfile specifically for the EKS deployment (e.g.,my-app/mistral-eks-values.yaml). - This file overrides the generic vLLM chart's defaults:
- In your GitOps repository, create a
# Example: my-app/mistral-eks-values.yaml
image:
repository: vllm/vllm-openai # Official vLLM image
tag: latest # Or specific version
modelArgs:
- "--model"
- "mistralai/Mistral-7B-Instruct-v0.1" # Model name from Hugging Face
- "--host"
- "0.0.0.0"
- "--port"
- "8000"
- "--served-model-name"
- "mistral-7b-instruct-gpu"
# '--device cuda' is often implicit but can be added
resources:
limits:
nvidia.com/gpu: 1 # <-- Request NVIDIA GPU
memory: "32Gi" # <-- Adjust RAM based on instance type/model
requests:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "8000m" # <-- Adjust CPU based on instance type
persistence:
enabled: true
storageClass: ebs-gp3 # <-- Use the EBS StorageClass defined in EKS
size: 50Gi # Cache size for downloaded model
mountPath: /root/.cache/huggingface- Create Cloud Argo CD Application (
<project>-eks-app.yaml):- Create a new
Applicationmanifest targeting the EKS cluster. - Point
sourceto the generic vLLM chart in your Helm repository. - Point
source.helm.valueFilesto the EKS-specific values file (e.g.,my-app/mistral-eks-values.yaml). - Set the
destination.namespaceand potentiallydestination.serverif targeting a specific EKS cluster API endpoint managed by Argo CD.
- Create a new
# Example: my-app/mistral-eks-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-mistral-eks-deployment
namespace: argocd
spec:
project: default
source:
# --- Source Generic vLLM Chart ---
repoURL: 'http://my-central-helm-repo.example.com' # Central Helm repo URL
chart: generic-vllm-chart # Generic chart for vLLM
targetRevision: 1.1.0 # Chart version
helm:
# --- Override with EKS-specific values ---
valueFiles:
- my-app/mistral-eks-values.yaml # Values for EKS deployment
destination:
server: 'https://<eks-cluster-api-server>' # Target EKS cluster
namespace: production-llm # Target namespace
# ... syncPolicy ...- Configure Istio Gateway: The
GatewayandVirtualServicedefinitions in GitOps repo remain conceptually similar, but thehostin theGatewaymight be set to a specific domain name (e.g.,mistral.mycompany.com), and the underlying Load Balancer handles the external traffic. - Commit, Push, Apply: Commit the EKS-specific
values.yamlandApplicationmanifest to GitOps repo, then apply theApplication:kubectl apply -f my-app/mistral-eks-app.yaml. - Result: Argo CD deploys the generic vLLM chart to EKS, configured with the EKS-specific values, requesting NVIDIA GPUs and using EBS for storage. The core MLOps workflow (Git commit -> Argo sync -> Deployment) remains unchanged.