From c7cbf1addc400f5465a3d536f1ba78ca2bf7d79b Mon Sep 17 00:00:00 2001 From: Olivier Tardieu Date: Fri, 21 Mar 2025 16:01:44 -0400 Subject: [PATCH] Include command to stream pod logs in vllm example --- setup.KubeConEU25/README.md | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/setup.KubeConEU25/README.md b/setup.KubeConEU25/README.md index 3c97a53..b39340b 100644 --- a/setup.KubeConEU25/README.md +++ b/setup.KubeConEU25/README.md @@ -561,8 +561,8 @@ model.
First, `alice` creates a persistent volume claim to cache the model weights on -first invocation so that subsequent instantiation of the model will reuse the -cached data. +first invocation so that subsequent instantiations of the model will reuse the +cached model weights. ```yaml kubectl apply --as alice -n blue -f- << EOF apiVersion: v1 @@ -579,10 +579,10 @@ spec: EOF ``` The workload wraps a Kubernetes Job in an AppWrapper. The Job consists of one -Pod with two containers using an upstream `vllm-openai` image. The `vllm` -container runs the inference runtime. The `load-generator` container submits a -random series of requests to the inference runtime and reports a number of -metrics such as _Time to First Token_ (TTFT) and _Time per Output Token_ (TPOT). +Pod with two containers. The `vllm` container runs the inference runtime using +an upstream `vllm-openai` image. The `load-generator` container submits a random +series of requests to the inference runtime and reports a number of metrics such +as _Time to First Token_ (TTFT) and _Time per Output Token_ (TPOT). ```yaml kubectl apply --as alice -n blue -f- << EOF apiVersion: workload.codeflare.dev/v1beta2 @@ -599,16 +599,13 @@ spec: spec: template: metadata: - annotations: - kubectl.kubernetes.io/default-container: load-generator labels: app: batch-inference spec: - terminationGracePeriodSeconds: 0 restartPolicy: Never containers: - name: vllm - image: quay.io/tardieu/vllm-openai:v0.7.3 # mirror of vllm/vllm-openai:v0.7.3 + image: quay.io/tardieu/vllm-openai:v0.7.3 # vllm/vllm-openai:v0.7.3 command: # serve model and wait for halt signal - sh @@ -635,7 +632,7 @@ spec: - name: load-generator image: quay.io/tardieu/vllm-benchmarks:v0.7.3 command: - # wait for vllm, submit batch of inference requests, send halt signal + # wait for vllm, submit batch of requests, send halt signal - sh - -c - | @@ -666,6 +663,15 @@ The two containers are synchronized as follows: `load-generator` waits for `vllm` to be ready to accept requests and, upon completion of the batch, signals `vllm` to make it quit. +Stream the logs of the `vllm` container with: +```sh +kubectl logs --as alice -n blue -l app=batch-inference -c vllm -f +``` +Stream the logs of the `load-generator` container with: +```sh +kubectl logs --as alice -n blue -l app=batch-inference -c load-generator -f +``` +
### Pre-Training with PyTorch