diff --git a/deploy/helm/AGENTS.md b/deploy/helm/AGENTS.md index e0c386fc..53992bbe 100644 --- a/deploy/helm/AGENTS.md +++ b/deploy/helm/AGENTS.md @@ -118,6 +118,62 @@ Rules specific to the `deploy/helm/` directory. General contribution guidelines - **Do not use YAML anchors at the root level of `values.yaml`** (e.g., `_defaults: &defaults`). Helm treats unknown root-level keys as invalid and may emit warnings or errors. Instead, duplicate shared configuration explicitly for each component. +## Odin Presets (`moai-inference-preset`) + +An Odin preset is a pair of Odin `InferenceServiceTemplate` resources — a **base template** (runtime base) and a **preset-specific template** — that together define how to deploy a Moreh vLLM pod. The base template defines how vLLM servers are launched and is shared across presets. The preset-specific template adds model-specific arguments, environment variables, resource requests, and disaggregation settings. + +### Preset naming convention + +Preset names follow the pattern: +`{image_tag}-{org_name}-{model_name}[-mtp][-prefill][-decode]-{accelerator_vendor}-{accelerator_model}-{parallelism}[-moe-{moe_parallelism}]` + +- `{org_name}` and `{model_name}` follow Hugging Face Hub names in kebab-case (e.g., `meta-llama/Llama-3.3-70B-Instruct` → `meta-llama-llama-3.3-70b-instruct`). +- `-mtp` is appended after `{model_name}` if multi-token prediction is used. +- `-prefill` or `-decode` is appended for disaggregation modes, placed after `{model_name}` (or `-mtp`) and before `{accelerator_vendor}`. +- `{parallelism}` examples: `1`, `tp4`, `tp8`, `dp8`. Canonical order for combined strategies: `dp` → `pp` → `tp` → `cp`. +- For MoE models, `-moe-{moe_parallelism}` is appended (e.g., `-moe-ep8`, `-moe-tp8`). + +### Reserved labels + +Odin presets use `mif.moreh.io/*` labels: + +| Label key | Description | Example values | +| :-------------------------------- | :--------------------------- | :-------------------------------------- | +| `mif.moreh.io/template.type` | Template type | `runtime-base`, `preset` | +| `mif.moreh.io/model.org` | HF org name (kebab-case) | `meta-llama`, `deepseek-ai` | +| `mif.moreh.io/model.name` | HF model name (kebab-case) | `llama-3.3-70b-instruct`, `deepseek-r1` | +| `mif.moreh.io/model.mtp` | Multi-token prediction | `"true"` or unset | +| `mif.moreh.io/role` | Disaggregation mode | `e2e`, `prefill`, `decode` | +| `mif.moreh.io/accelerator.vendor` | GPU vendor | `amd` | +| `mif.moreh.io/accelerator.model` | GPU model | `mi250`, `mi300x`, `mi308x` | +| `mif.moreh.io/parallelism` | Parallelism mode | `tp4`, `dp8-moe-ep8` | + +### Responsibility boundaries + +**Presets define** (model/GPU-specific, not user-configurable): +- vLLM arguments for parallelism within a single rank (`--tensor-parallel-size`, `--enable-expert-parallel`, etc.) +- Model-specific vLLM arguments (`--trust-remote-code`, `--max-model-len`, `--max-num-seqs`, `--kv-cache-type`, `--quantization`, `--gpu-memory-utilization`, etc.) +- Model-specific environment variables (`VLLM_ROCM_USE_AITER`, `VLLM_MOE_DP_CHUNK_SIZE`, `UCX_*`, `NCCL_*`, etc.) +- Resources (GPU count, RDMA NICs), tolerations, and nodeSelector + +**Runtime bases define** (shared across presets): +- Execution command(s) and launch logic (for-loop for DP, cleanup traps) +- Cross-rank parallelism arguments (`--data-parallel-rank`, `--data-parallel-address`, `--data-parallel-rpc-port`) +- Disaggregation-specific environment variables (`VLLM_NIXL_SIDE_CHANNEL_HOST`, `VLLM_IS_DECODE_WORKER`) +- Shared memory settings, readiness probes +- Proxy sidecar configuration (for PD disaggregation) + +**Users configure** (not defined by presets or runtime bases): +- Image repository and tag (with default provided) +- Volume mounts and model loading method (HF download vs. PV) +- Hugging Face token +- Number of replicas +- Logging arguments (`--no-enable-log-requests`, `--disable-uvicorn-access-log`, etc.) +- `--no-enable-prefix-caching` + +**Product team templates configure** (must NOT be set in presets): +- `PYTHONHASHSEED`, `--prefix-caching-hash-algo`, `--kv-events-config`, `--block-size` + ### MIF Pod Label Keys When filtering or labeling logs, metrics, or other signals by MIF-specific pod attributes, use these label keys: diff --git a/deploy/helm/moai-inference-preset/templates/presets/deepseek-r1/vllm-deepseek-r1-decode-mi300x-dp8ep.helm.yaml b/deploy/helm/moai-inference-preset/templates/presets/deepseek-r1/vllm-deepseek-r1-decode-mi300x-dp8-moe-ep8.helm.yaml similarity index 96% rename from deploy/helm/moai-inference-preset/templates/presets/deepseek-r1/vllm-deepseek-r1-decode-mi300x-dp8ep.helm.yaml rename to deploy/helm/moai-inference-preset/templates/presets/deepseek-r1/vllm-deepseek-r1-decode-mi300x-dp8-moe-ep8.helm.yaml index 8310cb6c..1009e982 100644 --- a/deploy/helm/moai-inference-preset/templates/presets/deepseek-r1/vllm-deepseek-r1-decode-mi300x-dp8ep.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/presets/deepseek-r1/vllm-deepseek-r1-decode-mi300x-dp8-moe-ep8.helm.yaml @@ -1,7 +1,7 @@ apiVersion: odin.moreh.io/v1alpha1 kind: InferenceServiceTemplate metadata: - name: vllm-deepseek-r1-decode-mi300x-dp8ep + name: vllm-deepseek-r1-decode-mi300x-dp8-moe-ep8 namespace: {{ include "common.names.namespace" . }} labels: {{- include "mif.preset.labels" . | nindent 4 }} @@ -10,7 +10,7 @@ metadata: mif.moreh.io/role: decode mif.moreh.io/accelerator.vendor: amd mif.moreh.io/accelerator.model: mi300x - mif.moreh.io/parallelism: dp8ep8 + mif.moreh.io/parallelism: dp8-moe-ep8 spec: parallelism: data: 8 diff --git a/deploy/helm/moai-inference-preset/templates/presets/deepseek-r1/vllm-deepseek-r1-prefill-mi300x-dp8ep.helm.yaml b/deploy/helm/moai-inference-preset/templates/presets/deepseek-r1/vllm-deepseek-r1-prefill-mi300x-dp8-moe-ep8.helm.yaml similarity index 96% rename from deploy/helm/moai-inference-preset/templates/presets/deepseek-r1/vllm-deepseek-r1-prefill-mi300x-dp8ep.helm.yaml rename to deploy/helm/moai-inference-preset/templates/presets/deepseek-r1/vllm-deepseek-r1-prefill-mi300x-dp8-moe-ep8.helm.yaml index 2ba8d3db..b6c032e6 100644 --- a/deploy/helm/moai-inference-preset/templates/presets/deepseek-r1/vllm-deepseek-r1-prefill-mi300x-dp8ep.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/presets/deepseek-r1/vllm-deepseek-r1-prefill-mi300x-dp8-moe-ep8.helm.yaml @@ -1,7 +1,7 @@ apiVersion: odin.moreh.io/v1alpha1 kind: InferenceServiceTemplate metadata: - name: vllm-deepseek-r1-prefill-mi300x-dp8ep + name: vllm-deepseek-r1-prefill-mi300x-dp8-moe-ep8 namespace: {{ include "common.names.namespace" . }} labels: {{- include "mif.preset.labels" . | nindent 4 }} @@ -10,7 +10,7 @@ metadata: mif.moreh.io/role: prefill mif.moreh.io/accelerator.vendor: amd mif.moreh.io/accelerator.model: mi300x - mif.moreh.io/parallelism: dp8ep8 + mif.moreh.io/parallelism: dp8-moe-ep8 spec: parallelism: data: 8 diff --git a/deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-decode-amd-mi300x-dp8ep8.helm.yaml b/deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-decode-amd-mi300x-dp8-moe-ep8.helm.yaml similarity index 98% rename from deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-decode-amd-mi300x-dp8ep8.helm.yaml rename to deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-decode-amd-mi300x-dp8-moe-ep8.helm.yaml index ddb5d9e6..8582ea29 100644 --- a/deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-decode-amd-mi300x-dp8ep8.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-decode-amd-mi300x-dp8-moe-ep8.helm.yaml @@ -1,7 +1,7 @@ apiVersion: odin.moreh.io/v1alpha1 kind: InferenceServiceTemplate metadata: - name: quickstart-vllm-deepseek-ai-deepseek-r1-decode-amd-mi300x-dp8ep8 + name: quickstart-vllm-deepseek-ai-deepseek-r1-decode-amd-mi300x-dp8-moe-ep8 namespace: {{ include "common.names.namespace" . }} labels: {{- include "mif.preset.labels" . | nindent 4 }} @@ -10,7 +10,7 @@ metadata: mif.moreh.io/role: decode mif.moreh.io/accelerator.vendor: amd mif.moreh.io/accelerator.model: mi300x - mif.moreh.io/parallelism: dp8ep8 + mif.moreh.io/parallelism: dp8-moe-ep8 spec: parallelism: data: 8 diff --git a/deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-prefill-amd-mi300x-dp8ep8.helm.yaml b/deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-prefill-amd-mi300x-dp8-moe-ep8.helm.yaml similarity index 98% rename from deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-prefill-amd-mi300x-dp8ep8.helm.yaml rename to deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-prefill-amd-mi300x-dp8-moe-ep8.helm.yaml index db174338..a78069ab 100644 --- a/deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-prefill-amd-mi300x-dp8ep8.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-prefill-amd-mi300x-dp8-moe-ep8.helm.yaml @@ -1,7 +1,7 @@ apiVersion: odin.moreh.io/v1alpha1 kind: InferenceServiceTemplate metadata: - name: quickstart-vllm-deepseek-ai-deepseek-r1-prefill-amd-mi300x-dp8ep8 + name: quickstart-vllm-deepseek-ai-deepseek-r1-prefill-amd-mi300x-dp8-moe-ep8 namespace: {{ include "common.names.namespace" . }} labels: {{- include "mif.preset.labels" . | nindent 4 }} @@ -10,7 +10,7 @@ metadata: mif.moreh.io/role: prefill mif.moreh.io/accelerator.vendor: amd mif.moreh.io/accelerator.model: mi300x - mif.moreh.io/parallelism: dp8ep8 + mif.moreh.io/parallelism: dp8-moe-ep8 spec: parallelism: data: 8 diff --git a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-decode-dp.helm.yaml b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-decode-dp.helm.yaml index e286799c..255a3e1e 100644 --- a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-decode-dp.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-decode-dp.helm.yaml @@ -34,7 +34,7 @@ spec: } trap cleanup SIGTERM SIGINT - _dp_local_size={{ "{{" }} or .Spec.Parallelism.DataLocal 1 {{ "}}" }} + _dp_local_size={{ "{{" }} deref .Spec "Parallelism" "DataLocal" | default (deref .Spec "Parallelism" "Data" | default 1) {{ "}}" }} for ((_index=0; _index<_dp_local_size; _index++)); do /app/proxy \ @@ -101,8 +101,8 @@ spec: eval "$ISVC_PRE_PROCESS_SCRIPT" fi - _dp_size={{ "{{" }} or .Spec.Parallelism.Data 1 {{ "}}" }} - _dp_local_size={{ "{{" }} or .Spec.Parallelism.DataLocal 1 {{ "}}" }} + _dp_size={{ "{{" }} deref .Spec "Parallelism" "Data" | default 1 {{ "}}" }} + _dp_local_size={{ "{{" }} deref .Spec "Parallelism" "DataLocal" | default (deref .Spec "Parallelism" "Data" | default 1) {{ "}}" }} _start_rank=$(( ${LWS_WORKER_INDEX:-0} * _dp_local_size )) for ((_index=0; _index<_dp_local_size; _index++)); do @@ -134,12 +134,12 @@ spec: "${ISVC_MODEL_PATH}" \ --served-model-name "${ISVC_MODEL_NAME}" \ --port ${_dp_rank_port} \ - --tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \ + --tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \ --data-parallel-size ${_dp_size} \ --data-parallel-rank $((_start_rank + _index)) \ --data-parallel-address $(LWS_LEADER_ADDRESS) \ - --data-parallel-rpc-port {{ "{{" }} or .Spec.Parallelism.DataRPCPort 13345 {{ "}}" }} \ - {{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ + --data-parallel-rpc-port {{ "{{" }} deref .Spec "Parallelism" "DataRPCPort" | default 13345 {{ "}}" }} \ + {{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ ${_config_arg} \ "${_kv_events_args[@]}" \ $(ISVC_EXTRA_ARGS) \ diff --git a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-decode-pp.helm.yaml b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-decode-pp.helm.yaml index a08a9b8e..f0e6cd56 100644 --- a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-decode-pp.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-decode-pp.helm.yaml @@ -106,7 +106,7 @@ spec: }") fi - _pp_size={{ "{{" }} or .Spec.Parallelism.Pipeline 1 {{ "}}" }} + _pp_size={{ "{{" }} deref .Spec "Parallelism" "Pipeline" | default 1 {{ "}}" }} VLLM_NIXL_SIDE_CHANNEL_PORT=30020 \ exec vllm serve \ @@ -117,8 +117,8 @@ spec: --master-addr $(LWS_LEADER_ADDRESS) \ --nnodes ${_pp_size} \ --node-rank $(LWS_WORKER_INDEX) \ - --tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \ - {{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ + --tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \ + {{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ ${_config_arg} \ "${_kv_events_args[@]}" \ $(ISVC_EXTRA_ARGS) @@ -226,7 +226,7 @@ spec: }") fi - _pp_size={{ "{{" }} or .Spec.Parallelism.Pipeline 1 {{ "}}" }} + _pp_size={{ "{{" }} deref .Spec "Parallelism" "Pipeline" | default 1 {{ "}}" }} VLLM_NIXL_SIDE_CHANNEL_PORT=30020 \ exec vllm serve \ @@ -237,8 +237,8 @@ spec: --master-addr $(LWS_LEADER_ADDRESS) \ --nnodes ${_pp_size} \ --node-rank $(LWS_WORKER_INDEX) \ - --tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \ - {{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ + --tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \ + {{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ ${_config_arg} \ "${_kv_events_args[@]}" \ $(ISVC_EXTRA_ARGS) diff --git a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-decode.helm.yaml b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-decode.helm.yaml index 553e4d0d..38a2e093 100644 --- a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-decode.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-decode.helm.yaml @@ -110,8 +110,8 @@ spec: "${ISVC_MODEL_PATH}" \ --served-model-name "${ISVC_MODEL_NAME}" \ --port 8200 \ - --tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \ - {{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ + --tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \ + {{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ ${_config_arg} \ "${_kv_events_args[@]}" \ $(ISVC_EXTRA_ARGS) diff --git a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-dp.helm.yaml b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-dp.helm.yaml index e34bb12c..86ef1d6b 100644 --- a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-dp.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-dp.helm.yaml @@ -50,8 +50,8 @@ spec: eval "$ISVC_PRE_PROCESS_SCRIPT" fi - _dp_size={{ "{{" }} or .Spec.Parallelism.Data 1 {{ "}}" }} - _dp_local_size={{ "{{" }} or .Spec.Parallelism.DataLocal 1 {{ "}}" }} + _dp_size={{ "{{" }} deref .Spec "Parallelism" "Data" | default 1 {{ "}}" }} + _dp_local_size={{ "{{" }} deref .Spec "Parallelism" "DataLocal" | default (deref .Spec "Parallelism" "Data" | default 1) {{ "}}" }} _start_rank=$(( ${LWS_WORKER_INDEX:-0} * _dp_local_size )) for ((_index=0; _index<_dp_local_size; _index++)); do @@ -82,12 +82,12 @@ spec: "${ISVC_MODEL_PATH}" \ --served-model-name "${ISVC_MODEL_NAME}" \ --port ${_dp_rank_port} \ - --tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \ + --tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \ --data-parallel-size ${_dp_size} \ --data-parallel-rank $((_start_rank + _index)) \ --data-parallel-address $(LWS_LEADER_ADDRESS) \ - --data-parallel-rpc-port {{ "{{" }} or .Spec.Parallelism.DataRPCPort 13345 {{ "}}" }} \ - {{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ + --data-parallel-rpc-port {{ "{{" }} deref .Spec "Parallelism" "DataRPCPort" | default 13345 {{ "}}" }} \ + {{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ ${_config_arg} \ "${_kv_events_args[@]}" \ $(ISVC_EXTRA_ARGS) \ diff --git a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-pp.helm.yaml b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-pp.helm.yaml index 92193420..c9696160 100644 --- a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-pp.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-pp.helm.yaml @@ -69,7 +69,7 @@ spec: }") fi - _pp_size={{ "{{" }} or .Spec.Parallelism.Pipeline 1 {{ "}}" }} + _pp_size={{ "{{" }} deref .Spec "Parallelism" "Pipeline" | default 1 {{ "}}" }} VLLM_NIXL_SIDE_CHANNEL_PORT=30020 \ exec vllm serve \ @@ -80,8 +80,8 @@ spec: --master-addr $(LWS_LEADER_ADDRESS) \ --nnodes ${_pp_size} \ --node-rank $(LWS_WORKER_INDEX) \ - --tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \ - {{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ + --tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \ + {{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ ${_config_arg} \ "${_kv_events_args[@]}" \ $(ISVC_EXTRA_ARGS) @@ -189,7 +189,7 @@ spec: }") fi - _pp_size={{ "{{" }} or .Spec.Parallelism.Pipeline 1 {{ "}}" }} + _pp_size={{ "{{" }} deref .Spec "Parallelism" "Pipeline" | default 1 {{ "}}" }} VLLM_NIXL_SIDE_CHANNEL_PORT=30020 \ exec vllm serve \ @@ -200,8 +200,8 @@ spec: --master-addr $(LWS_LEADER_ADDRESS) \ --nnodes ${_pp_size} \ --node-rank $(LWS_WORKER_INDEX) \ - --tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \ - {{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ + --tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \ + {{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ ${_config_arg} \ "${_kv_events_args[@]}" \ $(ISVC_EXTRA_ARGS) diff --git a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-prefill-dp.helm.yaml b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-prefill-dp.helm.yaml index 82e19355..dd6db0bb 100644 --- a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-prefill-dp.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-prefill-dp.helm.yaml @@ -51,8 +51,8 @@ spec: eval "$ISVC_PRE_PROCESS_SCRIPT" fi - _dp_size={{ "{{" }} or .Spec.Parallelism.Data 1 {{ "}}" }} - _dp_local_size={{ "{{" }} or .Spec.Parallelism.DataLocal 1 {{ "}}" }} + _dp_size={{ "{{" }} deref .Spec "Parallelism" "Data" | default 1 {{ "}}" }} + _dp_local_size={{ "{{" }} deref .Spec "Parallelism" "DataLocal" | default (deref .Spec "Parallelism" "Data" | default 1) {{ "}}" }} _start_rank=$(( ${LWS_WORKER_INDEX:-0} * _dp_local_size )) for ((_index=0; _index<_dp_local_size; _index++)); do @@ -84,12 +84,12 @@ spec: "${ISVC_MODEL_PATH}" \ --served-model-name "${ISVC_MODEL_NAME}" \ --port ${_dp_rank_port} \ - --tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \ + --tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \ --data-parallel-size ${_dp_size} \ --data-parallel-rank $((_start_rank + _index)) \ --data-parallel-address $(LWS_LEADER_ADDRESS) \ - --data-parallel-rpc-port {{ "{{" }} or .Spec.Parallelism.DataRPCPort 13345 {{ "}}" }} \ - {{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ + --data-parallel-rpc-port {{ "{{" }} deref .Spec "Parallelism" "DataRPCPort" | default 13345 {{ "}}" }} \ + {{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ ${_config_arg} \ "${_kv_events_args[@]}" \ $(ISVC_EXTRA_ARGS) \ diff --git a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-prefill-pp.helm.yaml b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-prefill-pp.helm.yaml index 7c36d424..8b87f1f9 100644 --- a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-prefill-pp.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-prefill-pp.helm.yaml @@ -70,7 +70,7 @@ spec: }") fi - _pp_size={{ "{{" }} or .Spec.Parallelism.Pipeline 1 {{ "}}" }} + _pp_size={{ "{{" }} deref .Spec "Parallelism" "Pipeline" | default 1 {{ "}}" }} VLLM_NIXL_SIDE_CHANNEL_PORT=30020 \ exec vllm serve \ @@ -81,8 +81,8 @@ spec: --master-addr $(LWS_LEADER_ADDRESS) \ --nnodes ${_pp_size} \ --node-rank $(LWS_WORKER_INDEX) \ - --tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \ - {{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ + --tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \ + {{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ ${_config_arg} \ "${_kv_events_args[@]}" \ $(ISVC_EXTRA_ARGS) @@ -190,7 +190,7 @@ spec: }") fi - _pp_size={{ "{{" }} or .Spec.Parallelism.Pipeline 1 {{ "}}" }} + _pp_size={{ "{{" }} deref .Spec "Parallelism" "Pipeline" | default 1 {{ "}}" }} VLLM_NIXL_SIDE_CHANNEL_PORT=30020 \ exec vllm serve \ @@ -201,8 +201,8 @@ spec: --master-addr $(LWS_LEADER_ADDRESS) \ --nnodes ${_pp_size} \ --node-rank $(LWS_WORKER_INDEX) \ - --tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \ - {{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ + --tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \ + {{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ ${_config_arg} \ "${_kv_events_args[@]}" \ $(ISVC_EXTRA_ARGS) diff --git a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-prefill.helm.yaml b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-prefill.helm.yaml index ed0feb36..55bcd892 100644 --- a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-prefill.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm-prefill.helm.yaml @@ -74,8 +74,8 @@ spec: "${ISVC_MODEL_PATH}" \ --served-model-name "${ISVC_MODEL_NAME}" \ --port 8000 \ - --tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \ - {{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ + --tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \ + {{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ ${_config_arg} \ "${_kv_events_args[@]}" \ $(ISVC_EXTRA_ARGS) diff --git a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm.helm.yaml b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm.helm.yaml index 5be6118f..6b0f0a3e 100644 --- a/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/runtime-bases/vllm.helm.yaml @@ -69,8 +69,8 @@ spec: "${ISVC_MODEL_PATH}" \ --served-model-name "${ISVC_MODEL_NAME}" \ --port 8000 \ - --tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \ - {{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ + --tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \ + {{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \ ${_config_arg} \ "${_kv_events_args[@]}" \ $(ISVC_EXTRA_ARGS) diff --git a/deploy/helm/moai-inference-preset/templates/utils/sim-decode-dp.helm.yaml b/deploy/helm/moai-inference-preset/templates/utils/sim-decode-dp.helm.yaml index 2b669617..4d31438a 100644 --- a/deploy/helm/moai-inference-preset/templates/utils/sim-decode-dp.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/utils/sim-decode-dp.helm.yaml @@ -27,12 +27,25 @@ spec: - | set -ex - exec /app/proxy \ - --port 8000 \ - --decoder-ip $(POD_IP) \ - --decoder-port 8200 \ - --data-parallel-size {{ "{{" }} or .Spec.Parallelism.DataLocal 1 {{ "}}" }} \ - $(ISVC_EXTRA_ARGS) + cleanup() { + echo "Received SIGTERM, killing child processes..." + pkill -P $$ + wait + } + trap cleanup SIGTERM SIGINT + + _dp_local_size={{ "{{" }} deref .Spec "Parallelism" "DataLocal" | default (deref .Spec "Parallelism" "Data" | default 1) {{ "}}" }} + + for ((_index=0; _index<_dp_local_size; _index++)); do + /app/proxy \ + --port $((8000 + _index)) \ + --decoder-ip $(POD_IP) \ + --decoder-port $((8200 + _index)) \ + $(ISVC_EXTRA_ARGS) \ + & + done + + wait env: - name: ISVC_EXTRA_ARGS value: >- @@ -75,8 +88,8 @@ spec: eval "$ISVC_PRE_PROCESS_SCRIPT" fi - _dp_size={{ "{{" }} or .Spec.Parallelism.Data 1 {{ "}}" }} - _dp_local_size={{ "{{" }} or .Spec.Parallelism.DataLocal 1 {{ "}}" }} + _dp_size={{ "{{" }} deref .Spec "Parallelism" "Data" | default 1 {{ "}}" }} + _dp_local_size={{ "{{" }} deref .Spec "Parallelism" "DataLocal" | default (deref .Spec "Parallelism" "Data" | default 1) {{ "}}" }} _start_rank=$(( ${LWS_WORKER_INDEX:-0} * _dp_local_size )) for ((_index=0; _index<_dp_local_size; _index++)); do diff --git a/deploy/helm/moai-inference-preset/templates/utils/sim-dp.helm.yaml b/deploy/helm/moai-inference-preset/templates/utils/sim-dp.helm.yaml index ab6fb2a6..4477d4c1 100644 --- a/deploy/helm/moai-inference-preset/templates/utils/sim-dp.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/utils/sim-dp.helm.yaml @@ -35,8 +35,8 @@ spec: eval "$ISVC_PRE_PROCESS_SCRIPT" fi - _dp_size={{ "{{" }} or .Spec.Parallelism.Data 1 {{ "}}" }} - _dp_local_size={{ "{{" }} or .Spec.Parallelism.DataLocal 1 {{ "}}" }} + _dp_size={{ "{{" }} deref .Spec "Parallelism" "Data" | default 1 {{ "}}" }} + _dp_local_size={{ "{{" }} deref .Spec "Parallelism" "DataLocal" | default (deref .Spec "Parallelism" "Data" | default 1) {{ "}}" }} _start_rank=$(( ${LWS_WORKER_INDEX:-0} * _dp_local_size )) for ((_index=0; _index<_dp_local_size; _index++)); do diff --git a/deploy/helm/moai-inference-preset/templates/utils/sim-prefill-dp.helm.yaml b/deploy/helm/moai-inference-preset/templates/utils/sim-prefill-dp.helm.yaml index 7b3f1af0..3e483c53 100644 --- a/deploy/helm/moai-inference-preset/templates/utils/sim-prefill-dp.helm.yaml +++ b/deploy/helm/moai-inference-preset/templates/utils/sim-prefill-dp.helm.yaml @@ -36,8 +36,8 @@ spec: eval "$ISVC_PRE_PROCESS_SCRIPT" fi - _dp_size={{ "{{" }} or .Spec.Parallelism.Data 1 {{ "}}" }} - _dp_local_size={{ "{{" }} or .Spec.Parallelism.DataLocal 1 {{ "}}" }} + _dp_size={{ "{{" }} deref .Spec "Parallelism" "Data" | default 1 {{ "}}" }} + _dp_local_size={{ "{{" }} deref .Spec "Parallelism" "DataLocal" | default (deref .Spec "Parallelism" "Data" | default 1) {{ "}}" }} _start_rank=$(( ${LWS_WORKER_INDEX:-0} * _dp_local_size )) for ((_index=0; _index<_dp_local_size; _index++)); do diff --git a/skills/guide-odin/SKILL.md b/skills/guide-odin/SKILL.md index f3ff7a0f..0c5ab0c8 100644 --- a/skills/guide-odin/SKILL.md +++ b/skills/guide-odin/SKILL.md @@ -20,6 +20,7 @@ Odin is the Kubernetes operator at the core of the MoAI Inference Framework (MIF Odin introduces a **template composition system** (`InferenceServiceTemplate`) that allows reusable configurations — runtime-bases and model-specific presets — to be layered and merged using Kubernetes strategic merge patch semantics. This enables a separation of concerns: platform teams maintain runtime-bases, model teams maintain presets, and end users compose them with minimal configuration. **This skill covers:** + - InferenceService and InferenceServiceTemplate CRDs - Template composition (templateRefs, merging, variable substitution) - Parallelism configuration (tensor, pipeline, data, expert) @@ -33,6 +34,7 @@ Odin introduces a **template composition system** (`InferenceServiceTemplate`) t **Out of scope:** Heimdall plugin configuration (see `guide-heimdall`), vLLM engine internals, Gateway controller setup, cluster-level infrastructure. **Key codebase paths:** + - `website/docs/reference/odin/api-reference.mdx` — API field reference - `website/docs/features/preset.mdx` — template composition guide - `website/docs/getting-started/quickstart.mdx` — end-to-end deployment @@ -44,13 +46,13 @@ Odin introduces a **template composition system** (`InferenceServiceTemplate`) t When you need field-level details beyond this guide (e.g., exact CRD validation rules, all supported env variables, template variable list), consult the reference docs below. Prefer the local file path when filesystem access is available (faster, complete). Use the URL as a fallback when filesystem access is unavailable. -| Topic | Local path | URL (fallback) | -| --- | --- | --- | -| API field reference (CRD spec) | `website/docs/reference/odin/api-reference.mdx` | https://test-docs.moreh.io/dev/reference/odin/api-reference/ | -| Template composition & presets | `website/docs/features/preset.mdx` | https://test-docs.moreh.io/dev/features/preset/ | -| End-to-end quickstart | `website/docs/getting-started/quickstart.mdx` | https://test-docs.moreh.io/dev/getting-started/quickstart/ | -| PV-based model management | `website/docs/operations/hf-model-management-with-pv.mdx` | https://test-docs.moreh.io/dev/operations/hf-model-management-with-pv/ | -| Monitoring & metrics | `website/docs/operations/monitoring/metrics/index.mdx` | https://test-docs.moreh.io/dev/operations/monitoring/metrics/ | +| Topic | Local path | URL (fallback) | +| ------------------------------ | --------------------------------------------------------- | ---------------------------------------------------------------------- | +| API field reference (CRD spec) | `website/docs/reference/odin/api-reference.mdx` | https://test-docs.moreh.io/dev/reference/odin/api-reference/ | +| Template composition & presets | `website/docs/features/preset.mdx` | https://test-docs.moreh.io/dev/features/preset/ | +| End-to-end quickstart | `website/docs/getting-started/quickstart.mdx` | https://test-docs.moreh.io/dev/getting-started/quickstart/ | +| PV-based model management | `website/docs/operations/hf-model-management-with-pv.mdx` | https://test-docs.moreh.io/dev/operations/hf-model-management-with-pv/ | +| Monitoring & metrics | `website/docs/operations/monitoring/metrics/index.mdx` | https://test-docs.moreh.io/dev/operations/monitoring/metrics/ | --- @@ -73,10 +75,10 @@ flowchart TD ### CRDs -| CRD | API Group | Short Names | Purpose | -| --- | --- | --- | --- | -| `InferenceService` | `odin.moreh.io/v1alpha1` | `is`, `isvc` | User-facing resource for deploying inference workloads | -| `InferenceServiceTemplate` | `odin.moreh.io/v1alpha1` | `ist`, `isvctmpl` | Reusable template for composable configurations | +| CRD | API Group | Short Names | Purpose | +| -------------------------- | ------------------------ | ----------------- | ------------------------------------------------------ | +| `InferenceService` | `odin.moreh.io/v1alpha1` | `is`, `isvc` | User-facing resource for deploying inference workloads | +| `InferenceServiceTemplate` | `odin.moreh.io/v1alpha1` | `ist`, `isvctmpl` | Reusable template for composable configurations | --- @@ -88,26 +90,26 @@ kind: InferenceService metadata: name: spec: - replicas: # default: 1 - inferencePoolRefs: # max 1 entry + replicas: # default: 1 + inferencePoolRefs: # max 1 entry - name: - templateRefs: # merged in order, later overrides earlier + templateRefs: # merged in order, later overrides earlier - name: - rolloutStrategy: # optional + rolloutStrategy: # optional type: rollingUpdate: maxUnavailable: maxSurge: - partition: # LeaderWorkerSet only - parallelism: # optional - tensor: # min: 1 - pipeline: # min: 1; mutually exclusive with data - data: # min: 1; mutually exclusive with pipeline - dataLocal: # min: 1; must be set with data - dataRPCPort: # 1-65535 - expert: # enable expert parallelism (MoE models) - template: # for Deployment or LWS leader - workerTemplate: # for LWS workers; triggers LWS mode + partition: # LeaderWorkerSet only + parallelism: # optional + tensor: # min: 1 + pipeline: # min: 1; mutually exclusive with data + data: # min: 1; mutually exclusive with pipeline + dataLocal: # min: 1; must be set with data + dataRPCPort: # 1-65535 + expert: # enable expert parallelism (MoE models) + template: # for Deployment or LWS leader + workerTemplate: # for LWS workers; triggers LWS mode ``` ### Key validation rules (enforced by webhook) @@ -129,14 +131,15 @@ spec: kubectl get inferenceservice -n ``` -| Column | Source | Description | -| --- | --- | --- | -| READY | `.status.conditions[?(@.type=='Ready')].status` | `True`, `False`, or `Unknown` | -| DESIRED | `.spec.replicas` | Desired replica count | -| UP-TO-DATE | `.status.updatedReplicas` | Replicas with current spec | -| AGE | `.metadata.creationTimestamp` | Time since creation | +| Column | Source | Description | +| ---------- | ----------------------------------------------- | ----------------------------- | +| READY | `.status.conditions[?(@.type=='Ready')].status` | `True`, `False`, or `Unknown` | +| DESIRED | `.spec.replicas` | Desired replica count | +| UP-TO-DATE | `.status.updatedReplicas` | Replicas with current spec | +| AGE | `.metadata.creationTimestamp` | Time since creation | Wait for readiness: + ```shell kubectl wait inferenceservice -n --for=condition=Ready --timeout=15m ``` @@ -153,9 +156,9 @@ kind: InferenceServiceTemplate metadata: name: spec: - parallelism: # optional - template: # optional - workerTemplate: # optional + parallelism: # optional + template: # optional + workerTemplate: # optional ``` ### Merging rules @@ -171,35 +174,56 @@ flowchart LR ``` **Merge behavior:** + - Lists with strategic merge keys (e.g., containers by `name`, env vars by `name`) merge by key, not replace - Scalar fields in later templates override earlier ones - Unset fields in overlays do not erase base values ### Variable substitution -Templates support Go template syntax with [Sprig functions](http://masterminds.github.io/sprig/). Variables are resolved at reconciliation time. +Templates support Go template syntax with [Sprig functions](http://masterminds.github.io/sprig/) and Odin-provided functions. Variables are resolved at reconciliation time. Available variables: -| Variable | Type | Description | -| --- | --- | --- | -| `.Name` | string | InferenceService name | -| `.Namespace` | string | InferenceService namespace | -| `.Labels` | map | InferenceService labels | -| `.Spec.Parallelism.Tensor` | int | Tensor parallelism value | -| `.Spec.Parallelism.Pipeline` | int | Pipeline parallelism value | -| `.Spec.Parallelism.Data` | int | Data parallelism value | -| `.Spec.Parallelism.DataLocal` | int | Data local parallelism value | -| `.Spec.Parallelism.DataRPCPort` | int | Data RPC port value | -| `.Spec.Parallelism.Expert` | bool | Expert parallelism flag | - -**Example:** A runtime-base uses `{{ .Spec.Parallelism.Tensor }}` in container args: +| Variable | Type | Description | +| ------------------------------- | ------ | ---------------------------- | +| `.Name` | string | InferenceService name | +| `.Namespace` | string | InferenceService namespace | +| `.Labels` | map | InferenceService labels | +| `.Spec.Parallelism.Tensor` | int | Tensor parallelism value | +| `.Spec.Parallelism.Pipeline` | int | Pipeline parallelism value | +| `.Spec.Parallelism.Data` | int | Data parallelism value | +| `.Spec.Parallelism.DataLocal` | int | Data local parallelism value | +| `.Spec.Parallelism.DataRPCPort` | int | Data RPC port value | +| `.Spec.Parallelism.Expert` | bool | Expert parallelism flag | + +#### Nil-safe field access with `deref` + +`.Spec.Parallelism` is a pointer and may be nil. Accessing `.Spec.Parallelism.Tensor` directly panics when Parallelism is nil because Go templates evaluate the full field chain before passing the result to functions like `or`. + +Use the `deref` function for nil-safe traversal: + +``` +{{ deref .Spec "Parallelism" "Tensor" | default 1 }} +``` + +`deref` takes a root object followed by field names as strings. It walks the struct chain and returns nil if any intermediate pointer is nil, allowing `default` to provide the fallback value. + +**Example:** A runtime-base uses `deref` with `default` in container args: + ```yaml args: - --tensor-parallel-size - - "{{ .Spec.Parallelism.Tensor }}" + - "{{ deref .Spec "Parallelism" "Tensor" | default 1 }}" +``` + +When the InferenceService sets `parallelism.tensor: 4`, this renders as `--tensor-parallel-size 4`. When `parallelism` is not set, it defaults to `1`. + +For boolean flags, use `deref` directly in conditionals: + +```yaml +{{ if deref .Spec "Parallelism" "Expert" }}--enable-expert-parallel{{ end }} ``` -When the InferenceService sets `parallelism.tensor: 4`, this renders as `--tensor-parallel-size 4`. ### Template lookup order @@ -220,11 +244,11 @@ The presence and configuration of parallelism determines the workload type and p ### Decision matrix -| `workerTemplate` | Parallelism | Workload type | Pod topology | -| --- | --- | --- | --- | -| Not set | None or tensor only | **Deployment** | `replicas` independent pods | -| Set | `data` | **LeaderWorkerSet** | `replicas` groups, each with `data/dataLocal` workers | -| Set | `pipeline` | **LeaderWorkerSet** | `replicas` groups, each with `pipeline` workers | +| `workerTemplate` | Parallelism | Workload type | Pod topology | +| ---------------- | ------------------- | ------------------- | ----------------------------------------------------- | +| Not set | None or tensor only | **Deployment** | `replicas` independent pods | +| Set | `data` | **LeaderWorkerSet** | `replicas` groups, each with `data/dataLocal` workers | +| Set | `pipeline` | **LeaderWorkerSet** | `replicas` groups, each with `pipeline` workers | ### Tensor parallelism @@ -232,7 +256,7 @@ Tensor parallelism shards the model across GPUs **within a single pod**. It does ```yaml parallelism: - tensor: 4 # Each pod uses 4 GPUs + tensor: 4 # Each pod uses 4 GPUs ``` Tensor parallelism is configured in the runtime-base via the `--tensor-parallel-size` vLLM argument (using template variable substitution). @@ -240,6 +264,7 @@ Tensor parallelism is configured in the runtime-base via the `--tensor-parallel- ### Data parallelism (LeaderWorkerSet) Data parallelism distributes batches across multiple pods (workers). Odin creates a **LeaderWorkerSet** where: + - **Size** = `data / dataLocal` (number of worker pods per group) - **Leader** uses `template` pod spec - **Workers** use `workerTemplate` pod spec @@ -247,19 +272,21 @@ Data parallelism distributes batches across multiple pods (workers). Odin create ```yaml parallelism: - data: 8 # Total data parallel size - dataLocal: 4 # Local parallelism per worker - # → 8/4 = 2 workers per group + data: 8 # Total data parallel size + dataLocal: + 4 # Local parallelism per worker + # → 8/4 = 2 workers per group ``` ### Pipeline parallelism (LeaderWorkerSet) Pipeline parallelism splits the model across pipeline stages, each on a separate pod. + - **Size** = `pipeline` (number of worker pods per group) ```yaml parallelism: - pipeline: 4 # 4 pipeline stages = 4 workers per group + pipeline: 4 # 4 pipeline stages = 4 workers per group ``` ### Expert parallelism @@ -276,6 +303,7 @@ parallelism: ### Mutual exclusivity **Pipeline and data parallelism cannot be used simultaneously.** This is enforced by the validating webhook: + - Set `pipeline` for pipeline-parallel deployments - Set `data` (+ `dataLocal`) for data-parallel deployments - Never set both @@ -288,25 +316,25 @@ parallelism: Runtime-bases define the container startup logic, parallelism wiring, and pod structure. They are installed in the `mif` namespace by the `moai-inference-preset` Helm chart. -| Runtime-base | Workload type | `template` / `workerTemplate` | Use case | -| --- | --- | --- | --- | -| `vllm` | Deployment | `template` | Simple aggregate (no PD disaggregation) | -| `vllm-dp` | LeaderWorkerSet | `workerTemplate` | Data-parallel aggregate | -| `vllm-pp` | LeaderWorkerSet | `workerTemplate` | Pipeline-parallel aggregate | -| `vllm-decode` | Deployment | `template` | Decode-only (PD disaggregation) | -| `vllm-decode-dp` | LeaderWorkerSet | `workerTemplate` | Decode-only with data parallelism | -| `vllm-decode-pp` | LeaderWorkerSet | `workerTemplate` | Decode-only with pipeline parallelism | -| `vllm-prefill` | Deployment | `template` | Prefill-only (PD disaggregation) | -| `vllm-prefill-dp` | LeaderWorkerSet | `workerTemplate` | Prefill-only with data parallelism | -| `vllm-prefill-pp` | LeaderWorkerSet | `workerTemplate` | Prefill-only with pipeline parallelism | +| Runtime-base | Workload type | `template` / `workerTemplate` | Use case | +| ----------------- | --------------- | ----------------------------- | --------------------------------------- | +| `vllm` | Deployment | `template` | Simple aggregate (no PD disaggregation) | +| `vllm-dp` | LeaderWorkerSet | `workerTemplate` | Data-parallel aggregate | +| `vllm-pp` | LeaderWorkerSet | `workerTemplate` | Pipeline-parallel aggregate | +| `vllm-decode` | Deployment | `template` | Decode-only (PD disaggregation) | +| `vllm-decode-dp` | LeaderWorkerSet | `workerTemplate` | Decode-only with data parallelism | +| `vllm-decode-pp` | LeaderWorkerSet | `workerTemplate` | Decode-only with pipeline parallelism | +| `vllm-prefill` | Deployment | `template` | Prefill-only (PD disaggregation) | +| `vllm-prefill-dp` | LeaderWorkerSet | `workerTemplate` | Prefill-only with data parallelism | +| `vllm-prefill-pp` | LeaderWorkerSet | `workerTemplate` | Prefill-only with pipeline parallelism | ### Choosing `template` vs. `workerTemplate` This is the most common source of misconfiguration. The rule is simple: **override the same field the runtime-base uses**. If the runtime-base puts its pod spec in `workerTemplate`, your overrides (env vars, resources, volumes) must also go in `spec.workerTemplate`. Putting them in `spec.template` instead creates a separate, unused field — the merge has no effect and overrides are silently ignored. -| Runtime-base suffix | Override in | Workload type | -| --- | --- | --- | -| (none): `vllm`, `vllm-decode`, `vllm-prefill` | `spec.template` | Deployment | +| Runtime-base suffix | Override in | Workload type | +| ----------------------------------------------------- | --------------------- | --------------- | +| (none): `vllm`, `vllm-decode`, `vllm-prefill` | `spec.template` | Deployment | | `-dp`: `vllm-dp`, `vllm-decode-dp`, `vllm-prefill-dp` | `spec.workerTemplate` | LeaderWorkerSet | | `-pp`: `vllm-pp`, `vllm-decode-pp`, `vllm-prefill-pp` | `spec.workerTemplate` | LeaderWorkerSet | @@ -324,17 +352,17 @@ kubectl get inferenceservicetemplate -n mif -l mif.moreh.io/template.type=preset ### Commonly overridden environment variables -| Variable | Purpose | Default | -| --- | --- | --- | -| `ISVC_MODEL_NAME` | HuggingFace model ID or name | (set by preset) | -| `ISVC_MODEL_PATH` | Local model path or HF ID | defaults to `$ISVC_MODEL_NAME` | -| `ISVC_EXTRA_ARGS` | Additional vLLM engine arguments | (set by preset) | -| `ISVC_PRE_PROCESS_SCRIPT` | Script to run before engine starts | (none) | -| `ISVC_USE_KV_EVENTS` | Publish KV cache events to Heimdall via ZMQ (for `precise-prefix-cache-scorer`) | `false` | -| `ISVC_PRESET_PATH` | Path to preset configuration file sourced at startup | (empty) | -| `HF_TOKEN` | HuggingFace API token | (user must provide) | -| `HF_HOME` | HuggingFace cache directory | `/mnt/models` (for PV usage) | -| `HF_HUB_OFFLINE` | Disable HF Hub network access | `1` (for PV usage) | +| Variable | Purpose | Default | +| ------------------------- | ------------------------------------------------------------------------------- | ------------------------------ | +| `ISVC_MODEL_NAME` | HuggingFace model ID or name | (set by preset) | +| `ISVC_MODEL_PATH` | Local model path or HF ID | defaults to `$ISVC_MODEL_NAME` | +| `ISVC_EXTRA_ARGS` | Additional vLLM engine arguments | (set by preset) | +| `ISVC_PRE_PROCESS_SCRIPT` | Script to run before engine starts | (none) | +| `ISVC_USE_KV_EVENTS` | Publish KV cache events to Heimdall via ZMQ (for `precise-prefix-cache-scorer`) | `false` | +| `ISVC_PRESET_PATH` | Path to preset configuration file sourced at startup | (empty) | +| `HF_TOKEN` | HuggingFace API token | (user must provide) | +| `HF_HOME` | HuggingFace cache directory | `/mnt/models` (for PV usage) | +| `HF_HUB_OFFLINE` | Disable HF Hub network access | `1` (for PV usage) | --- @@ -389,7 +417,7 @@ spec: parallelism: data: 2 tensor: 1 - workerTemplate: # workerTemplate for *-dp runtime-base + workerTemplate: # workerTemplate for *-dp runtime-base spec: containers: - name: main @@ -440,7 +468,7 @@ The Odin operator is deployed as part of MIF via the `moai-inference-framework` # In moai-inference-framework values.yaml odin: lws: - enabled: false # Set true if LWS not already installed + enabled: false # Set true if LWS not already installed replicas: 1 extraArgs: - --zap-encoder=json @@ -496,12 +524,12 @@ For data-parallel decode (`vllm-decode-dp`), the proxy receives `--data-parallel ### Port mapping summary -| Runtime-base | User-facing port | Backend port | Proxy | -| --- | --- | --- | --- | -| `vllm-decode`, `vllm-decode-pp` | 8000 (proxy) | 8200 (vLLM) | Yes | -| `vllm-decode-dp` | 8000-8007 (proxy) | 8200-8207 (vLLM) | Yes | -| `vllm-prefill`, `vllm-prefill-pp` | 8000 (vLLM direct) | — | No | -| `vllm-prefill-dp` | 8000-8007 (vLLM direct) | — | No | +| Runtime-base | User-facing port | Backend port | Proxy | +| --------------------------------- | ----------------------- | ---------------- | ----- | +| `vllm-decode`, `vllm-decode-pp` | 8000 (proxy) | 8200 (vLLM) | Yes | +| `vllm-decode-dp` | 8000-8007 (proxy) | 8200-8207 (vLLM) | Yes | +| `vllm-prefill`, `vllm-prefill-pp` | 8000 (vLLM direct) | — | No | +| `vllm-prefill-dp` | 8000-8007 (vLLM direct) | — | No | ### KV cache events @@ -519,9 +547,9 @@ When enabled, each pod publishes KV cache events via ZMQ to `tcp://` with your Hugging Face token that has accepted the model license**. -```yaml title="vllm-llama3-1b-instruct-tp2.yaml" {20} +```yaml title="vllm-llama3-1b-instruct-tp2.yaml" {18} apiVersion: odin.moreh.io/v1alpha1 kind: InferenceService metadata: @@ -331,9 +331,7 @@ spec: - name: heimdall templateRefs: - name: vllm - - name: vllm-meta-llama-llama-3.2-1b-instruct-amd-mi250-tp2 - parallelism: - tensor: 2 + - name: quickstart-vllm-meta-llama-llama-3.2-1b-instruct-amd-mi250-tp2 template: spec: containers: @@ -345,7 +343,7 @@ spec: - The `replicas` field specifies the number of vLLM pods. - The `inferencePoolRefs` field specifies the Heimdall's InferencePool where this vLLM pod will register to. -- The `templateRefs` field specifies the Odin Template resources; `vllm` is a runtime base, and `vllm-meta-llama-llama-3.2-1b-instruct-amd-mi250-tp2` is a model-specific template. +- The `templateRefs` field specifies the Odin Template resources; `vllm` is a runtime base, and `quickstart-vllm-meta-llama-llama-3.2-1b-instruct-amd-mi250-tp2` is a model-specific template. After that, you can deploy the Odin InferenceService by running the following command: