Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions deploy/helm/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,62 @@ Rules specific to the `deploy/helm/` directory. General contribution guidelines

- **Do not use YAML anchors at the root level of `values.yaml`** (e.g., `_defaults: &defaults`). Helm treats unknown root-level keys as invalid and may emit warnings or errors. Instead, duplicate shared configuration explicitly for each component.

## Odin Presets (`moai-inference-preset`)

An Odin preset is a pair of Odin `InferenceServiceTemplate` resources — a **base template** (runtime base) and a **preset-specific template** — that together define how to deploy a Moreh vLLM pod. The base template defines how vLLM servers are launched and is shared across presets. The preset-specific template adds model-specific arguments, environment variables, resource requests, and disaggregation settings.

### Preset naming convention

Preset names follow the pattern:
`{image_tag}-{org_name}-{model_name}[-mtp][-prefill][-decode]-{accelerator_vendor}-{accelerator_model}-{parallelism}[-moe-{moe_parallelism}]`

- `{org_name}` and `{model_name}` follow Hugging Face Hub names in kebab-case (e.g., `meta-llama/Llama-3.3-70B-Instruct` → `meta-llama-llama-3.3-70b-instruct`).
- `-mtp` is appended after `{model_name}` if multi-token prediction is used.
- `-prefill` or `-decode` is appended for disaggregation modes, placed after `{model_name}` (or `-mtp`) and before `{accelerator_vendor}`.
- `{parallelism}` examples: `1`, `tp4`, `tp8`, `dp8`. Canonical order for combined strategies: `dp` → `pp` → `tp` → `cp`.
- For MoE models, `-moe-{moe_parallelism}` is appended (e.g., `-moe-ep8`, `-moe-tp8`).

### Reserved labels

Odin presets use `mif.moreh.io/*` labels:

| Label key | Description | Example values |
| :-------------------------------- | :--------------------------- | :-------------------------------------- |
| `mif.moreh.io/template.type` | Template type | `runtime-base`, `preset` |
| `mif.moreh.io/model.org` | HF org name (kebab-case) | `meta-llama`, `deepseek-ai` |
| `mif.moreh.io/model.name` | HF model name (kebab-case) | `llama-3.3-70b-instruct`, `deepseek-r1` |
| `mif.moreh.io/model.mtp` | Multi-token prediction | `"true"` or unset |
| `mif.moreh.io/role` | Disaggregation mode | `e2e`, `prefill`, `decode` |
| `mif.moreh.io/accelerator.vendor` | GPU vendor | `amd` |
| `mif.moreh.io/accelerator.model` | GPU model | `mi250`, `mi300x`, `mi308x` |
| `mif.moreh.io/parallelism` | Parallelism mode | `tp4`, `dp8-moe-ep8` |

### Responsibility boundaries

**Presets define** (model/GPU-specific, not user-configurable):
- vLLM arguments for parallelism within a single rank (`--tensor-parallel-size`, `--enable-expert-parallel`, etc.)
- Model-specific vLLM arguments (`--trust-remote-code`, `--max-model-len`, `--max-num-seqs`, `--kv-cache-type`, `--quantization`, `--gpu-memory-utilization`, etc.)
- Model-specific environment variables (`VLLM_ROCM_USE_AITER`, `VLLM_MOE_DP_CHUNK_SIZE`, `UCX_*`, `NCCL_*`, etc.)
- Resources (GPU count, RDMA NICs), tolerations, and nodeSelector

**Runtime bases define** (shared across presets):
- Execution command(s) and launch logic (for-loop for DP, cleanup traps)
- Cross-rank parallelism arguments (`--data-parallel-rank`, `--data-parallel-address`, `--data-parallel-rpc-port`)
- Disaggregation-specific environment variables (`VLLM_NIXL_SIDE_CHANNEL_HOST`, `VLLM_IS_DECODE_WORKER`)
- Shared memory settings, readiness probes
- Proxy sidecar configuration (for PD disaggregation)

**Users configure** (not defined by presets or runtime bases):
- Image repository and tag (with default provided)
- Volume mounts and model loading method (HF download vs. PV)
- Hugging Face token
- Number of replicas
- Logging arguments (`--no-enable-log-requests`, `--disable-uvicorn-access-log`, etc.)
- `--no-enable-prefix-caching`

**Product team templates configure** (must NOT be set in presets):
- `PYTHONHASHSEED`, `--prefix-caching-hash-algo`, `--kv-events-config`, `--block-size`

### MIF Pod Label Keys

When filtering or labeling logs, metrics, or other signals by MIF-specific pod attributes, use these label keys:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceServiceTemplate
metadata:
name: vllm-deepseek-r1-decode-mi300x-dp8ep
name: vllm-deepseek-r1-decode-mi300x-dp8-moe-ep8
namespace: {{ include "common.names.namespace" . }}
labels:
{{- include "mif.preset.labels" . | nindent 4 }}
Expand All @@ -10,7 +10,7 @@ metadata:
mif.moreh.io/role: decode
mif.moreh.io/accelerator.vendor: amd
mif.moreh.io/accelerator.model: mi300x
mif.moreh.io/parallelism: dp8ep8
mif.moreh.io/parallelism: dp8-moe-ep8
spec:
parallelism:
data: 8
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceServiceTemplate
metadata:
name: vllm-deepseek-r1-prefill-mi300x-dp8ep
name: vllm-deepseek-r1-prefill-mi300x-dp8-moe-ep8
namespace: {{ include "common.names.namespace" . }}
labels:
{{- include "mif.preset.labels" . | nindent 4 }}
Expand All @@ -10,7 +10,7 @@ metadata:
mif.moreh.io/role: prefill
mif.moreh.io/accelerator.vendor: amd
mif.moreh.io/accelerator.model: mi300x
mif.moreh.io/parallelism: dp8ep8
mif.moreh.io/parallelism: dp8-moe-ep8
spec:
parallelism:
data: 8
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceServiceTemplate
metadata:
name: quickstart-vllm-deepseek-ai-deepseek-r1-decode-amd-mi300x-dp8ep8
name: quickstart-vllm-deepseek-ai-deepseek-r1-decode-amd-mi300x-dp8-moe-ep8
namespace: {{ include "common.names.namespace" . }}
labels:
{{- include "mif.preset.labels" . | nindent 4 }}
Expand All @@ -10,7 +10,7 @@ metadata:
mif.moreh.io/role: decode
mif.moreh.io/accelerator.vendor: amd
mif.moreh.io/accelerator.model: mi300x
mif.moreh.io/parallelism: dp8ep8
mif.moreh.io/parallelism: dp8-moe-ep8
spec:
parallelism:
data: 8
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceServiceTemplate
metadata:
name: quickstart-vllm-deepseek-ai-deepseek-r1-prefill-amd-mi300x-dp8ep8
name: quickstart-vllm-deepseek-ai-deepseek-r1-prefill-amd-mi300x-dp8-moe-ep8
namespace: {{ include "common.names.namespace" . }}
labels:
{{- include "mif.preset.labels" . | nindent 4 }}
Expand All @@ -10,7 +10,7 @@ metadata:
mif.moreh.io/role: prefill
mif.moreh.io/accelerator.vendor: amd
mif.moreh.io/accelerator.model: mi300x
mif.moreh.io/parallelism: dp8ep8
mif.moreh.io/parallelism: dp8-moe-ep8
spec:
parallelism:
data: 8
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ spec:
}
trap cleanup SIGTERM SIGINT

_dp_local_size={{ "{{" }} or .Spec.Parallelism.DataLocal 1 {{ "}}" }}
_dp_local_size={{ "{{" }} deref .Spec "Parallelism" "DataLocal" | default (deref .Spec "Parallelism" "Data" | default 1) {{ "}}" }}

for ((_index=0; _index<_dp_local_size; _index++)); do
/app/proxy \
Expand Down Expand Up @@ -101,8 +101,8 @@ spec:
eval "$ISVC_PRE_PROCESS_SCRIPT"
fi

_dp_size={{ "{{" }} or .Spec.Parallelism.Data 1 {{ "}}" }}
_dp_local_size={{ "{{" }} or .Spec.Parallelism.DataLocal 1 {{ "}}" }}
_dp_size={{ "{{" }} deref .Spec "Parallelism" "Data" | default 1 {{ "}}" }}
_dp_local_size={{ "{{" }} deref .Spec "Parallelism" "DataLocal" | default (deref .Spec "Parallelism" "Data" | default 1) {{ "}}" }}
_start_rank=$(( ${LWS_WORKER_INDEX:-0} * _dp_local_size ))

for ((_index=0; _index<_dp_local_size; _index++)); do
Expand Down Expand Up @@ -134,12 +134,12 @@ spec:
"${ISVC_MODEL_PATH}" \
--served-model-name "${ISVC_MODEL_NAME}" \
--port ${_dp_rank_port} \
--tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \
--tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \
--data-parallel-size ${_dp_size} \
--data-parallel-rank $((_start_rank + _index)) \
--data-parallel-address $(LWS_LEADER_ADDRESS) \
--data-parallel-rpc-port {{ "{{" }} or .Spec.Parallelism.DataRPCPort 13345 {{ "}}" }} \
{{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
--data-parallel-rpc-port {{ "{{" }} deref .Spec "Parallelism" "DataRPCPort" | default 13345 {{ "}}" }} \
{{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
${_config_arg} \
"${_kv_events_args[@]}" \
$(ISVC_EXTRA_ARGS) \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ spec:
}")
fi

_pp_size={{ "{{" }} or .Spec.Parallelism.Pipeline 1 {{ "}}" }}
_pp_size={{ "{{" }} deref .Spec "Parallelism" "Pipeline" | default 1 {{ "}}" }}

VLLM_NIXL_SIDE_CHANNEL_PORT=30020 \
exec vllm serve \
Expand All @@ -117,8 +117,8 @@ spec:
--master-addr $(LWS_LEADER_ADDRESS) \
--nnodes ${_pp_size} \
--node-rank $(LWS_WORKER_INDEX) \
--tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \
{{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
--tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \
{{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
${_config_arg} \
"${_kv_events_args[@]}" \
$(ISVC_EXTRA_ARGS)
Expand Down Expand Up @@ -226,7 +226,7 @@ spec:
}")
fi

_pp_size={{ "{{" }} or .Spec.Parallelism.Pipeline 1 {{ "}}" }}
_pp_size={{ "{{" }} deref .Spec "Parallelism" "Pipeline" | default 1 {{ "}}" }}

VLLM_NIXL_SIDE_CHANNEL_PORT=30020 \
exec vllm serve \
Expand All @@ -237,8 +237,8 @@ spec:
--master-addr $(LWS_LEADER_ADDRESS) \
--nnodes ${_pp_size} \
--node-rank $(LWS_WORKER_INDEX) \
--tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \
{{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
--tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \
{{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
${_config_arg} \
"${_kv_events_args[@]}" \
$(ISVC_EXTRA_ARGS)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,8 +110,8 @@ spec:
"${ISVC_MODEL_PATH}" \
--served-model-name "${ISVC_MODEL_NAME}" \
--port 8200 \
--tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \
{{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
--tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \
{{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
${_config_arg} \
"${_kv_events_args[@]}" \
$(ISVC_EXTRA_ARGS)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@ spec:
eval "$ISVC_PRE_PROCESS_SCRIPT"
fi

_dp_size={{ "{{" }} or .Spec.Parallelism.Data 1 {{ "}}" }}
_dp_local_size={{ "{{" }} or .Spec.Parallelism.DataLocal 1 {{ "}}" }}
_dp_size={{ "{{" }} deref .Spec "Parallelism" "Data" | default 1 {{ "}}" }}
_dp_local_size={{ "{{" }} deref .Spec "Parallelism" "DataLocal" | default (deref .Spec "Parallelism" "Data" | default 1) {{ "}}" }}
_start_rank=$(( ${LWS_WORKER_INDEX:-0} * _dp_local_size ))

for ((_index=0; _index<_dp_local_size; _index++)); do
Expand Down Expand Up @@ -82,12 +82,12 @@ spec:
"${ISVC_MODEL_PATH}" \
--served-model-name "${ISVC_MODEL_NAME}" \
--port ${_dp_rank_port} \
--tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \
--tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \
--data-parallel-size ${_dp_size} \
--data-parallel-rank $((_start_rank + _index)) \
--data-parallel-address $(LWS_LEADER_ADDRESS) \
--data-parallel-rpc-port {{ "{{" }} or .Spec.Parallelism.DataRPCPort 13345 {{ "}}" }} \
{{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
--data-parallel-rpc-port {{ "{{" }} deref .Spec "Parallelism" "DataRPCPort" | default 13345 {{ "}}" }} \
{{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
${_config_arg} \
"${_kv_events_args[@]}" \
$(ISVC_EXTRA_ARGS) \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ spec:
}")
fi

_pp_size={{ "{{" }} or .Spec.Parallelism.Pipeline 1 {{ "}}" }}
_pp_size={{ "{{" }} deref .Spec "Parallelism" "Pipeline" | default 1 {{ "}}" }}

VLLM_NIXL_SIDE_CHANNEL_PORT=30020 \
exec vllm serve \
Expand All @@ -80,8 +80,8 @@ spec:
--master-addr $(LWS_LEADER_ADDRESS) \
--nnodes ${_pp_size} \
--node-rank $(LWS_WORKER_INDEX) \
--tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \
{{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
--tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \
{{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
${_config_arg} \
"${_kv_events_args[@]}" \
$(ISVC_EXTRA_ARGS)
Expand Down Expand Up @@ -189,7 +189,7 @@ spec:
}")
fi

_pp_size={{ "{{" }} or .Spec.Parallelism.Pipeline 1 {{ "}}" }}
_pp_size={{ "{{" }} deref .Spec "Parallelism" "Pipeline" | default 1 {{ "}}" }}

VLLM_NIXL_SIDE_CHANNEL_PORT=30020 \
exec vllm serve \
Expand All @@ -200,8 +200,8 @@ spec:
--master-addr $(LWS_LEADER_ADDRESS) \
--nnodes ${_pp_size} \
--node-rank $(LWS_WORKER_INDEX) \
--tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \
{{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
--tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \
{{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
${_config_arg} \
"${_kv_events_args[@]}" \
$(ISVC_EXTRA_ARGS)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,8 @@ spec:
eval "$ISVC_PRE_PROCESS_SCRIPT"
fi

_dp_size={{ "{{" }} or .Spec.Parallelism.Data 1 {{ "}}" }}
_dp_local_size={{ "{{" }} or .Spec.Parallelism.DataLocal 1 {{ "}}" }}
_dp_size={{ "{{" }} deref .Spec "Parallelism" "Data" | default 1 {{ "}}" }}
_dp_local_size={{ "{{" }} deref .Spec "Parallelism" "DataLocal" | default (deref .Spec "Parallelism" "Data" | default 1) {{ "}}" }}
_start_rank=$(( ${LWS_WORKER_INDEX:-0} * _dp_local_size ))

for ((_index=0; _index<_dp_local_size; _index++)); do
Expand Down Expand Up @@ -84,12 +84,12 @@ spec:
"${ISVC_MODEL_PATH}" \
--served-model-name "${ISVC_MODEL_NAME}" \
--port ${_dp_rank_port} \
--tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \
--tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \
--data-parallel-size ${_dp_size} \
--data-parallel-rank $((_start_rank + _index)) \
--data-parallel-address $(LWS_LEADER_ADDRESS) \
--data-parallel-rpc-port {{ "{{" }} or .Spec.Parallelism.DataRPCPort 13345 {{ "}}" }} \
{{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
--data-parallel-rpc-port {{ "{{" }} deref .Spec "Parallelism" "DataRPCPort" | default 13345 {{ "}}" }} \
{{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
${_config_arg} \
"${_kv_events_args[@]}" \
$(ISVC_EXTRA_ARGS) \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ spec:
}")
fi

_pp_size={{ "{{" }} or .Spec.Parallelism.Pipeline 1 {{ "}}" }}
_pp_size={{ "{{" }} deref .Spec "Parallelism" "Pipeline" | default 1 {{ "}}" }}

VLLM_NIXL_SIDE_CHANNEL_PORT=30020 \
exec vllm serve \
Expand All @@ -81,8 +81,8 @@ spec:
--master-addr $(LWS_LEADER_ADDRESS) \
--nnodes ${_pp_size} \
--node-rank $(LWS_WORKER_INDEX) \
--tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \
{{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
--tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \
{{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
${_config_arg} \
"${_kv_events_args[@]}" \
$(ISVC_EXTRA_ARGS)
Expand Down Expand Up @@ -190,7 +190,7 @@ spec:
}")
fi

_pp_size={{ "{{" }} or .Spec.Parallelism.Pipeline 1 {{ "}}" }}
_pp_size={{ "{{" }} deref .Spec "Parallelism" "Pipeline" | default 1 {{ "}}" }}

VLLM_NIXL_SIDE_CHANNEL_PORT=30020 \
exec vllm serve \
Expand All @@ -201,8 +201,8 @@ spec:
--master-addr $(LWS_LEADER_ADDRESS) \
--nnodes ${_pp_size} \
--node-rank $(LWS_WORKER_INDEX) \
--tensor-parallel-size {{ "{{" }} or .Spec.Parallelism.Tensor 1 {{ "}}" }} \
{{ "{{" }} if .Spec.Parallelism.Expert {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
--tensor-parallel-size {{ "{{" }} deref .Spec "Parallelism" "Tensor" | default 1 {{ "}}" }} \
{{ "{{" }} if deref .Spec "Parallelism" "Expert" {{ "}}" }}--enable-expert-parallel{{ "{{" }} end {{ "}}" }} \
${_config_arg} \
"${_kv_events_args[@]}" \
$(ISVC_EXTRA_ARGS)
Expand Down
Loading