MAF-19231: feat(preset): add new InferenceServiceTemplates#45
Merged
Conversation
…eta-llama-3.2-1b-instruct across multiple AMD configurations - Introduced templates for vllm-meta-llama-3.2-1b-instruct with support for AMD MI250 and MI300x GPUs. - Configured environment variables and resource requests/limits for optimal performance. - Added support for different roles (consumer, producer) in the extra arguments for each template.
…rt-' prefix for consistency across vllm-meta-llama-3.2-1b-instruct templates for AMD MI250 and MI300x configurations.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds new InferenceServiceTemplate configurations for the Llama-3.2-1B-Instruct model to support disaggregated prefill/decode architectures and removes unnecessary vLLM configuration options to rely on defaults. According to the description, removing the --max-model-len option allows the model to use its default value of 131072, and removing --max-num-batched-tokens uses the default of max(max_model_len, 2048).
Changes:
- Added 5 new InferenceServiceTemplate files for different configurations (prefill/decode/combined variants for mi300x and mi250 GPUs)
- Removed explicit vLLM configuration options (
--quantization,--max-model-len,--max-num-batched-tokens,--no-enable-prefix-caching) from the existing mi250 template
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm-meta-llama-llama-3.2-1b-instruct-prefill-amd-mi300x-tp2.helm.yaml | New prefill template for mi300x with kv_producer role |
| vllm-meta-llama-llama-3.2-1b-instruct-prefill-amd-mi250-tp2.helm.yaml | New prefill template for mi250 with kv_producer role |
| vllm-meta-llama-llama-3.2-1b-instruct-decode-amd-mi300x-tp2.helm.yaml | New decode template for mi300x with kv_consumer role |
| vllm-meta-llama-llama-3.2-1b-instruct-decode-amd-mi250-tp2.helm.yaml | New decode template for mi250 with kv_consumer role |
| vllm-meta-llama-llama-3.2-1b-instruct-amd-mi300x-tp2.helm.yaml | New combined template for mi300x with kv_both role |
| vllm-meta-llama-llama-3.2-1b-instruct-amd-mi250-tp2.yaml | Removed explicit vLLM configuration options to use defaults |
…meta-llama-llama-3.2-1b-instruct presets across AMD MI250 and MI300x configurations.
hhk7734
requested changes
Feb 2, 2026
Member
hhk7734
left a comment
There was a problem hiding this comment.
--max-model-len 16384
--max-num-batched-tokens 8192
이렇게 하시죠
hhk7734
requested changes
Feb 2, 2026
…s to include new arguments for maximum model length and batched tokens across AMD MI250 and MI300x configurations.
…ama-3.2-1b-instruct Helm templates for AMD MI250 and MI300x configurations.
Author
|
hhk7734
previously approved these changes
Feb 3, 2026
…lama-llama-3.2-1b-instruct
hhk7734
approved these changes
Feb 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
--max-model-len옵션을 해제하면meta-llama/Llama-3.2-1B-Instruct모델의 경우 131072값이 default인 것으로 보이고,--max-num-batched-tokens옵션을 해제하면max(max_model_len, 2048)값이 default인 것으로 보입니다.