Npu doc update#9245
Conversation
Co-authored-by: Copilot <copilot@github.com>
There was a problem hiding this comment.
Pull request overview
Restructures the Ascend NPU support guide to make the supported scope, recommended setup paths, and end-to-end usage (train → merge → infer → deploy → troubleshoot) easier to follow, and keeps the English/Chinese docs aligned.
Changes:
- Reorganizes the guide with “support scope”, “usage path”, and clearer environment preparation (container + local) plus an NPU availability check.
- Adds an end-to-end ModelScope quick start and expands training/LoRA merge-resume/inference/deployment/evaluation/release sections.
- Expands the NPU FAQ with practical troubleshooting guidance and syncs EN/ZH structure.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| docs/source_en/BestPractices/NPU-support.md | Major restructure + quick start + expanded sections/FAQ for NPU workflows (EN). |
| docs/source/BestPractices/NPU-support.md | Mirrors the same restructure/content improvements to keep parity with EN (ZH). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | **DPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc | | ||
| | **PPO** | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc | | ||
| | **PPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc | | ||
| If you need to publish NPU-trained checkpoints, merged models, or quantized models to ModelScope/HuggingFace, use the push capability in `swift export`. For complete arguments and examples, see [Export and Push](../Instruction/Export-and-push.md#push-models). |
There was a problem hiding this comment.
The link target ../Instruction/Export-and-push.md#push-models appears to be a broken anchor. The referenced doc uses the heading "Push Model" (anchor is likely #push-model), so this link won’t jump to the intended section. Please update the fragment to match the actual heading/anchor (or drop the fragment and link to the doc section without an anchor).
| If you need to publish NPU-trained checkpoints, merged models, or quantized models to ModelScope/HuggingFace, use the push capability in `swift export`. For complete arguments and examples, see [Export and Push](../Instruction/Export-and-push.md#push-models). | |
| If you need to publish NPU-trained checkpoints, merged models, or quantized models to ModelScope/HuggingFace, use the push capability in `swift export`. For complete arguments and examples, see [Export and Push](../Instruction/Export-and-push.md#push-model). |
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the NPU support documentation in both Chinese and English. Key additions include a comprehensive support scope matrix, detailed instructions for container-based environment setup, a quick-start guide for a full LoRA training and deployment workflow, and an extensive FAQ section for troubleshooting common NPU issues. The review feedback identifies several critical improvements: correcting the arguments for the image building script, verifying potentially incorrect version numbers for vLLM packages, and ensuring SOC versions and environment variables are configured for maximum compatibility and safety.
| ```shell | ||
| git clone https://github.com/modelscope/modelscope.git | ||
| cd modelscope | ||
| DOCKER_REGISTRY=ms-swift python docker/build_image.py \ |
There was a problem hiding this comment.
The build_image.py script in the ModelScope repository uses the --repository argument to set the image name, rather than the DOCKER_REGISTRY environment variable. Setting the environment variable will likely be ignored, resulting in the default repository name being used, which will cause subsequent steps (like docker run) to fail because the expected image name won't exist.
| DOCKER_REGISTRY=ms-swift python docker/build_image.py \ | |
| python docker/build_image.py --repository ms-swift \ |
| pip install vllm==0.14.0 | ||
| pip install vllm-ascend==0.14.0rc1 |
There was a problem hiding this comment.
The version numbers 0.14.0 for vllm and 0.14.0rc1 for vllm-ascend appear to be incorrect or refer to non-public packages. Upstream vllm is currently at version 0.6.x, and vllm-ascend is typically at 0.1.x on PyPI. If these versions are hosted on a specific index, please provide the --index-url in the command; otherwise, users will encounter "No matching distribution found" errors.
| ```shell | ||
| git clone https://github.com/modelscope/modelscope.git | ||
| cd modelscope | ||
| DOCKER_REGISTRY=ms-swift python docker/build_image.py \ |
There was a problem hiding this comment.
The build_image.py script in the ModelScope repository uses the --repository argument to set the image name, rather than the DOCKER_REGISTRY environment variable. Setting the environment variable will likely be ignored, resulting in the default repository name being used, which will cause subsequent steps (like docker run) to fail because the expected image name won't exist.
| DOCKER_REGISTRY=ms-swift python docker/build_image.py \ | |
| python docker/build_image.py --repository ms-swift \ |
| pip install vllm==0.14.0 | ||
| pip install vllm-ascend==0.14.0rc1 |
There was a problem hiding this comment.
The version numbers 0.14.0 for vllm and 0.14.0rc1 for vllm-ascend appear to be incorrect or refer to non-public packages. Upstream vllm is currently at version 0.6.x, and vllm-ascend is typically at 0.1.x on PyPI. If these versions are hosted on a specific index, please provide the --index-url in the command; otherwise, users will encounter "No matching distribution found" errors.
| DOCKER_REGISTRY=ms-swift python docker/build_image.py \ | ||
| --image_type ascend \ | ||
| --python_version 3.11.11 \ | ||
| --soc_version ascend910b1 \ |
There was a problem hiding this comment.
| 跑 Megatron-SWIFT 前,`PYTHONPATH` 和 `MEGATRON_LM_PATH` 必须指向同一份 Megatron-LM 源码树。否则 Python 可能能启动,但实际导入到的是另一套 Megatron/MindSpeed 组合,后续报错会很像模型或参数问题。 | ||
|
|
||
| ```shell | ||
| export PYTHONPATH=$PYTHONPATH:<your_local_megatron_lm_path> |
There was a problem hiding this comment.
| DOCKER_REGISTRY=ms-swift python docker/build_image.py \ | ||
| --image_type ascend \ | ||
| --python_version 3.11.11 \ | ||
| --soc_version ascend910b1 \ |
There was a problem hiding this comment.
| Before running Megatron-SWIFT, `PYTHONPATH` and `MEGATRON_LM_PATH` must point to the same Megatron-LM source tree. Otherwise Python may start successfully while importing a different Megatron/MindSpeed combination, which can make later errors look like model or argument issues. | ||
|
|
||
| ```shell | ||
| export PYTHONPATH=$PYTHONPATH:<your_local_megatron_lm_path> |
There was a problem hiding this comment.
PR type
PR information