Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/Instruction/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@
- 🔥output_dir: 默认为None,设置为`output/<model_name>`
- 🔥gradient_checkpointing: 是否使用gradient_checkpointing,默认为True
- 🔥deepspeed: 默认为None。可以设置为'zero0', 'zero1', 'zero2', 'zero3', 'zero2_offload', 'zero3_offload'来使用ms-swift内置的deepspeed配置文件
- zero_hpz_partition_size: 默认为None,这个参数是ZeRO++的特性,即node内模型分片,node间数据分片,如果遇到grad_norm NaN,请尝试使用`--torch_dtype float16`
- 🔥per_device_train_batch_size: 默认值1
- 🔥per_device_eval_batch_size: 默认值1
- weight_decay: weight衰减系数,默认值0.1
Expand Down
1 change: 1 addition & 0 deletions docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ This parameter list inherits from transformers `Seq2SeqTrainingArguments`, with
- 🔥output_dir: Defaults to None, set as `output/<model_name>`.
- 🔥gradient_checkpointing: Whether to use gradient checkpointing, default is True.
- 🔥deepspeed: Defaults to None. It can be set to 'zero0', 'zero1', 'zero2', 'zero3', 'zero2_offload', 'zero3_offload' to use the built-in deepspeed configuration file of ms-swift.
- zero_hpz_partition_size: Default is `None`. This parameter is a feature of `ZeRO++`, which implements model sharding within nodes and data sharding between nodes. If you encounter grad_norm `NaN` issues, please try using `--torch_dtype float16`
- 🔥per_device_train_batch_size: Default is 1.
- 🔥per_device_eval_batch_size: Default is 1.
- weight_decay: Weight decay coefficient, default value is 0.1.
Expand Down
8 changes: 8 additions & 0 deletions swift/llm/argument/train_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,9 @@ class TrainArguments(SwanlabArguments, TorchAccArguments, TunerArguments, Seq2Se
temperature: float = 0.
load_args: bool = False

# zero++
zero_hpz_partition_size: Optional[int] = None

def __post_init__(self) -> None:
if self.resume_from_checkpoint:
self.resume_from_checkpoint = to_abspath(self.resume_from_checkpoint, True)
Expand Down Expand Up @@ -237,6 +240,11 @@ def _init_deepspeed(self):
break

self.deepspeed = self.parse_to_dict(self.deepspeed)
if self.zero_hpz_partition_size is not None:
assert 'zero_optimization' in self.deepspeed
self.deepspeed['zero_optimization']['zero_hpz_partition_size'] = self.zero_hpz_partition_size
logger.warn('If `zero_hpz_partition_size`(ZeRO++) causes grad_norm NaN, please'
' try `--torch_dtype float16`')
logger.info(f'Using deepspeed: {self.deepspeed}')

def _init_liger(self):
Expand Down
2 changes: 2 additions & 0 deletions swift/llm/ds_config/zero3.json
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"zero_quantized_weights": false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是为什么呢

"zero_quantized_gradients": false,
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
Expand Down
Loading