FP16 with Zero and Gradient Accumulation in Configuration File #1190

conceptofmind · 2022-06-29T17:09:04Z

conceptofmind
Jun 29, 2022

Hi,

Is there a specific way to define fp16 with ZeRO and gradient accumulation in the configuration file? Does defining ZeRO in the configuration file automatically handle mixed precision and use fp16 for training, so you would not even need to define a dictionary for fp16?

I receive this error when using both in the config.py file:
Error:

colossalai.context.config.ConfigException: It is not allowed to set fp16 and zero-configuration in your config file at the same time

For example, the currently defined config.py file:

from colossalai.amp import AMP_TYPE

from colossalai.zero.shard_utils import TensorShardStrategy

fp16 = dict(
    mode = AMP_TYPE.TORCH,
    init_scale = 2.**16,
    growth_factor = 2.0,
    backoff_factor = 0.5,
    growth_interval = 2000,
    enabled = True
)

zero = dict(
    model_config = dict(
        shard_strategy = TensorShardStrategy(),
        tensor_placement_policy = 'cpu',
        reuse_fp16_shard = False
    )
)

gradient_accumulation = 4
clip_grad_norm = 1.0

Also as a side note, how should we cite ColossalAI? Is there a preferred method?

Thank you,

Enrico

Answered by YuliangLiu0306

Jul 3, 2022

Hi, Enrico
ZeRO will convert the master weight into fp16 type during training, and convert them into fp32 type during steping. So you don't need to set fp16 configuration when ZeRO is used in your training.

View full answer

YuliangLiu0306 · 2022-07-03T08:38:24Z

YuliangLiu0306
Jul 3, 2022

Hi, Enrico
ZeRO will convert the master weight into fp16 type during training, and convert them into fp32 type during steping. So you don't need to set fp16 configuration when ZeRO is used in your training.

4 replies

conceptofmind Jul 3, 2022
Author

Hi @YuliangLiu0306 ,

I greatly appreciate the response.

Thank you for confirming that ZeRO converts the type into fp16 while training automatically in ColossalAI. I will exclude the fp16 dict from the configuration file.

I had one additional question. Are there any noticeable issues when combining ZeRO 2 or 3 with Tensor Parallelism? I know that combining ZeRO 2 with Pipeline Parallesim is computationally inefficient and turned off in DeepSpeed. I believe ZeRO 3 and Tensor Parallelism are complimentary but I am unsure if the same is for ZeRO 2 as well.

Thank you,

Enrico

YuliangLiu0306 Jul 4, 2022

To the best of my knowledge, there is no problem to combilne ZeRO and TP. If you find any performance issue, you could post here to let us know.

conceptofmind Jul 4, 2022
Author

Hi @YuliangLiu0306

Thank you for the additional input. I will document my results and update if I see any issues while training.

Best,

Enrico

AntoineBlanot Aug 3, 2022

Hi, Enrico ZeRO will convert the master weight into fp16 type during training, and convert them into fp32 type during steping. So you don't need to set fp16 configuration when ZeRO is used in your training.

Is there a way to deactive fp16 when using ZeRO ? I think it is the cause of some inf values in my loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP16 with Zero and Gradient Accumulation in Configuration File #1190

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

FP16 with Zero and Gradient Accumulation in Configuration File #1190

conceptofmind Jun 29, 2022

Replies: 1 comment · 4 replies

YuliangLiu0306 Jul 3, 2022

conceptofmind Jul 3, 2022 Author

YuliangLiu0306 Jul 4, 2022

conceptofmind Jul 4, 2022 Author

AntoineBlanot Aug 3, 2022

conceptofmind
Jun 29, 2022

Replies: 1 comment 4 replies

YuliangLiu0306
Jul 3, 2022

conceptofmind Jul 3, 2022
Author

conceptofmind Jul 4, 2022
Author