-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: still have inflight params[BUG] #5648
Comments
Same issue here. I used Zero-3, ZeRO Init and deepspeed 0.14.2. Here is the list o f my pip versions: accelerate 0.31.0 I encountered error:
|
@sxhysj i got same issue, have you solved it? |
Still not yet solved. |
@sxhysj I solved this issue using version 0.14.4. |
Closing as appeared resolved. Please reopen if needed. |
Describe the bug
Hello,Can some one get Help. I use V0.14.3, installed from source code tar.gz: https://github.com/melMass/DeepSpeed/releases
I use deepspeed Zero3, and training LLama Factory KTO task, under the training-evaluate stage get this problem.
Launcher context
deepspeed --num_gpus 1 --master_port=9901 src/train.py .....
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
Docker context
Are you using a specific docker image that you can share?
Additional context
RuntimeError: still have inflight params [{'id': 35, 'status': 'AVAILABLE', 'numel': 65536, 'ds_numel': 65536, 'shape': (4096, 16), 'ds_shape': (4096, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 37, 'status': 'AVAILABLE', 'numel': 16384, 'ds_numel': 16384, 'shape': (1024, 16), 'ds_shape': (1024, 16), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([16384])}, {'id': 39, 'status': 'AVAILABLE', 'numel': 65536, 'ds_numel': 65536, 'shape': (4096, 16), 'ds_shape': (4096, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 41, 'status': 'AVAILABLE', 'numel': 229376, 'ds_numel': 229376, 'shape': (14336, 16), 'ds_shape': (14336, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([229376])}, {'id': 45, 'status': 'AVAILABLE', 'numel': 229376, 'ds_numel': 229376, 'shape': (14336, 16), 'ds_shape': (14336, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([229376])}, {'id': 43, 'status': 'AVAILABLE', 'numel': 65536, 'ds_numel': 65536, 'shape': (4096, 16), 'ds_shape': (4096, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 47, 'status': 'AVAILABLE', 'numel': 229376, 'ds_numel': 229376, 'shape': (14336, 16), 'ds_shape': (14336, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([229376])}, {'id': 51, 'status': 'AVAILABLE', 'numel': 229376, 'ds_numel': 229376, 'shape': (14336, 16), 'ds_shape': (14336, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([229376])}]
The text was updated successfully, but these errors were encountered: