when i disable fp16 ，got error: 'lr_this_step' referenced before assignment. #426

LIZHICHAOUNICORN · 2020-09-21T02:34:07Z

ENV:
two GPU: Tesla P40
GPU Memory: 22919MiB
os: ubuntu 16.04
Driver Version: 440.33.01
CUDA Version: 10.2
Python 3.7.0

I run DeepSpeedExamples/bing_bert: ds_train_bert_bsz32k_seq512.sh, I modified some configure:

--cf ${base_dir}/bert_base.json \
--deepspeed_config ${base_dir}/deepspeed_bsz32k_lamb_config_seq512.json \
#--load_training_checkpoint ${CHECKPOINT_BASE_PATH} \
#--load_checkpoint_id ${CHECKPOINT_EPOCH150_NAME} \
#&> ${JOB_NAME}.log

Notice bert_base.json change fp16 from true to false.

{
  "train_batch_size": 32768,
  "train_micro_batch_size_per_gpu": 32,
  "steps_per_print": 1000,
  "prescale_gradients": false,
  "optimizer": {
    "type": "Lamb",
    "params": {
      "lr": 2e-3,
      "weight_decay": 0.01,
      "bias_correction": false,
      "max_coeff": 0.3,
      "min_coeff": 0.01
    }
  },
  "gradient_clipping": 1.0,

  "wall_clock_breakdown": false,

  "fp16": {
    "enabled": false,
    "loss_scale": 0
  }
}

then, i got error: 'lr_this_step' referenced before assignment.

second:
when i enable , i found the log warning: OVERFLOW.

so does Deepspeed support disable fp16 or my operations was wrong?

The text was updated successfully, but these errors were encountered:

tjruwase · 2020-09-21T05:49:42Z

@LIZHICHAOUNICORN Thanks for using DeepSpeed. Please see responses to your questions below

'lr_this_step' referenced before assignment. : This is a bug in fp32 mode. We will fix asap.
the log warning: OVERFLOW: Overflow is normal at the beginning of training in fp16 mode, aka mixed-precision training. The training will automatically adjust with loss-scaling to stop the overflows.

LIZHICHAOUNICORN · 2020-09-21T06:32:15Z

Thanks for your responses.

tjruwase · 2020-09-22T07:20:23Z

Closing here after reopening DeepSpeedExamples, which is the appropriate repo.

tjruwase mentioned this issue Sep 22, 2020

when i disable fp16 ，got error: 'lr_this_step' referenced before assignment microsoft/DeepSpeedExamples#53

Closed

tjruwase closed this as completed Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when i disable fp16 ，got error: 'lr_this_step' referenced before assignment. #426

when i disable fp16 ，got error: 'lr_this_step' referenced before assignment. #426

LIZHICHAOUNICORN commented Sep 21, 2020 •

edited

tjruwase commented Sep 21, 2020

LIZHICHAOUNICORN commented Sep 21, 2020

tjruwase commented Sep 22, 2020

when i disable fp16 ，got error: 'lr_this_step' referenced before assignment. #426

when i disable fp16 ，got error: 'lr_this_step' referenced before assignment. #426

Comments

LIZHICHAOUNICORN commented Sep 21, 2020 • edited

tjruwase commented Sep 21, 2020

LIZHICHAOUNICORN commented Sep 21, 2020

tjruwase commented Sep 22, 2020

LIZHICHAOUNICORN commented Sep 21, 2020 •

edited