Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when i disable fp16 ,got error: 'lr_this_step' referenced before assignment. #426

Closed
LIZHICHAOUNICORN opened this issue Sep 21, 2020 · 3 comments

Comments

@LIZHICHAOUNICORN
Copy link

LIZHICHAOUNICORN commented Sep 21, 2020

ENV:
two GPU: Tesla P40
GPU Memory: 22919MiB
os: ubuntu 16.04
Driver Version: 440.33.01
CUDA Version: 10.2
Python 3.7.0

I run DeepSpeedExamples/bing_bert: ds_train_bert_bsz32k_seq512.sh, I modified some configure:

--cf ${base_dir}/bert_base.json \
--deepspeed_config ${base_dir}/deepspeed_bsz32k_lamb_config_seq512.json \
#--load_training_checkpoint ${CHECKPOINT_BASE_PATH} \
#--load_checkpoint_id ${CHECKPOINT_EPOCH150_NAME} \
#&> ${JOB_NAME}.log

Notice bert_base.json change fp16 from true to false.

{
  "train_batch_size": 32768,
  "train_micro_batch_size_per_gpu": 32,
  "steps_per_print": 1000,
  "prescale_gradients": false,
  "optimizer": {
    "type": "Lamb",
    "params": {
      "lr": 2e-3,
      "weight_decay": 0.01,
      "bias_correction": false,
      "max_coeff": 0.3,
      "min_coeff": 0.01
    }
  },
  "gradient_clipping": 1.0,

  "wall_clock_breakdown": false,

  "fp16": {
    "enabled": false,
    "loss_scale": 0
  }
}

then, i got error: 'lr_this_step' referenced before assignment.

second:
when i enable , i found the log warning: OVERFLOW.

so does Deepspeed support disable fp16 or my operations was wrong?

@tjruwase
Copy link
Contributor

@LIZHICHAOUNICORN Thanks for using DeepSpeed. Please see responses to your questions below

  1. 'lr_this_step' referenced before assignment. : This is a bug in fp32 mode. We will fix asap.

  2. the log warning: OVERFLOW: Overflow is normal at the beginning of training in fp16 mode, aka mixed-precision training. The training will automatically adjust with loss-scaling to stop the overflows.

@LIZHICHAOUNICORN
Copy link
Author

Thanks for your responses.

@tjruwase
Copy link
Contributor

Closing here after reopening DeepSpeedExamples, which is the appropriate repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants