You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Title: Conflict between precision settings and MegatronHalfPrecisionPlugin in MegatronGPT training
Describe the bug
When attempting to continue training the MegatronGPT model, I encountered a conflict between precision=bf16-mixed and the MegatronHalfPrecisionPlugin. This results in a ValueError indicating that both precision=bf16-mixed and the MegatronHalfPrecisionPlugin were received and only one should be chosen.
Steps/Code to reproduce bug
Set up the environment as described below.
Use the following configuration and code snippet to initiate training.
[NeMo W 2024-06-09 00:24:29 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
[NeMo W 2024-06-09 00:24:29 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:563: `precision=bf16` is supported for historical reasons but its usage is discouraged. Please set your precision to bf16-mixed instead!
Error executing job with overrides: []
Traceback (most recent call last):
File "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_continue_training.py", line 167, in main
trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer, callbacks=callbacks)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/argparse.py", line 70, in insert_env_defaults
return fn(self, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 401, in __init__
self._accelerator_connector = _AcceleratorConnector(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 134, in __init__
self._check_config_and_set_final_flags(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 271, in _check_config_and_set_final_flags
raise ValueError(
ValueError: Received both `precision=bf16-mixed` and `plugins=<nemo.collections.nlp.parts.nlp_overrides.MegatronHalfPrecisionPlugin object at 0x7fed4a8568c0>`. Choose one.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Expected behavior
The training should proceed without conflicts between precision settings and plugins.
Environment overview (please complete the following information)
Environment location: Docker
Environment details
OS version: Ubuntu 20.04
PyTorch version: 1.13.1
Python version: 3.10
Additional context
GPU model: NVIDIA A100
NVIDIA Driver version: 470.57.02
CUDA version: 11.4
The text was updated successfully, but these errors were encountered:
Hi, the contrinue_pretraining script was consolidated with the pretraining one megatron_gpt_pretraining.py, so you may want to try this. I think you should not face the issue any more.
Title: Conflict between precision settings and MegatronHalfPrecisionPlugin in MegatronGPT training
Describe the bug
When attempting to continue training the MegatronGPT model, I encountered a conflict between
precision=bf16-mixed
and theMegatronHalfPrecisionPlugin
. This results in aValueError
indicating that bothprecision=bf16-mixed
and theMegatronHalfPrecisionPlugin
were received and only one should be chosen.Steps/Code to reproduce bug
Configuration:
Error Message:
Expected behavior
The training should proceed without conflicts between precision settings and plugins.
Environment overview (please complete the following information)
Environment details
Additional context
The text was updated successfully, but these errors were encountered: