Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflict between bf16-mixed Precision Setting and MegatronHalfPrecisionPlugin in MegatronGPT Training #9429

Closed
moutasemalakkad opened this issue Jun 10, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@moutasemalakkad
Copy link

Title: Conflict between precision settings and MegatronHalfPrecisionPlugin in MegatronGPT training

Describe the bug

When attempting to continue training the MegatronGPT model, I encountered a conflict between precision=bf16-mixed and the MegatronHalfPrecisionPlugin. This results in a ValueError indicating that both precision=bf16-mixed and the MegatronHalfPrecisionPlugin were received and only one should be chosen.

Steps/Code to reproduce bug

  1. Set up the environment as described below.
  2. Use the following configuration and code snippet to initiate training.

Configuration:

DATA='{train:[1.0,training_data_indexed/train_text_document], validation:[training_data_indexed/val_text_document], test:[training_data_indexed/test_text_document]}'

!torchrun --nproc_per_node=1 /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_continue_training.py \
    model.data.data_prefix="$DATA" \
    name=megatron_gpt_ \
    exp_manager.name=megatron_gpt_1 \
    restore_from_path='/workspace/new_nemo_out/new_megatron_gpt_model.nemo' \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=16-mixed \
    trainer.val_check_interval=300 \
    trainer.max_steps=1200 \
    model.megatron_amp_O2=False \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    model.micro_batch_size=1 \
    model.global_batch_size=1 \
    ++model.use_flash_attention=False \
    ++model.seq_len_interpolation_factor=null

Error Message:

[NeMo W 2024-06-09 00:24:29 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
[NeMo W 2024-06-09 00:24:29 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:563: `precision=bf16` is supported for historical reasons but its usage is discouraged. Please set your precision to bf16-mixed instead!
    
Error executing job with overrides: []
Traceback (most recent call last):
  File "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_continue_training.py", line 167, in main
    trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer, callbacks=callbacks)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/argparse.py", line 70, in insert_env_defaults
    return fn(self, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 401, in __init__
    self._accelerator_connector = _AcceleratorConnector(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 134, in __init__
    self._check_config_and_set_final_flags(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 271, in _check_config_and_set_final_flags
    raise ValueError(
ValueError: Received both `precision=bf16-mixed` and `plugins=<nemo.collections.nlp.parts.nlp_overrides.MegatronHalfPrecisionPlugin object at 0x7fed4a8568c0>`. Choose one.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Expected behavior

The training should proceed without conflicts between precision settings and plugins.

Environment overview (please complete the following information)

  • Environment location: Docker

Environment details

  • OS version: Ubuntu 20.04
  • PyTorch version: 1.13.1
  • Python version: 3.10

Additional context

  • GPU model: NVIDIA A100
  • NVIDIA Driver version: 470.57.02
  • CUDA version: 11.4
@moutasemalakkad moutasemalakkad added the bug Something isn't working label Jun 10, 2024
@analogtechnica
Copy link

analogtechnica commented Jun 11, 2024

@moutasemalakkad

Hi, I've just faced same issue.

  • work on Docker image ; nvcr.io/nvidia/nemo:24.03.framework

Here is my solution : insert cfg.trainer.precision = None above this line.

trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer, callbacks=callbacks)

This solution is inspired from this PR #8908

It should solve the conflict.

@moutasemalakkad
Copy link
Author

moutasemalakkad commented Jun 13, 2024

Thanks! That also did not work, the work around was to set the plugins to an empty list

trainer = Trainer(plugins=[], strategy=strategy, **cfg.trainer, callbacks=callbacks)

@akoumpa
Copy link
Member

akoumpa commented Jul 18, 2024

Hi, the contrinue_pretraining script was consolidated with the pretraining one megatron_gpt_pretraining.py, so you may want to try this. I think you should not face the issue any more.

Thanks.

@akoumpa akoumpa closed this as completed Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants