Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter at index 195 has been marked as ready twice. #23018

Closed
2 of 4 tasks
skye95git opened this issue Apr 27, 2023 · 7 comments
Closed
2 of 4 tasks

Parameter at index 195 has been marked as ready twice. #23018

skye95git opened this issue Apr 27, 2023 · 7 comments

Comments

@skye95git
Copy link

System Info

  • transformers version: 4.28.0
  • Platform: Linux-5.4.0-122-generic-x86_64-with-glibc2.31
  • Python version: 3.9.12
  • Huggingface_hub version: 0.13.4
  • Safetensors version: not installed
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help?

@ArthurZucker @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I retrained Roberta on my own corpus with the MLM task. I set model.gradient_checkpointing_enable() to save memory.

model = RobertaModel.from_pretrained(model_name_or_path,config=config)
model.gradient_checkpointing_enable()  # Activate gradient checkpointing
model = Model(model,config,tokenizer,args)

My model:

class Model(nn.Module):   
    def __init__(self, model,config,tokenizer,args):
        super(Model, self).__init__()
        self.encoder = model
        self.config = config
        self.tokenizer = tokenizer
        self.args = args
        self.lm_head = nn.Linear(config.hidden_size,config.vocab_size)
        self.lm_head.weight = self.encoder.embeddings.word_embeddings.weight
        self.register_buffer(
        "bias", torch.tril(torch.ones((args.block_size, args.block_size), dtype=torch.uint8)).view(1, args.block_size, args.block_size)
        )

   def forward(self, mlm_ids): 
...

There is an error:

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parame
ter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. o
r try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multipl
e reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result 
in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple
 times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does n
ot change over iterations.                                                                                                                
Parameter at index 195 with name encoder.encoder.layer.11.output.LayerNorm.weight has been marked as ready twice. This means that multiple
 autograd engine  hooks have fired for this particular parameter during this iteration.                                                   

If I get rid of this line of code:model.gradient_checkpointing_enable(), it is ok. Why?

Expected behavior

I want to pre-train with gradient_checkpointing.

@sgugger
Copy link
Collaborator

sgugger commented Apr 27, 2023

There is little we can do to help without seeing a full reproducer.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jun 5, 2023
@CrissBrian
Copy link

Got exact same bug when gradient_checkpointing_enable()

@mirix
Copy link

mirix commented Oct 13, 2023

Are you using DDP?

I am using DDP on two GPUs:

python -m torch.distributed.run --nproc_per_node 2 run_audio_classification.py

(run because launch fails)

All the rest being equal facebook/wav2vec2-base works if gradient_checkpointing is set to True, however, the large model crashes unless the option it is either set to False or removed.

gradient_checkpointing works for both models if using a single GPU, so the issue seems to be DDP-related.

This seems to come from:

https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/reducer.cpp

@mirix
Copy link

mirix commented Oct 13, 2023

The problem may be that when the trainer is invoked from torchrun is setting find_unused_parameters to True for all devices, when, apparently, it should only do it for the first one:

https://discuss.pytorch.org/t/finding-the-cause-of-runtimeerror-expected-to-mark-a-variable-ready-only-once/124428/3

And the reason why the base model works is because that option can be set to False. However, for the large model it has to be True.

The solution would be changing the way in which that argument is parsed.

@infinitylogesh
Copy link

infinitylogesh commented Nov 3, 2023

Thank you @mirix , Making ddp_find_unused_parameters=False in Trainer solved this issue for me.

@younesbelkada
Copy link
Contributor

if you use enable_gradient_checkpointing() you can now overcome this issue by passing gradient_checkpointing_kwargs={"use_reentrant": False}

model.enable_gradient_checkpointing(gradient_checkpointing_kwargs={"use_reentrant": False})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants