Fix nn.DataParallel compatibility in PyTorch 1.5 #4300

julien-c · 2020-05-12T01:25:02Z

As reported in #3936

PyTorch 1.5 changes the handling of nn.Parameters in DataParallel replicas. (pytorch/pytorch#33907). The replicas now don't have parameters() anymore.

This PR updates our self.device and self.dtype helpers to mimic nn.Module's parameters() helper but for attributes, i.e. recursively look for an attribute of type Tensor.

Reformer and TransfoXL seem to be doing fancy things based on the module's Parameters so I didn't attempt to fix them.

Finally I'm introducing a multigpu CI flag. CI does not currently run on multiple GPUs so remember to run it locally.

Also pinging @ngimel the author of the change in PyTorch, to check if I'm doing something stupid here.

codecov-io · 2020-05-13T23:46:01Z

Codecov Report

Merging #4300 into master will decrease coverage by 0.00%.
The diff coverage is 91.66%.

@@            Coverage Diff             @@
##           master    #4300      +/-   ##
==========================================
- Coverage   78.21%   78.21%   -0.01%     
==========================================
  Files         120      120              
  Lines       20038    20040       +2     
==========================================
+ Hits        15673    15674       +1     
- Misses       4365     4366       +1

Impacted Files	Coverage Δ
src/transformers/modeling_xlnet.py	`75.73% <75.00%> (ø)`
src/transformers/modeling_albert.py	`77.20% <100.00%> (ø)`
src/transformers/modeling_bert.py	`88.82% <100.00%> (ø)`
src/transformers/modeling_t5.py	`83.66% <100.00%> (ø)`
src/transformers/modeling_utils.py	`90.98% <100.00%> (+0.02%)`	⬆️
src/transformers/file_utils.py	`73.44% <0.00%> (-0.42%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7cb203f...7eef4f5. Read the comment docs.

julien-c · 2020-05-13T23:52:29Z

Update: Also asked about this on shared PyTorch Slack channel

ngimel · 2020-05-14T00:29:39Z

I don't see anything obviously wrong, but I'm not very familiar with your codebase. You are looking for tensor attributes - are you sure that you don't have any other tensor attributes that don't correspond to former parameters?
Finally, we also have _former_parameters in pytorch, introduced here pytorch/pytorch#36523. You may find it useful, but we can't guarantee that it will remain as a stable API.

src/transformers/modeling_utils.py

patrickvonplaten · 2020-05-14T10:18:09Z

src/transformers/modeling_utils.py

@@ -110,11 +110,31 @@ def reset_memory_hooks_state(self):

    @property
    def device(self) -> device:
-        return next(self.parameters()).device
+        try:


I think it's a great idea to have this as a property here! :-)

patrickvonplaten · 2020-05-14T10:20:03Z

tests/test_modeling_ctrl.py

@@ -41,7 +41,7 @@ class CTRLModelTester(object):
        def __init__(
            self,
            parent,
-            batch_size=13,
+            batch_size=14,


We need an even number of batch_size ? Or is it because it's an unlucky number? :D

Yes for some (but not all for some reason) of the models, the batch size seems to need to be a multiple of the number of DataParallel replicas.

I didn’t investigate too much as to why.

sshleifer

LGTM!

tests/utils.py

LysandreJik

LGTM

tests/utils.py

thomwolf

LGTM

blacksph3re · 2020-06-16T06:43:51Z

I think you forgot GPT-2:
transformers/modeling_gpt2.py", line 464, in forward attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility StopIteration

julien-c · 2020-06-16T08:04:42Z

Yes, if I remember correctly I didn't try to remove all the calls to next(self.parameters()) in this PR – do you want to open a PR to fix this?

The same type of errors as in huggingface#4300

The same type of errors as in #4300

The same type of errors as in huggingface#4300

…/transformers#4300

julien-c added 3 commits May 13, 2020 22:05

Test case for #3936

c71e3fb

multigpu tests pass on pytorch 1.4.0

7653c08

Fixup

e0b57e5

julien-c force-pushed the tests_multigpu branch from 81d28ad to e0b57e5 Compare May 13, 2020 22:06

multigpu tests pass on pytorch 1.5.0

3aba4c5

julien-c changed the title ~~Test case for #3936~~ Fix nn.DataParallel compatibility in PyTorch 1.5 May 13, 2020

julien-c marked this pull request as ready for review May 13, 2020 23:36

julien-c requested review from sshleifer, thomwolf, patrickvonplaten and LysandreJik May 13, 2020 23:36

julien-c commented May 14, 2020

View reviewed changes

src/transformers/modeling_utils.py Show resolved Hide resolved

julien-c commented May 14, 2020

View reviewed changes

src/transformers/modeling_utils.py Show resolved Hide resolved

patrickvonplaten approved these changes May 14, 2020

View reviewed changes

patrickvonplaten reviewed May 14, 2020

View reviewed changes

sshleifer approved these changes May 14, 2020

View reviewed changes

tests/utils.py Outdated Show resolved Hide resolved

julien-c added 2 commits May 14, 2020 14:45

Update src/transformers/modeling_utils.py

729ecd8

Update src/transformers/modeling_utils.py

7eef4f5

LysandreJik approved these changes May 14, 2020

View reviewed changes

tests/utils.py Outdated Show resolved Hide resolved

mrshenli mentioned this pull request May 15, 2020

Error out when parameters() is called on replicated models pytorch/pytorch#38493

Closed

julien-c mentioned this pull request May 18, 2020

[T5 fp16] Fix fp16 in T5 #4436

Merged

thomwolf approved these changes May 18, 2020

View reviewed changes

julien-c added 2 commits May 19, 2020 00:27

rename multigpu to require_multigpu

a1c8c40

mode doc

83182df

julien-c merged commit 4c06893 into master May 19, 2020

julien-c deleted the tests_multigpu branch May 19, 2020 00:34

This was referenced May 30, 2020

--fp causes an issue when running example scripts in distributed mode #4657

Closed

Pytorch 1.5 DataParallel #3936

Closed

ngimel mentioned this pull request Jun 29, 2020

DataParallel with Torch 1.5 pytorch/pytorch#40457

Open

LysandreJik mentioned this pull request Sep 30, 2020

Fix LXMERT with DataParallel #7471

Merged

guhur added a commit to guhur/transformers that referenced this pull request Oct 9, 2020

fix nn.DataParallel compatibility with PyTorch 1.5

5bf7720

The same type of errors as in huggingface#4300

guhur mentioned this pull request Oct 9, 2020

fix nn.DataParallel compatibility with PyTorch 1.5 #7671

Merged

5 tasks

LysandreJik pushed a commit that referenced this pull request Oct 9, 2020

fix nn.DataParallel compatibility with PyTorch 1.5 (#7671)

0578a91

The same type of errors as in #4300

TevenLeScao mentioned this pull request Nov 2, 2020

TransformerXL: StopIteration: Caught StopIteration in replica 0 on device 0 #8145

Closed

fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

fix nn.DataParallel compatibility with PyTorch 1.5 (huggingface#7671)

66233ab

The same type of errors as in huggingface#4300

LysandreJik mentioned this pull request Feb 16, 2021

StopIteration error happened #10199

Closed

hjgwak mentioned this pull request Jun 18, 2021

Solving the 'StopIteration' error when using multi-GPUs with PyTorch >= 1.5, CUDA>=11.0 jerryji1993/DNABERT#35

Merged

BenfengXu mentioned this pull request Sep 28, 2021

Stop Iteration BenfengXu/SSAN#15

Open

RuntimeRacer added a commit to RuntimeRacer/Real-Time-Voice-Cloning that referenced this pull request Dec 24, 2021

Trying to fix Issues with DataParallel by using code from huggingface…

4a53fbf

…/transformers#4300

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nn.DataParallel compatibility in PyTorch 1.5 #4300

Fix nn.DataParallel compatibility in PyTorch 1.5 #4300

julien-c commented May 12, 2020 •

edited

Loading

codecov-io commented May 13, 2020 •

edited

Loading

julien-c commented May 13, 2020

ngimel commented May 14, 2020

patrickvonplaten May 14, 2020

patrickvonplaten May 14, 2020

julien-c May 14, 2020

sshleifer left a comment

LysandreJik left a comment

thomwolf left a comment

blacksph3re commented Jun 16, 2020

julien-c commented Jun 16, 2020

Fix nn.DataParallel compatibility in PyTorch 1.5 #4300

Fix nn.DataParallel compatibility in PyTorch 1.5 #4300

Conversation

julien-c commented May 12, 2020 • edited Loading

codecov-io commented May 13, 2020 • edited Loading

Codecov Report

julien-c commented May 13, 2020

ngimel commented May 14, 2020

patrickvonplaten May 14, 2020

Choose a reason for hiding this comment

patrickvonplaten May 14, 2020

Choose a reason for hiding this comment

julien-c May 14, 2020

Choose a reason for hiding this comment

sshleifer left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

thomwolf left a comment

Choose a reason for hiding this comment

blacksph3re commented Jun 16, 2020

julien-c commented Jun 16, 2020

julien-c commented May 12, 2020 •

edited

Loading

codecov-io commented May 13, 2020 •

edited

Loading