Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP validation: All gather for flattened 1D tensors taking long time to complete #55

Open
krishansubudhi opened this issue Aug 10, 2021 · 4 comments · May be fixed by #68
Open

DDP validation: All gather for flattened 1D tensors taking long time to complete #55

krishansubudhi opened this issue Aug 10, 2021 · 4 comments · May be fixed by #68
Assignees
Labels
bug Something isn't working

Comments

@krishansubudhi
Copy link
Contributor

Task = POS tagging

    def val_step(self, global_step: int, batch, device="cpu", encoder = None, encoder_kwargs={}):
        """
        Can return multiple outputs. First output need not be loss.
        """
        ...
        print(rels_predicted.shape)
        return label_loss, pointer_loss, rels_predicted, rels_labels

validation ptb_dep 3:: 0%| | 0/7 [00:00<?, ?it/s]torch.Size([1541])
torch.Size([1547])
torch.Size([1500])
torch.Size([1514])
torch.Size([1570])
torch.Size([1506])
torch.Size([1477])
torch.Size([1626])
validation ptb_dep 2:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 30.67it/s]
gathering
validation ptb_dep 3:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 29.47it/s]
gathering
validation ptb_dep 1:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 28.46it/s]
gathering
validation ptb_dep 0:: 29%|█████████████████████████████████████████████████████████▏ | 2/7 [00:00<00:00, 27.57it/s]
gathering

@krishansubudhi
Copy link
Contributor Author

From more debugging it seems like a bug in gather_tensors_on_cpu

def gather_tensors_on_cpu(self, x: torch.tensor):
        n_samples = len(x)
        self._set_gather_frequency(n_samples)
        gathered = []
        n_chunks = n_samples // self.gather_frequency + 1
        print(n_chunks, n_samples, self.gather_frequency) # Debug code introduced

Output

1510 3018 2

This bug is because of self._set_gather_frequency(n_samples).
In case of multiple outputs, if dimension 0 of first output was 2, then gather_frequency will be set as 2 for rest of the outputs. Class variable assignment needs to be avoided here.

@krishansubudhi
Copy link
Contributor Author

Also if n_chunks is different in 2 processes, all-gather gets stuck as the process with higher number of chunks keeps waiting. For this, either make num_chunks = 1 or gather the num_chunks tensor first and take the maximum.

@jsleep
Copy link
Contributor

jsleep commented Aug 10, 2021

Thanks for debugging this @krishansubudhi - any chance you'd be able to put these changes in or should we try to resource for next sprint?

@jsleep jsleep added the bug Something isn't working label Aug 10, 2021
@krishansubudhi
Copy link
Contributor Author

I am working on the fix will raise a PR soon.

@jsleep jsleep linked a pull request Oct 11, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants