Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice with distributed training? #894

Closed
songtianhui opened this issue Jun 12, 2024 · 1 comment
Closed

Best practice with distributed training? #894

songtianhui opened this issue Jun 12, 2024 · 1 comment

Comments

@songtianhui
Copy link

Hello.
In the case of distributed training of CLIP, I think the key engineering details are shard and gather in ClipLoss. I can see several arguments to control the running mode, specifically, local_loss and gather_with_grad.
Following the nice discussion in #616, I got that local_loss must be with gather_with_grad=True. So I want to check two points:

  1. If local_loss=False, namely global_loss, should gather_with_grad be True or False?
  2. local_loss or global_loss, which is better in practice?

Thanks!

@rwightman
Copy link
Collaborator

@songtianhui pretty much all models featured here that were trained with OpenCLIP are using --local-loss --gather-with-grad .. it's the only option that scales. Back when we first implemented it, we verified that w/ the gradient through gather, the local loss results were equivalence to doing the global loss.

@mlfoundations mlfoundations locked and limited conversation to collaborators Jun 12, 2024
@rwightman rwightman converted this issue into discussion #895 Jun 12, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants