Best practice with distributed training? #894

songtianhui · 2024-06-12T09:36:36Z

Hello.
In the case of distributed training of CLIP, I think the key engineering details are shard and gather in ClipLoss. I can see several arguments to control the running mode, specifically, local_loss and gather_with_grad.
Following the nice discussion in #616, I got that local_loss must be with gather_with_grad=True. So I want to check two points:

If local_loss=False, namely global_loss, should gather_with_grad be True or False?
local_loss or global_loss, which is better in practice?

Thanks!

The text was updated successfully, but these errors were encountered:

rwightman · 2024-06-12T16:40:59Z

@songtianhui pretty much all models featured here that were trained with OpenCLIP are using --local-loss --gather-with-grad .. it's the only option that scales. Back when we first implemented it, we verified that w/ the gradient through gather, the local loss results were equivalence to doing the global loss.

mlfoundations locked and limited conversation to collaborators Jun 12, 2024

rwightman converted this issue into discussion #895 Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Best practice with distributed training? #894

Best practice with distributed training? #894

songtianhui commented Jun 12, 2024

rwightman commented Jun 12, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Best practice with distributed training? #894

Best practice with distributed training? #894

Comments

songtianhui commented Jun 12, 2024

rwightman commented Jun 12, 2024

This issue was moved to a discussion.