You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello.
In the case of distributed training of CLIP, I think the key engineering details are shard and gather in ClipLoss. I can see several arguments to control the running mode, specifically, local_loss and gather_with_grad.
Following the nice discussion in #616, I got that local_loss must be with gather_with_grad=True. So I want to check two points:
If local_loss=False, namely global_loss, should gather_with_grad be True or False?
local_loss or global_loss, which is better in practice?
Thanks!
The text was updated successfully, but these errors were encountered:
@songtianhui pretty much all models featured here that were trained with OpenCLIP are using --local-loss --gather-with-grad .. it's the only option that scales. Back when we first implemented it, we verified that w/ the gradient through gather, the local loss results were equivalence to doing the global loss.
Hello.
In the case of distributed training of CLIP, I think the key engineering details are shard and gather in
ClipLoss
. I can see several arguments to control the running mode, specifically,local_loss
andgather_with_grad
.Following the nice discussion in #616, I got that
local_loss
must be withgather_with_grad=True
. So I want to check two points:local_loss=False
, namely global_loss, shouldgather_with_grad
be True or False?Thanks!
The text was updated successfully, but these errors were encountered: