(If you need context on what z-loss is, please check out the WandB Report
Description
Let's see if we should use default z-loss.
Will run fast-ish 1.4b and 8b and look at standard metrics.
Hypothesis or Goal
Is z-loss obviously better or just for cooldown?
Links
Results
Tried running two different configs @ 1.4B/42B with zloss vs not. We used 1e-4 for the zloss penalty, which is the same as olmo.
The two configs were:
- cosine, wd=0.1
- wsd, wd=0.05
There’s no discernible impact on either config of adding zloss, in terms of token loss or eval score.
However, surprisingly, zloss increases the norm of the lm_head by quite a lot?!? I didn’t think to turn on other metrics here. (Update: See below for more analysis)
Conclusion: defaulting z-loss on is probably fine, but its behavior is pretty surprising.
(If you need context on what z-loss is, please check out the WandB Report
Description
Let's see if we should use default z-loss.
Will run fast-ish 1.4b and 8b and look at standard metrics.
Hypothesis or Goal
Is z-loss obviously better or just for cooldown?
Links
Results
Tried running two different configs @ 1.4B/42B with zloss vs not. We used 1e-4 for the zloss penalty, which is the same as olmo.
The two configs were:
There’s no discernible impact on either config of adding zloss, in terms of token loss or eval score.
However, surprisingly, zloss increases the norm of the lm_head by quite a lot?!? I didn’t think to turn on other metrics here. (Update: See below for more analysis)
Conclusion: defaulting z-loss on is probably fine, but its behavior is pretty surprising.