Skip to content

Experiment: Default z-loss? #935

@dlwh

Description

@dlwh

(If you need context on what z-loss is, please check out the WandB Report

Description

Let's see if we should use default z-loss.

Will run fast-ish 1.4b and 8b and look at standard metrics.

Hypothesis or Goal

Is z-loss obviously better or just for cooldown?

Links

Results

Tried running two different configs @ 1.4B/42B with zloss vs not. We used 1e-4 for the zloss penalty, which is the same as olmo.

The two configs were:

  • cosine, wd=0.1
  • wsd, wd=0.05

There’s no discernible impact on either config of adding zloss, in terms of token loss or eval score.

However, surprisingly, zloss increases the norm of the lm_head by quite a lot?!? I didn’t think to turn on other metrics here. (Update: See below for more analysis)

Conclusion: defaulting z-loss on is probably fine, but its behavior is pretty surprising.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions