Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BIT-601] Scaling law on EMA loss #1022

Merged
merged 11 commits into from
Dec 9, 2022

Conversation

opentaco
Copy link
Contributor

@opentaco opentaco commented Dec 1, 2022

BIT-601 Scaling law on EMA loss

The neural language model scaling law [1] is typically meant to be computed on a loss averaged over the entire training data. Currently it is computed within-batch only, which frequently sees losses below 1.69 (the natural entropy of text).

Here we now compute the scaling law and the resultant effective number of model parameters on the exponentially moving average loss for a server, which should greatly improve the definition of the result.

[1] (OpenAI scaling laws) Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv:2001.08361 (2020)

The neural language model scaling law is typically meant to be computed on a loss averaged over the entire training sample. Currently it is computed within-batch only, which frequently sees losses below 1.69 the of natural entropy of text.

Here we now compute the scaling law and the resultant effective number of model parameters on the exponentially moving average loss for a server, which should greatly improve the definition of the result.
Copy link
Contributor

@Eugene-hu Eugene-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@opentaco
Copy link
Contributor Author

opentaco commented Dec 8, 2022

Comparative analysis of BIT-601

We compare the neuron stats of two types of nakamoto validators, one on current master branch, and the other on the BIT-601 branch. The change from master is that BIT-601 now applies the scaling law on the average loss across multiple batches, instead of on each batch_loss separately like master.

master: avg(scaling_law(batch_loss))

# estimate the effective number of model parameters from the batch_loss
_num_params = scaling_law_loss_to_params(_loss)

BIT-601: scaling_law(avg(batch_losses))

# estimate the effective number of model parameters from EMA loss
_num_params = scaling_law_loss_to_params(torch.tensor(stats['loss_nxt']))

We expect a change in base_params_nxt and shapley_values_nxt from master to BIT-601. We measure the probability distribution shift and see that overall BIT-601 has smaller values.

  1. Generally avg(batch_losses) > 1.69, because over a lot of batches we start to be limited by the natural entropy of text, which means that BIT-601 will rarely clamp in the scaling law.
  2. Frequently batch_loss < 1.69, because small individual batches on fine-tuned data can often get the next token prediction correct, which means that master often clamps the scaling law.
  3. Generally when batch_loss < avg(batch_losses), then despite clamping on batch_loss this inequality still seems to be the case in the result obtained where base_params_nxt is then larger in master compared to BIT-601.
  4. However, since less clamping happens in BIT-601, we expect a more accurate scoring and separation of model sizes because the noise due to clamping is removed, which means that larger models will be able to separate themselves better.

Screenshot 2022-12-08 at 19 36 47

  1. With BIT-601 we observe improved penetration into higher base parameter counts, and we also see less noise between the loss_nxt and base_params_nxt relationship, compared to master.
  2. A small minority of BIT-601 parameter counts are clamped internally at a loss of 1.69, which is relatively acceptable to prevent outliers from getting extremely large parameter counts occasionally when solving tasks they've been fine-tuned on.
  3. The EMA window is not really wide enough, and the current task base not novel enough, to naturally achieve the 1.69 limit, so it has to be enforced.

Screenshot 2022-12-09 at 11 54 14

@opentaco opentaco merged commit 332ba29 into nobunaga Dec 9, 2022
@opentaco opentaco deleted the feature/BIT-601/scaling-law-on-ema-loss branch December 9, 2022 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants