[BIT-601] Scaling law on EMA loss #1022

opentaco · 2022-12-01T15:46:40Z

BIT-601 Scaling law on EMA loss

The neural language model scaling law [1] is typically meant to be computed on a loss averaged over the entire training data. Currently it is computed within-batch only, which frequently sees losses below 1.69 (the natural entropy of text).

Here we now compute the scaling law and the resultant effective number of model parameters on the exponentially moving average loss for a server, which should greatly improve the definition of the result.

[1] (OpenAI scaling laws) Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv:2001.08361 (2020)

The neural language model scaling law is typically meant to be computed on a loss averaged over the entire training sample. Currently it is computed within-batch only, which frequently sees losses below 1.69 the of natural entropy of text. Here we now compute the scaling law and the resultant effective number of model parameters on the exponentially moving average loss for a server, which should greatly improve the definition of the result.

Eugene-hu

LGTM

bittensor/_neuron/text/core_validator/__init__.py

opentaco · 2022-12-08T18:15:13Z

Comparative analysis of BIT-601

We compare the neuron stats of two types of nakamoto validators, one on current master branch, and the other on the BIT-601 branch. The change from master is that BIT-601 now applies the scaling law on the average loss across multiple batches, instead of on each batch_loss separately like master.

master: avg(scaling_law(batch_loss))

# estimate the effective number of model parameters from the batch_loss
_num_params = scaling_law_loss_to_params(_loss)

BIT-601: scaling_law(avg(batch_losses))

# estimate the effective number of model parameters from EMA loss
_num_params = scaling_law_loss_to_params(torch.tensor(stats['loss_nxt']))

We expect a change in base_params_nxt and shapley_values_nxt from master to BIT-601. We measure the probability distribution shift and see that overall BIT-601 has smaller values.

Generally avg(batch_losses) > 1.69, because over a lot of batches we start to be limited by the natural entropy of text, which means that BIT-601 will rarely clamp in the scaling law.
Frequently batch_loss < 1.69, because small individual batches on fine-tuned data can often get the next token prediction correct, which means that master often clamps the scaling law.
Generally when batch_loss < avg(batch_losses), then despite clamping on batch_loss this inequality still seems to be the case in the result obtained where base_params_nxt is then larger in master compared to BIT-601.
However, since less clamping happens in BIT-601, we expect a more accurate scoring and separation of model sizes because the noise due to clamping is removed, which means that larger models will be able to separate themselves better.

With BIT-601 we observe improved penetration into higher base parameter counts, and we also see less noise between the loss_nxt and base_params_nxt relationship, compared to master.
A small minority of BIT-601 parameter counts are clamped internally at a loss of 1.69, which is relatively acceptable to prevent outliers from getting extremely large parameter counts occasionally when solving tasks they've been fine-tuned on.
The EMA window is not really wide enough, and the current task base not novel enough, to naturally achieve the 1.69 limit, so it has to be enforced.

opentaco added 3 commits December 1, 2022 16:28

Convert to tensor for calcs

0971914

Ascending sort loss tables

f20cf88

opentaco mentioned this pull request Dec 1, 2022

Remove support for huge models #1021

Closed

opentaco requested review from Eugene-hu and isabella618033 December 6, 2022 14:11

Eugene-hu added the release/3.6.0 label Dec 6, 2022

opentaco added 4 commits December 8, 2022 12:32

Add top and bottom weights to validator table

81aaeb0

Add top and bottom weights to validator table

6f4a583

Add top and bottom weights to validator table

ed446cb

Change mark uids in weights table

66ca6d5

Eugene-hu approved these changes Dec 8, 2022

View reviewed changes

bittensor/_neuron/text/core_validator/__init__.py Show resolved Hide resolved

Merge branch 'nobunaga' into feature/BIT-601/scaling-law-on-ema-loss

059f386

opentaco and others added 3 commits December 9, 2022 17:46

Update scaling law powers each epoch

e349620

Fix neuron.ip_version

2f58f80

Merge branch 'nobunaga' into feature/BIT-601/scaling-law-on-ema-loss

2c8b99e

opentaco merged commit 332ba29 into nobunaga Dec 9, 2022

opentaco deleted the feature/BIT-601/scaling-law-on-ema-loss branch December 9, 2022 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BIT-601] Scaling law on EMA loss #1022

[BIT-601] Scaling law on EMA loss #1022

opentaco commented Dec 1, 2022 •

edited by jira bot

Eugene-hu left a comment

opentaco commented Dec 8, 2022 •

edited

[BIT-601] Scaling law on EMA loss #1022

[BIT-601] Scaling law on EMA loss #1022

Conversation

opentaco commented Dec 1, 2022 • edited by jira bot