Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update default net to nn-335a9b2d8a80.nnue #4295

Closed

Conversation

linrock
Copy link
Contributor

@linrock linrock commented Dec 21, 2022

Created by retraining the master net with a combination of:

  • the previous best dataset (Leela-dfrc_n5000.binpack), with about half the dataset filtered using depth6 multipv2 search to throw away positions where either of the 2 best moves are captures
  • Leela T80 Oct and Nov training data rescored with best moves, adding ~9.5 billion positions

Trained effectively the same way as the previous master net:

python3 easy_train.py \
  --experiment-name=leela-dfrc-filtered-T80-oct-nov \
  --training-dataset=/data/leela-dfrc-filtered-T80-oct-nov.binpack \
  --start-from-engine-test-net True \
  --gpus="0," \
  --start-lambda=1.0 \
  --end-lambda=0.75 \
  --gamma=0.995 \
  --lr=4.375e-4 \
  --tui=False \
  --seed=$RANDOM \
  --max_epoch=800 \
  --auto-exit-timeout-on-training-finished=900 \
  --network-testing-threads 20 \
  --num-workers 6

Local testing at a fixed 25k nodes:
experiments/experiment_leela-dfrc-filtered-T80-oct-nov/training/run_0/nn-epoch779.nnue
localElo: run_0/nn-epoch779.nnue : 4.7 +/- 3.1

The new Leela T80 part of the dataset was prepared by downloading test80 training data from all of Oct 2022 and Nov 2022, rescoring with syzygy 6-piece tablebases and ~600 GB of 7-piece tablebases, saving best moves to exported .plain files, removing all positions with castling flags, then converting to binpacks and using interleave_binpacks.py to merge them together. Scripts used in this data conversion process are available at:
https://github.com/linrock/lc0-data-converter

Filtering binpack data using depth6 multipv2 search was done by modifying transform.cpp in the tools branch:
https://github.com/linrock/Stockfish/tree/tools-filter-multipv2-no-rescore

Links for downloading the training data (total size: 338 GB) are available at:
https://robotmoon.com/nnue-training-data/

Passed STC:
LLR: 2.94 (-2.94,2.94) <0.00,2.00>
Total: 30544 W: 8244 L: 7947 D: 14353
Ptnml(0-2): 93, 3243, 8302, 3542, 92
https://tests.stockfishchess.org/tests/view/63a0d377264a0cf18f86f82b

Passed LTC:
LLR: 2.95 (-2.94,2.94) <0.50,2.50>
Total: 32464 W: 8866 L: 8573 D: 15025
Ptnml(0-2): 19, 3054, 9794, 3345, 20
https://tests.stockfishchess.org/tests/view/63a10bc9fb452d3c44b1e016

Bench 3554904

bench 3554904
@vondele
Copy link
Member

vondele commented Dec 21, 2022

nice, great to see again progress in net training, this is a lot of work! Thanks also for making the data available.

@vondele vondele closed this in c620886 Dec 21, 2022
@ppigazzini
Copy link
Contributor

@linrock the data page seems to have the </html> in wrong position...
https://validator.w3.org/nu/?showsource=yes&doc=https%3A%2F%2Frobotmoon.com%2Fnnue-training-data%2F

<!DOCTYPE html><html></html><head><title>Stockfish NNUE training data</title>

@Sopel97
Copy link
Member

Sopel97 commented Dec 21, 2022

There's a lot of stuff happening, and we need to get resolve it reasonably, so...

The tools branch will need a PR with the changes. Now, I think the additional filtering code should not be a part of rescoring. I'm thinking whether a cryptic transform name (like the hash of this commit) would maybe work, as the logic is very specific to this one network in particular.

I also suggest, in parallel to the above, instead of removing positions just set them to score VALUE_NONE and filter these during training. This should result in much smaller datasets while not impacting training speed negatively, provided there is not too many positions fitered away.

Thirdly, we need to establish a complete training procedure. Right now this would mean 3 training sessions, but it might be possible to omit training on Leela-dfrc_n5000.binpack and train directly on the filtered data.

First, I'd suggest you make a WIP PR for the tools branch with the changes, where we could discuss it and the way it should be implemented.

@linrock
Copy link
Contributor Author

linrock commented Dec 21, 2022

@linrock the data page seems to have the in wrong position...

thanks, fixed now

The tools branch will need a PR with the changes.

sounds good, i'll open up a WIP PR with some of the junk in my branch cleaned up

I also suggest, in parallel to the above, instead of removing positions just set them to score VALUE_NONE and filter these during training. This should result in much smaller datasets while not impacting training speed negatively, provided there is not too many positions fitered away.

👍 i'm estimating the depth6 multipv2 filtering throws away 25% more positions than smart-fen-skipping alone. using VALUE_NONE will help maintain the original dataset size while filtering.

however, with this approach the lc0 rescorer tool would need to be modified to maintain binpack compression for converting leela training data with bestmoves. using bestmoves in exported leela data messes up binpack compression but has higher elo than the default of played moves getting exported. likely since bestmoves are better targets for smart-fen-skipping

Thirdly, we need to establish a complete training procedure. Right now this would mean 3 training sessions, but it might be possible to omit training on Leela-dfrc_n5000.binpack and train directly on the filtered data.

it'd be convenient to only need 2 training sessions and to verify this is possible. from early testing, i found that adding too much leela data when retraining the previous master did not result in much stronger nets. i was surprised that only T80 oct+nov performed as well as it did vs other tests that had more leela data, though i could've messed something up.

@Sopel97
Copy link
Member

Sopel97 commented Dec 21, 2022

however, with this approach the lc0 rescorer tool would need to be modified to maintain binpack compression for converting leela training data with bestmoves. using bestmoves in exported leela data messes up binpack compression but has higher elo than the default of played moves getting exported. likely since bestmoves are better targets for smart-fen-skipping

Playedmove could be preserved, and the bestmove used for set the "skip flag", since this kind of filtering is not probabilistic. The value none score could be output by the rescorer (under some opt in switch) without issues.

@linrock
Copy link
Contributor Author

linrock commented Dec 21, 2022

yea that works, will still need to modify the lc0 rescorer to set the "skip flag" when bestmove is a capture, but maybe not difficult

@MichaelB7
Copy link
Contributor

MichaelB7 commented Dec 21, 2022 via email

vondele pushed a commit to vondele/Stockfish that referenced this pull request Jan 1, 2023
Created by retraining the master net on the previous best dataset with additional filtering. No new data was added.

More of the Leela-dfrc_n5000.binpack part of the dataset was pre-filtered with depth6 multipv2 search to remove bestmove captures. About 93% of the previous Leela/SF data and 99% of the SF dfrc data was filtered. Unfiltered parts of the dataset were left out. The new Leela T80 oct+nov data is the same as before. All early game positions with ply count <= 28 were skipped during training by modifying the training data loader in nnue-pytorch.

Trained in a similar way as recent master nets, with a different nnue-pytorch branch for early ply skipping:

python3 easy_train.py \
  --experiment-name=leela93-dfrc99-filt-only-T80-oct-nov-skip28 \
  --training-dataset=/data/leela93-dfrc99-filt-only-T80-oct-nov.binpack \
  --start-from-engine-test-net True \
  --nnue-pytorch-branch=linrock/nnue-pytorch/misc-fixes-skip-ply-lteq-28 \
  --gpus="0," \
  --start-lambda=1.0 \
  --end-lambda=0.75 \
  --gamma=0.995 \
  --lr=4.375e-4 \
  --tui=False \
  --seed=$RANDOM \
  --max_epoch=800 \
  --network-testing-threads 20 \
  --num-workers 6

For the exact training data used: https://robotmoon.com/nnue-training-data/
Details about the previous best dataset: official-stockfish#4295

Local testing at a fixed 25k nodes:
experiment_leela93-dfrc99-filt-only-T80-oct-nov-skip28
Local Elo: run_0/nn-epoch779.nnue : 5.1 +/- 1.5

Passed STC
https://tests.stockfishchess.org/tests/view/63adb3acae97a464904fd4e8
LLR: 2.94 (-2.94,2.94) <0.00,2.00>
Total: 36504 W: 9847 L: 9538 D: 17119
Ptnml(0-2): 108, 3981, 9784, 4252, 127

Passed LTC
https://tests.stockfishchess.org/tests/view/63ae0ae25bd1e5f27f13d884
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 36592 W: 10017 L: 9717 D: 16858
Ptnml(0-2): 17, 3461, 11037, 3767, 14

closes official-stockfish#4314

bench 4015511
vondele pushed a commit to vondele/Stockfish that referenced this pull request Jan 2, 2023
This is a later epoch (epoch 859) from the same experiment run that trained yesterday's master net nn-60fa44e376d9.nnue (epoch 779). The experiment was manually paused around epoch 790 and unpaused with max epoch increased to 900 mainly to get more local elo data without letting the GPU idle.

nn-60fa44e376d9.nnue is from official-stockfish#4314
nn-335a9b2d8a80.nnue is from official-stockfish#4295

Local elo vs. nn-335a9b2d8a80.nnue at 25k nodes per move:
experiment_leela93-dfrc99-filt-only-T80-oct-nov-skip28
run_0/nn-epoch779.nnue (nn-60fa44e376d9.nnue) : 5.0 +/- 1.2
run_0/nn-epoch859.nnue (nn-a3dc078bafc7.nnue) : 5.6 +/- 1.6

Passed STC vs. nn-335a9b2d8a80.nnue
https://tests.stockfishchess.org/tests/view/63ae10495bd1e5f27f13d94f
LLR: 2.95 (-2.94,2.94) <0.00,2.00>
Total: 37536 W: 10088 L: 9781 D: 17667
Ptnml(0-2): 110, 4006, 10223, 4325, 104

An LTC test vs. nn-335a9b2d8a80.nnue was paused due to nn-60fa44e376d9.nnue passing LTC first:
https://tests.stockfishchess.org/tests/view/63ae5d34331d5fca5113703b

Passed LTC vs. nn-60fa44e376d9.nnue
https://tests.stockfishchess.org/tests/view/63af1e41465d2b022dbce4e7
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 148704 W: 39672 L: 39155 D: 69877
Ptnml(0-2): 59, 14443, 44843, 14936, 71

closes official-stockfish#4319

bench 3984365
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants