[WIP] Automatic script to fetch current datasets. #257

Sopel97 · 2023-07-06T14:47:16Z

The intended purpose of this script is to always document and fetch the datasets required for replicating the training of the current Stockfish master network.

Right now this is mostly a skeleton with a DSL allowing to define how the datasets are combined. Downloading from kaggle and concatenation should work but are untested. Interleaving is not yet implemented. It is only meant to be used in the dry-run form right now.

A single kaggle dataset is always combined into a single destination file by concatenation in alphabetical sort order. If this is too rigid of a requirement we can work on relaxing it, but I think it works for all the current datasets.

@linrock Could you please add a full specification for the currently used datasets? I included an example for the dataset used in the first stage of the training. If there is any needed functionality missing let me know.

Even the data for the first stage requires downloading 200GB of data, so I'm unable to verify the correctness right now. We'll see about it after we have the full process documented.

vondele · 2023-07-12T06:14:31Z

I've tried this a couple of days ago, IMO looks good. I would not delete intermediate files by default, I think for most people it is more difficult to download 200GB than to store it, even though you might have different constraints on your server.

Let's try to complete the list of data sets for current master nets. I can definitely upload more data as needed.

linrock · 2023-07-12T15:32:40Z

the current master stage2 dataset is composed of:
https://www.kaggle.com/datasets/joostvandevondele/t60t70wisrightfarseert60t74t75t76
https://www.kaggle.com/datasets/linrock/t78juntoaugt79mart80dec-16tb7p

   LeelaFarseer-T78juntoaugT79marT80dec.binpack (141G)
     T60T70wIsRightFarseerT60T74T75T76.binpack
     test78-junjulaug2022-16tb7p.no-db.min.binpack
     test79-mar2022-16tb7p.no-db.min.binpack
     test80-dec2022-16tb7p.no-db.min.binpack

i'll take a closer look at stage3 later. the current L1-2048 master final stage dataset is an unshuffled 800GB+ binpack that i'm no longer using since it's too inconvenient. i'm working on replacing it with a fully minimized ~330GB dataset which i'll document later as well.

linrock · 2023-07-22T18:33:34Z

the current master stage3 dataset is:
https://www.kaggle.com/datasets/linrock/leela96-filt-v2-min
https://www.kaggle.com/datasets/linrock/dfrc99-16tb7p-filt-v2-min
https://www.kaggle.com/datasets/linrock/sfnnv7-s3

   leela96-dfrc99-v2-T80dectofeb-sk20-mar-v6-T77decT78janfebT79apr.binpack (223G)
     leela96-filt-v2.min.binpack
     dfrc99-16tb7p-eval-filt-v2.min.binpack
     test80-dec2022-16tb7p-filter-v6-sk20.min-mar2023.binpack
     test80-jan2023-16tb7p-filter-v6-sk20.min-mar2023.binpack
     test80-feb2023-16tb7p-filter-v6-sk20.min-mar2023.binpack
     test80-mar2023-2tb7p-filter-v6.min.binpack
     test77-dec2021-16tb7p.no-db.min.binpack
     test78-janfeb2022-16tb7p.no-db.min.binpack
     test79-apr2022-16tb7p.no-db.min.binpack

linrock · 2023-09-10T04:49:41Z

the current master stage4/5 dataset is composed of:
https://www.kaggle.com/datasets/linrock/leela96-filt-v2-min
https://www.kaggle.com/datasets/linrock/dfrc99-16tb7p-filt-v2-min
https://www.kaggle.com/datasets/linrock/0dd1cebea57-test80-v6-dd
https://www.kaggle.com/datasets/linrock/0dd1cebea57-misc-v6-dd
https://www.kaggle.com/datasets/linrock/test80-apr2023-2tb7p-no-db

the uploaded dataset components are all minimized. parts of the dataset were unminimized to increase randomness during training. however, it's unclear how much of an elo benefit this brings. see official-stockfish/Stockfish#4606 for more details on this particular dataset.

as of now, all datasets for training the current master net (nn-c38c3d8d3920.nnue) are documented in this PR.

linrock · 2023-09-14T22:20:56Z

the current master stage6 dataset is composed of:

since this was a retraining of the master net, all datasets for training the current master net (nn-1ee1aba5ed4c.nnue) are documented in this PR. more details about this dataset in: official-stockfish/Stockfish#4782

preliminary fetch_datasets script

cacb0fd

vondele mentioned this pull request Sep 22, 2023

Update NNUE architecture to SFNNv8: L1-2560 nn-ac1dbea57aa3.nnue official-stockfish/Stockfish#4795

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Automatic script to fetch current datasets. #257

[WIP] Automatic script to fetch current datasets. #257

Sopel97 commented Jul 6, 2023 •

edited

vondele commented Jul 12, 2023

linrock commented Jul 12, 2023

linrock commented Jul 22, 2023

linrock commented Sep 10, 2023

linrock commented Sep 14, 2023

[WIP] Automatic script to fetch current datasets. #257

Are you sure you want to change the base?

[WIP] Automatic script to fetch current datasets. #257

Conversation

Sopel97 commented Jul 6, 2023 • edited

vondele commented Jul 12, 2023

linrock commented Jul 12, 2023

linrock commented Jul 22, 2023

linrock commented Sep 10, 2023

linrock commented Sep 14, 2023

Sopel97 commented Jul 6, 2023 •

edited