Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I trained a 20b 256f network (93229e) #1113

Closed
bjiyxo opened this Issue Mar 27, 2018 · 710 comments

Comments

Projects
None yet
@bjiyxo
Copy link

bjiyxo commented Mar 27, 2018

Currently I train a network(20 blocks and 256 filters). It initialized from b8adb7da(78) n2n to 20b 256f. And it is now being trained to b3a80524(99). According to unofficial test by others, it is much stronger than current weights (85c6f2ad) on 1080Ti. Enjoy it!
V2 (93229e)
training up to b3a80524(99)
https://drive.google.com/file/d/1m4rK068Kiky1sKbMrwCcR49GBI0krhSu/view?usp=sharing

Edit: 3/29
V3 (8d9c5e)
training up to 18827fa7(101)
https://drive.google.com/file/d/1Z1GP0IjCNJUaVEFJQ__97b2I-wf8IAjX/view?usp=sharing

Edit: 4/5
V4 (5b48caa1)
training up to 193437be(104)
https://drive.google.com/file/d/1y1A-ijeE1f50EuT0haeWUEtNdtUhJuON/view?usp=sharing
V4 swa (00335835)
https://drive.google.com/file/d/11GizcSzrr3ZTx8GPOOdlSn8b5YdWdN4R/view?usp=sharing

The following versions all use swa.

Edit: 4/7
V5 (1629cb)
training up to b7768081(107)
https://drive.google.com/file/d/1TikFfZkrtzfs0jBu1Y7-4IE-T6TMXRRY/view?usp=sharing

Edit: 4/13
V6 ()
training up to 1ccb7342 (108)
https://drive.google.com/file/d/1sWfBzeXJsuK6xCGnCrpfNTx6R1mHs66F/view?usp=sharing

Edit: 4/17
V7 (00045d1d)
training up to 2f4d7274 (112)
https://drive.google.com/file/d/1fx4DeqLDUuC2BlyWqY4ubgFtUDL6GNi9/view?usp=sharing
V8 (b269a6c7)
training up to 8ed44722 (115)
https://drive.google.com/file/d/1HpEQwzhDlgp0v636aSkY7ha_hDbtX7u0/view?usp=sharing
V9 (32179366)
training up to 39d46507 (116) (10b finished)
https://drive.google.com/file/d/1KayAwWUqkaz8eM9M6YPO84Nt7vHMCOw9/view?usp=sharing

Edit: 4/28
V11 (90767fc9)
training up to 8a045bce(124)
https://drive.google.com/file/d/1Ljt0anATrtfdtpH2xPkLs-4NP7pIgy8i/view?usp=sharing

V12 (bef701e8)
training up to 59bb7337 (126)
https://drive.google.com/file/d/1PdbC3hLH89q9Ysb5gaNgfI0Xvp2-dTfn/view?usp=sharing

Edit: 5/11
V13 (756542bf)
training up to ecab83bb (131) (before ELF's selfplay)
https://drive.google.com/file/d/1o1chP-ohKI1o_1kjvqCsbpo-UxgT6Vmy/view?usp=sharing

Edit: 5/28
V14 (9ee6ab54)
training up to 7c6588ce (142) (including ELF's selfplay)
https://drive.google.com/file/d/1WLVOqoEeWIg8gSyNrgEMJs7PuLBuNmu7/view?usp=sharing

Edit: 6/12
V15 (fee6830f)
training up to 0cb74be2 (146) (including ELF's selfplay)
https://drive.google.com/file/d/1lSeBRVVhHWr2Za3VPvfuf4oPi1M622EM/view?usp=sharing

Edit: 6/25
V16 (c010f034)
training up to d0187996 (148) (including ELF's selfplay)
https://drive.google.com/file/d/1Zz-2Ktku0R86Le0o1VuGrNLCEF_4NcFj/view?usp=sharing

Edit: 7/2
V17 (0589016c)
training up to 2b80a9db (150) (including ELF's selfplay)
https://drive.google.com/file/d/19ogANudaRHdi9iF9B1GzFHMzV8NaSXr_/view?usp=sharing

Edit: 7/7
V18 (fb01adbb)
training up to e1d466aa (153 incomplete, only ~40k games) (including ELF's selfplay)
https://drive.google.com/file/d/1quBKBS68b3v8J-EnQxW2MAZSSve-RJc3/view?usp=sharing

Edit: 7/18
V19 (48437672)
training up to e1d466aa (153) (excluding ELF's selfplay)
https://drive.google.com/file/d/1PGTm2LLSW9vEtUoBXF1OKeHfWoj40PvU/view?usp=sharing

Edit: 7/22
V20 (0c77d215)
training up to 050375ce (156)
https://drive.google.com/file/d/15pwwnaDhgprRhN0F18xa-ldp5Vk9vo9N/view?usp=sharing

Edit: 7/28
V20-2 (dc011d01)
training up to 050375ce (156) (including ELF's selfplay)
https://drive.google.com/file/d/16OmFwwvdJuHgwuVVzuTUYQ7gaTCruHzj/view?usp=sharing

@bood

This comment has been minimized.

Copy link
Collaborator

bood commented Mar 27, 2018

Since all our bootstrap to bigger network failed so far, maybe worth putting in the queue and see how it performs? Not sure if the server can handle it though. @gcp

@kris-computer-go

This comment has been minimized.

Copy link

kris-computer-go commented Mar 27, 2018

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Mar 27, 2018

Not sure if the server can handle it though.

Nope, see leela-zero/leela-zero-server#23.

How long this network has been trained?

More important, what was it trained on?

@isty2e

This comment has been minimized.

Copy link

isty2e commented Mar 27, 2018

Is it too much to introduce a binary format for large networks? Though I feel like it is some sort of unnecessary overkill...

@bjiyxo

This comment has been minimized.

Copy link
Author

bjiyxo commented Mar 27, 2018

How long this network has been trained?

About 14 days on 1080Ti.

Do you know how much stronger it is then 85c6 (just approx. to have any
idea)?

No. Some people only tested in few games (~1 or 2), and I didn't hear any lost.

More important, what was it trained on?

It trained on 1080Ti, and the final learning rate is 0.0001. I trained every 120k kifus for 512k steps.

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Mar 27, 2018

I mean the data! From what you say I assume it's the SGFs.

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Mar 27, 2018

Is it too much to introduce a binary format for large networks? Though I feel like it is some sort of unnecessary overkill...

It'll make things a bit smaller perhaps, but I wouldn't expect a massive difference compared to 7zipped text files.

@bood

This comment has been minimized.

Copy link
Collaborator

bood commented Mar 27, 2018

I mean the data! From what you say I assume it's the SGFs.

I think he trained on those same self-plays we have. But not sure which self-play he started from. @bjiyxo Could you clarify?

@bjiyxo

This comment has been minimized.

Copy link
Author

bjiyxo commented Mar 27, 2018

@gcp
It's from the raw data #167. And I trained from b8adb7da(78) to b3a80524(99) which I mentioned earlier. So I didn't transform sgf files to raw data.

@lp200

This comment has been minimized.

Copy link

lp200 commented Mar 27, 2018

atari reading has been improved.
It is a good sign.

http://www.yss-aya.com/cgos/viewer.cgi?19x19/SGF/2018/03/20/399589.sgf
Move 269
policy probability
85c6f2ad 1.1%
bjiyxo's weights 31.8%

@odeint

This comment has been minimized.

Copy link

odeint commented Mar 27, 2018

Quite nice what the 256x20 was able to extract from relatively 'old' data (networks 78 to 99)! Just training on each data dump for 512k steps seems to work really nicely.

Isn't this pretty much the same as using always-promote, except the training data doesn't come from the promoted network yet? Taking the bigger network along the history of the the smaller one instead of just training on the very latest data seems to have worked.

@bjiyxo

This comment has been minimized.

Copy link
Author

bjiyxo commented Mar 27, 2018

FYI, the max batchsize is 128 (20B 256F) on 1080Ti. So I'm on batchsize 128 and learning rate 0.0001 now.

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Mar 27, 2018

Isn't this pretty much the same as using always-promote, except the training data doesn't come from the promoted network yet?

It has nothing whatsoever to do with always-promote.

It's basically the bootstrapping technique that was used before net2net (and some indication the limited size of the networks does force them to "forget" more things).

@odeint

This comment has been minimized.

Copy link

odeint commented Mar 27, 2018

It's basically the bootstrapping technique that was used before net2net.

I thought bootstrapping before net2net only trained on the last 500000 games without moving through the whole history. @bjiyxo used net2net as well, just a bit further back in time, so that the bigger network can soak up more information on the way back to the present. I would have suspected a net2net of the latest network, then trained on the latest data, to do better than his approach, but it seems not so.

It seems the lack of match gating (there is looking at training loss etc. of course) didn't stop the 256x20 from getting stronger when just moving through the history of training data.

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Mar 27, 2018

I thought bootstrapping before net2net only trained on the last 500000 games without moving through the whole history.

No, it used the whole history. The whole point of net2net was to avoid this. But a net2net enlarged net still needs to be trained.

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Mar 27, 2018

It seems the lack of match gating (there is looking at training loss etc. of course) didn't stop the 256x20 from getting stronger when just moving through the history of training data.

This argument doesn't work at all: the presence of match gating ensured that the training data monotonically increased in quality.

@odeint

This comment has been minimized.

Copy link

odeint commented Mar 27, 2018

This argument doesn't work at all: the presence of match gating ensures that the training data monotonically increased in quality.

Interesting, to me the quality of the training data was never the perceived point of the match gating. I thought the gating was to prevent the network from getting worse. I saw the training data as "protected" by MCTS, which boosts the strength of the raw network by several hundred Elo. This would keep them from getting worse than the raw network even for a few regressions in a row during always-promote.

@isty2e

This comment has been minimized.

Copy link

isty2e commented Mar 27, 2018

To ensure we always generate the best quality data, we evaluate each new neural network checkpoint against the current best network...

From the AGZ paper.

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Mar 27, 2018

Interesting, to me the quality of the training data was never the perceived point of the match gating. I thought the gating was to prevent the network from getting worse.

These are the same thing. If a weaker network plays training games, the quality of that data gets worse. It doesn't change anything that MCTS boosts the strength: it's still weaker than with a stronger network.

From the AGZ paper.

Yes, but discarded in the AZ paper, which is why there's so much discussion about it.

(I think it's completely and utterly irrelevant to this topic, as already said)

@hydrogenpi

This comment has been minimized.

Copy link

hydrogenpi commented Mar 27, 2018

@bjiyxo
Can you create a guide or tutorial on how you did this? I would like to try as well but to start from scratch using the latest 85 network... (as opposed to the very dated network that you started on two weeks or more ago etc)

Yes, this seems superstrong. on my work computer CPU only I'm running one single playout (raw network) and its winning every single game.

@herazul

This comment has been minimized.

Copy link

herazul commented Mar 27, 2018

Maybe it could just be the way to go : going for 20*256 with the usual training process to keep the community training flow going, and give everyone time to try/test and improve all these new training method and process that doesnt give that much results for now.

@gcp gcp changed the title I trained a 20b 256f network. I trained a 20b 256f network (93229e) Mar 27, 2018

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Mar 27, 2018

Server issue fixed (was a Chrome bug that got into node.js via V8, only recently fixed). Match is queued.

@bood

This comment has been minimized.

Copy link
Collaborator

bood commented Mar 27, 2018

Awesome! But I wonder how we can tell if it is strong enough? Given same playouts cost significant different time on 20x256 and 10x128. Maybe we should implement some kind of time controlled match in the future?

@hydrogenpi

This comment has been minimized.

Copy link

hydrogenpi commented Mar 27, 2018

@bood Well if it doesn't win, we can assume it isn't strong enough. If it does win, then more accurate analysis is needed.

@Friday9i

This comment has been minimized.

Copy link

Friday9i commented Mar 27, 2018

Did you do some black magic ; -)? It leads 19-2 against the best current network! Let's wait and see (but I guess that with 3200 visits, it may come close to the top of CGOS...)

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Mar 27, 2018

It's about 4-5 times slower "only" from my benchmarking. So it only has to be about 100-200 Elo stronger to make up, IIRC. I think someone measured Elo gain from thinking doubling time somewhere in an issue...

@jkiliani

This comment has been minimized.

Copy link

jkiliani commented Mar 27, 2018

Maybe apart from actually promoting this net, there may be valuable information in here about how it got so strong. Could it be that the reduced variety from the games before the root FPU fix hurt our last bootstraps?

Maybe diversifying the dataset could actually produce a stronger 192x15 net.

@gcp

This comment has been minimized.

Copy link
Member

gcp commented Mar 27, 2018

I think there may also be indications that net2net and then training with the most recent nets only isn't "good enough", in the sense that too many "good" historical data is lost. That might've been due to the lack of variety, but it might be other things.

It's very notable this doesn't even have most of the recent data on it...

@marcocalignano

This comment has been minimized.

Copy link
Member

marcocalignano commented Mar 27, 2018

Could not just be that the network structure make up for all the gain?

@Dorus

This comment has been minimized.

Copy link

Dorus commented Jul 29, 2018

If you set one to 3200, you should probably set all to 3200.

3200 for promotion also makes most sense as that's the value for self play.

@roy7

This comment has been minimized.

Copy link
Collaborator

roy7 commented Jul 29, 2018

New selfplay is 1600 visits, but the original V20-2 vs 15 block was 3200 so I changed 2140aa3f match to be 3200. The 96k one already started but if it does well we could always re-run the test at 3200. That might be an interesting result to compare anyway.

@roy7

This comment has been minimized.

Copy link
Collaborator

roy7 commented Jul 29, 2018

I don't know if @bjiyxo is training his V* networks with 3200 or 1600, I assume 3200. But server based selfplay tasks are now 1600 visits for 20 block networks.

@roy7

This comment has been minimized.

Copy link
Collaborator

roy7 commented Jul 29, 2018

For science I'm doing another 5a5e7393 VS d351f06e at 3200.

@wonderingabout

This comment has been minimized.

Copy link
Contributor

wonderingabout commented Jul 29, 2018

@roy7
i'm also interested to see the influence of visits on winrate
if its really different, it may be worth considering using visit number that maximizes 20b performance and potential

@TFiFiE

This comment has been minimized.

Copy link
Contributor

TFiFiE commented Jul 29, 2018

As the seemingly strongest 256x20, will 5a5e7393 be forcibly promoted now?

@Dorus

This comment has been minimized.

Copy link

Dorus commented Jul 29, 2018

It's nonsense to think you need high visits in self play to get good results from self play. The only measurement you are allowed to use for quality of self play is how fast the network gains elo compared to how much computing power you invest.

Indeed more visits increase the quality of the self play games. However, they also double the cost. There is a sweet spot where you gain most.

Also i see the false assumption here that networks that perform better with more visits should also be trained with more visits. This is not true. Larger networks always perform better with more visits, we've seen this in all our previous tests with them. Larger networks simply have a lower branching factor so they perform better with more visits. This can be as extreme that they are weaker as a smaller network at low visits, but stronger at high visits.

More visits obvious increase self play quality. Too few visits and you wont even get very interesting games as noise wont have enough effect and temp might pick very random moves.

But, there are also many (less intuitive) reasons why more visits is bad. More visits means less games, so less diversity. It means you need to wait twice as long for the same number of games, or it means you use more positions from the same game, something that leads to over fitting (where the network remembers the game result instead of generalizing), something that is notorious bad.

If you half the visits, the same position will appear twice as many times, but with different noised root, also the value net will be trained twice as fast as it has 2 result to learn from as well as 2 follow ups. This should compensate for the swallower search.

Btw, for AGZ they used 1600 visits and fir A0, 800 visits was the sweet spot.

@roy7

This comment has been minimized.

Copy link
Collaborator

roy7 commented Jul 29, 2018

@TFiFiE I do plan to, but I think when these matches wrap up I might have it do a normal promotion match to see how they do. Then force after that if it doesn't hit the 55% mark.

@roy7

This comment has been minimized.

Copy link
Collaborator

roy7 commented Jul 30, 2018

Early yet but 43 : 21 (67.19%) looking good for promotion match by V21-96K vs V20-2.

@roy7

This comment has been minimized.

Copy link
Collaborator

roy7 commented Jul 30, 2018

Maybe I should alter the training # counts for the manually trained networks. Since they didn't really train on that # total. Unsure if it'll break something though.

@roy7

This comment has been minimized.

Copy link
Collaborator

roy7 commented Jul 30, 2018

5a5e7393 is very similar strength vs d351f06e in head to head play, but knows the weaknesses of V20-2 much better apparently. Its win rate vs V20-2 seems (unless it drops a lot by 400 games) to be much stronger than d351f06e.

@l1t1

This comment has been minimized.

Copy link

l1t1 commented Jul 30, 2018

lz 159 passed, I hope it will produce new best weight without manual operates

@bjiyxo

This comment has been minimized.

@l1t1

This comment has been minimized.

Copy link

l1t1 commented Jul 30, 2018

@bjiyxo did your v21 weights use elf training data? and the offical lz weitghts all training without them.
I worry about the new generated weights(such as f3d85364) still weaker than lz 159.

@bjiyxo

This comment has been minimized.

Copy link
Author

bjiyxo commented Jul 30, 2018

I will keep training until 256k steps. Since the 20x256 on the official website looks pretty weak (~20% win rate), I'm going to list some parameters that I used.

Batch Size=256
Learning Rate=0.0001 (10e-4)
The training window contains ~630k games, 350 lz games and 280k elf games.

@roy7

This comment has been minimized.

Copy link
Collaborator

roy7 commented Jul 30, 2018

I"ll queue 192k now before bed. :)

@Mardak

This comment has been minimized.

Copy link
Collaborator

Mardak commented Jul 30, 2018

Here's the S16 move from 3-3 knight joseki for the new 256x20 networks:

regular 9.040M networks
 +32k   S16 ->      47 (V: 52.59%) (N: 46.39%) PV: S16 R18 T17 Q18 S17 N18 O16 N17 R14 O14 Q12 C6 C12 C3 C4
 +64k   S16 ->      48 (V: 52.80%) (N: 46.58%) PV: S16 R18 T17 Q18 S17 N18 O16 N17 R14 O14 Q12 C6 C12 C3 C4 D3 E4
 +96k   S16 ->      50 (V: 52.80%) (N: 48.07%) PV: S16 R18 T17 Q18 S17 N18 O16 N17 R14 O14 Q12 C6 C12 C3 C4 D3 E4
+128k   S16 ->      51 (V: 52.95%) (N: 48.52%) PV: S16 R18 T17 Q18 S17 N18 O16 N17 R14 O14 Q12 C6 C12 C3 C4 D3 E4 B4
+192k   S16 ->      49 (V: 52.86%) (N: 46.97%) PV: S16 R18 T17 Q18 S17 N18 O16 N17 R14 O14 Q12 C6 C12 C3 C4 D3 E4 B4
+256k   S16 ->      49 (V: 52.89%) (N: 47.31%) PV: S16 R18 T17 Q18 S17 N18 O16 N17 R14 O14 Q12 C6 C12 C3 C4 D3 E4 B4

bjiyxo
20-2    S16 ->      63 (V: 52.85%) (N: 61.35%) PV: S16 R18 T17 Q18 S17 N18 O16 N17 R14 O14 Q12 C6 O15 N15 P15 N16 C12 Q13 C5 D6 F4
21+ 96k S16 ->      66 (V: 52.66%) (N: 64.45%) PV: S16 R18 T17 Q18 S17 N18 O16 N17 R14 O14 Q12 C6 O15 N15 P15 N16 C12 Q13 C5 D6
21+128k S16 ->      66 (V: 52.73%) (N: 64.34%) PV: S16 R18 T17 Q18 S17 N18 O16 N17 R14 O14 Q12 C6 O15 N15 P15 N16 C12 Q13 C5 D6
21+192k S16 ->      66 (V: 52.72%) (N: 64.10%) PV: S16 R18 T17 Q18 S17 N18 O16 N17 R14 O14 Q12 C6 O15 N15 P15 N16 C12 Q13 C5 D6
21+256k S16 ->      67 (V: 52.54%) (N: 65.63%) PV: S16 R18 T17 Q18 S17 N18 O16 N17 R14 O14 Q12 C6 O15 N15 P15 N16 C12 Q13 C5 D6
@wonderingabout

This comment has been minimized.

Copy link
Contributor

wonderingabout commented Jul 30, 2018

interesting results
so it seems 1600 or 3200 visits doesnt significantly affect match winrate, if both networks play in the same conditions

however, lz 157 has 72.5% "loserate" vs elf at 3200 visits,
while lz 159 has 79.5 "loserate" vs elf at 1600 visits

lz 157 and lz 159 were measured to be of even strength more or less, but yet by decreasing the visits to 1600, lz 159 looses 7% winrate against elf, which is natural since elf plays with one visit

conclusion : lowering visits has a significant impact on strength (but we already knew this)

looking forward to : overall, i think it will be interesting to see how lowing visit number affects game generation, quality of games (ladders, etc..), diversity (probably more diversity with lower visits), and learning speed

last question @roy7
since lz 158 was weaker, is it possible to remove all the training data that was generated by it ?

@jkiliani

This comment has been minimized.

Copy link

jkiliani commented Jul 30, 2018

Removing Lz 158 training data would be a big mistake. It's evident from the self-play games that the bootstrap has specific weaknesses with poor moves regularly having a significant visit count, although usually they aren't the most visited move. This can be seen by the much higher level of blunders in self-play compared to matches, for both 158 and 159.

Only training on those self-play games including their blunders will fix these weaknesses and propel the bootstrap nets forward. Throwing any of the 256x20 games away would be wasting a lot of potential progress.

@wonderingabout

This comment has been minimized.

Copy link
Contributor

wonderingabout commented Jul 30, 2018

@jkiliani i understand, but lz 158 only trained for 1 day, its not a big amount going to be sacrificed
however, lz 159 has a stronger basis to learn the same things that lz 158 would learn
so is it not better to start directly at lz 159 ? 1 day passes fast

also, about visits, an interesting way of seeing it : you can think of lowering visits as training with weights (like in a sport , they make their body heavier with body masses, no pun intended !)
so at first the winrate vs elf will be lower because of less visits, but when 20b networks overcome the 50% limit, if you go up to 3200 visits it will be much higher

so, i suggest that elf matches vs 20b networks are played excetionnally at 3200 visits to have a comparison point with previous 15b networks
another reason for this suggestion is that 3200 visits may show some potentially slightly different playstyle, which would be interesting to compare to usual playstyle at 1600 visits

@bjiyxo

This comment has been minimized.

Copy link
Author

bjiyxo commented Jul 30, 2018

@diadorak

This comment has been minimized.

Copy link

diadorak commented Jul 30, 2018

@bjiyxo Maybe it's better to open a new issue since your 20x256 networks are official? This issue is too big and kind of hard to find. Thanks!

@bjiyxo

This comment has been minimized.

Copy link
Author

bjiyxo commented Jul 30, 2018

@diadorak I open a new issue in #1674

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.