Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

40B v 15B #1810

Closed
qzq1 opened this issue Sep 5, 2018 · 28 comments
Closed

40B v 15B #1810

qzq1 opened this issue Sep 5, 2018 · 28 comments

Comments

@qzq1
Copy link

qzq1 commented Sep 5, 2018

I matched 174 (40B) and 157 (15B) with 1 visit each for 100 games, CPU only. The 40B weights won 86.0% of the games - almost exactly the expectation based on the ELO difference between the two.*

What I find odd is that every game of the 100 was identical through 43 moves, though the colors were alternating. It takes my computer an hour to play games at 500 visits, but the couple so far are identical to the 100 through 24 moves (all star point opening, double approach by B in one corner, etc.). Is this an artifact of the training,, or a result of my sad machine being overwhelmed? I can't make sense of it.

*actually, based on a modification of the win likelihood formula based on my experience matching other versions. P(A) = 1/(1+10m), where m = (Br - Ar)/800.

@roy7
Copy link
Collaborator

roy7 commented Sep 5, 2018

With 1 visit the only randomness will come from the rotation of the board. Generally, the highest prior move will always be the same and thus the games will repeat since that first visit is all they can do.

@qzq1
Copy link
Author

qzq1 commented Sep 5, 2018

Thanks for the response. I anticipated that every second game might start nearly the same. I was more surprised that when the colors switched the games were so similar for so long. I thought the difference in the structure of the weights and the more than 650 ELO difference might have shown up in opening tendencies, which is what I was curious about. But I suppose in the grand scheme of things - i.e., the number of training games - there isn't so much difference yet.

@jkiliani
Copy link

jkiliani commented Sep 6, 2018

Can the 40b win against 15b at equal time? I presume it should, but it's likely not that decisive... did you try this as well yet?

@qzq1
Copy link
Author

qzq1 commented Sep 6, 2018

@jkiliani, I don't have the equipment to look at that question properly (i.e., it takes so long to run games CPU-only at a decent number of visits), but based on a cursory look with very low visits, I find 15B to be stronger, even with less time. I'm running 100 games now allowing 2 visits for 40B and 20 visits for 15B. Still 40B uses twice as much time, and 15B has currently won 11 of 16 games.

For me, it's no contest for everyday use: I would always use 15B for live analysis or game review. I have no idea if a GPU would narrow the performance gap, though it would obviously make 40B more practical for my purposes..

@qzq1
Copy link
Author

qzq1 commented Sep 6, 2018

Following up on my original observation about repeated openings: spot checking recent 40B-40B match games, I have yet to find a single one that does not begin with Black star point, White diagonal star point, Black star point (followed more often than not by another White star point). That unchanging three-move opening strikes me as odd. There's clearly plenty of variation in the training games.

@Marcin1960
Copy link

My dream would be 1. to let 40b do self-play games to accumulate them and to advance (hopefully, clients will not desert), then train 20b (I hope it will not be dumped as 15b was) and 15b on those games.

@Vargooo
Copy link

Vargooo commented Sep 6, 2018

jkiliani said :

Can the 40b win against 15b at equal time?

No, see HERE for a 60 game match at equal time between networks 157 and 174

@Strappa71
Copy link

@qzq1 Is that really that surprising? Training games have randomizing to create variation. When you let LZ try and find "best move" with very few visits then less variation is exactly what happens.

@qzq1
Copy link
Author

qzq1 commented Sep 6, 2018

@Strappa71 regarding my first post and follow up, what I found a bit surprising was that both LZ 40B and 15B played the same, fixed openings through at least 43 moves, whether black or white. I thought there would be enough difference between the two that that would not be the case. Upon reflection I guess they are still quite closely related, genealogically, so perhaps it's not a great surprise.

Regarding my last post, on the 40B - 40B matches, those are games with 1600 visits, and I do find it odd that the first three moves never vary. Even if it's not a surprise because of the underlying methodology, it's fair to ask - speaking as a go player - whether that indicates a rigidity that makes LZ less useful and interesting. As gcp has noted elsewhere, LZ generates a lot of variety in 'second best' moves, so it's not a great concern for practical purposes. But I do think it's something to note. Granting that the available Alphago games were curated, and not necessarily representative, they display more variety in openings.

@Splee99
Copy link

Splee99 commented Sep 6, 2018

To my understanding the net2net conversion doesn't introduce any new knowledge to the neural network. This was what happened when we initially bootstrapped the 15b to 20b then to 40b. To use the full potential of the 40b network, we have to do A LOT of self play training.

@maxinjapan
Copy link

maxinjapan commented Sep 6, 2018

@qzq1 I too find too many identical openings (up to move 37 or so) even in current match games.
I think it kind of makes sense, actually. The other moves are second (or third, or forth, etc) best and as such it is not really convenient for LZ to play them. If you already knew one move was better than the others, even if marginally, which one would you use in a match?

@qzq1
Copy link
Author

qzq1 commented Sep 6, 2018

@maxinjapan I understand that in match play LZ always chooses the move it evaluates as best, and that move is likely to be unchanging given identical circumstances - for a given set of weights. Yet, if there is to be any advance in strength, we expect to see changes in play from one set of weights to another. One question is whether there is any "true" superiority in, for example, Black playing move 3 on the star point rather than one of the 3-4 points. I doubt that it is true as a matter of principle, so either LZ gets stuck at a local minimum and is missing something (until lots of accumulated training games gets it out of the trough) or maybe it's just that if there is no meaningful difference in true value between alternative moves, there is no way to train LZ off its initial choice. Possibly we won't ever know for sure.

@gjm11
Copy link

gjm11 commented Sep 6, 2018

In training self-play, the moves are randomized somewhat and quite a lot of games will have (say) black playing 3-4 on move 3. If those games turn out better on average than games where black plays 4-4, the network will learn this and start playing 3-4 more often.

@maxinjapan
Copy link

@qzq1 @gjm11 or maybe LZ has already learned that starting in 4,4 is better.
I'd leave the answer to the 9Ds around here - this is way above my level

@gjm11
Copy link

gjm11 commented Sep 6, 2018

@maxinjapan If the network strongly prefers playing 4-4 to playing 3-4 (which I think it does) then that's good evidence that that playing 4-4 is (at least given the network's other strengths and weaknesses) better than playing 3-4. I was arguing (against qzq1 and agreeing with you, unless I'm confused) that we shouldn't be too worried that LZ might just have randomly latched onto 3-4 in preference to 4-4 and have no way of discovering if it's wrong. I think it will discover if it's wrong because self-play explores a wide variety of openings, I think it would have discovered by now if it were systematically playing bad openings, and I think we should probably trust that whatever openings it plays all the time really do work better (at least for LZ playing against LZ) than other options.

@maxinjapan
Copy link

@gjm11 yes, that was my point too :-)

@kuba97531
Copy link
Contributor

kuba97531 commented Sep 7, 2018

I have run a local test of 256 games on time parity of 15B (157) and 40B (174).
15B has won 134 vs 122 (52.3%) which shows that likely neither net is considerably stronger.

  • run on a relatively recent next branch
  • Both 157 and 174 had identical GPU (1080Ti) with 1 second per move pondering enabled.
  • in 1 second 174 was getting rougly 330 playouts / s, and 157 about 3-4 times that much

I have uploaded logs and sgfs generated from logs here:
https://kklzhost.com/leela-zero/157_vs_154_time_parity.zip

The openings are very repetitive but are completely different depending on who is black and white
image

@hred6
Copy link

hred6 commented Sep 9, 2018

Half precision compute support: NO

In the *.log file
Why is that?

@l1t1
Copy link

l1t1 commented Sep 16, 2018

request test 177 weight

@qzq1
Copy link
Author

qzq1 commented Sep 17, 2018

I ran 40B (177) with 100 visits versus 15B (157) with 600 visits. No GPU, no pondering; each averaged about 13.7 seconds per move; (wish I had better hardware).
After 156 games, 40B (177) had won 20 games (17%); 19 as white, 1 as black. Not statistically different from what I was able to do with LZ176, which isn't too surprising given the low number of games, but in any case, no striking difference for the newer 40B weights.

@l1t1
Copy link

l1t1 commented Sep 20, 2018

@qzq1
Copy link
Author

qzq1 commented Sep 20, 2018

Great effort by xela posted to lifein19x19, but unfortunate that the 157 weight wasn't part of the test. Also, I would prefer more games for a smaller set of engines.
I ran my even-time comparison using 178 v 157 overnight. Only 45 games, with win rate of about 18% for 178, not distinguishable from the results for 177, though the opening patterns were somewhat different.

@Splee99
Copy link

Splee99 commented Sep 21, 2018

My impression is that 40b is too optimistic in winning a ko. Very often when a ko is lost the winrate drops considerably. More training in that area will be needed for the overall strength increase.

@Marcin1960
Copy link

Marcin1960 commented Oct 1, 2018

Now testing ELF 240x20 against new 128x10 (LeelaDan vs LeelaZeroT on KGS). Same time about 30 sec per move.

#1889

@gjm11
Copy link

gjm11 commented Oct 1, 2018

@Splee99 It seems like correct evaluation of a position with a big ko is likely to depend on detailed tactical reading -- does the 40b network do better when given plenty of time to think?

@Splee99
Copy link

Splee99 commented Oct 1, 2018 via email

@l1t1
Copy link

l1t1 commented Oct 3, 2018

@gcp gcp closed this as completed Oct 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests