Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability: 40b is now the best LZ net for game analysis and long games #1914

Open
Friday9i opened this issue Oct 5, 2018 · 50 comments

Comments

Projects
None yet
@Friday9i
Copy link

commented Oct 5, 2018

Update 2018/11/12:

  • Addition of LZ188 vs ELFv1 (clear purple curve) + comments at the end
  • Addition of LZ184 vs LZ174, showing reasonable improvements of 40b with 10 new nets : -)
  • Addition of the 10K visits ratio for LZ126 vs LZ118, to check high visits trends (it takes time ...!)

And a special thanks to @herazul for useful contributions to draw the curves for LZ184 and 188! And thanks also to @AncalagonX who helped a little bit too.

Previous update (2018/10/15): LZ181 vs ELFv1 added (purple curve) + a comment at the end.

Here is an update of the scalability of nets:
image

Legend:

  • x-axis: number of visits for the first net (eg LZ181 for the LZ181/LZ157 curve)
  • y-axis: ratio of visits needed by second net to get 50% winrate (eg for LZ181/LZ157 and with 100 visits for LZ181, we see the curve at 3 because LZ157 needs ~3 times more visits to be as strong as LZ181, ie to get ~50% winrate).

Interpretation:

  • With 1 visit, larger nets are much better than smaller nets. Eg LZ157 needs quite a lot of visits to compete with LZ181@1 visit: about 13 visits are needed for LZ157!
  • With more visits, from about 30 visits to about 300 visits, the ratio needed to compete goes down to a minimum, and the minimum is quite flat (from ~30 to ~300 visits). In the case of LZ181 vs LZ157, LZ157 needs only ~3 times more visit to compete with the larger net (but as it is about 4x faster, it confirms LZ157 is still a bit better on time-parity with relatively low visits)
  • With more visits, larger nets scale systematically better than smaller nets. For 40b vs 15b, this curve goes up too, but it is still relatively slow: the ratio goes up to ~4 with 1000 visits and ~5 with 3000 visits (ie ~16K visits needed for LZ157 when LZ181 has 3000 visits). Hence, LZ181 is better than LZ157 on time parity when it has above ~1000 visits

Updated conclusion: LZ181 & LZ188 are now the best LZ nets for game analysis and long games, as they scale better than smaller nets (ie an U-curve, going up above ~100 visits). 40b nets are now better than LZ157, the best 15b net : -).
Update note for LZ vs ELF: Unfortunately, after testing, LZ doesn't scale well against ELF, as the curve decrease above ~100 visits! So ELF should still play better on strong hardware (at time parity) than LZ ;-(. That's surprising as recent LZ nets are significantly better than older one and scales well against each others (eg LZ184 vs LZ174). From issue #2011, I guess this bad performance against ELF is a consequence of still training LZ nets with ELF games: hopefully this situation should normalise with better scaling for LZ vs ELF quickly after ELF games are removed from training. We'll see in a few weeks if that's the case or not, and of course I'll test again around LZ195 (?) vs ELFv1 to see if scalability improves?

And to conclude, the same graph on a log-log scale:
image

Old comments (still valid but not up-to-date as of 2018/11/12):
An interesting observation: the ratio increase always quite linearly above ~1000 visits (I may try to test one of the curve with 10k vs more visits, possibly LZ126/LZ118 as it would be the fastest to compute). Update: done, and it seems to confirm the hypothesis!
Additional comment regarding LZ181 vs ELFv1: unfortunately, the new 40b does not scale well yet against ELF, as ELF only needs ~1.5x more visits to compete against LZ181@3K visits. TBH, I'm quite disappointed...
But on the positive side, it means our 40b baby has a lot of potential to grow, as after proper training, the curve should be much better in the future. If we believe previous upgrades of net sizes, ELFv1 would probably need at least 15K visits, if not 100K, to be on par against a well trained 40b net @3k visits, ie 5x to 30x better than LZ181: fingers crossed, the future should be bright for 40b!

@MaurizioDeLeo

This comment has been minimized.

Copy link

commented Oct 6, 2018

Very nice work, and a lot of data. Kudos !

In terms of presentation (if this was a scientific paper) I would remove the curve with same size nets, where the ratio is always above 1 (e.g. 126/118). It simply means that one net is stronger than the other (assuming that the speed changes only with net size).

Then focus on only one net per size (Let's say LZ91, LZ116, LZ157, ELF, LZ181) and plot the ratios in the same way , maybe going down two sizes (LZ116/LZ91 , LZ157/LZ91, LZ157/LZ116, ELF/LZ116, ELF/LZ157, LZ181/LZ157, LZ181/ELF)

Then plot some horizontal lines with the same color, representing the ratio of visit at time parity (e.g. 4 for the pair LZ181/LZ157. This would allow to see in which range each size is stronger.

@MaurizioDeLeo

This comment has been minimized.

Copy link

commented Oct 6, 2018

Also, your data seems to suggest that when changing the size of the network, and testing against a smaller network, it would be useful to test:

  • At time parity
  • At around 50 visit for the larger network.

In case the larger network wins, there are good chances that it is stronger at every time control.

@wonderingabout

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2018

interesting knowledge, well documented, but i have a few questions :

  1. you say "lz181 is now the best", do you mean it was not the case with lz180 and earlier networks ? if yes, could you add the curve of another 40b network on your curve to see how different they are

  2. to make so much points on your curve, did you play the game entirely at different time settings, or did compare the winrate estimations given by both networks at a precise move ? (because we know these evaluations are often optimistic)

  3. for example, for lz 181 vs lz 157, if we play 200 games at 100 visits for lz 181 and 300 visits for lz 157, the win/loss will be arround 100/100 ? did you play that much games to verify ?

@Friday9i

This comment has been minimized.

Copy link
Author

commented Oct 6, 2018

Thanks for your feedbacks!
@MaurizioDeLeo I kept the curves as they are but added the net sizes in the legends (eg "LZ126(mid192x15)", meaning it's an intermediate strength net of size 192x15). That should clarify things. Regarding the horizontal line, true but it depends of the net, so not so easy (I may use colors for that, will give it a try)
@wonderingabout I was not comparing LZ181vsLZ180 as they are quite close, it was a more general comment on the current situation of 40b vs best 15b. Adding another curve takes a lot of time: 500 games per point on the graph (excepted on the right, with > 1000 visits, where I play generally around 100 games "only" because it is soooo slow)

@wonderingabout

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2018

@Friday9i
i understand, thank you
you may want to add lz 174 (first official 40b network) to make comparisons with lz 181, but it takes time

also, i would call lz 181 weak 40b, as top 40b would be alphago, and mid 40b something more like fineart

@Friday9i

This comment has been minimized.

Copy link
Author

commented Oct 6, 2018

I'm currently adding LZ181 vs ELFv1, and I will also add LZ157 (best 192x15) vs LZ116 (best 128x10). Possibly I'll do LZ174 vs LZ157.
Regarding LZ181 being weak or mid strength for a 40b, I hestitated: it's the 8th 40b net... Not a big deal anyway

@qzq1

This comment has been minimized.

Copy link

commented Oct 6, 2018

@Friday9i terrific work, thank you! Can you tell us the hardware you're using? Also the relative time used for LZ181/157, say at the 100/300 visits point on your graph? In my own match tests, on a basic laptop, no GPU, I find LZ181 has about a 25% win rate v LZ157 at equal time, which I approximate using 200 visits for LZ181 and 1200 visits for LZ157 (based on more than 100 games). Hence, I still default to LZ157 for quick game analysis (though really it makes little difference for my purposes).

@Friday9i

This comment has been minimized.

Copy link
Author

commented Oct 6, 2018

I'm using a GTX1080 (not Ti) and a quad core. For LZ181@100 visits vs LZ157@350 visits, it's about equivalent time (71min for LZ181 vs 79min for LZ157, for a total of 291 games).

@petgo3

This comment has been minimized.

Copy link

commented Oct 6, 2018

Great work, but watch out hardware differences. Time parity means quite different proportions with different GPU's. So the extremely smart figure above is only true for pure 1080 (not ti).
I think this is reasonable because the part of time consumed by cpu is far more stable than the gpu part.
E.g. with GTX1060 we will not have time parity with LZ181@100 vs LZ157@350 visits.
LZ181@100 would be significantly slower, i believe.

@zhanzhenzhen

This comment has been minimized.

Copy link
Contributor

commented Oct 12, 2018

Please test ELF against 40b. This will be the most important test.

@petgo3

This comment has been minimized.

Copy link

commented Oct 13, 2018

@Friday9i "the ratio increase always quite linearly above ~1000 visits".
If this is provable, we could predict relative strength of more visits. Did you try some matches above 3000 visits to prove the linear behavior?

@Friday9i

This comment has been minimized.

Copy link
Author

commented Oct 13, 2018

@zhanzhenzhen Currently doing ELFv1 vs LZ181, results in a few days : -). I've got the 1K point, but testing the 3K point will take some more time...
@petgo3 Above 3K visits is 10K visits... If the ratio is ~5, it means games with 10K vs 50K visits: that takes a huge time to make 100 games with these settings, especially with 40b... So no, I didn't test it ; -(.

@petgo3

This comment has been minimized.

Copy link

commented Oct 13, 2018

@Friday9i : I would suggest only on combination of smaller weights for proving linearity. (i apologize to have only a quite mediocre hardware, so i can't help at all).

@ALL: Perhaps someone else could help?
We would be able to extrapolate strength also for actual weights and high numbers of playouts. :-)

@Friday9i

This comment has been minimized.

Copy link
Author

commented Oct 13, 2018

Yeah, I may try on smaller nets with high visits
But I'd also like to make the graphs for LZ157 (best 15x192) vs LZ116 (best 10x128), Elfv1 vs LZ157, and also one or two 20b nets (even if they weren't well trained when we switched to 40b). So, many things to try...
Regarding tests of log-log linearity above 3K visits, the fastest to try would be LZ126 with 10K vs LZ118 with ~27K visits but those are two 128x10 nets, not ideal. Otherwise LZ181 with 10K vs LZ157 with ~66K would be nice to test, but very slow... If anyone want to test, that would be nice!

@Friday9i

This comment has been minimized.

Copy link
Author

commented Oct 15, 2018

Post updated with LZ181 vs ELFv1

@herazul

This comment has been minimized.

Copy link

commented Oct 16, 2018

Weird results against ELF for sure ! Maybe it's because 40b is only marginally better at equal visit that the scaling looks weird.
It will be interesting to see in a few promotion if playouts strengh start to take effect and we see the same cumulative effect of playouts starting to kick in against ELF at between 500 and 1000 visits

@Friday9i

This comment has been minimized.

Copy link
Author

commented Oct 16, 2018

@herazul Of course I'll test again in a few promotions, probably around LZ190, against ELFv1. And hopefully, it will begin to give an increasing ratio of visits above ~1000 visits! If that is the case, our 40b will become the undisputed champion.
On the other side, I'm wondering if our current training process is not partially hindered by too much random plays (ie the "- m 999")... That would possibly explain these deceiving results! For sure, it's useful and necessary to get random moves, in order to discover new moves, which are by definition unexpected until they are tested (and show good results)! However, too much noise degrades the quality of the eval function... To "correct" that and also make better match games, I've got 3 ideas (easy to implement!) which could possibly help a lot the training and net-selection processes:

  1. Why not play a % of the selfplay games with -m 50 for example (and keep the other games with -m 999)? These -m 50 games should help the eval function to be more precise, as they will not be polluted by mistakes later on...
  2. Just using "-m 999 and --randomvisits 1" avoids very obvious mistakes not visited at all. However, it may be a pity not to play from time to time obvious mistakes not visited at all (eg to re-learn that self-ataris are generally stupid ;-). But on the other side, with these parameters, most of the random moves are really bad and not interesting: I think the exploration should focus most of the times on "somewhat interesting alternative moves". For that, "-m 999 --randomvisits 50" should help a lot: it is still random AND it is focusing most of the energy to explore moves somewhat interesting (as they received at least 50 visits): isn't it very desirable??? Hence, we could for example split selfplay games in 10 groups with different "--randomvisits" parameters: 10% with just "-m 999" (ie --randomvisits 0: those are very random games, testing "probably stupid moves"), 10% with "-m 999 -randomvisits 1", 10% with 2, then 4, 8, 16, 32, 64, 128 and eventually 10% of the games with "--randomvisits 256" (ie excellent level games focusing all the random exploration on almost best moves that received at least 256 visits!).
    More generally, 1) and 2) could/should be combined: selfplay games played with either -m 50 or -m 999, and with 10 groups of "--randomvisits" parameters between 0 and 256. Using -n is useful too.
  3. For match games, I tried "-m 50 --randomvisits 50" (and also -n) and it works well: games are both all different, and they are not plagued by very crude mistakes (just by some "not optimal choices"). I don't know exactly the parameters used for match games (just -n ?), but it seems there is little variability, which can possibly lead to bad selection of new nets. Hence, why not using "- m 50 --randomvisits 50 - n" for match games?
    What do you think of these ideas?
@herazul

This comment has been minimized.

Copy link

commented Oct 16, 2018

You need GCP to respond, i have no clue ^^
Problem is, these ideas makes sense, but i have learned from GCP and this project that it's not because an idea "make sense" that it will be successful.
I was baffled when i saw the self-play games produced when the switch to full randomness was made, and i tough that the self play games was so random and bad quality that it would crush the training process. But the result was that the net continued to improve like nothing happened. So what do i know !

Ultimatly, the only thing i think is sure for self play games is that the level of the net generating games is important, as it was tested several time (even with older net and newer games for example). I think we have no consensus about every other possible parameters (number of visits, randomness, diversity, games from mixed nets with elf for exemple...), as the training process seems to be able to learn from pretty much everything GCP throwed at it.

Problem is, we would need to test these ideas, and the only way to test would be to have hundreds of computers contributing to test pipelines to see the results, but we don't have that.
I think we will have possibilities to test things when 40b stalls, or maybe with minigo runs when andrew tries gating it (next minigo run will be gated according to andrew)

@Friday9i

This comment has been minimized.

Copy link
Author

commented Oct 16, 2018

@herazul : thanks for answering, I share your view and your amazement (especially at the results from full randomness: as unbelievable as I find it, it works!?! Makes me more humble too: it always hurts, but I have to admit my intuition totally failed on that one :-).
Regarding the ability to test the suggested ideas, I'm optimistic: it should be quite easy to test them (if parameters can be sent from the server to clients, without a new version of autogtp. Otherwise, it is a pain)! If GCP agrees, he could just turn on these parameters for selfplay games: if the rythm of network promotions increases significantly, it is useful and works efficiently. If not (or worse, if it stalls), then it's not a good idea... Easy, isn't it?

@herazul

This comment has been minimized.

Copy link

commented Oct 16, 2018

It's maybe easy to try, but can be difficult to properly evaluate. Our first full window of 40b is just complete, we had a few promotion, and lately 2 lucky one that were hard to beat. We could also have a few unlucky runs. (simply put : The sample size is small)
It can be difficult to evaluate if a solution we try is better, because we would have to wait a lot of self play game generated and promotions to start to evaluate the rate of progress, and even then we would have a hard time figuring how the usual training process would have gone. (simply put : we have no control group)
It's easier when the run is stalling, because then if go from no progress to any kind of progress, we could conclude that the solution helped.

The possible solution when we stall, besides tuning randomness, is also increasing the number of visits back to 3200 or more.

@Friday9i

This comment has been minimized.

Copy link
Author

commented Oct 16, 2018

Always hard to evaluate without control group, clear, and that is not an option because it would be too demanding.
However, the rythm of new nets should be a clear indicator within about a month: if we get more than 10 new nets, it's very promising. Around 7 new nets, no clear impact. Below 4 new nets, it is very probably inefficient (and it should probably be reversed).
But basically, testing this approach, selfplay games should be generated as fast as they are now, they would just be a bit less noisy and would explore in more details interesting alternative moves (as well as purely random ones), so whatever happens, it should still be useful to train new nets. Efficient or not, the rythm of progression would tell us. JMHO.

@Ishinoshita

This comment has been minimized.

Copy link

commented Oct 17, 2018

@friday9, @herazul As @gcp only surrenders to data, which is certainly a high quality for driving such project ;-) and as we don't have such things as test 20xx, 30xx as LC0 has to test so called 'improvements' in live mode, the only way to go would be to train a 40b on training data where positions with win rate lower than some threshold would have been filtered out, to cut the long tails of no-resign games. And compare progress vs official network.

We don't have the value inside the training data to do that. However, based on statistical indicators on the policy (moving average of p_best10 eg), it is possible to built some heuristic on when the value drops below some threshold. I have some (yet unpublished) data that show a good correlation between value drop VS policy flattening with MCTS.

I would be willing to write some script that heuristicaly cut the game tails, to produce 'tail-free' training data. But only if I can team up with someone who has the training capability, and is willing too run that experiment.

If a network trained is this way is stronger than the regular one, or of equal strength whereas it was trained with say 15-20% less training samples, then we would have a 'result' ... And corner @gcp in some negociations ;-). Or we may end up walking away pitifully, cursing our damned intuition, proved false once again....

@Friday9i

This comment has been minimized.

Copy link
Author

commented Oct 17, 2018

@Ishinoshita Nice! You're right, without data we probably won't be heard...
I'd be willing to give some days/weeks of GTX1080 to train nets with these data
However, unfortunately, I'm afraid it won't be enough: in the proposition done, I was suggesting to train the nets with modified selfplay games focusing (most of) the random exploration on the most interesting alternative moves (rather than a quasi random exploration of moves with more than 1 visit, as done today. Through the use of "-n -m 999 --randomvisits xyz" and taking various values for xyz). Hence, for that, we need to generate additional selfplay games with different parameters: that is not feasible with just a few computers, we would probably need at least 20 GTX1080 for weeks..., and then train the nets with these selfplay games.
Hence, just using a subset of the existing training games is probably not enough: what do you think of it?

@Ishinoshita

This comment has been minimized.

Copy link

commented Oct 17, 2018

@friday9 My proposal was in fact a much more modest, lighter one, leveraging on existing data, networks. Rolling back to an older but already well bootstrapped 40b, then feed it with a few millions of already available training data sorted out to remove post resignation positions (or to remove all games suspected to be a no-resign game).

If you want to test a fork of the whole training pipeline, then I think it's great and I'm all with you, but only as an observer or as a very modest contributor (I can lend my GTX960 for a few weeks to generate self play games; but note that I'm complete newbee in programing).

@Friday9i

This comment has been minimized.

Copy link
Author

commented Oct 17, 2018

I would like to try it, but I'm also quite a newbee in programing: I did some programing years ago but with LZ, after a few fails, I just eventually managed to compile it on my computer for the first time last week...: hence, I'm very far from being in a position to initiate a fork of the training ; -(.

@wonderingabout

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2018

@Friday9i
for your tests, you can use a tesla v100 free trial with google cloud at #1905

if lets say 10 people use their free trials simultaneously, it should be fairly fast to gather the data you need
especially now that i found optimal settings with -g 2 (will add comment later this day)

if you want windows, i made it also run on windows server 2016 (with gui) in the past

@gcp

This comment has been minimized.

Copy link
Member

commented Oct 23, 2018

Problem is, we would need to test these ideas, and the only way to test would be to have hundreds of computers contributing to test pipelines to see the results, but we don't have that.

It may be possible to run experiments on 9x9 or 13x13 with a small team to get some idea.

I'm aware that it's a problem that it's so hard to test proposed process improvements. lc0 has a test server, but they also stalled out the "main" network very quickly, whereas ours is still improving, be it at a glacial pace. (Also, the lc0 experiments were generally not very successful!)

I am honestly not sure what the best thing for the project would be at this point.

@herazul

This comment has been minimized.

Copy link

commented Oct 23, 2018

Yep, unless we suddenly have hundreds of new contributors, i don't see how we can test these things (as far as i know 9x9 experiment kinda got different results from 20b or 40b net on 19x19. Even AZ-style worked on 9x9, but no one managed to get it right on 19x19 as far as i know. Even minigo is going AGZ-style for the next run after 12 unsuccessful AZ run.)

The obvious way is indeed waiting for the 40b to stall, and try things to see if it make it go further.

I use this thread and comment to ask about elf v1 proportion, you said you reduced the proportion of elf games, but is it possible to have the from % to % the reduction was ?
And if i understand correctly, your plan is to reduce it gradually ?

@Friday9i

This comment has been minimized.

Copy link
Author

commented Oct 23, 2018

Thanks for this comment @gcp, not easy indeed to know where to go...
I was suggesting above to include a % of the selfplay games with different parameters, such as "-m 50 --randomvisits 50" (or a mix of different parameters for -m and --randomvisits, cf details above), in order to focus a bit more the exploration around "quite interesting moves": it may either help training and that would be nice, or have little impact (not a big deal) or degrade the training (I don't see why, but who knows?). And it's measurable in a few weeks through the rythm of promotions (up/flat/down vs today). In any case, it should be easy to set up (in my understanding) and easy to go back to previous state if it doesn't work. So why not try?
Any comment on that @gcp?

@gcp

This comment has been minimized.

Copy link
Member

commented Oct 23, 2018

I use this thread and comment to ask about elf v1 proportion, you said you reduced the proportion of elf games, but is it possible to have the from % to % the reduction was ?

I tried randomly sampling 75%. Yes, the idea is to regularly try to lower this. If the main network keeps getting stronger, this would have to work at some point.

Any comment on that @gcp?

I think there are too many things moving around to draw conclusions about this. That's also why I think the client fidding back to m=30 that someone did isn't going to be so useful, the training is a mixture of so many things then, that it working once doesn't really mean anything in the larger scale.

but no one managed to get it right on 19x19 as far as i know

Presumably DeepMind did, and I think the Leela Chess people did too...though they are having problems improving the result.

@gcp

This comment has been minimized.

Copy link
Member

commented Oct 23, 2018

If you don't trust 13x13 to scale (we have no networks for it, could be nice :-) you might try just redoing 19x19. There are so many optimizations now you would catch up with the main network with much less resources, especially if you actually find an improvement.

I guess the real issue is if you stall out earlier, which you won't necessarily see in advance til you catch up. But then you could start switching things around.

@herazul

This comment has been minimized.

Copy link

commented Oct 23, 2018

Presumably DeepMind did, and I think the Leela Chess people did too...though they are having problems improving the result.

Yeah obviously DeepMind did, but with all the problems and fails of the AZ approach, i think a lot of people are waiting for the full paper with frustration :D (and even them didn't produce a better net than AGZ, we still don't know why)

Leela Chess kinda did, but with a lot of problems and it still isn't on par at all with AZ chess, there are still very far from beating stockfish.

I think these things and the fact that the paper is still in peer-review tell me that there is a problem with AZ.

@Ishinoshita

This comment has been minimized.

Copy link

commented Oct 23, 2018

Keeping ressources focused on what has proven to work so far vs speading ressources over more supposedly promising branches. This RL project is kind of fractal. You're faced with the same exploitation vs exploration dilemma, at all scale ;-)

@Ishinoshita

This comment has been minimized.

Copy link

commented Oct 23, 2018

@gcp Sorry, didn't got it, when you say

I tried randomly sampling 75%

What do you mean exactly by "randomly sampling" and by "75%"? From time to time you use 75% of Elf1 games for a training iteration ?....

@herazul

This comment has been minimized.

Copy link

commented Oct 23, 2018

i understood it at sampling randomly on 75% 40b net self play games, and 25% elf V1 selfplay games.
Otherwise i think it kinda wouldn't make sense ? I don't know i am in doubt since you asked the question @Ishinoshita x)

@Ishinoshita

This comment has been minimized.

Copy link

commented Oct 23, 2018

@herazul Fully makes sense, indeed. Since the question was formulated as 'the proportion of Elf games', I was expecting a reference to 25% rather than 75%. Hence my question.

By the way, how are these 25% of Elf1 self play games generated? Through ordinary clients or dedicated ones (I didn't notice any Elf1 network hash when my autogtp is running; but I didn't pay too much attention either).

@roy7

This comment has been minimized.

Copy link
Collaborator

commented Oct 23, 2018

@Ishinoshita New ELF games aren't being generated any more. We have plenty of them already and since the network is static, new games won't be any different really than the 250K+ we already have. So we just re-use them as needed to supplement LZ's own newly generated games.

@Ishinoshita

This comment has been minimized.

Copy link

commented Oct 23, 2018

@roy7 Crystal clear, thanks!

@herazul

This comment has been minimized.

Copy link

commented Oct 26, 2018

@Friday9i i was thinking about something : You use validation.exe to test the nets scalability ?
Maybe we can help you test nets, at least i know i do. You could for exemple give me the command you want to run for a test, i run it on my 1080, and give you back the results. It could maybe help, especially for the 1000+ visits test that takes a while !

@gcp

This comment has been minimized.

Copy link
Member

commented Oct 26, 2018

I meant that normally it's 50% selfplay 50% ELF, but I randomly dropped 25% of the ELF games. Maybe I should have increased the selfplay to compensate, dunno.

@herazul

This comment has been minimized.

Copy link

commented Oct 26, 2018

@gcp That just became more confusing ;)
At the moment, you sample 50% from selfplay, and 25% from ELF games (down from 50%) ? So what does that mean for the last 25% ? Does that mean that if you didn't set self play to 75%, it means that it's in reality approximatly 66,66% selfplay and 33,33% ELF games ?
Or maybe i did'nt understand what you meant at all :)

Edit - After thinking, i think that what you meant is that you dropped 25% of the ELF games from the window, so the total games in the windows is now 437k games : 250K selfplay games and 187K elf games ?

@gcp

This comment has been minimized.

Copy link
Member

commented Oct 26, 2018

Normally I'd write out 250k ELFv1 games. Instead, I made the loop that writes them out randomly drop 25%.

@herazul

This comment has been minimized.

Copy link

commented Oct 26, 2018

Ok, so my edit seems correct ?

@Friday9i

This comment has been minimized.

Copy link
Author

commented Oct 26, 2018

@herazul with pleasure, Ancalagon is also already helping to test LZ184 vs LZ174!
However I just left for 1 week so I will not be in a position to test things for 1 week. But we could connect on discord in a group and coordinate the 3 of us

@wonderingabout

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2018

@Friday9i @herazul
again, you can use google cloud free trials (maybe on windows server 2016 with datacenters if you prefer GUI, i successfully ran autogtp on windows server 2016 with gui on google cloud in the past, with V100)
with 390 hours of Tesla V100 for free, you'll probably find these helpful

@TFiFiE

This comment has been minimized.

Copy link
Contributor

commented Nov 3, 2018

It would be mathematically contradictory for the curves to always eventually become linear if they were to indicate overall strength, meaning that measuring it by playing against the same opponent will give distorted results, just like self play does (if only to a lesser extent).

When it comes to scalability, I find the following plots from the AlphaZero paper illuminating:

1712 01815 pdf-030 1712 01815 pdf-032

@Friday9i

This comment has been minimized.

Copy link
Author

commented Nov 7, 2018

FYI, I'll do an update of the post in a few days, with help from herazul and Ancalagon : thanks ; -) !
Results that will be released:

  • The 10K visits ratio for LZ126 vs LZ118 (it took a lot of time ...)
  • The curve for LZ184 vs 174 (up to 1K visits)
  • The curve for the last net LZ188 vs ELFv1, up to 3K visits: showing the improvements since LZ181 ; -)
@zhanzhenzhen

This comment has been minimized.

Copy link
Contributor

commented Dec 17, 2018

@Friday9i Any plan to test LZ195?

@herazul

This comment has been minimized.

Copy link

commented Dec 19, 2018

@gcp Little question : we see from scaling tests that strength difference tends to add up (a sort of compounded effect) with more visits.
So i was thinking about something : if a net A get for exemple 60% WR vs net B at 1600 visit, it may have 63% WR vs net B at 3200 visits. (it's whats tend to be demonstrated by scaling results).
So wouldn't it impact negatively on the ability of net for passing 55% gating when we went down from 3200 to 1600 visit for test matchs ?

@zhanzhenzhen

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

@Friday9i 214 is out. It's been a long time. Could you add it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.