Go tournament manager #870

Cabu · 2018-02-13T17:16:42Z

Is there a program who can manage multiple go engines to run a series of matches and display the result like arena (http://www.playwitharena.com/) or LittleBlitzer (http://www.kimiensoftware.com/software/chess/littleblitzer) could do for chess?

odeint · 2018-02-13T17:22:04Z

Gomill has been recommended here before for testing two engines, but it seems it can also run tournaments:
https://mjw.woodcraft.me.uk/gomill/doc/0.8.2/allplayalls.html
https://mjw.woodcraft.me.uk/gomill/doc/0.8.2/tournament_results.html

Alternatively, you could get in contact with the owner of this thread and ask how he does it ( https://www.lifein19x19.com/viewtopic.php?f=18&t=13322 ). In the first post he uses twogtp and GoGui, but it might be more automated now.

ghost · 2018-02-13T17:25:33Z

Here's a config for running multiple matches with gomill ringmaster. Assuming you have these binaries in the directory you are running from (in the example below, the directory with ringmaster).

Run with the config in a subdirectory so that the extra files are in the subdirectory.

./ringmaster folder/config run -j2

competition_type = 'playoff'
LZ_ARGS = '/projects/lztest/{} -p1600 --noponder -r1 -g -t1 -q -d -w {}'
def LZ(vals):
  return Player(LZ_ARGS.format(*vals))
players = dict()
p1 = 'leelaz-parent-eval_5a'
p2 = 'leelaz-next-e06f_5a'
p3 = 'leelaz-parent-eval_1a'
p4 = 'leelaz-next-e06f_1a'
players[p1] = LZ(p1.split('_'))
players[p2] = LZ(p2.split('_'))
players[p3] = LZ(p3.split('_'))
players[p4] = LZ(p4.split('_'))
board_size = 19
komi = 7.5
matchups = [
  Matchup(p1, p2, alternating=True, number_of_games=400),
  Matchup(p3, p4, alternating=True, number_of_games=400),
]

as0770 · 2018-02-13T21:42:33Z

I am the owner of the Engine-Torunament thread... I still use twogtp and gogui. It is not as comfortable as Arena but it is more reliable ;-) As soon as you figured out how to configure engine matches you can create some standard text files, add the engine commands with an text editor and run quickly the matches for a tournament.

For windows I can recommend smartgo. I like the concept of the GUI, even more than Arena for chess. You can't run tournaments but engine matches, collect them in a folder and create the tables. I just don't use it for running the tournaments because it can't handle all engines in Linux

One hint for twogtp: Don't use the alternate option. When switching colours the result is messed up.

Cabu · 2018-02-13T22:20:31Z

Thanks. I will throw an eye.
I seen that gomill can do that too but it need a previous version of python.

marcocalignano · 2018-02-13T22:22:37Z

@Cabu if you are willing to write down some specification, I can see if I can modify Validation to run tournaments.

Cabu · 2018-02-14T09:55:21Z

@marcocalignano

For a very basic tournament program we should be able to define multiple engine (ideally more than 2) by their command line (eg: "leelaz.exe -g -q -w my_network -v 2000 --noponder -t 1"). We don't really need to separate the executable name, its network and parameters into separate command line options like it is actually done as some engines doesn't have such parameters (eg: AQ doesn't have a network parameter).

We should also be able to set:

the board size
the komi (/!\ jigo is possible with a komi of 0)
the time allocated for a game (eg: 60:00 for 1h sudden death per game).
the number of game each engine should play against another engine.

Then the tournament should start by matching each engine against each other except itself.
For 3 engines we will have the matches: 1vs2, 1vs3, 2vs1, 2vs3, 3vs1, 3vs2. The first engine of each pairing will get white and the second black.

Then it could display a nice table like:

.	eng1	eng2	eng3	total
eng1	-	8\2	6\4	25\15
eng2	3\7	-	1\9	7\33
eng3	6\4	9\1	-	28\12

In this table we can see:

eng3 is the strongest overall.
eng1 is stronger than eng3 when playing white. (ok, the number of games in this example are not really enough to make such conclusion but you get the idea)
eng2 is a total loser

For an advanced tournament system we can add:

Setting the strength of each engine to compute the handicap (in case of handicap in the matches x vs y and y vs x, the weaker will always play with black). This is 'easy' for 19x19, but for other board sizes see https://senseis.xmp.net/?HandicapForSmallerBoardSizes.
byoyomi (eg: 20:00+5x0:20 for japanese 20min game plus 5 periods of 20s, or 20:00+10 in 5:00 for canadian 20min game plus 10 moves in 5 minutes)
manage leagues (the basic tournament system could be considered as a league), supporting multiple league mean that we can for each league define the engines, board size, komi, game time, number of games...)
support of cluster of computers
any crazy ideas that could come :-)

marcocalignano · 2018-02-14T10:23:27Z

@Cabu
Validation already does a lot of your requirement only for 2 engine, and I think to add multiple engine shouldn't be difficult.
The only thing I need to do is to set a fix number of game and not let SPRT stop the matches.
The other three points are GTP parameters that validation pass on the command line so no need of implementation for them.

The first engine of each pairing will get white and the second black.

Validation, as default, takes the first engine as Black and the second as White and alternates them over the different games. Is this a problem? The result in the table then can be print as we want.

Validation is already multitasking and multi GPU, but I think is really difficult to support a computer cluster.

Byoyomi setting should also be a GTP command, so is also given at command line.

To manage Handicaps I have really no Idea what I need. (free handicap or fixed, is a gtp command, will the engine choose the free handicap or we, etc.)

To Manage leagues I will have to implement the tournament to tell you how easy it will be.

Cabu · 2018-02-14T12:21:22Z

The other three points are GTP parameters that validation pass on the command line so no need of implementation for them.

How do I then set them to the values I would like? For me, they should be passed through the command line parameters with a default value such that Validation is backward compatible.

Validation, as default, takes the first engine as Black and the second as White

Not a problem, it's just a convention.

and alternates them over the different games.

That is not a big problem. But we then lose the capacity to evaluate if an engine is better with a color over another. If there is no way to change that, engine n should only fight against engine m for m>n as there is no need for explicit 1vs2 and 2vs1. In that case numgames should ideally be an even number.

To manage Handicaps I have really no Idea what I need. (free handicap or fixed, is a gtp command, will the engine choose the free handicap or we, etc.)

That should also be a parameter saying the type of handicap to use (free/fixed)

To Manage leagues I will have to implement the tournament to tell you how easy it will be.

League could be seen as multiple tournaments. Passing that kind of information only by the command line could be considered as impossible. You will need config files for that.

Side question: Could you add the compilation of Validation in the VS2017 solution. I have tried but without success (I don't know how to tell the compiler to add the QT dependencies) :(

marcocalignano · 2018-02-14T12:37:00Z

@Cabu I do not have VS2017 so I cannot do it. But did you tried with just qmake in the validation directory?

Cabu · 2018-02-14T12:59:29Z

Nope I didn't tried as I can barrely compile leela in the VS. But it doesn't seems to work:

D:\Sources\C++\leela-zero\validation>qmake
WARNING: D:/Sources/C++/leela-zero/autogtp/main.cpp conflicts with D:/Sources/C++/leela-zero/validation/main.cpp
WARNING: Automatically turning off nmake's inference rules. (CONFIG += no_batch)
WARNING: D:/Sources/C++/leela-zero/autogtp/main.cpp conflicts with D:/Sources/C++/leela-zero/validation/main.cpp
WARNING: Automatically turning off nmake's inference rules. (CONFIG += no_batch)

No executable is generated.

wctgit · 2018-02-14T13:21:54Z

@marcocalignano, @Cabu: Running general tournaments could become a big project in itself. Just a caution that it might become a bigger project than you initially imagined and could eat up more of your time then initially planned. But not to say don't go for it, just be aware of what you might be getting into. (Just to be clear: I think this is an awesome idea, regardless of how big- or small-scale you end up going with, and might even be able to help out possibly (no guarantees, though).)

But with that in mind, I've read a little bit about actual Go tournaments recently, and one of the most common general tournament styles between multiple competitors is the Swiss system, specifically the McMahon variant of the Swiss: https://en.wikipedia.org/wiki/McMahon_system_tournament. This allows many multiple entrants, but only requires a few rounds (as opposed to round-robin). I'm not sure if that's the kind of tournament you're intending to run; I just thought I'd mention it so you have some context of how big and potentially complicated running a generalized tournament project could get.

A more simple system might just be based on the GA/evolutionary system I described earlier in #814 (comment) and subsequent comments. This would be a kind of 'ongoing tournament' of a population of networks. (You needn't do the evolutionary stuff like mutation and crossover, of course!) The main idea is simply to randomly pair networks against each other. I suppose you could then compile statistics based on who beats who. (In the GA/Evolutionary setting, the 'statistics' are maintained simply by keeping a pool/population of 'survivors', and letting the rest of the entrants to 'go extinct'.)

Similar to that, you could try something like is done on CGOS which is more along the ongoing random match-up style or the KGS Computer Go tournaments which is more of the fixed-length 'official tournament' style.

zediir · 2018-02-14T13:22:57Z

@Cabu You need to run nmake after qmake to build.

wctgit · 2018-02-14T13:27:50Z

Also there's info on Sensei's Library about Go tournaments. As a starting point, here's the page on https://senseis.xmp.net/?McMahonPairing

wctgit · 2018-02-14T13:33:32Z

Oh, and I guess the 'ongoing random match-up style' I'm describing is perhaps more properly known as a Ladder system, rather than a Tournament System. See also https://senseis.xmp.net/?ClubLadder for examples.

wctgit · 2018-02-14T13:38:06Z

Also, this looks useful, as a sort of alternative to Elo: https://senseis.xmp.net/?EGFRatingSystem. (Doesn't seem as general as BayesElo, though.)

wctgit · 2018-02-14T14:07:44Z

One thing which might be really cool and helpful for running experiments would be for AutoGTP to be forked/adapted to allow tournaments, ladders, and validation to be run across multiple computers, based on people opting-in to volunteer their computer power to help run a particular tournament (or set of tournaments) or whatever. A server can then dole out games as needed. Perhaps each client could register which kinds of engines it is able to support, and tournament organizers could provide people with a list of pre-requisites that clients need to be able to support their tournament (e.g. a custom/forked version of Leela Zero, as a downloadable binary (perhaps; perhaps requiring compilation is safer?), or an alternative Go engine installation like AQ).

marcocalignano · 2018-02-14T14:14:56Z

But I wonder if all of this is pertinent to the leela zero project. Why do we need tournaments?

jkiliani · 2018-02-14T14:15:54Z

@wctgit I suggested this to @gcp before, at that point for the purpose of search parameter tuning. The response was not positive because of the problems with distributing multiple binaries across clients. For this reason, I'm not optimistic anything of the sort is going to happen, even though it might be highly useful for the current attempts to find good solutions for FPU reduction that work for multiple nets.

Cabu · 2018-02-14T14:19:26Z

@marcocalignano

But I wonder if all of this is pertinent to the leela zero project. Why do we need tournaments?

To evaluate leela engine and leela network against itself and/or other engines.

Cabu · 2018-02-14T14:32:58Z

@wctgit
By Tournament, we don't think about real life tournament/competition, but just running 2 engines for hundred/thousands of games to check which engine is better than the other. Here we just talk about the possibility to run multiple engines against themselves to compare them against eachother a well enough number of times to ensure that the results are statistically sound.

marcocalignano · 2018-02-14T14:35:19Z

But I still would like to know the opinion of @gcp on this matter.

gcp · 2018-02-14T15:03:34Z

for the purpose of search parameter tuning. The response was not positive because of the problems with distributing multiple binaries across clients.

Search parameter tuning does not need separate binaries. Trying other search algorithms does.

We're not in chess territory where a 2 Elo improvement is something that gets people to pop the champagne, so I have or see little need for this.

It's handy if people want to try search changes and don't have a GPU and can't leave their machine on overnight and don't want to use AWS or GCP. But it requires a significant investment: making packages that can do local git repo fetches + builds of the source code (so including a compiler), a server side that can dole out the parameters, and an account system with approval so random people can't upload arbitrary C code to the network participants.

even though it might be highly useful for the current attempts to find good solutions for FPU reduction that work for multiple nets.

I really don't understand why that would require anything of the sort, very least of all tournaments, and especially not using other engines, which, combined with the non-homogeneous testing environment can fuck up such a tuning really hard.

It's going to be hard to tune something if the strength of your opponent depends on which system the current match got scheduled on.

If you think the problem of those FPU reduction threads is that the results vary too much from net to net, I think you're mistaken: the problem is that much of the testing was initially only done for a few games before turning knobs, so the results look very random. But people are starting to learn, I think.

jkiliani · 2018-02-14T16:27:34Z

If you think the problem of those FPU reduction threads is that the results vary too much from net to net, I think you're mistaken: the problem is that much of the testing was initially only done for a few games before turning knobs, so the results look very random. But people are starting to learn, I think.

Fair point, and I am also sceptical of results with extremely few games. I do think there's a high possibility that FPU reduction (or any search parameter change for that matter) will also change the balance of strength between nets, to some extent. I got this impression from one post who retested a net that lost by 0:50 or maybe 1:50 to then current best, but managed to score 4:1 against that same net with FPU reduction. Of course a 4:1 does not mean that the first net is stronger, but it should mean that there is a statistically significant discrepancy to another test ending in 1:50.

Another result that puzzled me is the blowout that @killerducky experienced when implementing dynamic parent evaluation for Minigo (tensorflow/minigo#87), when for Leela Zero this turned out considerably worse than FPU.

I'm certainly not saying here that I can prove my suspicions in this regard, you may well be correct after all. I just saw a number of results that suggested that there are a lot of things we don't completely understand about the effects of search code changes, so testing such changes with multiple nets instead of a single one seems prudent to me.

gcp · 2018-02-14T16:34:54Z

But minigo didn't have any kind of FPU reduction right? Leela Zero's next already has the original proposal.

killerducky · 2018-02-14T16:35:06Z

Minigo doesn't have any FPU reduction, so my results just show that dynamic parent evaluation is better than the original init to parent eval. I also have another pull for minigo that shows FPU reduction works will on minigo. I didn't test FPU reduction vs dynamic parent eval on minigo, which is what we are testing here.

jkiliani · 2018-02-14T16:41:27Z

FPU reduction tested around 75% winrate vs static parent eval (baseline) for LZ, and dynamic parent eval tested around 40% vs FPU reduction. If we assume these results are transitive to at least a reasonable degree, there is a discrepancy to a 92% winrate of dynamic parent eval to static parent eval in Minigo, right?

remdu · 2018-02-14T17:01:21Z

It's 92% for 65 games. Not sure how it could change with more games. Also is it on 9x9 ? Might be a lot better for this size.

wctgit · 2018-02-14T19:09:07Z

But I wonder if all of this is pertinent to the leela zero project. Why do we need tournaments?

One reason is: It would help people to run side experiments based on Leela Zero.

wctgit · 2018-02-14T19:11:45Z

By Tournament, we don't think about real life tournament/competition, but just running 2 engines for hundred/thousands of games to check which engine is better than the other.

Yes, I understand that, but you also mentioned "any crazy ideas that could come :-)". Hence, I dumped a bunch of crazy ideas. :-D

wctgit · 2018-02-14T19:17:01Z

It's handy if people want to try search changes and don't have a GPU and can't leave their machine on overnight and don't want to use AWS or GCP.

So, most of us, then. ;-)

If you think the problem of those FPU reduction threads is that the results vary too much from net to net, I think you're mistaken: the problem is that much of the testing was initially only done for a few games before turning knobs, so the results look very random.

So you're saying that it would be better if people were able to run large numbers of games in like, maybe, a tournament or something, so they can get more valid statistical results than possible on just their one local machine. ;-)

ghost · 2018-02-14T19:29:26Z

I've found it only takes a day or two with a GPU to get a hundred or more games.

marcocalignano · 2018-02-14T19:32:19Z

One reason is: It would help people to run side experiments based on Leela Zero.

You said it, "side experiments based on" are not part of this project.

wctgit · 2018-02-14T22:19:40Z

You said it side experiment based on are not part of this project.

They could be if they are used to discover some useful bug fix or innovation. E.g. FPU calculations, winrate estimation tweaks (opportunity/risk), etc.

d7urban · 2018-03-06T12:48:52Z

Anyone used gomill/ringmaster under Ubuntu on Windows?
I can get the check to pass, but trying to run it it just creates some empty log and history files and then does nothing.

My config.ctl:
`competition_type = 'playoff'

board_size = 19
komi = 7.5

players = {
'leelaz' : Player('../../leela-zero-0.12-win64/leelaz.exe --gtp --noponder -w ../leela-zero-0.12-win64/5b90bd32ccc835d8e08d41970d39753e5732413d75e8f4035bebb5f1da69fb87',
startup_gtp_commands=[
"time_settings 0 30 1",]),
'leela' : Player('../../Leela0110GTP/Leela0110GTP_OpenCL.exe --gtp --noponder',
startup_gtp_commands=[
"time_settings 0 30 1",]),
}

matchups = [
Matchup('leelaz', 'leela',
alternating=True,
number_of_games=4),
]
`

sethtroisi · 2019-02-14T06:02:27Z

closing, no active discussion for ~1 year with subsequent issues and PRs (describing ringmaster on README)

ghost mentioned this issue Feb 13, 2018

Interesting! My own tuned LZ won against Zen7 9dan first ever! #868

Closed

Cabu closed this as completed Feb 14, 2018

Cabu reopened this Feb 14, 2018

ghost mentioned this issue Mar 5, 2018

WIP: Replace FPU reduction with dynamic parent eval #866

Closed

sethtroisi closed this as completed Feb 14, 2019

Go tournament manager #870

Go tournament manager #870

Comments

Cabu commented Feb 13, 2018

odeint commented Feb 13, 2018 • edited

ghost commented Feb 13, 2018

as0770 commented Feb 13, 2018 • edited

Cabu commented Feb 13, 2018

marcocalignano commented Feb 13, 2018

Cabu commented Feb 14, 2018

marcocalignano commented Feb 14, 2018

Cabu commented Feb 14, 2018 • edited

marcocalignano commented Feb 14, 2018

Cabu commented Feb 14, 2018 • edited

wctgit commented Feb 14, 2018 • edited

zediir commented Feb 14, 2018

wctgit commented Feb 14, 2018

wctgit commented Feb 14, 2018

wctgit commented Feb 14, 2018 • edited

wctgit commented Feb 14, 2018 • edited

marcocalignano commented Feb 14, 2018

jkiliani commented Feb 14, 2018

Cabu commented Feb 14, 2018

Cabu commented Feb 14, 2018

marcocalignano commented Feb 14, 2018

gcp commented Feb 14, 2018 • edited

jkiliani commented Feb 14, 2018 • edited

gcp commented Feb 14, 2018

killerducky commented Feb 14, 2018

jkiliani commented Feb 14, 2018 • edited

remdu commented Feb 14, 2018

wctgit commented Feb 14, 2018 • edited

wctgit commented Feb 14, 2018

wctgit commented Feb 14, 2018

ghost commented Feb 14, 2018

marcocalignano commented Feb 14, 2018 • edited

wctgit commented Feb 14, 2018

d7urban commented Mar 6, 2018 • edited

sethtroisi commented Feb 14, 2019

odeint commented Feb 13, 2018 •

edited

as0770 commented Feb 13, 2018 •

edited

Cabu commented Feb 14, 2018 •

edited

Cabu commented Feb 14, 2018 •

edited

wctgit commented Feb 14, 2018 •

edited

wctgit commented Feb 14, 2018 •

edited

wctgit commented Feb 14, 2018 •

edited

gcp commented Feb 14, 2018 •

edited

jkiliani commented Feb 14, 2018 •

edited

jkiliani commented Feb 14, 2018 •

edited

wctgit commented Feb 14, 2018 •

edited

marcocalignano commented Feb 14, 2018 •

edited

d7urban commented Mar 6, 2018 •

edited