-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FPU reduction proportional to root eval #1565
Conversation
Very nice analysis ! |
@Ttl I'm working on determining minigo optimal c_puct and running similiar large batches of games. I found the results change dramatically when I vary readouts/playouts/visits from a low first pass number (200 or 400) to the self-play number (800 for MG). I have a 1080 not being highly utilized, I can easily pull a change and run a ringmaster config or potentially even give you SSH to that machine if that would help. |
I also expect the number of playouts to affect the results, which is why I was hoping to test also with larger playouts. Ringmaster tournament with 400 games using the same command line but using 3200 visits would be interesting. If that is too much effort then 1600 visits would probably be also fine. |
Seth, want to try and get this running on a cluster this week? |
ringmaster config
Results
|
You are using a wrong branch. That one expands every child node and is meant for generating the data to fit the FPU reduction. Use the branch from this pull request instead.
|
Glad I saw this :) I fixed it now and restarted. |
I completed the set of games, if you want me to run some more I certainly can (ping me or ping AMJ and have him ping me on discord) |
Am I understanding the 3rd graph right that at 0% winrate, FPU reduction seems like it should be 0.0, while at 100% winrate, it seems like it should be around 0.5 ? |
Thanks for running the tests. I continued my test too and it also seems to converge around 50%:
It seems mostly equal to the constant FPU reduction. Combining all the tests with different playouts gives 452/888 wins = 50.9% winrate. It does use about 1.5% less CPU which I guess means that the results would be a little bit better with equal time, but I don't think it would really matter. @eddh You are reading the plot correctly. Near 50% it should give very similar results to the current method and it's only significantly different at extreme winrates. In practice the difference is mostly that it expand more nodes for the losing side and less for the winning side. I'll close this for now since it's not better by statistically significant margin. |
Before LZ would get too extreme on win rate it'd resign wouldn't it? I'm not sure better extreme win rate handling would do much in a head to head comparison. We might need to test it in a handicap situation against an older weaker LZ or something and see if the fpu_est has a higher win rate than next vs the same opponent and same handicap? |
There's actually quite large difference in the number of expanded nodes compared to next when winrate is low. For example in https://online-go.com/game/13035133 move 70 using 1000 visits: Next
Proportional FPU
I can't say though if expanding more low prior nodes is advantageous or not. At least on this position it doesn't really make any difference. Old LZ weights have some obvious weak spots with missing ataris on large groups and not understanding how ladders work. In my opinion converted minigo weights play much more sane while still being beatable with handicap (#1538 (comment)). |
I did some statistics about the optimal FPU reduction. I first changed FPU to 1 to expand all the first level nodes, added code to save child node net evaluations and then played some self play games to gather enough data. In the analysis script I calculated the difference between child node net evaluation and root net evaluation for every node, this would be the optimal FPU reduction constant to use for that node. Code is available at: https://github.com/Ttl/leela-zero/tree/fpu_estimation
Here is histogram of the evaluation difference of child and root:
As expected most of the moves are worse than the root which is why root evaluation minus constant is a more accurate initialization than just root evaluation.
Here is L2 loss plot with different FPU reduction constants:
The minimum is very close to what was previously determined to be strongest using CLOP. Estimated maximum was -0.3 in the last tuning run (#1211) and in this analysis -0.29 has the lowest L2 loss.
Plotting FPU reduction vs. root evaluation gives the following plot:
As the winrate can't go lower than 0 it makes sense to use FPU reduction that is proportional to the root evaluation instead of just subtracting a constant. In the plot green line is the least squares fitted FPU reduction vs. winrate (This pull request).
Plotting the FPU reduction vs. policy head output gives:
As would be expected moves with high policy output usually have value network evaluation close to the root and low policy moves are often much worse, although there are exceptions. This plot does give some explanation why
sqrt(total_visited_policy)
works: It lowers the reduction for moves with high policy that are expanded first and that are usually close to the root evaluation. In green lines are the FPU reductions with 0.1, 0.5 and 0.9 root evaluations using the equation in this pull request.There are many different functions that could be used to initialize nodes from policy and root evaluation, but I decided to keep it simple and avoid adding any more tunable variables. The equation in this pull request gives lower L2 loss than simple subtraction. I didn't try to find any better formula for taking the policy output into account and used the current
sqrt(total_visited_policy)
for that.L2 loss for this equation in prediction the child evaluation is 0.033 using the latest network compared to 0.051 using the current equation. -0.6 is the best reduction using this formula with this network. As was earlier observed when FPU reduction was first introduced it does depend on the network. Minigo-303 optimum is at -0.29 and ELF is at -0.7. The new formula is more accurate also for them by about the same amount as for LZ net.
I have been running a match between this pull request vs. next for one week now using a very weak laptop. Command line parameters are:
-g -d -t 1 --noponder -v 800 --timemanage off -r 10 -w d01879964d578b676714251164f7289da023f14c0063b1721e5e3cd2e8d51ae0.gz
Current results:
It's not statistically significant yet and it looks like completing the 400 games takes about two more weeks. I don't think there is currently any hope for me to test this at higher playouts. I hope that someone will help with testing or invent a more accurate initialization scheme using the analysis results.