Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Great job!! #1

Open
Zeta36 opened this issue Oct 31, 2017 · 31 comments
Open

Great job!! #1

Zeta36 opened this issue Oct 31, 2017 · 31 comments

Comments

@Zeta36
Copy link

Zeta36 commented Oct 31, 2017

Wonderful job, friend.

Can you please tell us what's the performance you got with this approach? Do you have some statistics or something?

Regards!

@yhyu13
Copy link

yhyu13 commented Oct 31, 2017

Hi, mokemokechicken

It's a wonderful project, keep it going!

@mokemokechicken
Copy link
Owner

Hi, thank you for your compliment!

performance you got with this approach? Do you have some statistics or something?

Currently, not so strong.
I do not do objective evaluation because I do not know how to do it, but it is still a level I can win.

The 7th generation model win http://kuina.ch/board_game/a01 's beginner player(the left icon of 初心者).

The progress of model evolution is as follows.

model generation winning percentage to best model Time Spent(hours)
1 - -
2 94.1% -
3 63.4% 13
4 62.0% 3
5 56.7% 8
6 67.3% 7
7 59.0% 3
8 59.7% 6
9 59.4% 3
10 55.7% 5
11 57.9% 9
12 55.6% 5
13 56.5% 7
14 58.4% 20
15 62.4% 3
16 56.0% 11
17 64.9% 17
18 55.2% 19
19 57.2% 33
20 55.7% 12

@yhyu13
Copy link

yhyu13 commented Nov 1, 2017

Hi, @mokemokechicken

Forgive my ignorance, I wonder what is the game that your agent is playing? It seems like a black-white flip stones game played on a 8x8 grid world. Could you please provide any background on this game? How is it similar and different from the game of Go?

And I am quite stuck on the implementation of AVP-MCTS in python. I am not new to python, but I haven't practiced coroutine programming before. Especially, could you please explain to me the idea behind player.py line 172-174. It would be great if you can take a look at my code (I have a feeling that I have to combine the MCTS tree node class and the tree search control class together).

@Zeta36
Copy link
Author

Zeta36 commented Nov 1, 2017

About Reversi game: https://en.wikipedia.org/wiki/Reversi

@mokemokechicken
Copy link
Owner

Hi, @yhyu13

the idea behind player.py line 172-174

Although I may not accurately grasp the intent of the question,
it is "Future" design pattern and "Producer and Consumer" design pattern.

You know, costs of TF prediction are very high and it is better to predict collectively.
Actual prediction is done collectively by Consumer(line 190-209), the requests are queued by Producer(line 172, 213).
The prediction results should be returned to the Producer, so Future pattern is useful.

AVP-MCTS

I saw your code. It is difficult to make useful comments,
but in case of your process pool architecture, I think it is better that
the main controller just queues each search iteration(1600) and they run in the pool(process pool).
Synchronize the each results(N, W and Q) while paying attention to the race condition.
Prediction worker(Producer and Consumer pattern) is also useful to improve performance(like the paper).
If you want to implement timeout, main controller decides timeout and collect results(about N) at that time, decides the next move.

@yhyu13
Copy link

yhyu13 commented Nov 3, 2017

@Zeta36 Thanks, it is helpful. And I just know that Reversi also sparks interests of A.I. in the late 90's. One of the professor at U of Alberta, Michael Buro was the pioneer in solving this challenge.

@mokemokechicken I have put TreeNode, NN and MCTS in the same class, it does sometimes to get it correct. I will let review it when I am ready.

@gooooloo
Copy link
Contributor

gooooloo commented Nov 8, 2017

model generation winning percentage to best model Time Spent(hours)
1 - -
2 94.1% -
3 63.4% 13
...

@mokemokechicken in your table, what is the "generation" mean? Generation of best model, or generation of trained model? As we can see every generation beats best model over 55%, so I guess it is the former one?

@mokemokechicken
Copy link
Owner

@gooooloo Yes, it is Generation of best model.

@gooooloo
Copy link
Contributor

gooooloo commented Nov 8, 2017

@mokemokechicken got it. Thank you!

@yhyu13
Copy link

yhyu13 commented Nov 10, 2017

@mokemokechicken

HI, I've learned a lot from your coroutine implementation of MCTS in python. Please take a look: APV-MCTS in python.

Also, I was recommended to compile it with Cython. The C compiler optimize my tree search a little bit.

Trough method profiling, I found two bottleneck:

  1. Expand leaf node
  2. Evaluate leaf node

(1) has to to with data structure, and (2) has to do with whether or not I am using GPU. The supervised training on Volta instance tells me that it is able to process 1450 samples per second (including I/O) with 20 blocks resNet. For a small batch ready to be evaluated (size 8~16), it should take about 0.01 seconds. However, for 1600 iterations, there has about 200 evaluations in total. So 200*0.01=2s is far away from DeepMind's 0.4 seconds per move. On my 4 cpus macbook, 1600 iterations takes about 5.6 seconds (excluding evaluation), so I will give a try for machines with more and better cpus.

My question at this moment is:

  1. What does margin in the method prediction worker does? Why is avoid finishing before other searches starting important?
  2. What is your tip to find the best combination of the number of Semaphore and the size of queue?

Regards,
Hang Yu

@yhyu13
Copy link

yhyu13 commented Nov 10, 2017

@mokemokechicken

Could you please tell me how did you manage to make the download_the_best_model.sh? Where are the files stored? I am trying to upload/download training dataset without using git lfs.

@mokemokechicken
Copy link
Owner

Hi, @yhyu13

APV-MCTS in python

I think it is a good idea to separate NetworkAPI from MTCS.

FYI: according to https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf, they set virutual loss to 3 .

So 200*0.01=2s is far away from DeepMind's 0.4 seconds per move.

Google's TPU machine might perform better than other GPUs.
If you have 4 TPUs or GPUs, there may be a more efficient implementation using coroutines.

  1. What does margin in the method prediction worker does? Why is avoid finishing before other searches starting important?

RUNNING_SIMULATION_NUM is incremented in the coroutine.
I thought execution order of coroutines is not controllable (depending on implementation or not??).
If the prediction worker coroutine starts before all other coroutines, the worker will finish immediately.
Maybe it is a very rare case, but I was a little worried, so I did it.

  1. What is your tip to find the best combination of the number of Semaphore and the size of queue?

It is a very difficult question...
I think there are two viewpoints. One is search performance, the other is search accuracy.
For the accuracy, I think single thread(sem=1) is best.
Larger sem is better performance, but it should be capped anyway and the accuracy may drop slightly.
Performance also depends on the execution environment, I think that there is only the experiment to adjust.

about download_the_best_model.sh

I just push models in https://github.com/mokemokechicken/reversi-alpha-zero-models.
I do not use Git LFS (store as normal files...).
Because the file size is large, I feel like I'm getting angry from github ...
I would like to change it if there is a better way.

@yhyu13
Copy link

yhyu13 commented Nov 11, 2017

@mokemokechicken

Here is two of my profile, I found that the time cost per move search really depends on the amount of illegal moves encounters. It will spend the most time if every leaf node is expandable (which calls the neural network & create all leaves nodes-> I tried to put illegal check in the expansion phase so that it creates less children nodes, but I wasn't find a hacky way to do so it's even slower).

First profile, the selected child node is generated by random process that will invoke the most evaluation & expansion (BTW, the elements in queue is upper bounded by the number of semaphores, i.e. it's better to have queue_size>=num_sem, so the queue is never full, am I right?)

2017-11-11 00:37:50,026 [45186] DEBUG    model.APV_MCTS: Searched for 9.05447 seconds

1747521 function calls in 3.358 seconds

   Ordered by: cumulative time, internal time, call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1593    0.040    0.000    3.358    0.002 APV_MCTS.py:123(expand)
     1593    0.712    0.000    3.284    0.002 APV_MCTS.py:127(<dictcomp>)
   576666    1.674    0.000    1.674    0.000 APV_MCTS.py:84(__init__)
   576666    0.744    0.000    0.903    0.000 index_tricks.py:516(__next__)
   576666    0.158    0.000    0.158    0.000 {built-in method builtins.next}
     1593    0.002    0.000    0.018    0.000 fromnumeric.py:138(reshape)
     1593    0.011    0.000    0.016    0.000 fromnumeric.py:55(_wrapfunc)
     1593    0.005    0.000    0.012    0.000 index_tricks.py:513(__init__)
     1593    0.005    0.000    0.007    0.000 numeric.py:463(asarray)
     1593    0.004    0.000    0.004    0.000 {method 'reshape' of 'numpy.ndarray' objects}

     1593    0.002    0.000    0.002    0.000 {built-in method numpy.core.multiarray.array}
     1593    0.001    0.000    0.001    0.000 index_tricks.py:530(__iter__)
     1593    0.001    0.000    0.001    0.000 {built-in method builtins.getattr}
     1593    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        0    0.000             0.000          profile:0(profiler)

Second profile is made by a run with a lot of illegal moves, thus there are 15x less expansion & 4x less evaluation:

2017-11-11 00:33:19,549 [45135] DEBUG    model.APV_MCTS: Searched for 5.97132 seconds

116282 function calls in 0.136 seconds

   Ordered by: cumulative time, internal time, call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      106    0.002    0.000    0.136    0.001 APV_MCTS.py:121(expand)
      106    0.039    0.000    0.132    0.001 APV_MCTS.py:125(<dictcomp>)
    38372    0.059    0.000    0.059    0.000 APV_MCTS.py:84(__init__)
    38372    0.025    0.000    0.033    0.000 index_tricks.py:516(__next__)
    38372    0.008    0.000    0.008    0.000 {built-in method builtins.next}
      106    0.000    0.000    0.001    0.000 fromnumeric.py:138(reshape)
      106    0.001    0.000    0.001    0.000 fromnumeric.py:55(_wrapfunc)
      106    0.000    0.000    0.001    0.000 index_tricks.py:513(__init__)

      106    0.000    0.000    0.000    0.000 numeric.py:463(asarray)
      106    0.000    0.000    0.000    0.000 {method 'reshape' of 'numpy.ndarray' objects}
      106    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
      106    0.000    0.000    0.000    0.000 index_tricks.py:530(__iter__)

DeepMind claim its average search time per move is 0.4, meaning I have to take the average of the upper bound and the lower bound. In this case, full expansion and only root expansion.

Third, the profile generated by only root expansion (with 8 sem on 4 cores machine):

2017-11-11 00:58:00,998 [45373] DEBUG    model.APV_MCTS: Searched for 2.57048 seconds

         2194 function calls in 0.002 seconds

   Ordered by: cumulative time, internal time, call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.002    0.001 APV_MCTS.py:123(expand)
        2    0.001    0.000    0.002    0.001 APV_MCTS.py:127(<dictcomp>)
      724    0.001    0.000    0.001    0.000 APV_MCTS.py:84(__init__)
      724    0.000    0.000    0.001    0.000 index_tricks.py:516(__next__)
      724    0.000    0.000    0.000    0.000 {built-in method builtins.next}
        2    0.000    0.000    0.000    0.000 fromnumeric.py:138(reshape)

        2    0.000    0.000    0.000    0.000 fromnumeric.py:55(_wrapfunc)
        2    0.000    0.000    0.000    0.000 index_tricks.py:513(__init__)
        2    0.000    0.000    0.000    0.000 numeric.py:463(asarray)
        2    0.000    0.000    0.000    0.000 {method 'reshape' of 'numpy.ndarray' objects}

Interestingly enough, the total time it takes with 16sem, 8 sem, 4 sem, 2 sem and 1 sem are the same (about 2.5s) with root expansion only. This is understandable because it doesn't need extra coroutines.

A naive conversion to Cython speeds up the lower bound to 2.1s

BTW, I didn't use actual tensorflow sess but barely return a random value during evaluation. Otherwise it takes too long (50s+).

However, if I get realistic and set per move iteration to be 200 instead of 1600. The upper bound become 0.75s and lower bound becomes 0.3s. I will try to tinker a better Cython version of APV_MCTS, according to your experience, would it be helpful?

@mokemokechicken
Copy link
Owner

@yhyu13

illegal moves

I think it is a good idea to check illegal moves in the expansion phase.
As learning of that model proceeds, illegal moves will cease.

it's better to have queue_size>=num_sem, so the queue is never full, am I right?

Yes, you are right.

timing about add/remove virtual_loss

in leaf nodes, I think virtual_loss_undo() should be done around backup_value_single() .
this timing is too early not to block other coroutines.

virtual_loss_do() is done in beginning of start_tree_search().
so, a node of self.children[select_move] (L240) 's virtual loss is added later.
Then, other coroutines can select the node before vitual loss added.
The virtual loss should be added before await. (after await, another coroutine will restart).

@yhyu13
Copy link

yhyu13 commented Nov 12, 2017

@mokemokechicken

I think it is a good idea to check illegal moves in the expansion phase.
As learning of that model proceeds, illegal moves will cease.

I've tired that, it makes code run slower because it doesn't reduce the total iterations at each expansion. It is still 362 because I need to check validity.

Also, do you have any idea about resigning games with a certain threshold? It sounds like a very convenient way to early stop a game and save time. I am not sure I understand the false positive rate:

screen shot 2017-11-11 at 6 05 23 pm

If I want to measure false-positive, I would replay all resigned games from the point of resign and see if I can get a win ratio better than 5%. I am not sure I understand how does it work.

timing about add/remove virtual_loss

Thanks for spotting this mistakes. I've changed my code. As you previous suggestion on moving NetworkAPI somewhere, I could find a away to do so. What was your idea?

@mokemokechicken
Copy link
Owner

mokemokechicken commented Nov 12, 2017

@yhyu13

idea about resigning games with a certain threshold?

I think it is not difficult.
Like the paper,

  1. disable resignation in 10% of self-play games and play until termination.
  2. check the games that met resign condition and the final results. The games of win after resign is false-positive.
  3. If false-positive rate is larger than 5%, relax resign condition.

@yhyu13
Copy link

yhyu13 commented Nov 12, 2017

@mokemokechicken

Thanks! So of these 10% self-play game, the algorithm would resign N games, and among those N games, I want make sure that the win ratio is less than 5%.

@yhyu13
Copy link

yhyu13 commented Nov 13, 2017

@mokemokechicken

I manage to refractory the APV_MCTS_2.py by using @classmethod. I believe now it not only looks better but also run faster.

But as I said earlier, the main reason for speed up comes from going into more illegal move so that the expensive expansion is not invoked.

When comparing APV_MCTS.py and APV_MCTS_2.py, I found even though their hyperparameters are the same, the search result is quite different consistently, see this line in second version and this line in the first version. The second version consistently expand less leaf nodes than the first version. The best guess for me is that some coruntines bypass the virtual_loss_do() and manage to select the same location for both player, thus create illegal move in early game.

I also found adding dirichlet noise is expensive because the data structure of my MCTS node has no phyonic way to do it see this line.

@Zeta36
Copy link
Author

Zeta36 commented Nov 13, 2017

@mokemokechicken, I can see in the "readme" that your model is still improving!!? I think you've done a really great job and that you only need more computation power (or more time) to get a superhuman player.

It seems you really were able to simulate DeepMind AlphaGo Zero theory model :).

As soon as I get time I'll try to adapt your code to a chess environment. I've looking your code and I think it'll be "easy" to do.

Regards!!

@mokemokechicken
Copy link
Owner

@yhyu13

I manage to refractory the APV_MCTS_2.py by using @ classmethod. I believe now it not only looks better but also run faster.

I think it is not a very good way to use Class feature for sharing a context.
Because functions/methods using the context must be implemented the one class, it leads to a big class.
Instead, create a 'context' object to share parameters and objects, share it explicitly among the class instances.
I think it is a better way (I don't know the best way...)

But as I said earlier, the main reason for speed up comes from going into more illegal move so that the expensive expansion is not invoked.
When comparing APV_MCTS.py and APV_MCTS_2.py, I found even though their hyperparameters are the same, the search result is quite different consistently, see this line in second version and this line in the first version. The second version consistently expand less leaf nodes than the first version. The best guess for me is that some coruntines bypass the virtual_loss_do() and manage to select the same location for both player, thus create illegal move in early game.

I think it is still late to call virtual_loss_do().
It should be called before await instead of begining of start_tree_search().

I also found adding dirichlet noise is expensive because the data structure of my MCTS node has no phyonic way to do it see this line.

Yes, I think so too.
I think it is better that the parent node have all children's N/W/U and move-probabilities as nd.array.
Then it becomes natural that the parent adds virtual_loss to a child node.

@yhyu13
Copy link

yhyu13 commented Nov 13, 2017

@mokemokechicken

Yes, I think so too.
I think it is better that the parent node have all children's N/W/U and move-probabilities as nd.array.
Then it becomes natural that the parent adds virtual_loss to a child node.

I manage to find a hacky way to solve this problem. I add dirichlet noise to the prior probabilities at the expansion phase. The profile results shows it explore more (1468/1600 vs. 1000/1600), and because of that, time per move increases from 3.6 to 4.2 without actually having to do with the child dictionary (which will increase time per move to 7s).

         5076128 function calls (5069017 primitive calls) in 4.207 seconds

   Ordered by: cumulative time, internal time, call count
   List reduced from 295 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.002    0.002    4.208    4.208 APV_MCTS_2.py:266(suggest_move_prob)
        1    0.014    0.014    4.097    4.097 {method 'run_until_complete' of 'uvloop.loop.Loop' objects}
     4603    0.012    0.000    4.055    0.001 APV_MCTS_2.py:286(tree_search)
8567/3067    0.052    0.000    4.024    0.001 APV_MCTS_2.py:150(start_tree_search)
     1468    0.037    0.000    2.028    0.001 APV_MCTS_2.py:123(expand)
     1468    0.551    0.000    1.904    0.001 APV_MCTS_2.py:128(<dictcomp>)
     2928    0.318    0.000    1.268    0.000 {built-in method builtins.max}
   531416    0.984    0.000    0.984    0.000 APV_MCTS_2.py:85(__init__)
  1059936    0.284    0.000    0.950    0.000 APV_MCTS_2.py:229(<lambda

I think it is still late to call virtual_loss_do().
It should be called before await instead of begining of start_tree_search().

I've fixed it, thanks!

Instead, create a 'context' object to share parameters and objects, share it explicitly among the class instances.

Forgive my ignorance, I not sure I understand what a context is? I believe you need to more explicit on your idea.

Talk to you soon.

@mokemokechicken
Copy link
Owner

Hi, @Zeta36

Yes, I am training my model now. I cannot win already...(^^

The version of chess is very interesting. I would like you to try implementing it by all means!

@mokemokechicken
Copy link
Owner

@yhyu13

I add dirichlet noise to the prior probabilities at the expansion phase.

That's good!

context

It means like this .
A context object has some parameters and objects for sharing among other objects.
Be careful not to become very big context.
In your case, the context may have running_simulation_num, ROOT, api and so on.

A Class field can have only one object.
Class fields are one of the global variables.
So, it is difficult to have two api objects if black and white want to use different api, for example.
(especially in thread/asynchronous programing. It is easy to mistake.)

@yhyu13
Copy link

yhyu13 commented Nov 14, 2017

@mokemokechicken

Thanks for your explanation! So I guess a context would be similar to the NetworkAPI class I made before but there is only the __init__() method, am I right?

But since predict_worker() and push_queue() only use attributes initialized in NetworkAPI, I move these two methods into that class.

So, it is difficult to have two api objects if black and white want to use different api, for example.
(especially in thread/asynchronous programing. It is easy to mistake.)

You're perfectly right about it, I will use the APV_MCTS with the NetworkAPI class for self-play evaluation. However, I guess it would be more efficient to let NN play out the game instead of aiding MCTS in the model evaluation procedure. According to DeepMind, one model is against the best model for 400 games and calculate the win ratio afterwards. DeepMind made each model play each move after 1,600 searches. It seems redundant. I believe if a model can beat a another model just by the outputs of policy networks, then that model is coined "stronger", what do you think?

EDIT: DeepMind mentions paper that the best model is to generate 25,000 games of self-play training datasets. So 400 games in compare to 25,000 is a small number, so I guess DeepMind doesn't care)

@mokemokechicken
Copy link
Owner

@yhyu13

Thanks for your explanation! So I guess a context would be similar to the NetworkAPI class I made before but there is only the init() method, am I right?

Yes, you are right.

But since predict_worker() and push_queue() only use attributes initialized in NetworkAPI, I move these two methods into that class.

I think that as one aspect of class design, it may be better for NetworkAPI has only Queue and loop attributes.
(Because NetoworkAPI don't use sem and running_simulation_num)
And a context class has api, sem and running_simulation_num attributes.
Well, but I do not mind that much (^^;

However, I guess it would be more efficient to let NN play out the game instead of aiding MCTS in the model evaluation procedure. ... I believe if a model can beat a another model just by the outputs of policy networks, then that model is coined "stronger", what do you think?

That is an interesting hypothesis! I think that it is worth experimenting.

If I give an objection, the goodness of the value network is not reflected. Because the value network has a big meaning if you are using MCTS in an actual game.

I think playing using only policy network means changing MCTS count from 1,600 to 1.
It may be a trade-off between thought time and strength.

@yhyu13
Copy link

yhyu13 commented Nov 15, 2017

@mokemokechicken

If I give an objection, the goodness of the value network is not reflected. Because the value network has a big meaning if you are using MCTS in an actual game.

I haven't thought about it before. It makes sense to me. However, my supervised training shows me the MSE of predicting game result stays in 1.00 because DeepMind set the factor of predicting game result MSE 100 time smaller in SL than in RL. I believe the value network is important as the MSE trained by DeepMind is around 0.2.

So the self-play with search not only test the policy network but also utilize the value network.

I think playing using only policy network means changing MCTS count from 1,600 to 1.
It may be a trade-off between thought time and strength.

Thanks for insight!

Question: I wonder how do you build the self-play agent that can play with reversi phone apps? Does it provide APIs?

@mokemokechicken
Copy link
Owner

Question: I wonder how do you build the self-play agent that can play with reversi phone apps? Does it provide APIs?

I don't build a phone app.
I made only a desktop app of python.

@gooooloo
Copy link
Contributor

@yhyu13

However, I guess it would be more efficient to let NN play out the game instead of aiding MCTS in the model evaluation procedure. ... I believe if a model can beat a another model just by the outputs of policy networks, then that model is coined "stronger", what do you think?

Here is another viewpoint: the NN policy(the agent) is to play a "wrapper game" rather than the original "Go" game. When the "wrapper game" env takes an action, it first uses MCTS to generate an "internal action" from agent's action, then apply the "internal action" to the internal "Go" game -- which is unknown to the NN policy. All the NN policy is playing with is the "wrapper game" env.

This matches the standard RL model, in which the state transition is stochastic though. From this viewpoint, I believe it is better to play with the same wrapper game when evaluating the NN policy, which means, it is better to use MCST too when evaluation.

-- Just my personal opinion though :)

@yhyu13
Copy link

yhyu13 commented Nov 16, 2017

@mokemokechicken

I found a desktop app Sabaki which support Go Text Protocol, and I am using it well.

@yhyu13
Copy link

yhyu13 commented Nov 16, 2017

@gooooloo

Good point! Your analogy makes sense. But keep in mind that the NN is trying to learn the statistical distribution calculated by MCTS which is guided by the NN in the first place. DeepMind also argues that MCTS usually gives stronger play.

But my concern now is the quality of the search tree. At least in the initial phase, the search tree gives result that says several nodes have equivalent probability and the others are zero (no visited nodes). Maybe 1,600 palyouts is no engouth? Pachi, a go program fully relies on MCTS, would do 10,000 playouts per move. Leela does even more playouts about 50,000 per move. Keep in mind the time it take to feedfoward a deep neural net must be a factor of trade-off as well.

@mationai
Copy link
Contributor

@yhyu13
Off topic, but according to the Deepmind Mastering Go paper, Pachi runs 100k simulations per move (p.6).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants