-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Great job!! #1
Comments
Hi, mokemokechicken It's a wonderful project, keep it going! |
Hi, thank you for your compliment!
Currently, not so strong. The 7th generation model win http://kuina.ch/board_game/a01 's beginner player(the left icon of 初心者). The progress of model evolution is as follows.
|
Hi, @mokemokechicken Forgive my ignorance, I wonder what is the game that your agent is playing? It seems like a black-white flip stones game played on a 8x8 grid world. Could you please provide any background on this game? How is it similar and different from the game of Go? And I am quite stuck on the implementation of AVP-MCTS in python. I am not new to python, but I haven't practiced coroutine programming before. Especially, could you please explain to me the idea behind player.py line 172-174. It would be great if you can take a look at my code (I have a feeling that I have to combine the MCTS tree node class and the tree search control class together). |
About Reversi game: https://en.wikipedia.org/wiki/Reversi |
Hi, @yhyu13
Although I may not accurately grasp the intent of the question, You know, costs of TF prediction are very high and it is better to predict collectively.
I saw your code. It is difficult to make useful comments, |
@Zeta36 Thanks, it is helpful. And I just know that Reversi also sparks interests of A.I. in the late 90's. One of the professor at U of Alberta, Michael Buro was the pioneer in solving this challenge. @mokemokechicken I have put TreeNode, NN and MCTS in the same class, it does sometimes to get it correct. I will let review it when I am ready. |
@mokemokechicken in your table, what is the "generation" mean? Generation of best model, or generation of trained model? As we can see every generation beats best model over 55%, so I guess it is the former one? |
@gooooloo Yes, it is Generation of best model. |
@mokemokechicken got it. Thank you! |
HI, I've learned a lot from your coroutine implementation of MCTS in python. Please take a look: APV-MCTS in python. Also, I was recommended to compile it with Cython. The C compiler optimize my tree search a little bit. Trough method profiling, I found two bottleneck:
(1) has to to with data structure, and (2) has to do with whether or not I am using GPU. The supervised training on Volta instance tells me that it is able to process 1450 samples per second (including I/O) with 20 blocks resNet. For a small batch ready to be evaluated (size 8~16), it should take about 0.01 seconds. However, for 1600 iterations, there has about 200 evaluations in total. So 200*0.01=2s is far away from DeepMind's 0.4 seconds per move. On my 4 cpus macbook, 1600 iterations takes about 5.6 seconds (excluding evaluation), so I will give a try for machines with more and better cpus. My question at this moment is:
Regards, |
Could you please tell me how did you manage to make the download_the_best_model.sh? Where are the files stored? I am trying to upload/download training dataset without using git lfs. |
Hi, @yhyu13
I think it is a good idea to separate NetworkAPI from MTCS. FYI: according to https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf, they set
Google's TPU machine might perform better than other GPUs.
It is a very difficult question...
I just push models in https://github.com/mokemokechicken/reversi-alpha-zero-models. |
Here is two of my profile, I found that the time cost per move search really depends on the amount of illegal moves encounters. It will spend the most time if every leaf node is expandable (which calls the neural network & create all leaves nodes-> I tried to put illegal check in the expansion phase so that it creates less children nodes, but I wasn't find a hacky way to do so it's even slower). First profile, the selected child node is generated by random process that will invoke the most evaluation & expansion (BTW, the elements in queue is upper bounded by the number of semaphores, i.e. it's better to have queue_size>=num_sem, so the queue is never full, am I right?)
Second profile is made by a run with a lot of illegal moves, thus there are 15x less expansion & 4x less evaluation:
DeepMind claim its average search time per move is 0.4, meaning I have to take the average of the upper bound and the lower bound. In this case, full expansion and only root expansion. Third, the profile generated by only root expansion (with 8 sem on 4 cores machine):
Interestingly enough, the total time it takes with 16sem, 8 sem, 4 sem, 2 sem and 1 sem are the same (about 2.5s) with root expansion only. This is understandable because it doesn't need extra coroutines. A naive conversion to Cython speeds up the lower bound to 2.1s BTW, I didn't use actual tensorflow sess but barely return a random value during evaluation. Otherwise it takes too long (50s+). However, if I get realistic and set per move iteration to be 200 instead of 1600. The upper bound become 0.75s and lower bound becomes 0.3s. I will try to tinker a better Cython version of APV_MCTS, according to your experience, would it be helpful? |
I think it is a good idea to check illegal moves in the expansion phase.
Yes, you are right. timing about add/remove virtual_lossin leaf nodes, I think
|
I've tired that, it makes code run slower because it doesn't reduce the total iterations at each expansion. It is still 362 because I need to check validity. Also, do you have any idea about resigning games with a certain threshold? It sounds like a very convenient way to early stop a game and save time. I am not sure I understand the false positive rate: If I want to measure false-positive, I would replay all resigned games from the point of resign and see if I can get a win ratio better than 5%. I am not sure I understand how does it work.
Thanks for spotting this mistakes. I've changed my code. As you previous suggestion on moving |
I think it is not difficult.
|
Thanks! So of these 10% self-play game, the algorithm would resign N games, and among those N games, I want make sure that the win ratio is less than 5%. |
I manage to refractory the APV_MCTS_2.py by using @classmethod. I believe now it not only looks better but also run faster. But as I said earlier, the main reason for speed up comes from going into more illegal move so that the expensive expansion is not invoked. When comparing APV_MCTS.py and APV_MCTS_2.py, I found even though their hyperparameters are the same, the search result is quite different consistently, see this line in second version and this line in the first version. The second version consistently expand less leaf nodes than the first version. The best guess for me is that some coruntines bypass the I also found adding |
@mokemokechicken, I can see in the "readme" that your model is still improving!!? I think you've done a really great job and that you only need more computation power (or more time) to get a superhuman player. It seems you really were able to simulate DeepMind AlphaGo Zero theory model :). As soon as I get time I'll try to adapt your code to a chess environment. I've looking your code and I think it'll be "easy" to do. Regards!! |
I think it is not a very good way to use Class feature for sharing a context.
I think it is still late to call
Yes, I think so too. |
I manage to find a hacky way to solve this problem. I add dirichlet noise to the prior probabilities at the expansion phase. The profile results shows it explore more (1468/1600 vs. 1000/1600), and because of that, time per move increases from 3.6 to 4.2 without actually having to do with the child dictionary (which will increase time per move to 7s).
I've fixed it, thanks!
Forgive my ignorance, I not sure I understand what a context is? I believe you need to more explicit on your idea. Talk to you soon. |
Hi, @Zeta36 Yes, I am training my model now. I cannot win already...(^^ The version of chess is very interesting. I would like you to try implementing it by all means! |
That's good!
It means like this . A Class field can have only one object. |
Thanks for your explanation! So I guess a context would be similar to the NetworkAPI class I made before but there is only the But since
You're perfectly right about it, I will use the APV_MCTS with the NetworkAPI class for self-play evaluation. However, I guess it would be more efficient to let NN play out the game instead of aiding MCTS in the model evaluation procedure. According to DeepMind, one model is against the best model for 400 games and calculate the win ratio afterwards. DeepMind made each model play each move after 1,600 searches. It seems redundant. I believe if a model can beat a another model just by the outputs of policy networks, then that model is coined "stronger", what do you think? EDIT: DeepMind mentions paper that the best model is to generate 25,000 games of self-play training datasets. So 400 games in compare to 25,000 is a small number, so I guess DeepMind doesn't care) |
Yes, you are right.
I think that as one aspect of class design, it may be better for NetworkAPI has only
That is an interesting hypothesis! I think that it is worth experimenting. If I give an objection, the goodness of the value network is not reflected. Because the value network has a big meaning if you are using MCTS in an actual game. I think playing using only policy network means changing MCTS count from 1,600 to 1. |
I haven't thought about it before. It makes sense to me. However, my supervised training shows me the MSE of predicting game result stays in 1.00 because DeepMind set the factor of predicting game result MSE 100 time smaller in SL than in RL. I believe the value network is important as the MSE trained by DeepMind is around 0.2. So the self-play with search not only test the policy network but also utilize the value network.
Thanks for insight! Question: I wonder how do you build the self-play agent that can play with reversi phone apps? Does it provide APIs? |
I don't build a phone app. |
Here is another viewpoint: the NN policy(the agent) is to play a "wrapper game" rather than the original "Go" game. When the "wrapper game" env takes an action, it first uses MCTS to generate an "internal action" from agent's action, then apply the "internal action" to the internal "Go" game -- which is unknown to the NN policy. All the NN policy is playing with is the "wrapper game" env. This matches the standard RL model, in which the state transition is stochastic though. From this viewpoint, I believe it is better to play with the same wrapper game when evaluating the NN policy, which means, it is better to use MCST too when evaluation. -- Just my personal opinion though :) |
I found a desktop app Sabaki which support Go Text Protocol, and I am using it well. |
Good point! Your analogy makes sense. But keep in mind that the NN is trying to learn the statistical distribution calculated by MCTS which is guided by the NN in the first place. DeepMind also argues that MCTS usually gives stronger play. But my concern now is the quality of the search tree. At least in the initial phase, the search tree gives result that says several nodes have equivalent probability and the others are zero (no visited nodes). Maybe 1,600 palyouts is no engouth? Pachi, a go program fully relies on MCTS, would do 10,000 playouts per move. Leela does even more playouts about 50,000 per move. Keep in mind the time it take to feedfoward a deep neural net must be a factor of trade-off as well. |
@yhyu13 |
Wonderful job, friend.
Can you please tell us what's the performance you got with this approach? Do you have some statistics or something?
Regards!
The text was updated successfully, but these errors were encountered: