-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Libtorch C++ AlphaZero #319
Libtorch C++ AlphaZero #319
Conversation
Hi @christianjans , this looks really great. Just a heads-up that we still plan to look into this.. we just got busy with vacation sand an influx of small PRs. Planning to get to this in the coming weeks. |
Hi @lanctot , no worries! There's no rush on my end. |
@christianjans regarding:
Seems rather important to be sure to get this right. Can you point us specifically to the place you're not sure about? Tagging a few PyTorch users I know: @michalsustr @ssokota |
Tagging @elkhrt because he's also taking a look with me. |
From a brief look, it seems okay to me re: consistency. One thing I did notice is that it looks like the code computes the cross entropy from policy rather than the logits, which is numerically unstable. Pytorch's cross entropy function doesn't accept soft labels but it has a stable log softmax function that you could use. IE do log_softmax(logits) rather than log(softmax(logits)). Alternatively, you could use KL divergence, which has the same gradient as cross entropy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @christianjans, can you see the comments about the policy loss? Thanks.
Sorry for the late reply, but thank you for the advice! I have been implementing the log softmax solution and am working on testing it (I have encountered a build error with Clang --- "unsupported option '-fopenmp' which I will try and investigate soon). Unfortunately, I finish up midterms this upcoming week and am working on completing a research project in the next few weeks. I will try and dedicate as much time as I can to this in the coming weeks, although it may be sparse. However, if it would be helpful, I can commit the untested implementation for review/revision. With regards to the softmax_cross_entropy_with_logits_v2 function confusion, I have confirmed its implementation with a quick example, and this Torch version should follow the same implementation. |
@christianjans there is no rush, let's wait until you have time to properly test it.. at least such that it passes the same tests remarked in the original PR (i.e. passes vpnet_test and solves Tic-Tac-Toe). |
@christianjans , just checking in to see if you still plan to do the final bits necessary to pull this in? My understanding is that people are already using it, so it'd be great to merge it. |
Hi @lanctot, sorry yes I was actually planning to resume work on it this weekend. I apologize for the lack of updates, but I should have more time to dedicate to this now. Glad to hear people are using it! |
Okay great, it looks like it's still passing the VPNet test: (venv) open_spiel $ ./build/algorithms/alpha_zero_torch/torch_vpnet_test
TestModelCreation: resnet
TestModelLearnsSimple: resnet
states: 7
0: Losses(total: 3.334, policy: 2.062, value: 1.249, l2: 0.023)
1: Losses(total: 2.062, policy: 1.556, value: 0.482, l2: 0.023)
...
33: Losses(total: 0.102, policy: 0.054, value: 0.000, l2: 0.048)
34: Losses(total: 0.096, policy: 0.048, value: 0.000, l2: 0.048)
TestModelLearnsOptimal: resnet
states: 4520
0: Losses(total: 1.983, policy: 1.226, value: 0.735, l2: 0.023)
1: Losses(total: 1.771, policy: 1.076, value: 0.672, l2: 0.023)
...
65: Losses(total: 0.203, policy: 0.122, value: 0.024, l2: 0.057)
66: Losses(total: 0.171, policy: 0.093, value: 0.021, l2: 0.057) and still learning in games like Tic Tac Toe: ./build/examples/alpha_zero_torch_example --nn_width=64 --nn_depth=2 --game=tic_tac_toe --replay_buffer_size=16384 --replay_buffer_reuse=4 --checkpoint_freq=25 --max_simulations=50 --actors=2 --evaluators=1 --max_steps=50 The error I was getting before regarding the '-fopenmp' flag and Clang had something to do with the Libtorch library that was downloaded. I believe this is because in Thanks again for your patience with this PR, let me know what else can be changed/added! |
Thanks @christianjans, this is looking great!
Can you add another URL in the list of alternative URLs below in install.sh, something like "For C++ PyTorch AlphaZero on MacOS we recommend this URL: https://download.pytorch.org/libtorch/cpu/libtorch-macos-1.8.0.zip"? |
Good idea! It has been added. |
@christianjans Hello Christian, |
Hi @selfsim! Sorry I didn't see your comment earlier. But yes absolutely! There is a script (https://github.com/deepmind/open_spiel/blob/master/open_spiel/python/algorithms/alpha_zero/analysis.py) that you can run that will do all this analysis for you. Essentially, when you run the Python AlphaZero or the Libtorch AlphaZero, an experiment directory will be created that contains the neural network checkpoints, logs, configs, and analysis data. You can specify this directory with the Once you finish training, you can view the results by running the $ python3 open_spiel/python/algorithms/alpha_zero/analysis.py --path=<path-to-the-az-directory> Let me know if anything was unclear or if you run into any problems. More information about this can be found here: https://github.com/deepmind/open_spiel/blob/master/docs/alpha_zero.md#analysis |
@christianjans Thanks for the reply, I will if everything goes well be looking at some cool graphs tonight; I am currently training an agent on a game implementation I created. I have a few questions, any help is greatly appreciated. How does one load checkpoints to continue training at a later point? How could I set up a 1v1 game between a player (human or otherwise) vs the trained agent? Thank you, and great work! |
Hi @selfsim, thanks, and great questions! There is, unfortunately, no way to continue from checkpoints with the C++ AlphaZero version without having some sort of weird behaviour. For example, the checkpoints indeed save the model and optimizer which can be used to start from again, however, there is currently no functionality (such as a command-line flag) to provide a checkpoint to resume training from. Additionally, the data that is recorded for analysis will be overwritten (or at the least corrupted) with the resumption of training from a checkpoint. Resuming training from a checkpoint is definitely a good feature to add though. I think I should have some time to look into this. I had tried doing this for the Python TensorFlow AlphaZero available in OpenSpiel. I was able to provide the functionality to resume training from a checkpoint, and the analysis data continued to record from that checkpoint. However, after continuing to train the model, there were two issues I could identify:
The first issue I think has to do with the Python TensorFlow AlphaZero perhaps not saving the state of the optimizer at each checkpoint. This is fixed in the C++ Libtorch AlphaZero as both the model and state of the optimizer are saved for each checkpoint. However, I was unable to identify a potential cause for the second issue (perhaps the two issues are related?). Anyways, here is some of my work that was done with regards to restarting from a checkpoint in the Python version: https://github.com/christianjans/open_spiel_resume_files. And, like resuming training, there isn't a way to set up a 1v1 game with other players, but this is definitely another good thing to add. The |
Hi @christianjans , thanks for the insights: I will look into setting up a 1v1 game and extracting a bot from the checkpoints using the |
@christianjans could you also maybe comment on the output of the analysis script? Particularly, what do the multiple lines mean for the MCTS solver graphs, is it the number of simulations? Also, for the outcome graph, I haven't been able to figure out exactly who player 1 and player 2 are. Thank you! |
@michalsustr I noticed in a different PR you mentioned that you have been working only with the LibTorch AZ implementation as opposed to TF. Have you played against a bot that you have trained? |
Hi @selfsim,
No worries, I have been looking into the C++ side of things and I have a basic implementation in the works where you can play your Libtorch AlphaZero against a random player or MCTS player. See this branch on my personal OpenSpiel repository. After installing and building (with Libtorch on, of course), you can run it using the command: $ ./build/examples/alpha_zero_torch_game_example --game=<game_to_play> --player1=random --player2=az_torch --az_path=<your_libtorch_az_experiment_directory> I will try and continue updating this branch, but will also look into restarting training from checkpoints.
Yes, definitely. The MCTS solver graph is certainly confusing, but I can try and explain it here. During training, your AlphaZero player is constantly being evaluated by playing against MCTS players of varying "levels". The level of the opponent MCTS player is determined by the number of simulations they are allowed to run for each move they make. A higher-level MCTS opponent runs more simulations than a lower-level MCTS opponent. Notice that there are 7 levels in: The number of simulations the MCTS opponent at level So what we should expect to see is that we are able to beat lower-level MCTS opponents faster (earlier on in training) than higher-level MCTS opponents. This is what we see in the above graph example.
The outcome graph shows the results of your AlphaZero player playing against itself in self-play. So player 1 and player 2 are both your AlphaZero algorithm, just from different perspectives. Feel free to let me know if anything is still confusing or if there are any other questions. |
@christianjans Thanks a bunch for the clarification and patience with this barrage of questions. I looked at your branch and it looks like a great start. Are you working on a human vs bot option currently? If not I could see what I could do. I'd like to test out some of my training runs. Really appreciate the work! |
No worries, @selfsim, happy to help. And yeah that would be a great addition! I was not planning to add a human bot, but I think I could dedicate some time to it too. If you get around to it, feel free to submit a pull request if you like! |
There is now an initial implementation of a human bot on the branch. |
@christianjans, this is awesome! Can you submit a PR so we can add it to the master branch? Seems like a wonderful thing to have for people using your code. BTW for the previous discussion about checkpoints: I think for AlphaZero to properly restore its full state you would need to store/retain all of the data in the current buffer in addition to everything else. |
And of course I would absolutely love to see any checkpointing code also merged into master too if you manage to get it to work! :) @selfsim if you see any opportunity to improve the docs regarding the formats of the visualization tools, please flag them and/or submit PRs. This code has really been a wonderful addition to OpenSpiel and it'd be great if we can maintain it well! |
@lanctot, yes for sure! There's still some cleaning up to do, but I should be able to submit a pull request soon. Were you thinking just the C++ human bot implementation? Or also the ability to play a trained Libtorch AlphaZero in a game?
And oh right, good call 👍🏻. Will keep this in mind as well. |
Oh I didn't realize you had both, I was only referring to the ability to play a trained Libtorch agent in a game, but both would be great. No rush, of course! |
A Libtorch version of C++ AlphaZero.
Notes:
--explicit_learning
flag) that can be used when multiple devices are available. It ensures that the GPU holding the model responsible for learning does not also take inference requests when it is supposed to be learning from the replay buffer. This seems to speed up the learning process when there are quite a few actors and evaluators.Results:
Design decisions:
softmax_cross_entropy_with_logits_v2
), but the policy loss version implemented here should be similar and produces loss around the same order of magnitude.Let me know if you have thoughts or suggestions on this!