Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libtorch C++ AlphaZero #319

Merged
merged 7 commits into from
Mar 15, 2021

Conversation

christianjans
Copy link
Contributor

A Libtorch version of C++ AlphaZero.

Notes:

  • Unlike the TensorFlow Python AlphaZero, there is only the option of a residual network model so far.
  • Multiple devices are supported - there is a noticeable increase in speed with more GPUs.
  • An extra device managing option was added (the --explicit_learning flag) that can be used when multiple devices are available. It ensures that the GPU holding the model responsible for learning does not also take inference requests when it is supposed to be learning from the replay buffer. This seems to speed up the learning process when there are quite a few actors and evaluators.

Results:

  • Passes the vpnet_test.cc (requires the network to fit to batches of different sizes from Tic-Tac-Toe).
  • Learns to play Tic-Tac-Toe optimally.

Design decisions:

  • This does not write an initial graph definition to a file ('vpnet.pb') like the TensorFlow C++ AlphaZero, rather it simply writes a struct describing the model to the file. The Libtorch model can be initialized from this file. This was done to preserve the way alpha_zero.cc creates the model (calls a function to create the graph definition file, then uses this file to make the model).
  • Model layers: Compared TensorFlow and Torch documentation to ensure the hyperparameters of the layers were coherent.
  • Model loss functions: Tried to recreate the loss functions of the TensorFlow version. There was not as much documentation on the TensorFlow policy loss (softmax_cross_entropy_with_logits_v2), but the policy loss version implemented here should be similar and produces loss around the same order of magnitude.

Let me know if you have thoughts or suggestions on this!

@lanctot
Copy link
Collaborator

lanctot commented Sep 17, 2020

Hi @christianjans , this looks really great. Just a heads-up that we still plan to look into this.. we just got busy with vacation sand an influx of small PRs. Planning to get to this in the coming weeks.

@christianjans
Copy link
Contributor Author

Hi @lanctot , no worries! There's no rush on my end.

@lanctot lanctot self-assigned this Oct 15, 2020
@lanctot lanctot requested a review from tewalds October 15, 2020 20:14
@lanctot lanctot requested a review from elkhrt October 26, 2020 14:59
@lanctot
Copy link
Collaborator

lanctot commented Nov 9, 2020

@christianjans regarding:

Model loss functions: Tried to recreate the loss functions of the TensorFlow version. There was not as much documentation on the TensorFlow policy loss (softmax_cross_entropy_with_logits_v2), but the policy loss version implemented here should be similar and produces loss around the same order of magnitude.

Seems rather important to be sure to get this right. Can you point us specifically to the place you're not sure about? Tagging a few PyTorch users I know: @michalsustr @ssokota

@lanctot
Copy link
Collaborator

lanctot commented Nov 9, 2020

Tagging @elkhrt because he's also taking a look with me.

@ssokota
Copy link
Contributor

ssokota commented Nov 9, 2020

From a brief look, it seems okay to me re: consistency. One thing I did notice is that it looks like the code computes the cross entropy from policy rather than the logits, which is numerically unstable. Pytorch's cross entropy function doesn't accept soft labels but it has a stable log softmax function that you could use. IE do log_softmax(logits) rather than log(softmax(logits)). Alternatively, you could use KL divergence, which has the same gradient as cross entropy.

@lanctot lanctot self-requested a review November 13, 2020 20:08
Copy link
Collaborator

@lanctot lanctot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @christianjans, can you see the comments about the policy loss? Thanks.

@christianjans
Copy link
Contributor Author

Hi @lanctot and @ssokota !

Sorry for the late reply, but thank you for the advice! I have been implementing the log softmax solution and am working on testing it (I have encountered a build error with Clang --- "unsupported option '-fopenmp' which I will try and investigate soon). Unfortunately, I finish up midterms this upcoming week and am working on completing a research project in the next few weeks. I will try and dedicate as much time as I can to this in the coming weeks, although it may be sparse. However, if it would be helpful, I can commit the untested implementation for review/revision.

With regards to the softmax_cross_entropy_with_logits_v2 function confusion, I have confirmed its implementation with a quick example, and this Torch version should follow the same implementation.

@lanctot
Copy link
Collaborator

lanctot commented Nov 15, 2020

@christianjans there is no rush, let's wait until you have time to properly test it.. at least such that it passes the same tests remarked in the original PR (i.e. passes vpnet_test and solves Tic-Tac-Toe).

@lanctot
Copy link
Collaborator

lanctot commented Mar 2, 2021

@christianjans , just checking in to see if you still plan to do the final bits necessary to pull this in? My understanding is that people are already using it, so it'd be great to merge it.

@christianjans
Copy link
Contributor Author

Hi @lanctot, sorry yes I was actually planning to resume work on it this weekend. I apologize for the lack of updates, but I should have more time to dedicate to this now. Glad to hear people are using it!

@christianjans
Copy link
Contributor Author

christianjans commented Mar 7, 2021

Okay great, it looks like it's still passing the VPNet test:

(venv) open_spiel $ ./build/algorithms/alpha_zero_torch/torch_vpnet_test 
TestModelCreation: resnet
TestModelLearnsSimple: resnet
states: 7
0: Losses(total: 3.334, policy: 2.062, value: 1.249, l2: 0.023)
1: Losses(total: 2.062, policy: 1.556, value: 0.482, l2: 0.023)
...
33: Losses(total: 0.102, policy: 0.054, value: 0.000, l2: 0.048)
34: Losses(total: 0.096, policy: 0.048, value: 0.000, l2: 0.048)
TestModelLearnsOptimal: resnet
states: 4520
0: Losses(total: 1.983, policy: 1.226, value: 0.735, l2: 0.023)
1: Losses(total: 1.771, policy: 1.076, value: 0.672, l2: 0.023)
...
65: Losses(total: 0.203, policy: 0.122, value: 0.024, l2: 0.057)
66: Losses(total: 0.171, policy: 0.093, value: 0.021, l2: 0.057)

and still learning in games like Tic Tac Toe:

./build/examples/alpha_zero_torch_example --nn_width=64 --nn_depth=2 --game=tic_tac_toe --replay_buffer_size=16384 --replay_buffer_reuse=4 --checkpoint_freq=25 --max_simulations=50 --actors=2 --evaluators=1 --max_steps=50

Screen Shot 2021-03-07 at 2 52 01 PM

The error I was getting before regarding the '-fopenmp' flag and Clang had something to do with the Libtorch library that was downloaded. I believe this is because in open_spiel/scripts/install.sh, we download a version that just doesn't work with Apple's Clang:
https://github.com/deepmind/open_spiel/blob/a961d89273b6ae93e47755cfe3adb54ecb880966/open_spiel/scripts/install.sh#L132-L150
I had to download the macOS Libtorch from https://pytorch.org/get-started/locally/ in order for Libtorch to work on macOS (specifically https://download.pytorch.org/libtorch/cpu/libtorch-macos-1.8.0.zip). I don't know how necessary it would be to change the Libtorch download based on the OS, but just wanted to make you aware.

Thanks again for your patience with this PR, let me know what else can be changed/added!

@lanctot
Copy link
Collaborator

lanctot commented Mar 8, 2021

Thanks @christianjans, this is looking great!

The error I was getting before regarding the '-fopenmp' flag and Clang had something to do with the Libtorch library that was downloaded. I believe this is because in open_spiel/scripts/install.sh, we download a version that just doesn't work with Apple's Clang:

https://github.com/deepmind/open_spiel/blob/a961d89273b6ae93e47755cfe3adb54ecb880966/open_spiel/scripts/install.sh#L132-L150

I had to download the macOS Libtorch from https://pytorch.org/get-started/locally/ in order for Libtorch to work on macOS (specifically https://download.pytorch.org/libtorch/cpu/libtorch-macos-1.8.0.zip). I don't know how necessary it would be to change the Libtorch download based on the OS, but just wanted to make you aware.

Can you add another URL in the list of alternative URLs below in install.sh, something like "For C++ PyTorch AlphaZero on MacOS we recommend this URL: https://download.pytorch.org/libtorch/cpu/libtorch-macos-1.8.0.zip"?

@christianjans
Copy link
Contributor Author

Can you add another URL in the list of alternative URLs below in install.sh, something like "For C++ PyTorch AlphaZero on MacOS we recommend this URL: https://download.pytorch.org/libtorch/cpu/libtorch-macos-1.8.0.zip"?

Good idea! It has been added.

@lanctot lanctot added the imported This PR has been imported and awaiting internal review. Please avoid any more local changes, thanks! label Mar 11, 2021
@OpenSpiel OpenSpiel merged commit 8cc5e69 into google-deepmind:master Mar 15, 2021
@selfsim
Copy link
Contributor

selfsim commented Mar 19, 2021

@christianjans Hello Christian,
could you provide some sample code/tips on how you were able to visualize the results of training with 'alpha_zero_torch_example' ?
It would be much appreciated.
-selfsim

@christianjans
Copy link
Contributor Author

christianjans commented Mar 23, 2021

Hi @selfsim! Sorry I didn't see your comment earlier. But yes absolutely! There is a script (https://github.com/deepmind/open_spiel/blob/master/open_spiel/python/algorithms/alpha_zero/analysis.py) that you can run that will do all this analysis for you.

Essentially, when you run the Python AlphaZero or the Libtorch AlphaZero, an experiment directory will be created that contains the neural network checkpoints, logs, configs, and analysis data. You can specify this directory with the path flag in both implementations of AlphaZero: https://github.com/deepmind/open_spiel/blob/b07e3ba2838c32be8f598abc705b886e64d75101/open_spiel/python/examples/alpha_zero.py#L45 https://github.com/deepmind/open_spiel/blob/b07e3ba2838c32be8f598abc705b886e64d75101/open_spiel/examples/alpha_zero_torch_example.cc#L23

Once you finish training, you can view the results by running the analysis.py script, ensuring to pass in the experiment directory:

$ python3 open_spiel/python/algorithms/alpha_zero/analysis.py --path=<path-to-the-az-directory>

Let me know if anything was unclear or if you run into any problems. More information about this can be found here: https://github.com/deepmind/open_spiel/blob/master/docs/alpha_zero.md#analysis

@selfsim
Copy link
Contributor

selfsim commented Apr 1, 2021

@christianjans Thanks for the reply, I will if everything goes well be looking at some cool graphs tonight; I am currently training an agent on a game implementation I created.

I have a few questions, any help is greatly appreciated. How does one load checkpoints to continue training at a later point? How could I set up a 1v1 game between a player (human or otherwise) vs the trained agent?

Thank you, and great work!

@christianjans
Copy link
Contributor Author

christianjans commented Apr 4, 2021

Hi @selfsim, thanks, and great questions!

There is, unfortunately, no way to continue from checkpoints with the C++ AlphaZero version without having some sort of weird behaviour. For example, the checkpoints indeed save the model and optimizer which can be used to start from again, however, there is currently no functionality (such as a command-line flag) to provide a checkpoint to resume training from. Additionally, the data that is recorded for analysis will be overwritten (or at the least corrupted) with the resumption of training from a checkpoint. Resuming training from a checkpoint is definitely a good feature to add though. I think I should have some time to look into this.

I had tried doing this for the Python TensorFlow AlphaZero available in OpenSpiel. I was able to provide the functionality to resume training from a checkpoint, and the analysis data continued to record from that checkpoint. However, after continuing to train the model, there were two issues I could identify:

  1. Some of the analysis graphs would show slightly erratic behaviour at the training step it continued from, and
  2. The models from checkpoints after the starting checkpoint would often perform worse than the starting checkpoint model.

The first issue I think has to do with the Python TensorFlow AlphaZero perhaps not saving the state of the optimizer at each checkpoint. This is fixed in the C++ Libtorch AlphaZero as both the model and state of the optimizer are saved for each checkpoint. However, I was unable to identify a potential cause for the second issue (perhaps the two issues are related?). Anyways, here is some of my work that was done with regards to restarting from a checkpoint in the Python version: https://github.com/christianjans/open_spiel_resume_files.

And, like resuming training, there isn't a way to set up a 1v1 game with other players, but this is definitely another good thing to add. The mcts.py example shows how it sets up a TensorFlow AlphaZero checkpoint for a Python AlphaZero bot. I think this can be followed on the C++ side. I will also try and look into this too.

@selfsim
Copy link
Contributor

selfsim commented Apr 7, 2021

Hi @christianjans , thanks for the insights: I will look into setting up a 1v1 game and extracting a bot from the checkpoints using the mcts.py as a guide. I will get back to you if I make any breakthroughs. I am quite busy currently but am hoping to be able to throw some more hours into this later on in the summer.

@selfsim
Copy link
Contributor

selfsim commented Apr 8, 2021

@christianjans could you also maybe comment on the output of the analysis script? Particularly, what do the multiple lines mean for the MCTS solver graphs, is it the number of simulations? Also, for the outcome graph, I haven't been able to figure out exactly who player 1 and player 2 are. Thank you!

@selfsim
Copy link
Contributor

selfsim commented Apr 9, 2021

@michalsustr I noticed in a different PR you mentioned that you have been working only with the LibTorch AZ implementation as opposed to TF. Have you played against a bot that you have trained?

@christianjans
Copy link
Contributor Author

christianjans commented Apr 9, 2021

Hi @selfsim,

I am quite busy currently but am hoping to be able to throw some more hours into this later on in the summer.

No worries, I have been looking into the C++ side of things and I have a basic implementation in the works where you can play your Libtorch AlphaZero against a random player or MCTS player. See this branch on my personal OpenSpiel repository. After installing and building (with Libtorch on, of course), you can run it using the command:

$ ./build/examples/alpha_zero_torch_game_example --game=<game_to_play> --player1=random --player2=az_torch --az_path=<your_libtorch_az_experiment_directory>

I will try and continue updating this branch, but will also look into restarting training from checkpoints.

@christianjans could you also maybe comment on the output of the analysis script? Particularly, what do the multiple lines mean for the MCTS solver graphs, is it the number of simulations?

Yes, definitely. The MCTS solver graph is certainly confusing, but I can try and explain it here. During training, your AlphaZero player is constantly being evaluated by playing against MCTS players of varying "levels". The level of the opponent MCTS player is determined by the number of simulations they are allowed to run for each move they make. A higher-level MCTS opponent runs more simulations than a lower-level MCTS opponent. Notice that there are 7 levels in:

Screen Shot 2021-03-07 at 2 52 01 PM

The number of simulations the MCTS opponent at level n is allowed to run for each move is c * 10^(n / 2) where c is the number of simulations the AZ player is allowed to do per move. So for example, say you run an AlphaZero experiment and give the argument --max_simulations=1000 and --eval_levels=7. Then, during evaluation, your AlphaZero player will play one MCTS player that performs 1000 * 10^(0 / 2) = 1000 simulations per move (level 0), another that performs 1000 * 10^(1 / 2) ~ 3162 simulations per move (level 1), and so on, with the highest level opponent (level 6) performing 1000 * 10^(6 / 2) = 1000000 simulations per move. All the while, your AlphaZero player is only performing 1000 simulations per move, no matter which level it is playing against.

So what we should expect to see is that we are able to beat lower-level MCTS opponents faster (earlier on in training) than higher-level MCTS opponents. This is what we see in the above graph example.

Also, for the outcome graph, I haven't been able to figure out exactly who player 1 and player 2 are.

The outcome graph shows the results of your AlphaZero player playing against itself in self-play. So player 1 and player 2 are both your AlphaZero algorithm, just from different perspectives.

Feel free to let me know if anything is still confusing or if there are any other questions.

@selfsim
Copy link
Contributor

selfsim commented Apr 9, 2021

@christianjans Thanks a bunch for the clarification and patience with this barrage of questions.

I looked at your branch and it looks like a great start. Are you working on a human vs bot option currently? If not I could see what I could do. I'd like to test out some of my training runs.

Really appreciate the work!

@christianjans
Copy link
Contributor Author

christianjans commented Apr 10, 2021

No worries, @selfsim, happy to help. And yeah that would be a great addition! I was not planning to add a human bot, but I think I could dedicate some time to it too. If you get around to it, feel free to submit a pull request if you like!

@christianjans
Copy link
Contributor Author

There is now an initial implementation of a human bot on the branch.

@lanctot
Copy link
Collaborator

lanctot commented Apr 10, 2021

There is now an initial implementation of a human bot on the branch.

@christianjans, this is awesome! Can you submit a PR so we can add it to the master branch? Seems like a wonderful thing to have for people using your code.

BTW for the previous discussion about checkpoints: I think for AlphaZero to properly restore its full state you would need to store/retain all of the data in the current buffer in addition to everything else.

@lanctot
Copy link
Collaborator

lanctot commented Apr 10, 2021

And of course I would absolutely love to see any checkpointing code also merged into master too if you manage to get it to work! :)

@selfsim if you see any opportunity to improve the docs regarding the formats of the visualization tools, please flag them and/or submit PRs. This code has really been a wonderful addition to OpenSpiel and it'd be great if we can maintain it well!

@christianjans
Copy link
Contributor Author

There is now an initial implementation of a human bot on the branch.

@christianjans, this is awesome! Can you submit a PR so we can add it to the master branch? Seems like a wonderful thing to have for people using your code.

@lanctot, yes for sure! There's still some cleaning up to do, but I should be able to submit a pull request soon. Were you thinking just the C++ human bot implementation? Or also the ability to play a trained Libtorch AlphaZero in a game?

BTW for the previous discussion about checkpoints: I think for AlphaZero to properly restore its full state you would need to store/retain all of the data in the current buffer in addition to everything else.

And oh right, good call 👍🏻. Will keep this in mind as well.

@lanctot
Copy link
Collaborator

lanctot commented Apr 11, 2021

Oh I didn't realize you had both, I was only referring to the ability to play a trained Libtorch agent in a game, but both would be great. No rush, of course!

@christianjans christianjans deleted the alpha_zero_torch branch May 19, 2021 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes imported This PR has been imported and awaiting internal review. Please avoid any more local changes, thanks!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants