Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot resume Alphazero training with torchlib #1136

Closed
robinpdev opened this issue Nov 9, 2023 · 6 comments
Closed

Cannot resume Alphazero training with torchlib #1136

robinpdev opened this issue Nov 9, 2023 · 6 comments

Comments

@robinpdev
Copy link
Contributor

robinpdev commented Nov 9, 2023

I'm trying to resume training training an Alphazero model as described here

But i receive this error message:
$ ./build/examples/alpha_zero_torch_example $VSC_DATA/shared/robin/os_out/rthex_5x5_2/config.json Logging directory: /data/gent/465/vscxxxxx/shared/robin/os_out/rthex_5x5_2 Using existing model: /data/gent/465/vscxxxxx/shared/robin/os_out/rthex_5x5_2/vpnet.pb Playing game: rthex Spiel Fatal Error: /data/gent/465/vscxxxxx/shared/robin/open_spiel/open_spiel/utils/json.cc:67 error == str.substr(0, std::min(30, static_cast<int>(str.size()))) error = Empty string, str.substr(0, std::min(30, static_cast<int>(str.size()))) =

These are the files in the model path:

checkpoint-0-optimizer.pt learner.jsonl log-actor-2.txt log-evaluator-1.txt checkpoint-0.pt log-actor-0.txt log-actor-3.txt log-learner.txt config.json log-actor-1.txt log-evaluator-0.txt vpnet.pb

And this is the config.json file

{ "actors": 4, "checkpoint_freq": 1, "cutoff_probability": 0.800000, "cutoff_value": 0.950000, "devices": "cuda:0,cpu:0", "eval_levels": 7, "evaluation_window": 100, "evaluators": 2, "explicit_learning": false, "game": "rthex", "graph_def": "vpnet.pb", "inference_batch_size": 6, "inference_cache": 262144, "inference_threads": 3, "learning_rate": 0.000100, "max_simulations": 300, "max_steps": 0, "nn_depth": 10, "nn_model": "resnet", "nn_width": 128, "path": "/data/gent/465/vscxxxxx/shared/robin/os_out/rthex_5x5_2", "policy_alpha": 1.000000, "policy_epsilon": 0.250000, "replay_buffer_reuse": 3, "replay_buffer_size": 65536, "temperature": 1.000000, "temperature_drop": 10.000000, "train_batch_size": 1024, "uct_c": 2.000000, "weight_decay": 0.000100 }

Any idea what might cause this or how i could resolve it?

@lanctot
Copy link
Collaborator

lanctot commented Nov 15, 2023

Hi @robinpdev ,

Apologies for the lateness on this.

This seems like an error when parsing the json, maybe it's missing a specific entry or has a syntax error.

Have you tried reading that json file separately using utils/json.{h,cc} ?

Did you manually create that .json file or was it printed out from AlphaZero training? Have you modified it?

@lanctot
Copy link
Collaborator

lanctot commented Nov 15, 2023

That error is coming from here:

std::min(30, static_cast<int>(str.size()))));

But that's an general parse error function that is called from multiple places in the file. It would be good to have the full stack trace or at least the token that's causing the issue.

Either way, would be good to reproduce in a simpler setting. Can you reproduce the problem in a much simpler main program that only tries to read that .json file to see if it's the reader itself stumbling?

(To avoid unnecessary dependencies, we built our own simple JSON parser but we may have not covered a case that is required by your specific .json file. In that case it'd be a quick fix once we isolate the problem.)

@robinpdev
Copy link
Contributor Author

The JSON file was generated by the alphazero training implementation and was not edited.

I however got it to work for my current configuration so this is not an immediate problem anymore. I will look into it more if i encounter this problem again. Thanks for the response.

@CasparQuast
Copy link

Hey robin. Did you manage to get it combiled for gpu? Im really struggling to build the project for gpu

@robinpdev
Copy link
Contributor Author

this seems to be fixed.

@christianjans
Copy link
Contributor

Hi, sorry for not responding! I have been away from GitHub for a while and didn't realize there are more people using Libtorch AZ. That's great to hear!

If you have the chance, would you be able to post your solution to this problem? Or maybe even the configuration that caused errors and the configuration you used that resolved the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants