Cannot resume Alphazero training with torchlib #1136

robinpdev · 2023-11-09T15:37:55Z

I'm trying to resume training training an Alphazero model as described here

But i receive this error message:
$ ./build/examples/alpha_zero_torch_example $VSC_DATA/shared/robin/os_out/rthex_5x5_2/config.json Logging directory: /data/gent/465/vscxxxxx/shared/robin/os_out/rthex_5x5_2 Using existing model: /data/gent/465/vscxxxxx/shared/robin/os_out/rthex_5x5_2/vpnet.pb Playing game: rthex Spiel Fatal Error: /data/gent/465/vscxxxxx/shared/robin/open_spiel/open_spiel/utils/json.cc:67 error == str.substr(0, std::min(30, static_cast<int>(str.size()))) error = Empty string, str.substr(0, std::min(30, static_cast<int>(str.size()))) =

These are the files in the model path:

checkpoint-0-optimizer.pt learner.jsonl log-actor-2.txt log-evaluator-1.txt checkpoint-0.pt log-actor-0.txt log-actor-3.txt log-learner.txt config.json log-actor-1.txt log-evaluator-0.txt vpnet.pb

And this is the config.json file

{ "actors": 4, "checkpoint_freq": 1, "cutoff_probability": 0.800000, "cutoff_value": 0.950000, "devices": "cuda:0,cpu:0", "eval_levels": 7, "evaluation_window": 100, "evaluators": 2, "explicit_learning": false, "game": "rthex", "graph_def": "vpnet.pb", "inference_batch_size": 6, "inference_cache": 262144, "inference_threads": 3, "learning_rate": 0.000100, "max_simulations": 300, "max_steps": 0, "nn_depth": 10, "nn_model": "resnet", "nn_width": 128, "path": "/data/gent/465/vscxxxxx/shared/robin/os_out/rthex_5x5_2", "policy_alpha": 1.000000, "policy_epsilon": 0.250000, "replay_buffer_reuse": 3, "replay_buffer_size": 65536, "temperature": 1.000000, "temperature_drop": 10.000000, "train_batch_size": 1024, "uct_c": 2.000000, "weight_decay": 0.000100 }

Any idea what might cause this or how i could resolve it?

The text was updated successfully, but these errors were encountered:

lanctot · 2023-11-15T15:48:45Z

Hi @robinpdev ,

Apologies for the lateness on this.

This seems like an error when parsing the json, maybe it's missing a specific entry or has a syntax error.

Have you tried reading that json file separately using utils/json.{h,cc} ?

Did you manually create that .json file or was it printed out from AlphaZero training? Have you modified it?

lanctot · 2023-11-15T15:52:03Z

That error is coming from here:

open_spiel/open_spiel/utils/json.cc

Line 67 in 7cbb52e

std::min(30, static_cast<int>(str.size()))));

But that's an general parse error function that is called from multiple places in the file. It would be good to have the full stack trace or at least the token that's causing the issue.

Either way, would be good to reproduce in a simpler setting. Can you reproduce the problem in a much simpler main program that only tries to read that .json file to see if it's the reader itself stumbling?

(To avoid unnecessary dependencies, we built our own simple JSON parser but we may have not covered a case that is required by your specific .json file. In that case it'd be a quick fix once we isolate the problem.)

robinpdev · 2023-11-15T17:47:10Z

The JSON file was generated by the alphazero training implementation and was not edited.

I however got it to work for my current configuration so this is not an immediate problem anymore. I will look into it more if i encounter this problem again. Thanks for the response.

CasparQuast · 2023-11-19T20:21:03Z

Hey robin. Did you manage to get it combiled for gpu? Im really struggling to build the project for gpu

robinpdev · 2023-12-09T14:49:07Z

this seems to be fixed.

christianjans · 2023-12-13T05:36:10Z

Hi, sorry for not responding! I have been away from GitHub for a while and didn't realize there are more people using Libtorch AZ. That's great to hear!

If you have the chance, would you be able to post your solution to this problem? Or maybe even the configuration that caused errors and the configuration you used that resolved the problem?

robinpdev mentioned this issue Nov 15, 2023

Resume training from most-recent checkpoint in Libtorch AlphaZero #581

Merged

2 tasks

robinpdev closed this as completed Dec 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot resume Alphazero training with torchlib #1136

Cannot resume Alphazero training with torchlib #1136

robinpdev commented Nov 9, 2023 •

edited

Loading

lanctot commented Nov 15, 2023

lanctot commented Nov 15, 2023 •

edited

Loading

robinpdev commented Nov 15, 2023

CasparQuast commented Nov 19, 2023

robinpdev commented Dec 9, 2023

christianjans commented Dec 13, 2023

Cannot resume Alphazero training with torchlib #1136

Cannot resume Alphazero training with torchlib #1136

Comments

robinpdev commented Nov 9, 2023 • edited Loading

These are the files in the model path:

And this is the config.json file

lanctot commented Nov 15, 2023

lanctot commented Nov 15, 2023 • edited Loading

robinpdev commented Nov 15, 2023

CasparQuast commented Nov 19, 2023

robinpdev commented Dec 9, 2023

christianjans commented Dec 13, 2023

robinpdev commented Nov 9, 2023 •

edited

Loading

lanctot commented Nov 15, 2023 •

edited

Loading