-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot resume Alphazero training with torchlib #1136
Comments
Hi @robinpdev , Apologies for the lateness on this. This seems like an error when parsing the json, maybe it's missing a specific entry or has a syntax error. Have you tried reading that json file separately using Did you manually create that .json file or was it printed out from AlphaZero training? Have you modified it? |
That error is coming from here: open_spiel/open_spiel/utils/json.cc Line 67 in 7cbb52e
But that's an general parse error function that is called from multiple places in the file. It would be good to have the full stack trace or at least the token that's causing the issue. Either way, would be good to reproduce in a simpler setting. Can you reproduce the problem in a much simpler main program that only tries to read that .json file to see if it's the reader itself stumbling? (To avoid unnecessary dependencies, we built our own simple JSON parser but we may have not covered a case that is required by your specific .json file. In that case it'd be a quick fix once we isolate the problem.) |
The JSON file was generated by the alphazero training implementation and was not edited. I however got it to work for my current configuration so this is not an immediate problem anymore. I will look into it more if i encounter this problem again. Thanks for the response. |
Hey robin. Did you manage to get it combiled for gpu? Im really struggling to build the project for gpu |
this seems to be fixed. |
Hi, sorry for not responding! I have been away from GitHub for a while and didn't realize there are more people using Libtorch AZ. That's great to hear! If you have the chance, would you be able to post your solution to this problem? Or maybe even the configuration that caused errors and the configuration you used that resolved the problem? |
I'm trying to resume training training an Alphazero model as described here
But i receive this error message:
$ ./build/examples/alpha_zero_torch_example $VSC_DATA/shared/robin/os_out/rthex_5x5_2/config.json Logging directory: /data/gent/465/vscxxxxx/shared/robin/os_out/rthex_5x5_2 Using existing model: /data/gent/465/vscxxxxx/shared/robin/os_out/rthex_5x5_2/vpnet.pb Playing game: rthex Spiel Fatal Error: /data/gent/465/vscxxxxx/shared/robin/open_spiel/open_spiel/utils/json.cc:67 error == str.substr(0, std::min(30, static_cast<int>(str.size()))) error = Empty string, str.substr(0, std::min(30, static_cast<int>(str.size()))) =
These are the files in the model path:
checkpoint-0-optimizer.pt learner.jsonl log-actor-2.txt log-evaluator-1.txt checkpoint-0.pt log-actor-0.txt log-actor-3.txt log-learner.txt config.json log-actor-1.txt log-evaluator-0.txt vpnet.pb
And this is the config.json file
{ "actors": 4, "checkpoint_freq": 1, "cutoff_probability": 0.800000, "cutoff_value": 0.950000, "devices": "cuda:0,cpu:0", "eval_levels": 7, "evaluation_window": 100, "evaluators": 2, "explicit_learning": false, "game": "rthex", "graph_def": "vpnet.pb", "inference_batch_size": 6, "inference_cache": 262144, "inference_threads": 3, "learning_rate": 0.000100, "max_simulations": 300, "max_steps": 0, "nn_depth": 10, "nn_model": "resnet", "nn_width": 128, "path": "/data/gent/465/vscxxxxx/shared/robin/os_out/rthex_5x5_2", "policy_alpha": 1.000000, "policy_epsilon": 0.250000, "replay_buffer_reuse": 3, "replay_buffer_size": 65536, "temperature": 1.000000, "temperature_drop": 10.000000, "train_batch_size": 1024, "uct_c": 2.000000, "weight_decay": 0.000100 }
Any idea what might cause this or how i could resolve it?
The text was updated successfully, but these errors were encountered: