Missing `/out/MuZeroOut/board\r_temp.pth.tar` after backpropagation #4

sherpal · 2021-02-25T07:48:59Z

Hello,

I try to launch the training for "hex" on my machine. The command I'm using is

python Main.py train -c Configurations/ModelConfigs/MuzeroBoard.json --game hex --gpu 1

I haven't touched anything in the configuration, so there are the ones from master.

The 50 self play iterations run successfully, then the 100 iterations of the back-propagation as well. However, after it finishes, I get the following error:

Traceback (most recent call last):
  File "Main.py", line 202, in <module>
    switch[content.algorithm](game, content, run_name)
  File "Main.py", line 86, in learnM0
    c.learn()
  File "C:\Users\antoi\projects\muzero\Coach.py", line 175, in learn
    self.opponent_net.load_checkpoint(folder=self.args.checkpoint, filename='temp.pth.tar')
  File "C:\Users\antoi\projects\muzero\MuZero\MuNeuralNet.py", line 243, in load_checkpoint
    raise FileNotFoundError(f"No MuZero Representation Model in path {representation_path}")
FileNotFoundError: No MuZero Representation Model in path ./out/MuZeroOut/board\r_temp.pth.tar

Indeed, the files I have in that folder are the following:

25/02/2021  08:11             1.231 boardgames_Hex_hex_20210225-081121.json
25/02/2021  08:39                97 checkpoint
25/02/2021  08:39         1.834.891 checkpoint_0.pth.tar.examples
25/02/2021  08:39         2.437.259 decoder_temp.pth.tar.data-00000-of-00001
25/02/2021  08:39               930 decoder_temp.pth.tar.index
25/02/2021  08:39         2.378.703 d_temp.pth.tar.data-00000-of-00001
25/02/2021  08:39             1.046 d_temp.pth.tar.index
25/02/2021  08:39         1.379.677 p_temp.pth.tar.data-00000-of-00001
25/02/2021  08:39               715 p_temp.pth.tar.index
25/02/2021  08:39         2.437.116 r_temp.pth.tar.data-00000-of-00001
25/02/2021  08:39               878 r_temp.pth.tar.index

Here are the versions of the libs I use:

python: 3.8.5
tensorflow: 2.4.1
keras: 2.4.3

I'm running on Windows 10 with CUDA 11 and, if it matters, a GTX1070 as GPU.

The text was updated successfully, but these errors were encountered:

joeryjoery · 2021-02-25T08:49:08Z

Hi @sherpal, I've also encountered this issue multiple times. The cause is that the model checkpoints are saved in multiple files, as indicated by the .data-XXX...

This is new behaviour from tensorflow > 2.1 which we did not test. A quick fix would be to downgrade to:

Python 3.7.9
tensorflow 2.1.0
keras 2.3.1

sherpal · 2021-02-26T08:55:43Z

Indeed, downgrading (almost) worked.

There was still a catch with the h5py package which made a breaking change in its 3.x version, and hence I hit the following issue: keras-team/keras#14265
But

pip uninstall h5py
pip install h5py==2.10

fixed it.

sherpal closed this as completed Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing `/out/MuZeroOut/board\r_temp.pth.tar` after backpropagation #4

Missing `/out/MuZeroOut/board\r_temp.pth.tar` after backpropagation #4

sherpal commented Feb 25, 2021

joeryjoery commented Feb 25, 2021

sherpal commented Feb 26, 2021

Missing /out/MuZeroOut/board\r_temp.pth.tar after backpropagation #4

Missing /out/MuZeroOut/board\r_temp.pth.tar after backpropagation #4

Comments

sherpal commented Feb 25, 2021

joeryjoery commented Feb 25, 2021

sherpal commented Feb 26, 2021

Missing `/out/MuZeroOut/board\r_temp.pth.tar` after backpropagation #4

Missing `/out/MuZeroOut/board\r_temp.pth.tar` after backpropagation #4