Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing /out/MuZeroOut/board\r_temp.pth.tar after backpropagation #4

Closed
sherpal opened this issue Feb 25, 2021 · 2 comments
Closed

Comments

@sherpal
Copy link

sherpal commented Feb 25, 2021

Hello,

I try to launch the training for "hex" on my machine. The command I'm using is

python Main.py train -c Configurations/ModelConfigs/MuzeroBoard.json --game hex --gpu 1

I haven't touched anything in the configuration, so there are the ones from master.

The 50 self play iterations run successfully, then the 100 iterations of the back-propagation as well. However, after it finishes, I get the following error:

Traceback (most recent call last):
  File "Main.py", line 202, in <module>
    switch[content.algorithm](game, content, run_name)
  File "Main.py", line 86, in learnM0
    c.learn()
  File "C:\Users\antoi\projects\muzero\Coach.py", line 175, in learn
    self.opponent_net.load_checkpoint(folder=self.args.checkpoint, filename='temp.pth.tar')
  File "C:\Users\antoi\projects\muzero\MuZero\MuNeuralNet.py", line 243, in load_checkpoint
    raise FileNotFoundError(f"No MuZero Representation Model in path {representation_path}")
FileNotFoundError: No MuZero Representation Model in path ./out/MuZeroOut/board\r_temp.pth.tar

Indeed, the files I have in that folder are the following:

25/02/2021  08:11             1.231 boardgames_Hex_hex_20210225-081121.json
25/02/2021  08:39                97 checkpoint
25/02/2021  08:39         1.834.891 checkpoint_0.pth.tar.examples
25/02/2021  08:39         2.437.259 decoder_temp.pth.tar.data-00000-of-00001
25/02/2021  08:39               930 decoder_temp.pth.tar.index
25/02/2021  08:39         2.378.703 d_temp.pth.tar.data-00000-of-00001
25/02/2021  08:39             1.046 d_temp.pth.tar.index
25/02/2021  08:39         1.379.677 p_temp.pth.tar.data-00000-of-00001
25/02/2021  08:39               715 p_temp.pth.tar.index
25/02/2021  08:39         2.437.116 r_temp.pth.tar.data-00000-of-00001
25/02/2021  08:39               878 r_temp.pth.tar.index

Here are the versions of the libs I use:

  • python: 3.8.5
  • tensorflow: 2.4.1
  • keras: 2.4.3

I'm running on Windows 10 with CUDA 11 and, if it matters, a GTX1070 as GPU.

@joeryjoery
Copy link
Collaborator

Hi @sherpal, I've also encountered this issue multiple times. The cause is that the model checkpoints are saved in multiple files, as indicated by the .data-XXX...

This is new behaviour from tensorflow > 2.1 which we did not test. A quick fix would be to downgrade to:

  • Python 3.7.9
  • tensorflow 2.1.0
  • keras 2.3.1

@sherpal
Copy link
Author

sherpal commented Feb 26, 2021

Indeed, downgrading (almost) worked.

There was still a catch with the h5py package which made a breaking change in its 3.x version, and hence I hit the following issue: keras-team/keras#14265
But

pip uninstall h5py
pip install h5py==2.10

fixed it.

@sherpal sherpal closed this as completed Feb 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants