Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should I run setup again? #10

Closed
wn4github opened this issue Mar 2, 2021 · 9 comments
Closed

Should I run setup again? #10

wn4github opened this issue Mar 2, 2021 · 9 comments

Comments

@wn4github
Copy link

I have git cloned the repository and run ./setup.py install and ./setup-dataset.sh, but then I realized train_all.sh was not present. Later I found it in the 4.1.3 release. Do I need to set up once again in the 4.1.3 directory or just copy train_all.sh, train_others.sh and other script files? Thank you.

@guicho271828
Copy link
Owner

I don't think regenerating the dataset is necessary.
Regarding setup.py, git diff HEAD..refs/tags/4.1.3 then if there are no diff it should be good to go

@wn4github
Copy link
Author

Thank you for the suggestion. But in the end, I couldn't get the code to run due to the incompatibility issues between nvidia-tensorflow 1.15 and keras. I have to use the nvidia-tensorflow because RTX 30 series do not support CUDA 10 while prebuilt tensorflow 1.x does not support CUDA 11.

I'm wondering if you are currently porting the code to PyTorch?

@guicho271828
Copy link
Owner

yes, that aspect is also something I am struggling with. My lab cluster is also transitioning to CUDA11, so I am attempting to rewrite part of the code, but the development is slow.

@guicho271828
Copy link
Owner

you could try to build tf 1.15 for cuda 11.

@guicho271828
Copy link
Owner

I just learned that NVIDIA (not Google) provides a backward-compatible version of tensorflow 1.15 that works on cuda 11.
tensorflow/tensorflow#43629
https://developer.nvidia.com/blog/accelerating-tensorflow-on-a100-gpus/
It seems its package name is nvidia-tensorflow.

@wn4github
Copy link
Author

Thank you so much for you help, Asaiさん.

I too discovered the nvidia-tensorflow could work with CUDA 11 and I managed to get a working tf 1.15 with

  • nvidia-tensorflow 1.15.4+nv20.12
  • Keras 2.2.5
  • keras-adabound 0.6.0
  • Keras-Applications 1.0.8

Now I am using the 4.1.3 version from the release source code, and running ./setup-dataset.sh is successful. However, I notice the difference of setup-dataset.sh between 4.1.3 and the latest commit: In 4.1.3, no .npz files are downloaded.

Anyway the problem I run into now is the error when executing ./train_all.sh. I only uncomment the first line for training: task-planning learn_plot_dump_summary. The trace is as follows

Fancy Traceback (most recent call last):
  File ./strips.py line 294 function <module> : main()
                    mode = 'learn_plot_dump_summary'
                sae_path = 'puzzle_mnist_3_3_5000_None_None_None_False_ConcreteDetNormalizedLogitAddEffectTransitionAE_planning'
      default_parameters = {'epoch': 200, 'batch_size': 500, 'optimizer': 'radam', 'max_temperature': 5.0, 'min_temperature': 0.7, 'M': 2, 'train_gumbel': True, 'train_softmax': True, 'test_gumbel': False, 'test_softmax': False, 'locality': 0.0, 'locality_delay': 0.0, 'aeclass': 'ConcreteDetNormalizedLogitAddEffectTransitionAE'}
              parameters = {'beta': [-0.3, -0.1, 0.0, 0.1, 0.3], 'lr': [0.1, 0.01, 0.001], 'N': [100, 200, 500, 1000], 'M': [2], 'layer': [1000], 'clayer': [16], 'dropout': [0.4], 'noise': [0.4], 'dropout_z': [False], 'activation': ['relu'], 'num_actions': [100, 200, 400, 800, 1600], 'aae_width': [100, 300, 600], 'aae_depth': [0, 1, 2], 'aae_activation': ['relu', 'tanh'], 'aae_delay': [0], 'direct': [0.1, 1.0, 10.0], 'direct_delay': [0.05, 0.1, 0.2, 0.3, 0.5], 'zerosuppress': [0.1, 0.2, 0.5], 'zerosuppress_delay': [0.05, 0.1, 0.2, 0.3, 0.5], 'loss': ['BCE'], 'type': ['mnist'], 'width': [3], 'height': [3], 'num_examples': [5000], 'stop_gradient': [False], 'aeclass': ['ConcreteDetNormalizedLogitAddEffectTransitionAE'], 'comment': ['planning']}

  File ./strips.py line 290 function main : globals()[task](*map(myeval,sys.argv))
                    task = 'puzzle'

  File ./strips.py line 208 function puzzle : show_summary(ae, train, test)
                    type = 'mnist'
                   width = 3
                  height = 3
            num_examples = 5000
                       N = None
             num_actions = None
                  direct = None
           stop_gradient = False
                 aeclass = 'ConcreteDetNormalizedLogitAddEffectTransitionAE'
                 comment = 'planning'
                    name = 'comment'
                   value = 'planning'
                    path = '/home/wn/workspace/latplan/latplan-4.1.3_original/latplan/puzzles/puzzle-mnist-3-3.npz'
                    data = <numpy.ndarray float32  (5000, 2, 42, 42)>
             pre_configs = <numpy.ndarray float64  (5000, 9)>
             suc_configs = <numpy.ndarray float64  (5000, 9)>
                    pres = <numpy.ndarray float32  (5000, 42, 42)>
                    sucs = <numpy.ndarray float32  (5000, 42, 42)>
             transitions = <numpy.ndarray float32  (2, 5000, 42, 42)>
                  states = <numpy.ndarray float32  (10000, 42, 42)>
                   train = <numpy.ndarray float32  (4500, 2, 42, 42)>
                     val = <numpy.ndarray float32  (250, 2, 42, 42)>
                    test = <numpy.ndarray float32  (250, 2, 42, 42)>
                      ae = None

  File ./strips.py line 180 function show_summary : ae.summary()
                      ae = None
                   train = <numpy.ndarray float32  (4500, 2, 42, 42)>
                    test = <numpy.ndarray float32  (250, 2, 42, 42)>

AttributeError: 'NoneType' object has no attribute 'summary'

I am also considering to use the trained weights directly if training cannot be done, but I need some guidance.

@guicho271828
Copy link
Owner

guicho271828 commented Mar 4, 2021

Now I am using the 4.1.3 version from the release source code, and running ./setup-dataset.sh is successful. However, I notice the difference of setup-dataset.sh between 4.1.3 and the latest commit: In 4.1.3, no .npz files are downloaded.

setup-dataset also downloads unrelated npz files that are not used in ijcai paper (but are used on other papers). Sorry for this confusion, this is because this entire repository is a kind of my "lab environment" which sets up everything I use for all of my papers. The failed ones for photorealistic-blocksworld are not used, so no worries. Instead, all datasets needed for reproducing the ijcai paper are rendered locally using a script included in this repo.

Anyway the problem I run into now is the error when executing ./train_all.sh.

Since you already have the trained weights, running this script is not necessary. All results including the csv dump and the PDDL domain file is included in the archive.

AttributeError: 'NoneType' object has no attribute 'summary'

Here is what is happening: task-planning learn_plot_dump_summary tries to run the training. However, since samples/*/grid_search.log already has more entires than the specified limit of hyperparameter configurations (300), it did not run the training. Thus the model instance (ae) is None.

If you want to regenerate the reconstructions etc., then task-planning plot_dump_summary would load the stored weights and produce a reconstruction plot and dump several files necessary for generating pddl files. Be sure that the archive is decompressed in a correct directory. It should be made so that samples/ directory is in the root of the repository.

The hyperparameter search is completely parallelized in the process level. So, if you have an 8-core 8-gpu machine, just run 8 processes in parallel.

@guicho271828
Copy link
Owner

guicho271828 commented Mar 4, 2021

If you do want to train the model, you may also want to prune some hyperparameters by looking at samples/*/grid_search.log. For example, this one is the best hyperparameter for mandrill 15-puzzle. Then you can edit strips.py and edit the dictionary.

@wn4github
Copy link
Author

My immediate goal is to use Cube-space AE to encode some MNIST 8-puzzle images. Then to better understand Latplan, I am planning to train the network, get my hands dirty with the implementation. But this is out of the topic of the issue, maybe I should open a new one. Thank you for all your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants