Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors upon running the project #1

Open
kirk86 opened this issue Sep 29, 2016 · 14 comments
Open

errors upon running the project #1

kirk86 opened this issue Sep 29, 2016 · 14 comments

Comments

@kirk86
Copy link

kirk86 commented Sep 29, 2016

Hi, I was trying to test the project implementation but I'm running on some errors running the following command python train.py with problem=cifar10 n_z=32 n_h=64 depths=[2,2,2] margs.depth_ar=1 margs.posterior=down_iaf2_NL margs.kl_min=0.25

[graphy] floatX = float32 Traceback (most recent call last): File "train.py", line 1, in <module> import graphy as G File "/home/user/projects/python/theano/iaf/graphy/__init__.py", line 45, in <module> import misc.data File "/home/user/projects/python/theano/iaf/graphy/misc/data.py", line 6, in <module> basepath = os.environ['ML_DATA_PATH'] File "/home/user/anaconda2/lib/python2.7/UserDict.py", line 40, in __getitem__ raise KeyError(key) KeyError: 'ML_DATA_PATH'

Any suggestions much appreciated!

@rfarouni
Copy link

rfarouni commented Oct 7, 2016

I also encountered the same error. I just set the environment variable ML_DATA_PATH to the path where I keep my data. You would also need to set ML_LOG_PATH to some other location as well. Besides the environment variables, I needed to install two packages I didn't have on my system

conda install pil
pip install sacred

@kirk86
Copy link
Author

kirk86 commented Oct 7, 2016

@rfarouni thanks a a lot of the suggestions. For future reference for anyone else, I ended up with the following 3 environment variables

ML_DATA_PAPTH=/path/to/cifar10
CIFAR10_PATH=/path/to/cifar10
ML_LOG_PATH=/path/to/logs

then I modified graphy/nodes/conv.py by adding

if 'gpu' in theano.config.device:  # @UndefinedVariable
    from theano.sandbox.cuda.dnn import dnn_conv
    from theano.sandbox.cuda.dnn import dnn_pool
elif 'cuda' in theano.config.device:  # @UndefinedVariable
    from theano.sandbox.gpuarray.dnn import dnn_conv
    from theano.sandbox.gpuarray.dnn import dnn_pool
elif 'cpu' in theano.config.device:
    from theano.tensor.nnet import conv2d as dnn_conv
    from theano.tensor.signal.pool import pool_2d as dnn_pool
else:
    raise Exception()

since I don't have gpu on my machine

but now I get the following error about the posterior down_iaf2_NL

[graphy] floatX = float32
INFO - Deep VAE - Running command 'train'
WARNING - Deep VAE - No observers have been added to this run
INFO - Deep VAE - Started
Logpath: /Users/user/iaf/logs//1475861678.05/
CVAE1 with  {'depths': [2, 2, 2], 'nl': u'elu', 'n_h2': 64, 'n_z': 32, 'shape_x': [3, 32, 32], 'optim': u'adamax', 'weightsharing': False, 'px': u'logistic', 'kernel_x': [5, 5], 'n_h1': 64, 'prior': u'diag', 'posterior': u'down_iaf2_NL', 'pad_x': 0, 'beta2': 0.001, 'beta1': 0.1, 'depth_ar': 1, 'alpha': 0.002, 'kl_min': 0.25, 'downsample_type': u'nn', 'kernel_h': [3, 3]}
ERROR - Deep VAE - Failed after 0:00:01!
Traceback (most recent calls WITHOUT Sacred internals):
  File "train.py", line 185, in train
    model = construct_model(data_init)
  File "train.py", line 128, in construct_model
    model = models.cvae1(**margs)
  File "/Users/user/iaf/models.py", line 396, in cvae1
    layers[i].append(cvae_layer(name, prior, posterior, n_h1, n_h2, n_z, depth_ar, downsample, nl, kernel_h, False, downsample_type, w))
  File "/Users/user/iaf/models.py", line 105, in cvae_layer
    raise Exception("Unknown posterior "+posterior)
Exception: Unknown posterior down_iaf2_NL

@rfarouni
Copy link

rfarouni commented Oct 7, 2016

@kirk86 You need to change down_iaf2_NL to down_iaf2_nl here

python train.py with problem=cifar10 n_z=32 n_h=64 depths=[2,2,2] margs.depth_ar=1 margs.posterior=down_iaf2_NL margs.kl_min=0.25

@kirk86
Copy link
Author

kirk86 commented Oct 7, 2016

@rfarouni thanks. Now I'm getting the following error:

python train.py with problem=cifar10 n_z=32 n_h=64 depths=[2,2,2] margs.depth_ar=1 margs.posterior=down_iaf2_nl margs.kl_min=0.25
[graphy] floatX = float32
INFO - Deep VAE - Running command 'train'
WARNING - Deep VAE - No observers have been added to this run
INFO - Deep VAE - Started
Logpath: /Users/user/iaf/logs//1475865352.93/
CVAE1 with  {'depths': [2, 2, 2], 'nl': u'elu', 'n_h2': 64, 'n_z': 32, 'shape_x': [3, 32, 32], 'optim': u'adamax', 'weightsharing': False, 'px': u'logistic', 'kernel_x': [5, 5], 'n_h1': 64, 'prior': u'diag', 'posterior': u'down_iaf2_nl', 'pad_x': 0, 'beta2': 0.001, 'beta1': 0.1, 'depth_ar': 1, 'alpha': 0.002, 'kl_min': 0.25, 'downsample_type': u'nn', 'kernel_h': [3, 3]}
ERROR - Deep VAE - Failed after 0:00:01!
Traceback (most recent calls WITHOUT Sacred internals):
  File "train.py", line 185, in train
    model = construct_model(data_init)
  File "train.py", line 128, in construct_model
    model = models.cvae1(**margs)
  File "/Users/users/iaf/models.py", line 520, in cvae1
    f_encode_decode(w)
  File "/Users/users/iaf/models.py", line 416, in f_encode_decode
    h = x_enc(_x - .5, w)
  File "/Users/users/iaf/graphy/nodes/conv.py", line 196, in f
    input_shape = h.tag.test_value.shape[1:]
AttributeError: scratchpad instance has no attribute 'test_value'

@rfarouni
Copy link

rfarouni commented Oct 7, 2016

@kirk86 In .theanorc, add this line compute_test_value=raise. My file looks like this

[global]
floatX = float32
device = gpu
compute_test_value=raise

Note: I am using a GPU

@kirk86
Copy link
Author

kirk86 commented Oct 7, 2016

@rfarouni thanks for you patience! I was just about to close this when I saw the print out messages but then I run on this NaN error

ar.conv2d 0_0_posterior_conv1_out_1 (64, 16, 16) (32, 16, 16) [3, 3] True False True valid False True
conv2d 0_0_down_conv2_1 (96, 16, 16) (64, 16, 16) [3, 3] True valid (1, 1) 1
conv2d x_dec (64, 16, 16) (3, 32, 32) [5, 5] True valid (1, 1) 2
AdaMax_Avg alpha: -0.002 beta1: 0.1 beta2: 0.001
Compiling...  212.75 s
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan], dtype=float32)]
ERROR - Deep VAE - Failed after 0:11:00!
Traceback (most recent calls WITHOUT Sacred internals):
  File "train.py", line 256, in train
    result = model.train(data_train, n_batch=n_batch)
  File "/Users/user/iaf/models.py", line 538, in newf
    return f.cache(*args, **kws)
  File "/Users/user/iaf/graphy/function.py", line 110, in func
    raise Exception("NaN detected")
Exception: NaN detected

@rfarouni
Copy link

rfarouni commented Oct 8, 2016

@kirk86 Although I didn't encounter this error, I got a memory error after a minute or so of run time. I have only a 4G GPU and it seems I need larger memory to run the code on the dataset given the parameters that were provided

@kirk86
Copy link
Author

kirk86 commented Oct 8, 2016

@rfarouni, ah, that gives me a hint to try it on a machine with larger memory as well. Even though I'm not quite confident that this nan error comes from memory issues in my case. It seems to me more of a computation error related to the actual code implementation than the memory part. I'll tested on a bigger machine just in case and report back.

On another note, I like your favorite quotes section 👍 that I might steal the idea even though I've had lots of those quotes collected in my notebook but never posted them...

@kirk86
Copy link
Author

kirk86 commented Oct 14, 2016

Some update regarding this issue. So I run the script on a machine with 8 cores and 32GB of memory and two days now, it's like I've been watching a black hole, speaking in memory terms. The only thing running on that machine is this script and so far it has swallowed 27GB of memory. I'm not closing this issue yet until the script is over. But to be fair this is kind of insane in terms of memory consumption.

@dpkingma
Copy link
Contributor

Hi Kirk86,

Thanks for bringing this up; it definitely doesn't need that much memory,
probably a bug. I'll look into it.

On Fri, Oct 14, 2016 at 1:36 PM, kirk86 notifications@github.com wrote:

Some update regarding this issue. So I run the script on a machine with 8
cores and 32GB of memory and two days now, it's like I've been watching a
black hole, speaking in memory terms. The only thing running on that
machine is this script and so fat it has swallowed 27GB of memory. I'm not
closing this issue yet until the script is over. But to be fair this is
kind of insane in terms of memory consumption.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACUc85LFEhTl_K86mV6sb0f4-IMpWU8Mks5qz-fKgaJpZM4KKVtu
.

@rfarouni
Copy link

@dpkingma @kirk86 I also ran into memory problems on the GPU the first time I ran it. The second time, for some unexplained reason, worked fine, although very slowly. I also tried to run the Tensorflow implementation on one GPU, but I encountered this error

python tf_train.py --logdir $ML_LOG_PATH --hpconfig depth=1,num_blocks=20,kl_min=0.1,learning_rate=0.002,batch_size=32 --num_gpus 1 --mode train
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.1.5 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
Num trainable variables: 41557927
starting training
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: GeForce GTX 960
major: 5 minor: 2 memoryClockRate (GHz) 1.304
pciBusID 0000:01:00.0
Total memory: 3.95GiB
Free memory: 3.41GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:01:00.0)
Starting queue runners
Initializing parameters.
Initialized!
Traceback (most recent call last):
  File "tf_train.py", line 392, in <module>
    tf.app.run()
  File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "tf_train.py", line 386, in main
    run(hps)
  File "tf_train.py", line 265, in run
    with sv.managed_session(config=config) as sess:
  File "/home/rick/anaconda3/envs/py27/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 974, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 802, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 386, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/rick/anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 963, in managed_session
    start_standard_services=start_standard_services)
  File "/home/rick/Documents/repos/iaf/tf_utils/common.py", line 222, in prepare_or_wait_for_session
    not_ready = self._session_manager._model_not_ready(sess)
AttributeError: 'SessionManager' object has no attribute '_model_not_ready'

@kirk86
Copy link
Author

kirk86 commented Oct 16, 2016

@rfarouni I know that this might not be the solution , but did you install tqdm? Regarding the slowliness it's something that I've also experienced. In my case it's almost one day for each epoch.

@rfarouni
Copy link

@kirk86 sure! conda install tqdm

@Mistobaan
Copy link

tdqm is not in conda but is in pip pip install tqdm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants