Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReLERNN train TF2 model.fit memory leak and errors #16

Closed
LZeitler opened this issue Feb 24, 2020 · 12 comments
Closed

ReLERNN train TF2 model.fit memory leak and errors #16

LZeitler opened this issue Feb 24, 2020 · 12 comments

Comments

@LZeitler
Copy link

I have problems running the TF2 version of relernn.
I'm using:
tensorflow 2.1
cudatk 10.1.243
cudnn 7.6.4
CUDA enabled GPU (1080Ti)

Memory leak
Each training iteration memory usage keeps increasing which eventually leads to >200GB RAM usage. I think it's related to these issues
tensorflow/tensorflow#33030
tensorflow/tensorflow#35100
I also tried nightly which has the same issue.

Error message
I'm also getting error and warning messages in each epoch with TF2.

2020-02-22 01:44:32.078164: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.

I don't know if these problems are related but maybe they are.

On a side note, I can run the example pipeline with producing output, even though another error comes up when loading modules (Could not load dynamic library 'libnvinfer.so.6')

There was another issue with ReLERNN train (earlier TF1 commits, where model.fit_generator was used). The model fitting would not succeed after all epochs ran, without any error message. Maybe you have an idea what the problem could be here? Then I could use the TF1 version of ReLERNN and run my stuff that way.

I'm running it on a dataset with 5 individuals and about 2M SNPs (unphased, with some missing data).

Any help would be greatly appreciated.

@andrewkern
Copy link
Member

hey there-- if you are getting the library error I would guess something is wrong in your nvidia setup. How did you install tf / the cuda tools?

Also I'm not sure if ReLERNN is ready for tf2.1, but I do know @jradrion has it working on tf2....

@LZeitler
Copy link
Author

Hi Andrew,
I installed TF 2.1 with pip. According to the readme ReLERNN is tested on TF2.1 so I thought I give it a go and tried matching dependencies as good as possible.
Do you think the library error is related to memory usage?

@andrewkern
Copy link
Member

so this memory leak seems to be on the TF side, but the error you report has to do with nvidia tools-- how was cuda installed on your system?

@LZeitler
Copy link
Author

not sure, it loads with python when run on a GPU node

@andrewkern
Copy link
Member

okay so one question for your admins is what happened to nvinfer. that seems to be either installed somewhere that TF can't find it or not installed

@LZeitler
Copy link
Author

I will ask them whats going on.
So generally would you say rather use Relernn with TF 1.x for now?

@jradrion
Copy link
Contributor

@LZeitler I had not noticed a memory leak issue in my testing of ReLERNN with tf2.1. However, I testing by running only ~10 epochs for speed, and our machine has a fairly large amount of memory, so it's possible that I'll find this issue when training for more epochs. I will be testing this ASAP.

As far as the warning your first describe, I also get that warning at every epoch. I had seen a comment about it being a spurious in this thread, and temporarily ignored it since everything else appeared to be working. However, users are now saying it is connected to the memory leak issue.

The warning about Could not load dynamic library 'libnvinfer.so.6' appears to be connected to not having TensorRT installed. TensorRT should not be necessary, but the warning did go away once we installed it.

I'll do some more testing and report back.

@jradrion
Copy link
Contributor

@LZeitler I have not forgotten about you. I'm still debugging a number of issues related to this move to tf2.

@jradrion
Copy link
Contributor

@LZeitler I removed multiprocessing from model.fit, and I was able to run a full-sized dataset to >400 epochs without running into any memory issues. Could you try to pull these changes and reinstall. Please let me know if you are still having issues.

@LZeitler
Copy link
Author

LZeitler commented Mar 2, 2020

@jradrion I pulled and reinstalled.
For now I'm ignoring the warnings related to TensorRT.
However, testing with the example pipeline, I am now getting some other warnings:

2020-03-02 15:39:32.734313: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[IteratorGetNext/_2]]
2020-03-02 15:39:32.734285: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 200 batches). You may need to use the repeat() function when building your dataset.
WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
Traceback (most recent call last):
File "/cluster/home/zeitlerl/.local/bin/ReLERNN_TRAIN", line 117, in <module>
main()
File "/cluster/home/zeitlerl/.local/bin/ReLERNN_TRAIN", line 107, in main
gpuID=args.gpuID)
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/ReLERNN/helpers.py", line 371, in runModels
model.load_weights(network[1])
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 234, in load_weights
return super(Model, self).load_weights(filepath, by_name, skip_mismatch)
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/network.py", line 1222, in load_weights
hdf5_format.load_weights_from_hdf5_group(f, self.layers)
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 699, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/tensorflow_core/python/keras/backend.py", line 3323, in batch_set_value
x.assign(np.asarray(value, dtype=dtype(x)))
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 819, in assign
self._shape.assert_is_compatible_with(value_tensor.shape)
File "/cluster/home/zeitlerl/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_shape.py", line 1110, in assert_is_compatible_with
raise ValueError("Shapes %s and %s are incompatible" % (self, other))
ValueError: Shapes (2610, 256) and (2036, 256) are incompatible

@LZeitler
Copy link
Author

LZeitler commented Mar 2, 2020

@jradrion Running on big dataset works now! Thanks for the fix! Memory issue is also resolved!

@jradrion
Copy link
Contributor

jradrion commented Mar 2, 2020

Hi @LZeitler thanks for bringing the issue with the example scripts to my attention. I had to bump up the number of training simulations to avoid this error with a fixed number of epochs. ReLERNN/examples/example_pipeline.sh should be working now. Glad to see you are no longer having memory issues with your big dataset.

I'm going to go ahead and close this issue. Please let me know if you come across any other problems.

Best,
Jeff

@jradrion jradrion closed this as completed Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants