Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work with TensorFlow backend #38

Closed
hammer opened this issue Jul 20, 2016 · 10 comments
Closed

Work with TensorFlow backend #38

hammer opened this issue Jul 20, 2016 · 10 comments

Comments

@hammer
Copy link
Contributor

hammer commented Jul 20, 2016

I have been able to get several Keras examples to work with the TensorFlow backend but I have not been able to get mhcflurry to work.

@hammer
Copy link
Contributor Author

hammer commented Jul 20, 2016

One clue: the preamble spew is different.

Here's what I get when running a Keras example:

# sudo KERAS_BACKEND=tensorflow PATH=$PATH:/usr/local/cuda/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64 python examples/imdb_cnn_lstm.py
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:0a:00.0
Total memory: 12.00GiB
Free memory: 11.87GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 1 with properties:
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:09:00.0
Total memory: 12.00GiB
Free memory: 11.87GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 2 with properties:
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:06:00.0
Total memory: 12.00GiB
Free memory: 11.87GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 3 with properties:
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:05:00.0
Total memory: 12.00GiB
Free memory: 11.86GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 1 2 3
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 1:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 2:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 3:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:0a:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX TITAN X, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX TITAN X, pci bus id: 0000:06:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)

Here's what I get when running mhcflurry:

# sudo KERAS_BACKEND=tensorflow PATH=$PATH:/usr/local/cuda/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64 python script/mhcflurry-train-class1-allele-specific-models.py
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
Using gpu device 0: GeForce GTX TITAN X (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 4007)

@hammer
Copy link
Contributor Author

hammer commented Jul 20, 2016

Another clue: when I set breakpoints and step through mhcflurry, I die on constructors that seem to create TensorFlow variables, e.g. https://github.com/hammerlab/mhcflurry/blob/master/mhcflurry/args.py#L172 if changed to read optimizer='RMSprop' will run fine but when using the object RMSprop(lr=args.learning_rate) I get

F tensorflow/stream_executor/cuda/cuda_driver.cc:302] current context was not created by the StreamExecutor cuda_driver API: 0x2f95ff0; a CUDA runtime call was likely performed without using a StreamExecutor context

@hammer
Copy link
Contributor Author

hammer commented Jul 20, 2016

Further investigation indicates that the successfully opened CUDA library happens when you import keras.

The Found device spew in the Keras example happens when you model.add(Embedding(max_features, embedding_size, input_length=maxlen)) at https://github.com/fchollet/keras/blob/master/examples/imdb_cnn_lstm.py#L10.

The Using gpu device 0 spew in mhcflurry happens when you say from mhcflurry.common import normalize_allele_name.

@hammer hammer changed the title Work with TensorFlow as backend Work with TensorFlow backend Aug 1, 2016
@hammer
Copy link
Contributor Author

hammer commented Aug 1, 2016

I put a notebook into a branch that tries to build a mhcflurry-like model of our IEDB data using Keras directly rather than going through mhcflurry libraries.

Currently I'm dying on model.add(Dense(input_dim=9*21, output_dim=1)), which appears to be an issue w/ IPython, as it loads just fine from a regular python shell. Note that the issue w/ IPython is fixed in TF 0.9.0 (I'm running on 0.8.0).

@hammer
Copy link
Contributor Author

hammer commented Aug 1, 2016

Okay to avoid the IPython issues I've pushed a minimal script that runs on the Theano backend but segfaults the TF backend.

@hammer
Copy link
Contributor Author

hammer commented Aug 1, 2016

I have no idea why https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py works and https://github.com/hammerlab/mhcflurry/blob/hammer_tf_backend/script/tf-mhcflurry.py segfaults. Getting close to moving on from this issue and using Theano.

@hammer
Copy link
Contributor Author

hammer commented Aug 1, 2016

Update: the segfault goes away if I remove the mhcflurry imports. TF doesn't like something we're importing in mhcflurry.

@hammer
Copy link
Contributor Author

hammer commented Aug 1, 2016

Incredibly, setting device=cpu in ~/.theanorc, as discussed at tensorflow/tensorflow#916 (comment) made the TF backend work! To not stomp on people using Theano, you can just run your Keras scripts on the TensorFlow backend using THEANO_FLAGS='device=cpu' KERAS_BACKEND=tensorflow.

Even though I deleted every reference to theano from my local mhcflurry install, this fix works. No idea why!

@hammer hammer closed this as completed Aug 1, 2016
@iskandr
Copy link
Contributor

iskandr commented Aug 1, 2016

Wow, thanks for the perseverance.

@hammer
Copy link
Contributor Author

hammer commented Aug 3, 2016

Update: I can successfully train models, though they're pretty slow. I get this warning from TensorFlow, which may indicate that Keras is generating bad TF code? Some discussion at tensorflow/tensorflow#206 (comment).

/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py:89: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape."

@maximz maximz mentioned this issue Aug 3, 2016
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants