Work with TensorFlow backend #38

hammer · 2016-07-20T21:35:01Z

I have been able to get several Keras examples to work with the TensorFlow backend but I have not been able to get mhcflurry to work.

hammer · 2016-07-20T21:36:22Z

One clue: the preamble spew is different.

Here's what I get when running a Keras example:

# sudo KERAS_BACKEND=tensorflow PATH=$PATH:/usr/local/cuda/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64 python examples/imdb_cnn_lstm.py
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:0a:00.0
Total memory: 12.00GiB
Free memory: 11.87GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 1 with properties:
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:09:00.0
Total memory: 12.00GiB
Free memory: 11.87GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 2 with properties:
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:06:00.0
Total memory: 12.00GiB
Free memory: 11.87GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 3 with properties:
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:05:00.0
Total memory: 12.00GiB
Free memory: 11.86GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 1 2 3
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 1:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 2:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 3:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:0a:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX TITAN X, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX TITAN X, pci bus id: 0000:06:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)

Here's what I get when running mhcflurry:

# sudo KERAS_BACKEND=tensorflow PATH=$PATH:/usr/local/cuda/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64 python script/mhcflurry-train-class1-allele-specific-models.py
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
Using gpu device 0: GeForce GTX TITAN X (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 4007)

hammer · 2016-07-20T21:43:14Z

Another clue: when I set breakpoints and step through mhcflurry, I die on constructors that seem to create TensorFlow variables, e.g. https://github.com/hammerlab/mhcflurry/blob/master/mhcflurry/args.py#L172 if changed to read optimizer='RMSprop' will run fine but when using the object RMSprop(lr=args.learning_rate) I get

F tensorflow/stream_executor/cuda/cuda_driver.cc:302] current context was not created by the StreamExecutor cuda_driver API: 0x2f95ff0; a CUDA runtime call was likely performed without using a StreamExecutor context

hammer · 2016-07-20T21:53:27Z

Further investigation indicates that the successfully opened CUDA library happens when you import keras.

The Found device spew in the Keras example happens when you model.add(Embedding(max_features, embedding_size, input_length=maxlen)) at https://github.com/fchollet/keras/blob/master/examples/imdb_cnn_lstm.py#L10.

The Using gpu device 0 spew in mhcflurry happens when you say from mhcflurry.common import normalize_allele_name.

hammer · 2016-08-01T20:04:48Z

I put a notebook into a branch that tries to build a mhcflurry-like model of our IEDB data using Keras directly rather than going through mhcflurry libraries.

Currently I'm dying on model.add(Dense(input_dim=9*21, output_dim=1)), which appears to be an issue w/ IPython, as it loads just fine from a regular python shell. Note that the issue w/ IPython is fixed in TF 0.9.0 (I'm running on 0.8.0).

hammer · 2016-08-01T20:56:28Z

Okay to avoid the IPython issues I've pushed a minimal script that runs on the Theano backend but segfaults the TF backend.

hammer · 2016-08-01T21:15:44Z

I have no idea why https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py works and https://github.com/hammerlab/mhcflurry/blob/hammer_tf_backend/script/tf-mhcflurry.py segfaults. Getting close to moving on from this issue and using Theano.

hammer · 2016-08-01T21:19:30Z

Update: the segfault goes away if I remove the mhcflurry imports. TF doesn't like something we're importing in mhcflurry.

hammer · 2016-08-01T21:37:44Z

Incredibly, setting device=cpu in ~/.theanorc, as discussed at tensorflow/tensorflow#916 (comment) made the TF backend work! To not stomp on people using Theano, you can just run your Keras scripts on the TensorFlow backend using THEANO_FLAGS='device=cpu' KERAS_BACKEND=tensorflow.

Even though I deleted every reference to theano from my local mhcflurry install, this fix works. No idea why!

iskandr · 2016-08-01T21:53:58Z

Wow, thanks for the perseverance.

hammer · 2016-08-03T15:17:32Z

Update: I can successfully train models, though they're pretty slow. I get this warning from TensorFlow, which may indicate that Keras is generating bad TF code? Some discussion at tensorflow/tensorflow#206 (comment).

/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py:89: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape."

hammer changed the title ~~Work with TensorFlow as backend~~ Work with TensorFlow backend Aug 1, 2016

hammer closed this as completed Aug 1, 2016

maximz mentioned this issue Aug 3, 2016

Dockerizing (WIP) #36

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work with TensorFlow backend #38

Work with TensorFlow backend #38

hammer commented Jul 20, 2016

hammer commented Jul 20, 2016

hammer commented Jul 20, 2016

hammer commented Jul 20, 2016

hammer commented Aug 1, 2016 •

edited

hammer commented Aug 1, 2016

hammer commented Aug 1, 2016

hammer commented Aug 1, 2016

hammer commented Aug 1, 2016 •

edited

iskandr commented Aug 1, 2016

hammer commented Aug 3, 2016 •

edited

Work with TensorFlow backend #38

Work with TensorFlow backend #38

Comments

hammer commented Jul 20, 2016

hammer commented Jul 20, 2016

hammer commented Jul 20, 2016

hammer commented Jul 20, 2016

hammer commented Aug 1, 2016 • edited

hammer commented Aug 1, 2016

hammer commented Aug 1, 2016

hammer commented Aug 1, 2016

hammer commented Aug 1, 2016 • edited

iskandr commented Aug 1, 2016

hammer commented Aug 3, 2016 • edited

hammer commented Aug 1, 2016 •

edited

hammer commented Aug 1, 2016 •

edited

hammer commented Aug 3, 2016 •

edited