Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras fit_generator not using GPU in TF 1.3.0 #7732

Closed
atabakd opened this issue Aug 24, 2017 · 2 comments
Closed

Keras fit_generator not using GPU in TF 1.3.0 #7732

atabakd opened this issue Aug 24, 2017 · 2 comments

Comments

@atabakd
Copy link

atabakd commented Aug 24, 2017

I am trying to train the following model:

from tensorflow.contrib import keras
model = keras.models.Sequential()
model.add(keras.layers.GRU(32, input_shape=(None,float_data.shape[-1])))
model.add(keras.layers.Dense(1))
model.compile(optimizer=keras.optimizers.RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
                           steps_per_epoch=500,
                           epochs=20,
                           validation_data=val_gen,
                           validation_steps=val_steps)

But this model is being trained on only a single cpu which I have confirmed with nvidia-smi. I know I am using the gpu version of tensorflow and if instead of using model.fit_generator I were to use model.fit (provided I reorganise the data, etc.) keras does indeed use the GPU.

@fabriciorsf
Copy link

I get this same error with Keras+TensorFlow on fit_generator.
And the same code with Keras+Theano works fine.

Follow the command gets the error:

model.fit_generator(self.train_inputs, steps_per_epoch=self.train_inputs.steps_per_epoch(),
                    validation_data=test_input_sequence, validation_steps=steps_test,
                    max_queue_size=self.train_inputs.workers, epochs=i+1, initial_epoch=i,
                    workers=self.train_inputs.workers, use_multiprocessing=True,
                    callbacks = callbacks)

The error:

Epoch 1/1
Traceback (most recent call last):
  File "/opt/programs/miniconda3/envs/myenv/lib/python3.6/site-packages/keras/utils/data_utils.py", line 497, in get
    inputs = self.queue.get(block=True).get()
  File "/opt/programs/miniconda3/envs/myenv/lib/python3.6/multiprocessing/pool.py", line 608, in get
    raise self._value
  File "/opt/programs/miniconda3/envs/myenv/lib/python3.6/multiprocessing/pool.py", line 385, in _handle_tasks
    put(task)
  File "/opt/programs/miniconda3/envs/myenv/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/opt/programs/miniconda3/envs/myenv/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: can't pickle _thread.lock objects

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./myfolder/mycode.py", line 473, in <module>
    main()
  File "./myfolder/mycode.py", line 459, in main
    autonem.train_autonem(args.embedding_file, args.tune_embedding)
  File "./myfolder/mycode.py", line 182, in train_autonem
    callbacks = callbacks)
  File "/opt/programs/miniconda3/envs/myenv/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/opt/programs/miniconda3/envs/myenv/lib/python3.6/site-packages/keras/engine/training.py", line 1809, in fit_generator
    generator_output = next(output_generator)
  File "/opt/programs/miniconda3/envs/myenv/lib/python3.6/site-packages/keras/utils/data_utils.py", line 502, in get
    raise StopIteration(e)
StopIteration: can't pickle _thread.lock objects

System information:

Have I written custom code: Yes
OS Platform and Distribution: Linux GnomeUbuntu 16.04, but with new kernel
_TensorFlow installed from: pip
TensorFlow version: 1.2.1
Python version: 3.6.1 (Miniconda3 4.3.11-64bit)
Bazel version (if compiling from source): I don't know.
CUDA/cuDNN version: I don't use because my graphic card is AMD-Radeon
GPU model and memory: AMD Radeon R7 M260/M265
CPU model: Intel® Core™ i7-4510U CPU @ 2.00GHz × 4
RAM Memory: 16GiB (2x8Gib dual-channel)
Exact command to reproduce:

history = CumulativeHistory()
callbacks = [history]
from keras import backend as K
if K.backend() == 'tensorflow':
  board = keras.callbacks.TensorBoard(log_dir=f"{self.prefix_folder_logs}{time()}",
                                    histogram_freq=1, write_graph=True, write_images=True)
  callbacks.append(board)
metric_to_compare = 'val_euclidean_distance'
print("Begin of training model...")
for i in range(MAX_NUM_EPOCHS):
  model.fit_generator(self.train_inputs, steps_per_epoch=self.train_inputs.steps_per_epoch(),
                      validation_data=test_input_sequence, validation_steps=steps_test,
                      max_queue_size=self.train_inputs.workers, epochs=i+1, initial_epoch=i,
                      workers=self.train_inputs.workers, use_multiprocessing=True,
                      callbacks = callbacks)
  try:
    metrics_diff = history.history[metric_to_compare][i] - min(history.history[metric_to_compare][:i])
  except:
    metrics_diff = -1
  if metrics_diff < 0:
    self._save_models(i)
    self.data_processor = None  # Empty memory
    best_epoch = i
    num_worse_epochs = 0
  elif metrics_diff > 0:
    num_worse_epochs += 1
    if num_worse_epochs >= PATIENCE:
      print("Ran out of patience. Stopping training.")
      break
print("End of training model.")

Collected information:

(myenv) myuser@mymachine:~$ ./tf_env_collect.sh 
Collecting system information...
2017-07-28 21:05:00.140602: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-28 21:05:00.140632: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-28 21:05:00.140645: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-28 21:05:00.140650: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-28 21:05:00.140656: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Wrote environment to tf_env.txt. You can review the contents of that file.
and use it to populate the fields in the github issue template.

cat tf_env.txt

(myenv) myuser@mymachine:~$ cat tf_env.txt

== cat /etc/issue ===============================================
Linux mymachine 4.4.0-87-generic #110~14.04.1-Ubuntu SMP Tue Jul 18 14:51:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="14.04.5 LTS, Trusty Tahr"
VERSION_ID="14.04"

== are we in docker =============================================
No

== compiler =====================================================
c++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== uname -a =====================================================
Linux mymachine 4.4.0-87-generic #110~14.04.1-Ubuntu SMP Tue Jul 18 14:51:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy (1.13.1)
protobuf (3.3.0)
tensorflow (1.2.1)

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.2.1
tf.GIT_VERSION = v1.2.0-5-g435cdfc
tf.COMPILER_VERSION = v1.2.0-5-g435cdfc
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH /opt/programs/miniconda3/envs/myenv/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin::/opt/programs/acml/gfortran64/lib
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
./tf_env_collect.sh: linha 105: nvidia-smi: comando não encontrado

== cuda libs  ===================================================

@stale
Copy link

stale bot commented Dec 4, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the stale label Dec 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants