New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime dies when training keras on a copied notebook #156

Closed
charlesverdad opened this Issue May 11, 2018 · 3 comments

Comments

Projects
None yet
2 participants
@charlesverdad

charlesverdad commented May 11, 2018

I am training a simple autoencoder using keras. After making a copy of my notebook (via google drive right click > make a copy), running the copied version fails with this error:

Train on 60000 samples, validate on 10000 samples
Epoch 1/100
---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1326     try:
-> 1327       return fn(*args)
   1328     except errors.OpError as e:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1311       return self._call_tf_sessionrun(
-> 1312           options, feed_dict, fetch_list, target_list, run_metadata)
   1313 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1419             self._session, options, feed_dict, fetch_list, target_list,
-> 1420             status, run_metadata)
   1421 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    515             compat.as_text(c_api.TF_Message(self.status.status)),
--> 516             c_api.TF_GetCode(self.status.status))
    517     # Delete the underlying status object from memory otherwise it stays alive

InternalError: Blas GEMM launch failed : a.shape=(256, 784), b.shape=(784, 32), m=256, n=32, k=784
	 [[Node: dense_1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_input_1_0_2/_35, dense_1/kernel/read)]]

During handling of the above exception, another exception occurred:

InternalError                             Traceback (most recent call last)
<ipython-input-12-0ac457a370aa> in <module>()
      3                 batch_size=256,
      4                 shuffle=True,
----> 5                 validation_data=(x_test, x_test))

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1703                               initial_epoch=initial_epoch,
   1704                               steps_per_epoch=steps_per_epoch,
-> 1705                               validation_steps=validation_steps)
   1706 
   1707     def evaluate(self, x=None, y=None,

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in _fit_loop(self, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
   1234                         ins_batch[i] = ins_batch[i].toarray()
   1235 
-> 1236                     outs = f(ins_batch)
   1237                     if not isinstance(outs, list):
   1238                         outs = [outs]

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
   2480         session = get_session()
   2481         updated = session.run(fetches=fetches, feed_dict=feed_dict,
-> 2482                               **self.session_kwargs)
   2483         return updated[:len(self.outputs)]
   2484 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    903     try:
    904       result = self._run(None, fetches, feed_dict, options_ptr,
--> 905                          run_metadata_ptr)
    906       if run_metadata:
    907         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1138     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1139       results = self._do_run(handle, final_targets, final_fetches,
-> 1140                              feed_dict_tensor, options, run_metadata)
   1141     else:
   1142       results = []

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1319     if handle is None:
   1320       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1321                            run_metadata)
   1322     else:
   1323       return self._do_call(_prun_fn, handle, feeds, fetches)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1338         except KeyError:
   1339           pass
-> 1340       raise type(e)(node_def, op, message)
   1341 
   1342   def _extend_graph(self):

InternalError: Blas GEMM launch failed : a.shape=(256, 784), b.shape=(784, 32), m=256, n=32, k=784
	 [[Node: dense_1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_input_1_0_2/_35, dense_1/kernel/read)]]

Caused by op 'dense_1/MatMul', defined at:
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2822, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-4fcc5490dd17>", line 5, in <module>
    activity_regularizer=regularizers.l1(10e-7))(input_img)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/topology.py", line 619, in __call__
    output = self.call(inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/layers/core.py", line 877, in call
    output = K.dot(inputs, self.kernel)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 1076, in dot
    out = tf.matmul(x, y)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 2108, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 4209, in mat_mul
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(256, 784), b.shape=(784, 32), m=256, n=32, k=784
	 [[Node: dense_1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_input_1_0_2/_35, dense_1/kernel/read)]]
@charlesverdad

This comment has been minimized.

charlesverdad commented May 11, 2018

As a workaround, I had to kill the other process using ! kill -9 <pid>, where pid is the process id of the original python kernel taken from !ps aux

I think the the "Restart Runtime" option in the original notebook might work as well.

@charlesverdad

This comment has been minimized.

charlesverdad commented May 19, 2018

Closing this because it's not really a bug in colab. Just remember to shutdown your kernels if you're working on a new notebook.

@FypTeamFast2019TumorDetection

This comment has been minimized.

FypTeamFast2019TumorDetection commented Sep 5, 2018

I faced this problem in Co-Lab provides limited memory upto(12 GB) in cloud which creates many issues while solving a problem. That's why only 300 images are used to train and test.when images was preprocessed with dimension 600x600 and batch size was set to 128 it Keras model freezed during epoch 1 .Compiler did not show this error.Actually the error was runtime limited memory which was unable to handle by CoLab because it gave only 12GB limited memory for usage.
Solution to above mentioned problem was solved by changing batch size to 4 and reduce image dimension to 300x300 because with 600x600 it still not work.
Conclusively,Recommend Solution is Make Images dimension and Batch_size small until you get no error. Run Again and Again by further changing batch size and image dimension small until there will no run time errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment