Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help! training not starting! #urgent #124

Open
danialvi opened this issue Jan 30, 2022 · 6 comments
Open

Help! training not starting! #urgent #124

danialvi opened this issue Jan 30, 2022 · 6 comments

Comments

@danialvi
Copy link

Capture

My training is not starting. I have used python 3.6 with tensorflow gpu 1.8.0 and keras 2.1.2. Also I have a Geforce GTX 3060 running on my computer. So it shouldnt be a problem. I also installed Norton antivirus on this new computer. On the older computer which has a bad GPU I had Panda Dome, but there training was running. But after over 1 hour, the training was only on 1%. Thats why I bought a new computer with a good GPU and CPU. Some of this work is going to be presented in my master thesis. I would appreciate any help soon.

@danialvi
Copy link
Author

I got this error now:


InternalError Traceback (most recent call last)
C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
1321 try:
-> 1322 return fn(*args)
1323 except errors.OpError as e:

C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1306 return self._call_tf_sessionrun(
-> 1307 options, feed_dict, fetch_list, target_list, run_metadata)
1308

C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1408 self._session, options, feed_dict, fetch_list, target_list,
-> 1409 run_metadata)
1410 else:

InternalError: Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64
[[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]]
[[Node: loss/mul/_129 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1107_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

InternalError Traceback (most recent call last)
in
1 history = model.fit_generator(train_generator, steps_per_epoch=num_train_examples//batch_size, epochs=500, callbacks=callbacks,
----> 2 validation_data=eval_generator, validation_steps=num_eval_examples//batch_size, verbose=2)

C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\legacy\interfaces.py in wrapper(*args, **kwargs)
85 warnings.warn('Update your ' + object_name + 86 ' call to the Keras 2 API: ' + signature, stacklevel=2)
---> 87 return func(*args, **kwargs)
88 wrapper._original_function = func
89 return wrapper

C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\engine\training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
2145 outs = self.train_on_batch(x, y,
2146 sample_weight=sample_weight,
-> 2147 class_weight=class_weight)
2148
2149 if not isinstance(outs, list):

C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\engine\training.py in train_on_batch(self, x, y, sample_weight, class_weight)
1837 ins = x + y + sample_weights
1838 self._make_train_function()
-> 1839 outputs = self.train_function(ins)
1840 if len(outputs) == 1:
1841 return outputs[0]

C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\backend\tensorflow_backend.py in call(self, inputs)
2355 session = get_session()
2356 updated = session.run(fetches=fetches, feed_dict=feed_dict,
-> 2357 **self.session_kwargs)
2358 return updated[:len(self.outputs)]
2359

C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata)
898 try:
899 result = self._run(None, fetches, feed_dict, options_ptr,
--> 900 run_metadata_ptr)
901 if run_metadata:
902 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1133 if final_fetches or final_targets or (handle and feed_dict_tensor):
1134 results = self._do_run(handle, final_targets, final_fetches,
-> 1135 feed_dict_tensor, options, run_metadata)
1136 else:
1137 results = []

C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1314 if handle is None:
1315 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1316 run_metadata)
1317 else:
1318 return self._do_call(_prun_fn, handle, feeds, fetches)

C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
1333 except KeyError:
1334 pass
-> 1335 raise type(e)(node_def, op, message)
1336
1337 def _extend_graph(self):

InternalError: Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64
[[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]]
[[Node: loss/mul/_129 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1107_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'dense2/MatMul', defined at:
File "C:\ProgramData\anaconda3\envs\airsim\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\ProgramData\anaconda3\envs\airsim\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel_launcher.py", line 16, in
app.launch_new_instance()
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\traitlets\config\application.py", line 664, in launch_instance
app.start()
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel\kernelapp.py", line 612, in start
self.io_loop.start()
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\platform\asyncio.py", line 199, in start
self.asyncio_loop.run_forever()
File "C:\ProgramData\anaconda3\envs\airsim\lib\asyncio\base_events.py", line 442, in run_forever
self._run_once()
File "C:\ProgramData\anaconda3\envs\airsim\lib\asyncio\base_events.py", line 1462, in _run_once
handle._run()
File "C:\ProgramData\anaconda3\envs\airsim\lib\asyncio\events.py", line 145, in _run
self._callback(*self._args)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\ioloop.py", line 688, in
lambda f: self._run_callback(functools.partial(callback, future))
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\ioloop.py", line 741, in _run_callback
ret = callback()
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 814, in inner
self.ctx_run(self.run)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 162, in _fake_ctx_run
return f(*args, **kw)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 775, in run
yielded = self.gen.send(value)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel\kernelbase.py", line 365, in process_one
yield gen.maybe_future(dispatch(*args))
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 234, in wrapper
yielded = ctx_run(next, result)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 162, in _fake_ctx_run
return f(*args, **kw)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel\kernelbase.py", line 268, in dispatch_shell
yield gen.maybe_future(handler(stream, idents, msg))
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 234, in wrapper
yielded = ctx_run(next, result)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 162, in _fake_ctx_run
return f(*args, **kw)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel\kernelbase.py", line 545, in execute_request
user_expressions, allow_stdin,
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 234, in wrapper
yielded = ctx_run(next, result)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tornado\gen.py", line 162, in _fake_ctx_run
return f(*args, **kw)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel\ipkernel.py", line 306, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel\zmqshell.py", line 536, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\IPython\core\interactiveshell.py", line 2867, in run_cell
raw_cell, store_history, silent, shell_futures)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\IPython\core\interactiveshell.py", line 2895, in _run_cell
return runner(coro)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\IPython\core\async_helpers.py", line 68, in pseudo_sync_runner
coro.send(None)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\IPython\core\interactiveshell.py", line 3072, in run_cell_async
interactivity=interactivity, compiler=compiler, result=result)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\IPython\core\interactiveshell.py", line 3263, in run_ast_nodes
if (await self.run_code(code, result, async
=asy)):
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\IPython\core\interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 24, in
merged = Dense(10, activation=activation, name='dense2')(merged)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\engine\topology.py", line 603, in call
output = self.call(inputs, **kwargs)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\layers\core.py", line 843, in call
output = K.dot(inputs, self.kernel)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\backend\tensorflow_backend.py", line 1057, in dot
out = tf.matmul(x, y)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2122, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 4278, in mat_mul
name=name)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op
op_def=op_def)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\framework\ops.py", line 1718, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64
[[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]]
[[Node: loss/mul/_129 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1107_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

@danialvi
Copy link
Author

@mitchellspryn please help

@danialvi
Copy link
Author

@adshar

@danialvi
Copy link
Author

depencies.txt
Here is the list of dependencies I have in my anaconda env:

@mitchellspryn
Copy link
Contributor

I am not at MSFT currently, so I am not actively supporting this repo any more.

That said, I took a look at your stack trace. It looks like CUDA isn't installed properly. Relevant portion:

InternalError: Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64
[[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]]

I'd check to see if you can run any keras training operation - e.g. try training a linear model on some random data points and see if the forward/backpropagation works properly. My guess is no, and that'll help you debug what the situation is with your cuda install.

@danialvi
Copy link
Author

danialvi commented Mar 1, 2022

I am not at MSFT currently, so I am not actively supporting this repo any more.

That said, I took a look at your stack trace. It looks like CUDA isn't installed properly. Relevant portion:

InternalError: Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64
[[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]]

I'd check to see if you can run any keras training operation - e.g. try training a linear model on some random data points and see if the forward/backpropagation works properly. My guess is no, and that'll help you debug what the situation is with your cuda install.

Thank you for answering. I have tried to reinstall to check if it's something to do with cuda. I also tried by installing the cudatoolkit and cudann before install tensorflow by following these steps:
conda install cudatoolkit=9.0
conda install cudnn=7.1.4=cuda9.0_0
conda install -c anaconda tensorflow-gpu=1.8.0
conda install -c anaconda keras-gpu=2.1.2
python -m pip install --upgrade pip
conda update -n base conda
pip install msgpack-rpc-python
pip uninstall tornado
conda install -c conda-forge tornado=4.5.3
conda install jupyter
pip install matplotlib==2.1.2
pip install image
pip install keras_tqdm
conda install -c conda-forge opencv
conda install pandas
pip install --upgrade numpy==1.16.4
conda install scipy
pip install opencv-python
pip install --upgrade h5py==2.10.0
python -m ipykernel install --user

Still I have the same problem. Do you have any idea how I can solve this? I have really tried to look it up, but it seems many had the same problem, but no solutions that worked for me. As I am using this as a part of my master thesis, I have limited time as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants