CUDA error #355

StpMax · 2021-01-15T08:12:45Z

Python version: 3.6.9
Lightwood version: last staging
Additional info if applicable: print(torch.version) => 1.7.0

I have old GPU (geforce 660), so assume cuda should not be used during predictor training, but in log i see:

ERROR:mindsdb-logger-9c6604ca-5708-11eb-a3e2-2c56dc4ecd27---no_report:/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/mindsdb_native/libs/phases/model_interface/lightwood_backend.py:417 - Traceback (most recent call last):
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/mindsdb_native/libs/phases/model_interface/lightwood_backend.py", line 411, in train
    test_data=lightwood_test_ds
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/api/predictor.py", line 137, in learn
    self._mixer.fit(train_ds=train_ds, test_ds=test_ds)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/mixers/base_mixer.py", line 37, in fit
    self._fit(train_ds, test_ds, **kwargs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/mixers/nn.py", line 270, in _fit
    for epoch, training_error in enumerate(self._iter_fit(subset_train_ds, subset_id=subset_id)):
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/mixers/nn.py", line 571, in _iter_fit
    outputs = self.net(inputs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/mixers/helpers/default_net.py", line 125, in forward
    output = self._foward_net(input)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/functional.py", line 1690, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: no kernel image is available for execution on the device


ERROR:mindsdb-logger-9c6604ca-5708-11eb-a3e2-2c56dc4ecd27---no_report:/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/mindsdb_native/libs/phases/model_interface/lightwood_backend.py:418 - Exception while running NnMixer

Training finish well, predictor is queryable.

The text was updated successfully, but these errors were encountered:

paxcema · 2021-01-15T14:07:45Z

It's possible that this is a version mismatch between CUDA and the installed pyTorch.

Can you check that the major CUDA version reported by nvidia-smi matches the version for which pyTorch was built (should show up in the version number when doing pip show torch)?

George3d6 · 2021-01-16T01:20:04Z

The main issue here is:

Why are we auto-detecting CUDA as existing if it fails !?

We should fix this by creating a single-valued tensor and calling .cuda() instead of using torch's method to check.

Though I thought we were already doing this...

@StpMax were you specifying use_gpu=True or just letting mindsdb auto detect when you got this error?
@paxcema any issues with the aforementioned approach to GPU detection?

StpMax · 2021-01-18T12:41:39Z

@George3d6 auto detect. I checked with use_gpu=False is no error.
@paxcema torch not show cuda version
nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 660     Off  | 00000000:01:00.0 N/A |                  N/A |
| 30%   35C    P8    N/A /  N/A |   1210MiB /  1991MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

pip show torch

Name: torch
Version: 1.7.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages
Requires: typing-extensions, numpy, future, dataclasses

paxcema · 2021-01-18T13:44:11Z

According to this, PyTorch will report CUDA availability even if the GPU is no longer supported, as is the case for the GTX660.

@George3d6 indeed we already do that here, so I guess a solution is to set the minimum supported compute capability to 3.7 as stated in the PyTorch issue discussion, thoughts?

paxcema · 2021-01-18T15:53:05Z

@StpMax can you please try the branch check_min_cuda_compute and check if the issue is solved?

StpMax · 2021-01-18T16:15:40Z

@paxcema same issue

paxcema · 2021-01-21T13:40:45Z

Fixed in #359, closing

StpMax added the bug Something isn't working label Jan 15, 2021

StpMax mentioned this issue Jan 15, 2021

Error during train 'diamonds' predictor mindsdb/mindsdb_native#392

Closed

George3d6 added this to the Debug and refactor - January 2021 milestone Jan 16, 2021

paxcema closed this as completed Jan 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error #355

CUDA error #355

StpMax commented Jan 15, 2021

paxcema commented Jan 15, 2021

George3d6 commented Jan 16, 2021

StpMax commented Jan 18, 2021

paxcema commented Jan 18, 2021

paxcema commented Jan 18, 2021

StpMax commented Jan 18, 2021

paxcema commented Jan 21, 2021

CUDA error #355

CUDA error #355

Comments

StpMax commented Jan 15, 2021

paxcema commented Jan 15, 2021

George3d6 commented Jan 16, 2021

StpMax commented Jan 18, 2021

paxcema commented Jan 18, 2021

paxcema commented Jan 18, 2021

StpMax commented Jan 18, 2021

paxcema commented Jan 21, 2021