Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error #355

Closed
StpMax opened this issue Jan 15, 2021 · 7 comments
Closed

CUDA error #355

StpMax opened this issue Jan 15, 2021 · 7 comments
Labels
bug Something isn't working

Comments

@StpMax
Copy link
Contributor

StpMax commented Jan 15, 2021

  • Python version: 3.6.9
  • Lightwood version: last staging
  • Additional info if applicable: print(torch.version) => 1.7.0

I have old GPU (geforce 660), so assume cuda should not be used during predictor training, but in log i see:

ERROR:mindsdb-logger-9c6604ca-5708-11eb-a3e2-2c56dc4ecd27---no_report:/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/mindsdb_native/libs/phases/model_interface/lightwood_backend.py:417 - Traceback (most recent call last):
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/mindsdb_native/libs/phases/model_interface/lightwood_backend.py", line 411, in train
    test_data=lightwood_test_ds
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/api/predictor.py", line 137, in learn
    self._mixer.fit(train_ds=train_ds, test_ds=test_ds)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/mixers/base_mixer.py", line 37, in fit
    self._fit(train_ds, test_ds, **kwargs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/mixers/nn.py", line 270, in _fit
    for epoch, training_error in enumerate(self._iter_fit(subset_train_ds, subset_id=subset_id)):
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/mixers/nn.py", line 571, in _iter_fit
    outputs = self.net(inputs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/lightwood/mixers/helpers/default_net.py", line 125, in forward
    output = self._foward_net(input)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/torch/nn/functional.py", line 1690, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: no kernel image is available for execution on the device


ERROR:mindsdb-logger-9c6604ca-5708-11eb-a3e2-2c56dc4ecd27---no_report:/home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages/mindsdb_native/libs/phases/model_interface/lightwood_backend.py:418 - Exception while running NnMixer

Training finish well, predictor is queryable.

@paxcema
Copy link
Member

paxcema commented Jan 15, 2021

It's possible that this is a version mismatch between CUDA and the installed pyTorch.

Can you check that the major CUDA version reported by nvidia-smi matches the version for which pyTorch was built (should show up in the version number when doing pip show torch)?

@George3d6
Copy link
Contributor

The main issue here is:

  • Why are we auto-detecting CUDA as existing if it fails !?

We should fix this by creating a single-valued tensor and calling .cuda() instead of using torch's method to check.

Though I thought we were already doing this...

@StpMax were you specifying use_gpu=True or just letting mindsdb auto detect when you got this error?
@paxcema any issues with the aforementioned approach to GPU detection?

@StpMax
Copy link
Contributor Author

StpMax commented Jan 18, 2021

@George3d6 auto detect. I checked with use_gpu=False is no error.
@paxcema torch not show cuda version
nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 660     Off  | 00000000:01:00.0 N/A |                  N/A |
| 30%   35C    P8    N/A /  N/A |   1210MiB /  1991MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

pip show torch

Name: torch
Version: 1.7.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /home/maxs/dev/mdb/venv_new/lib/python3.6/site-packages
Requires: typing-extensions, numpy, future, dataclasses

@paxcema
Copy link
Member

paxcema commented Jan 18, 2021

According to this, PyTorch will report CUDA availability even if the GPU is no longer supported, as is the case for the GTX660.

@George3d6 indeed we already do that here, so I guess a solution is to set the minimum supported compute capability to 3.7 as stated in the PyTorch issue discussion, thoughts?

@paxcema
Copy link
Member

paxcema commented Jan 18, 2021

@StpMax can you please try the branch check_min_cuda_compute and check if the issue is solved?

@StpMax
Copy link
Contributor Author

StpMax commented Jan 18, 2021

@paxcema same issue

@paxcema
Copy link
Member

paxcema commented Jan 21, 2021

Fixed in #359, closing

@paxcema paxcema closed this as completed Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants