Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA problems in causal linear product #58

Closed
xyltt opened this issue Dec 26, 2020 · 8 comments
Closed

CUDA problems in causal linear product #58

xyltt opened this issue Dec 26, 2020 · 8 comments
Labels
bug Something isn't working

Comments

@xyltt
Copy link

xyltt commented Dec 26, 2020

Hi,
My machine has 4 gpus, but when I use the gpu-1 (where the default gpu is 0), I found the cuda code be computed on the gpu-0. And, the code can not be computed when I use multiple gpus one time. There is a out of memory error.

@boredtylin
Copy link

Same issue here. When the data are put in devices other than cuda:0, the output is always zero's.

To reproduce the err:

import torch
from fast_transformers.causal_product import causal_dot_product

q = k = v = torch.randn(5, 10, 10, 10).to(0)
print(causal_dot_product(q, k, v)) # this should produce the right result.

q = k = v = torch.randn(5, 10, 10, 10).to(1)
print(causal_dot_product(q, k, v)) #the output is all zero's

@angeloskath angeloskath added the bug Something isn't working label Jan 31, 2021
@katie-cathy-hunt
Copy link

Hi @angeloskath!
When do you plan to fix the bug?

@angeloskath
Copy link
Collaborator

@katie-cathy-hunt I will push a fix today. Sorry this took so long.

Cheers,
Angelos

@katie-cathy-hunt
Copy link

@angeloskath
Thanks for the quick response and help!

@bbelgodere
Copy link

@angeloskath I just rebuilt my environment to try your patch, but am running into a new issue

import torch
>>> from fast_transformers.causal_product import causal_dot_product
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/dccstor/bmbelgod1/projects/fast-transformers/fast_transformers/causal_product/__init__.py", line 9, in <module>
    from .causal_product_cpu import causal_dot_product as causal_dot_product_cpu, \
ModuleNotFoundError: No module named 'fast_transformers.causal_product.causal_product_cpu'

I can import fast_transformers, but if I try to import fast_transformers.causal_product I get the same error.

I verified I had pulled your fix

 sed -n 59,63p fast_transformers/aggregate/aggregate_cuda.cu
) {
    // Make sure that we are using the correct GPU device
    torch::DeviceGuard _guard(X.device());

    int N = X.size(0);

and it's in the environment

pip list | grep fast
pytorch-fast-transformers 0.3.0

No errors in the build/install log

@angeloskath
Copy link
Collaborator

Hmm, that is weird. What did you do to rebuild? Could I bother you to do a rm -r build and then rebuild?

(Next step should be to provide prebuilt binaries for common setups to avoid all these issues)

@bbelgodere
Copy link

I thought I may have induced the error myself, I am using a conda environment with cuda installed via conda, which only installs the shared libraries, not nvcc. Looked through your setup.py and it doesn't produce an error/message if it doesn't find nvcc. I then loaded the module to add cuda 11 (same version pytorch is compiled against) into my path.

I verified call(["nvcc"], stdout=DEVNULL, stderr=DEVNULL) returned 1

then I removed build and dist directories, then python setup.py install

Still no luck

python
Python 3.7.9 | packaged by conda-forge | (default, Dec  9 2020, 21:08:20)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> from fast_transformers.causal_product import causal_dot_product
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/dccstor/bmbelgod1/projects/fast-transformers/fast_transformers/causal_product/__init__.py", line 9, in <module>
    from .causal_product_cpu import causal_dot_product as causal_dot_product_cpu, \
ModuleNotFoundError: No module named 'fast_transformers.causal_product.causal_product_cpu'

This is on RHEL 8.2, Python 3.7.9, Pytorch 1.7.1

@bbelgodere
Copy link

@angeloskath I apologize, everything is working correctly, I started a python repl in the fast transformers directory after the install and it was looking for a local library first since there is a subdirectory called fast transformers... my mistake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants