Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

module 'tutel_custom_kernel' has no attribute 'inject_source' #132

Closed
LisaWang0306 opened this issue Apr 4, 2022 · 10 comments
Closed

module 'tutel_custom_kernel' has no attribute 'inject_source' #132

LisaWang0306 opened this issue Apr 4, 2022 · 10 comments

Comments

@LisaWang0306
Copy link

LisaWang0306 commented Apr 4, 2022

My cuda version is 11.4, python version is 3.6.5
Following the requirement, my torch and torchvision versions are torch==1.10.0+cu113 and torchvision==0.11.1+cu113.
Then I run
git clone https://github.com/microsoft/tutel --branch v0.1.x
python ./tutel/setup.py install --user
then run the tutorial:
python ./tutel/examples/helloworld.py --batch_size=16
but meet the following error:

Traceback (most recent call last):
  File "./tutel/examples/helloworld.py", line 118, in <module>
    output = model(x)
  File "/home/fanj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "./tutel/examples/helloworld.py", line 85, in forward
    result = self._moe_layer(input)
  File "/home/fanj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fanj/tutel/tutel/impls/moe_layer.py", line 424, in forward
    result_output, l_aux = self.gates[gate_index].apply_on_expert_fn(reshaped_input, self)
  File "/home/fanj/tutel/tutel/impls/moe_layer.py", line 73, in apply_on_expert_fn
    critical_data, l_loss = extract_critical(gates, self.top_k, self.capacity_factor, self.fp32_gate, self.batch_prioritized_routing)
  File "/home/fanj/tutel/tutel/impls/fast_dispatch.py", line 163, in extract_critical
    locations1 = compute_location(masks_se[0])
  File "/home/fanj/tutel/tutel/jit_kernels/gating.py", line 83, in fast_cumsum_sub_one
    return get_cumsum_kernel(int(data.size(0)), int(data.size(1)))(data)
  File "/home/fanj/tutel/tutel/jit_kernels/gating.py", line 68, in get_cumsum_kernel
    ''')
  File "/home/fanj/tutel/tutel/impls/jit_compiler.py", line 31, in generate_kernel
    return JitCompiler.create_raw(template)
  File "/home/fanj/tutel/tutel/impls/jit_compiler.py", line 21, in create_raw
    __ctx__ = tutel_custom_kernel.inject_source(source)
AttributeError: module 'tutel_custom_kernel' has no attribute 'inject_source'

Do you know how to solve this problem?
Thank you very much!

@ghostplant
Copy link
Contributor

I don't know whether python & python3 command target to the same python runtime in your environment.
Can you try using python3 ./tutel/setup.py install --user instead of python ./tutel/setup.py install --user to install tutel?

@LisaWang0306
Copy link
Author

I don't know whether python & python3 command target to the same python runtime in your environment. Can you try using python3 ./tutel/setup.py install --user instead of python ./tutel/setup.py install --user to install tutel?

Sorry I made a mistake, for the tutorial I still used python ./tutel/examples/helloworld.py --batch_size=16 to run.
For this case, do you have any idea why it came up with this error?
Thanks very much for your reply!!

@ghostplant
Copy link
Contributor

Firstly, can you run "python -m pip uninstall tutel" many times to ensure it is fully cleaned?

Then, can you run and share the output logs of that installation command python ./tutel/setup.py install --user? Thanks!

@LisaWang0306
Copy link
Author

Firstly, can you run "python -m pip uninstall tutel" many times to ensure it is fully cleaned?

Then, can you run and share the output logs of that installation command python ./tutel/setup.py install --user? Thanks!

Thanks for your suggestions!
I run python -m pip uninstall tutel for three times and it already shows WARNING: Skipping tutel as it is not installed.
The complete output log of the installation command python ./tutel/setup.py install --user is as follow:

(en1) fanj@worker124:~$ python ./tutel/setup.py install --user
running install
running bdist_egg
running egg_info
writing tutel.egg-info/PKG-INFO
writing dependency_links to tutel.egg-info/dependency_links.txt
writing requirements to tutel.egg-info/requires.txt
writing top-level names to tutel.egg-info/top_level.txt
/home/fanj/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py:381: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'tutel.egg-info/SOURCES.txt'
writing manifest file 'tutel.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.6/tutel_custom_kernel.cpython-36m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/tutel
copying build/lib.linux-x86_64-3.6/tutel/__init__.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.6/tutel/parted/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.6/tutel/parted/solver.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.6/tutel/parted/patterns.py -> build/bdist.linux-x86_64/egg/tutel/parted
copying build/lib.linux-x86_64-3.6/tutel/parted/spmdx.py -> build/bdist.linux-x86_64/egg/tutel/parted
creating build/bdist.linux-x86_64/egg/tutel/parted/backend
copying build/lib.linux-x86_64-3.6/tutel/parted/backend/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend
creating build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.6/tutel/parted/backend/torch/__init__.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.6/tutel/parted/backend/torch/executor.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
copying build/lib.linux-x86_64-3.6/tutel/parted/backend/torch/config.py -> build/bdist.linux-x86_64/egg/tutel/parted/backend/torch
creating build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.6/tutel/jit_kernels/__init__.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.6/tutel/jit_kernels/sparse.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.6/tutel/jit_kernels/gating.py -> build/bdist.linux-x86_64/egg/tutel/jit_kernels
copying build/lib.linux-x86_64-3.6/tutel/moe.py -> build/bdist.linux-x86_64/egg/tutel
copying build/lib.linux-x86_64-3.6/tutel/system_init.py -> build/bdist.linux-x86_64/egg/tutel
creating build/bdist.linux-x86_64/egg/tutel/custom
copying build/lib.linux-x86_64-3.6/tutel/custom/__init__.py -> build/bdist.linux-x86_64/egg/tutel/custom
creating build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/__init__.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/fast_dispatch.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/jit_compiler.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/communicate.py -> build/bdist.linux-x86_64/egg/tutel/impls
copying build/lib.linux-x86_64-3.6/tutel/impls/moe_layer.py -> build/bdist.linux-x86_64/egg/tutel/impls
creating build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.6/tutel/launcher/__init__.py -> build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.6/tutel/launcher/execl.py -> build/bdist.linux-x86_64/egg/tutel/launcher
copying build/lib.linux-x86_64-3.6/tutel/launcher/run.py -> build/bdist.linux-x86_64/egg/tutel/launcher
creating build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/__init__.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_deepspeed.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_megatron.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_ddp.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_amp.py -> build/bdist.linux-x86_64/egg/tutel/examples
copying build/lib.linux-x86_64-3.6/tutel/examples/helloworld_sharded_experts.py -> build/bdist.linux-x86_64/egg/tutel/examples
byte-compiling build/bdist.linux-x86_64/egg/tutel/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/solver.py to solver.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/patterns.py to patterns.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/spmdx.py to spmdx.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/executor.py to executor.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/parted/backend/torch/config.py to config.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/sparse.py to sparse.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/jit_kernels/gating.py to gating.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/moe.py to moe.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/system_init.py to system_init.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/custom/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/fast_dispatch.py to fast_dispatch.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/jit_compiler.py to jit_compiler.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/communicate.py to communicate.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/impls/moe_layer.py to moe_layer.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/execl.py to execl.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/launcher/run.py to run.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_deepspeed.py to helloworld_deepspeed.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld.py to helloworld.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_megatron.py to helloworld_megatron.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_ddp.py to helloworld_ddp.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_amp.py to helloworld_amp.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/tutel/examples/helloworld_sharded_experts.py to helloworld_sharded_experts.cpython-36.pyc
creating stub loader for tutel_custom_kernel.cpython-36m-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/tutel_custom_kernel.py to tutel_custom_kernel.cpython-36.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying tutel.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.tutel_custom_kernel.cpython-36: module references __file__
creating 'dist/tutel-0.1-py3.6-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing tutel-0.1-py3.6-linux-x86_64.egg
creating /home/fanj/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg
Extracting tutel-0.1-py3.6-linux-x86_64.egg to /home/fanj/.local/lib/python3.6/site-packages
Adding tutel 0.1 to easy-install.pth file

Installed /home/fanj/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg
Processing dependencies for tutel==0.1
Finished processing dependencies for tutel==0.1

Thanks!

@ghostplant
Copy link
Contributor

ghostplant commented Apr 4, 2022

I see this old file "helloworld_sharded_experts.py" is in the logs, it indicates that some of these codes are not the latest, and I don't see cpp code used by tutel_custom_kernel is being built.
Mostly likely it is an environmental issue from pip or setuptools.

Can you further try the following 2 options to check any one of them can work?

Option 1 - Do a clean Install of Tutel from another port:

# Get Rid of Environmental Issues
$ python -m pip install --upgrade pip setuptools
$ python -m pip uninstall tutel -y
$ python -m pip uninstall tutel_custom_kernel -y

# Clean Install from Repo
$ python -m pip install --user git+https://github.com/microsoft/tutel@v0.1.x

# Test
$ python -m tutel.examples.helloworld

Option 2 - Cleanup early build cache to avoid environmental problems:

# Get Rid of Environmental Issues
$ python -m pip install --upgrade pip setuptools
$ python -m pip uninstall tutel -y
$ python -m pip uninstall tutel_custom_kernel -y

# Clean Install from Local
$ rm -r ./tutel/dist ./tutel/build
$ python ./tutel/setup.py install --user

# Test
$ python -m tutel.examples.helloworld

@LisaWang0306
Copy link
Author

Thanks very much for your help!!! I will try it.
I have another question. Could you please have a look?
I changed to another system and had a try. It seems I don't have the header file nccl.h. When installing, it shows:

./tutel/custom/custom_kernel.cpp:20:10: fatal error: nccl.h: No such file or directory
   20 | #include <nccl.h>
      |          ^~~~~~~~
compilation terminated.
Try installing without NCCL extension..

Will this error affect the following running process?
Because when I try to run python -m tutel.examples.helloworld, the following error occurs:

Traceback (most recent call last):
  File "/home/wangp/anaconda3/envs/en/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/wangp/anaconda3/envs/en/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/examples/helloworld.py", line 121, in <module>
    output = model(x)
  File "/home/wangp/anaconda3/envs/en/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/examples/helloworld.py", line 86, in forward
    result = self._moe_layer(input)
  File "/home/wangp/anaconda3/envs/en/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/impls/moe_layer.py", line 481, in forward
    result_output, l_aux = self.gates[gate_index].apply_on_expert_fn(reshaped_input, self)
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/impls/moe_layer.py", line 111, in apply_on_expert_fn
    locations1 = self.compute_location(masks_se[0])
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/jit_kernels/gating.py", line 83, in fast_cumsum_sub_one
    return get_cumsum_kernel(int(data.size(0)), int(data.size(1)))(data)
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/jit_kernels/gating.py", line 72, in optimized_cumsum
    base_kernel(mask1.to(torch.int32).contiguous(), locations1)
  File "/home/wangp/.local/lib/python3.6/site-packages/tutel-0.1-py3.6-linux-x86_64.egg/tutel/impls/jit_compiler.py", line 24, in func
    tutel_custom_kernel.invoke(inputs, extra, __ctx__)
RuntimeError: (0) == (cuModuleLoadDataEx(&hMod, image.c_str(), sizeof(options) / sizeof(*options), options, values))INTERNAL ASSERT FAILED at "./tutel/custom/custom_kernel.cpp":185, please report a bug to PyTorch. CHECK_EQ fails.

@ghostplant
Copy link
Contributor

ghostplant commented Apr 6, 2022

@LisaWang0306 That dependency-missing error will just skip NCCL related optimization, so it shouldn't be related to the next one. Will any of these commands work?

  1. FAST_CUMSUM=0 python -m tutel.examples.helloworld
  2. USE_NVRTC=0 python -m tutel.examples.helloworld

It can help to determine which option triggers your issue, since I cannot reproduce that in all of our environments. Thanks!

@LisaWang0306
Copy link
Author

@LisaWang0306 That dependency-missing error will just skip NCCL related optimization, so it shouldn't be related to the next one. Will any of these commands work?

  1. FAST_CUMSUM=0 python -m tutel.examples.helloworld
  2. USE_NVRTC=0 python -m tutel.examples.helloworld

It can help to determine which option triggers your issue, since I cannot reproduce that in all of our environments. Thanks!

Finally! USE_NVRTC=0 python -m tutel.examples.helloworld works! Thanks!!!!!!!

@ghostplant
Copy link
Contributor

ghostplant commented Apr 6, 2022

Thanks for your information.
This should be a bug from NVRTC. Maybe we'll consider setting USE_NVRTC=0 by default since applications cannot guarantee whether CUDA's NVRTC is stable or not.
How about other issues?

@LisaWang0306
Copy link
Author

There are no more issues left. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants