Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o: No such file or directory #132

Closed
Irenehere opened this issue Sep 8, 2022 · 9 comments

Comments

@Irenehere
Copy link

I tried to install the fastmoe under the environment of cuda 11.1+pytorch 1.8 +nccl 2.8.3 (as the recommended environment of megatron-2.2). However, I come up with the following error when running the setup script

root@9fdbdafc67e5:~/data/fastmoe-master# python setup.py install running install
running bdist_egg
running egg_info
writing fastmoe.egg-info/PKG-INFO
writing dependency_links to fastmoe.egg-info/dependency_links.txt
writing top-level names to fastmoe.egg-info/top_level.txt
reading manifest file 'fastmoe.egg-info/SOURCES.txt'
writing manifest file 'fastmoe.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'fmoe_cuda' extension
Emitting ninja build file /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
1.10.1
g++ -pthread -shared -B /opt/conda/compiler_compat -L/opt/conda/lib -Wl,-rpath=/opt/conda/lib -Wl,--no-as-needed -Wl,--sysroot=/ /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/stream_manager.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/local_exchange.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/parallel_linear.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/fmoe_cuda.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/fastermoe/smart_schedule.o -L/opt/conda/lib/python3.8/site-packages/torch/lib -L/usr/local/cuda/lib64 -lnccl -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-3.8/fmoe_cuda.cpython-38-x86_64-linux-gnu.so
g++: error: /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o: No such file or directory
error: command 'g++' failed with exit status 1

There is no global_exchange.o under the directory of fastmoe-master/build/temp.linux-x86_64-3.8/cuda/. Do you know how to fix this?

@laekov
Copy link
Owner

laekov commented Sep 8, 2022

The error seems weird, as the object file is not compiled while it begins to link everything together. Maybe you shuold remove the build directory and then try compile again.

@Irenehere
Copy link
Author

Now it becomes

root@9fdbdafc67e5:~/data/fastmoe-master# python setup.py install running install
running bdist_egg
running egg_info
writing fastmoe.egg-info/PKG-INFO
writing dependency_links to fastmoe.egg-info/dependency_links.txt
writing top-level names to fastmoe.egg-info/top_level.txt
reading manifest file 'fastmoe.egg-info/SOURCES.txt'
writing manifest file 'fastmoe.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-3.8
creating build/lib.linux-x86_64-3.8/fmoe
copying fmoe/functions.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/transformer.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/linear.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/balance.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/distributed.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/utils.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/layers.py -> build/lib.linux-x86_64-3.8/fmoe
copying fmoe/init.py -> build/lib.linux-x86_64-3.8/fmoe
creating build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/patch.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/balance.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/distributed.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/utils.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/layers.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/checkpoint.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
copying fmoe/megatron/init.py -> build/lib.linux-x86_64-3.8/fmoe/megatron
creating build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/naive_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/zero_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/faster_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/base_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/gshard_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/utils.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/noisy_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/swipe_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/switch_gate.py -> build/lib.linux-x86_64-3.8/fmoe/gates
copying fmoe/gates/init.py -> build/lib.linux-x86_64-3.8/fmoe/gates
creating build/lib.linux-x86_64-3.8/fmoe/fastermoe
copying fmoe/fastermoe/config.py -> build/lib.linux-x86_64-3.8/fmoe/fastermoe
copying fmoe/fastermoe/expert_utils.py -> build/lib.linux-x86_64-3.8/fmoe/fastermoe
copying fmoe/fastermoe/schedule.py -> build/lib.linux-x86_64-3.8/fmoe/fastermoe
copying fmoe/fastermoe/init.py -> build/lib.linux-x86_64-3.8/fmoe/fastermoe
copying fmoe/fastermoe/shadow_policy.py -> build/lib.linux-x86_64-3.8/fmoe/fastermoe
running build_ext
building 'fmoe_cuda' extension
creating /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8
creating /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda
creating /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/fastermoe
Emitting ninja build file /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
1.10.1
g++ -pthread -shared -B /opt/conda/compiler_compat -L/opt/conda/lib -Wl,-rpath=/opt/conda/lib -Wl,--no-as-needed -Wl,--sysroot=/ /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/stream_manager.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/local_exchange.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/parallel_linear.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/fmoe_cuda.o /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/fastermoe/smart_schedule.o -L/opt/conda/lib/python3.8/site-packages/torch/lib -L/usr/local/cuda/lib64 -lnccl -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-3.8/fmoe_cuda.cpython-38-x86_64-linux-gnu.so
g++: error: /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/stream_manager.o: No such file or directory
g++: error: /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/local_exchange.o: No such file or directory
g++: error: /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o: No such file or directory
g++: error: /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o: No such file or directory
g++: error: /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/parallel_linear.o: No such file or directory
g++: error: /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/fmoe_cuda.o: No such file or directory
g++: error: /home/jovyan/data/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/fastermoe/smart_schedule.o: No such file or directory
error: command 'g++' failed with exit status 1

@laekov
Copy link
Owner

laekov commented Sep 8, 2022

Still weird situation. Ninja should manage the dependencies, but yours seems not working. Can you try update or reinstall your ninja?

@Irenehere
Copy link
Author

It seems there is some problem with ninja. Is it a version problem? Can fastmoe be installed and run under cuda 11.1?

ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1549, in _run_ninja_build
subprocess.run(
File "/opt/conda/lib/python3.8/subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "setup.py", line 39, in
setuptools.setup(
File "/opt/conda/lib/python3.8/site-packages/setuptools/init.py", line 153, in setup
return distutils.core.setup(**attrs)
File "/opt/conda/lib/python3.8/distutils/core.py", line 148, in setup
dist.run_commands()
File "/opt/conda/lib/python3.8/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/install.py", line 67, in run
self.do_egg_install()
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/install.py", line 109, in do_egg_install
self.run_command('bdist_egg')
File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 167, in run
cmd = self.call_command('install_lib', warn_dir=0)
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 153, in call_command
self.run_command(cmdname)
File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/install_lib.py", line 11, in run
self.build()
File "/opt/conda/lib/python3.8/distutils/command/install_lib.py", line 107, in build
self.run_command('build_ext')
File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
_build_ext.run(self)
File "/opt/conda/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
_build_ext.build_ext.run(self)
File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 340, in run
self.build_extensions()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 683, in build_extensions
build_ext.build_extensions(self)
File "/opt/conda/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 194, in build_extensions
self.build_extension(ext)
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
_build_ext.build_extension(self, ext)
File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
objects = self.compiler.compile(sources,
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 503, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1261, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1565, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

@laekov
Copy link
Owner

laekov commented Sep 8, 2022

I have never seen an issue like this before, so I am not sure if specific ninja version is required. As FastMoE directly uses PyTorch's extension building module. CUDA 11.x should be fine with FastMoE.

@Irenehere
Copy link
Author

I successfully installed megatron-patch branch.

@laekov laekov closed this as completed Dec 28, 2022
@lainanhui
Copy link

这个问题是怎么解决的呢请问

@lainanhui
Copy link

我在安装过程也遇到了No such a file or directory

@lainanhui
Copy link

It seems there is some problem with ninja. Is it a version problem? Can fastmoe be installed and run under cuda 11.1?似乎ninja有问题,是版本问题吗?fastmoe能在cuda 11.1下安装和运行吗?

ninja: build stopped: subcommand failed.ninja: build stopped: subcommand failed. (构建停止:子命令失败)
Traceback (most recent call last):回溯(最近的调用):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1549, in _run_ninja_build文件 "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py",第1549行,在_run_ninja_build
subprocess.run(reload-alert
File "/opt/conda/lib/python3.8/subprocess.py", line 512, in run文件 "/opt/conda/lib/python3.8/subprocess.py",第512行,在运行
raise CalledProcessError(retcode, process.args,提出 CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.命令'['ninja', '-v']'返回非零退出状态1。
The above exception was the direct cause of the following exception:上述异常是导致以下异常的直接原因:
Traceback (most recent call last):回溯(最近的调用):
File "setup.py", line 39, in 文件 "setup.py",第 39 行,在
setuptools.setup(
File "/opt/conda/lib/python3.8/site-packages/setuptools/init.py", line 153, in setup
return distutils.core.setup(**attrs)return distutils.core.setup(**attrs)
File "/opt/conda/lib/python3.8/distutils/core.py", line 148, in setup文件 "/opt/conda/lib/python3.8/distutils/core.py",第 148 行,在 setup 中
dist.run_commands()
File "/opt/conda/lib/python3.8/distutils/dist.py", line 966, in run_commands文件 "/opt/conda/lib/python3.8/distutils/dist.py",第966行,在run_commands
self.run_command(cmd) 运行命令
File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command文件 "/opt/conda/lib/python3.8/distutils/dist.py",第985行,在run_command
cmd_obj.run()
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/install.py", line 67, in run文件 "/opt/conda/lib/python3.8/site-packages/setuptools/command/install.py",第67行,在运行
self.do_egg_install()
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/install.py", line 109, in do_egg_install文件 "/opt/conda/lib/python3.8/site-packages/setuptools/command/install.py",第109行,在do_egg_install
self.run_command('bdist_egg')
File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command文件 "/opt/conda/lib/python3.8/distutils/cmd.py",第313行,在run_command
self.distribution.run_command(command)运行命令:self.distribution.run_command(command)
File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command文件 "/opt/conda/lib/python3.8/distutils/dist.py",第985行,在run_command
cmd_obj.run()
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 167, in run文件 "/opt/conda/lib/python3.8/site-packages/setuptools/command/bdist_egg.py",第167行,在运行
cmd = self.call_command('install_lib', warn_dir=0)
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 153, in call_command文件 "/opt/conda/lib/python3.8/site-packages/setuptools/command/bdist_egg.py",第153行,在call_command
self.run_command(cmdname)运行命令(cmdname)
File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command文件 "/opt/conda/lib/python3.8/distutils/cmd.py",第313行,在run_command
self.distribution.run_command(command)运行命令:self.distribution.run_command(command)
File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command文件 "/opt/conda/lib/python3.8/distutils/dist.py",第985行,在run_command
cmd_obj.run()
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/install_lib.py", line 11, in run文件 "/opt/conda/lib/python3.8/site-packages/setuptools/command/install_lib.py",第11行,在运行
self.build()
File "/opt/conda/lib/python3.8/distutils/command/install_lib.py", line 107, in build文件 "/opt/conda/lib/python3.8/distutils/command/install_lib.py",第107行,在build中
self.run_command('build_ext')
File "/opt/conda/lib/python3.8/distutils/cmd.py", line 313, in run_command文件 "/opt/conda/lib/python3.8/distutils/cmd.py",第313行,在run_command
self.distribution.run_command(command)运行命令:self.distribution.run_command(command)
File "/opt/conda/lib/python3.8/distutils/dist.py", line 985, in run_command文件 "/opt/conda/lib/python3.8/distutils/dist.py",第985行,在run_command
cmd_obj.run()
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run文件 "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_ext.py",第 79 行,在运行
_build_ext.run(self)
File "/opt/conda/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run文件 "/opt/conda/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py",第 186 行,在运行
_build_ext.build_ext.run(self)
File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 340, in run文件 "/opt/conda/lib/python3.8/distutils/command/build_ext.py",第340行,在运行
self.build_extensions() self.build_extensions() 方法的调用。
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 683, in build_extensions文件 "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py",第683行,在build_extensions中
build_ext.build_extensions(self)build_ext.build_extensions(self) 的中文翻译是:
File "/opt/conda/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 194, in build_extensions文件 "/opt/conda/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py",第 194 行,在 build_extensions 中
self.build_extension(ext)self.build_extension(ext) 的中文翻译为:
File "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 196, in build_extension文件 "/opt/conda/lib/python3.8/site-packages/setuptools/command/build_extension.py",第196行,在build_extension
_build_ext.build_extension(self, ext)
File "/opt/conda/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension文件 "/opt/conda/lib/python3.8/distutils/command/build_extension.py",第 528 行,在 build_extension
objects = self.compiler.compile(sources,对象 = self.compiler.compile(源码,
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 503, in unix_wrap_ninja_compile文件 "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py",第 503 行,在 unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1261, in _write_ninja_file_and_compile_objects文件 "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py",第1261行,在_write_ninja_file_and_compile_objects
_run_ninja_build(
File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1565, in _run_ninja_build文件 "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py",第 1565 行,在 _run_ninja_build 中
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension文件 "/opt/conda/lib/python3.8/site-packages/setuptools/init.py",第153行,在setup中

能问一下怎么解决的吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants