Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with Deepspeed integration #24438

Closed
2 of 4 tasks
karths8 opened this issue Jun 23, 2023 · 6 comments
Closed
2 of 4 tasks

Problem with Deepspeed integration #24438

karths8 opened this issue Jun 23, 2023 · 6 comments

Comments

@karths8
Copy link

karths8 commented Jun 23, 2023

System Info

  • transformers version: 4.29.2
  • Platform: Linux-5.4.0-137-generic-x86_64-with-glibc2.31
  • Python version: 3.11.3
  • Huggingface_hub version: 0.15.1
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am using the WizardCoder training script to further fine-tune the model on some examples that I have using DeepSpeed integration. I have followed their instructions given here to fine-tune the model and I am getting the following error:

datachat_env) root@C.6442427:~/Custom-LLM$ sh train.sh
[2023-06-23 00:36:25,039] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-23 00:36:25,077] [INFO] [runner.py:541:main] cmd = /root/anaconda3/envs/datachat_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None /root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py --model_name_or_path /root/Custom-LLM/WizardCoder-15B-V1.0 --data_path /root/Custom-LLM/data.json --output_dir /root/Custom-LLM/WC-Checkpoint --num_train_epochs 3 --model_max_length 512 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 50 --save_total_limit 2 --learning_rate 2e-5 --warmup_steps 30 --logging_steps 2 --lr_scheduler_type cosine --report_to tensorboard --gradient_checkpointing True --deepspeed /root/Custom-LLM/Llama-X/src/configs/deepspeed_config.json --fp16 True
[2023-06-23 00:36:26,992] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-06-23 00:36:26,993] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-06-23 00:36:26,993] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-06-23 00:36:26,993] [INFO] [launch.py:247:main] dist_world_size=4
[2023-06-23 00:36:26,993] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-06-23 00:36:29,650] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-06-23 00:36:55,124] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 15.82B parameters
[2023-06-23 00:37:12,845] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,968] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,969] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,970] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
FAILED: cpu_adam.o 
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
In file included from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:19,
                 from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:6:
/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:12:10: fatal error: curand_kernel.h: No such file or directory
   12 | #include <curand_kernel.h>
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
[2/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
FAILED: custom_cuda_kernel.cuda.o 
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
In file included from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu:6:
/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:12:10: fatal error: curand_kernel.h: No such file or directory
   12 | #include <curand_kernel.h>
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
    subprocess.run(
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
    train()
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
    trainer.train()
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
                                                ^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
    optimizer = DeepSpeedCPUAdam(model_parameters,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
    self.ds_opt_adam = CPUAdamBuilder().load()
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
    return self.jit_load(verbose)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
Loading extension module cpu_adam...    
op_module = load(name=self.name,
                ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
Traceback (most recent call last):
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
    train()
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
    trainer.train()
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
    _write_ninja_file_and_build_library(
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
Loading extension module cpu_adam...
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
                                                ^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
Traceback (most recent call last):
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
                         train() 
               File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    trainer.train()  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize

  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
    engine = DeepSpeedEngine(args=args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
                                                ^^    ^optimizer = DeepSpeedCPUAdam(model_parameters,^
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
    File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
                self.ds_opt_adam = CPUAdamBuilder().load() 
                                                       ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
    return self.jit_load(verbose)    
engine = DeepSpeedEngine(args=args,
                      ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
    op_module = load(name=self.name,
        File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
          ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
                      ^^^^^^^^^^^^^    ^return _jit_compile(^
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
    optimizer = DeepSpeedCPUAdam(model_parameters,
       return _import_module_from_library(name, build_directory, is_python_module) 
             ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
^^^^^^^^^^^^^^^^^^^^^^^^    ^self.ds_opt_adam = CPUAdamBuilder().load()^
^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ 
    File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
    return self.jit_load(verbose)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
    module = importlib.util.module_from_spec(spec)
             ^^^^^^^^^^^^^^^^    ^op_module = load(name=self.name,^
^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
^^  File "<frozen importlib._bootstrap>", line 573, in module_from_spec
^^  File "<frozen importlib._bootstrap_external>", line 1233, in create_module
^^  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
^^ImportError^: ^/root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory^
^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 573, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1233, in create_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Loading extension module cpu_adam...
Traceback (most recent call last):
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
    train()
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
    trainer.train()
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
                                                ^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
    optimizer = DeepSpeedCPUAdam(model_parameters,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
    self.ds_opt_adam = CPUAdamBuilder().load()
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
    return self.jit_load(verbose)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
    op_module = load(name=self.name,
                ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 573, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1233, in create_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fcaec4a89a0>
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
    ^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fbf4e6409a0>
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
    ^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f9ce61b09a0>
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f6c2bf109a0>
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
    ^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

Expected behavior

Expect the model to use the deepspeed config file and run training

@sgugger
Copy link
Collaborator

sgugger commented Jun 23, 2023

cc @pacman100

@pacman100
Copy link
Contributor

pacman100 commented Jun 23, 2023

Hello, this isn't an issue with DeepSpeed integration. The issue is this:

ImportError: /root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
...

RuntimeError: Error building extension 'cpu_adam'

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 27, 2023

Hi, @karths8

You can try rm -rf ~/.cache/torch_extensions/ first.

Related discussion: #14520

@karths8
Copy link
Author

karths8 commented Jun 29, 2023

rm -rf ~/.cache/torch_extensions/

This does not seem to work for me. The root of the problem lies in fatal error: curand_kernel.h: No such file or directory. If there are any insights on how to solve this issue please let me know. Any help is greatly appreciated!

@orangetin
Copy link

orangetin commented Jul 2, 2023

This isn't an integration issue like pacman100 said. See this: microsoft/DeepSpeed#1846
Looks like an issue with the DeepSpeed pip package, I recommend installing it via conda

@karths8
Copy link
Author

karths8 commented Jul 2, 2023

This isn't an integration issue like pacman100 said. See this: microsoft/DeepSpeed#1846 Looks like an issue with the DeepSpeed pip package, I recommend installing it via conda

Thanks! I fixed it using this

@karths8 karths8 closed this as completed Jul 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants