[BUG]Error after changing the model from opt to gpt #3373

lljjgg · 2023-04-25T07:33:55Z

I trained the PPO model, use the gpt. I modified the option of model_name_or_path from opt to gpt2 I passed step 1 and step 2,But An error occurred in step 3.The error is as follows:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/luojiangang/423_Deep/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_fin │
│ etuning/main.py:522 in │
│ │
│ 519 │
│ 520 │
│ 521 if name == "main": │
│ ❱ 522 │ main() │
│ 523 │
│ │
│ /data/luojiangang/423_Deep/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_fin │
│ etuning/main.py:431 in main │
│ │
│ 428 │ │ │ │ prompts = prompts[:, length - args.max_prompt_seq_len:] │
│ 429 │ │ │ │ raise ValueError("Prompt length is too long") │
│ 430 │ │ │ │
│ ❱ 431 │ │ │ out = trainer.generate_experience(prompts) │
│ 432 │ │ │ exp_dataset = exp_mini_dataset.add(out) │
│ 433 │ │ │ │
│ 434 │ │ │ if exp_dataset is not None: │
│ │
│ /data/luojiangang/423_Deep/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_fin │
│ etuning/ppo_trainer.py:97 in generate_experience │
│ │
│ 94 │ │
│ 95 │ def generate_experience(self, prompts): │
│ 96 │ │ self.eval() │
│ ❱ 97 │ │ seq = self._generate_sequence(prompts) │
│ 98 │ │ self.train() │
│ 99 │ │ │
│ 100 │ │ pad_token_id = self.tokenizer.pad_token_id │
│ │
│ /data/luojiangang/423_Deep/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_fin │
│ etuning/ppo_trainer.py:91 in _generate_sequence │
│ │
│ 88 │ │ │ │ continue │
│ 89 │ │ │ else: │
│ 90 │ │ │ │ out_seq.append(seq[i:i + 1]) │
│ ❱ 91 │ │ out_seq = torch.cat(out_seq, dim=0) # concate output in the batch dim │ │ 92 │ │ │ │ 93 │ │ return out_seq │ │ 94 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: torch.cat(): expected a non-empty list of Tensors
torch.Size([4, 50264])
torch.Size([4, 50264])
!!!! kernel execution error. (m: 2048, n: 4, k: 2048, error: 14)
!!!! kernel execution error. (m: 8192, n: 4, k: 2048, error: 13)
!!!! kernel execution error. (m: 2048, n: 4, k: 2048, error: 13)

Do you know what causes this? Can you provide the training steps for gpt2.Looking forward to your reply

JINGTING92 · 2023-04-25T10:09:17Z

same problem, any solutions?

frame #56: __libc_start_main + 0xe7 (0x7f5d250ddc87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #57: _start + 0x2a (0x55a1f2d1f4ca in /usr/bin/python)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f09f2bd94d7 in /usr/local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f09f2ba336b in /usr/local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f09f2c7dfa8 in /usr/local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)**

rSemology · 2023-05-10T14:05:40Z

I have the same issue.

Edit:
My issue was a bad GPU.

molly-smith · 2023-05-12T18:39:26Z

@JINGTING92 @lljjgg @rSemology can you please provide more information about your setups?

ds_report output
Please run ds_report to give us details about your setup.

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. two machines with x8 A100s each]
(if applicable) Hugging Face Transformers/Accelerate/etc. versions
Python version
Any other relevant info about your setup

LeeChongKeat · 2023-05-26T08:26:43Z

@JINGTING92 @lljjgg @rSemology can you please provide more information about your setups?

ds_report output Please run ds_report to give us details about your setup.

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04] GPU count and types [e.g. two machines with x8 A100s each] (if applicable) Hugging Face Transformers/Accelerate/etc. versions Python version Any other relevant info about your setup

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/home/chongkeat/anaconda3/envs/deep/lib/python3.11/site-packages/torch']
torch version .................... 2.0.1
deepspeed install path ........... ['/home/chongkeat/anaconda3/envs/deep/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

ensors
return torch._C._nn.flatten_dense_tensors(tensors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

rSemology · 2023-05-27T12:17:47Z

@JINGTING92 @lljjgg @rSemology can you please provide more information about your setups?

After further diagnosing my issue, turns out I had a faulty GPU. Thanks.

johnchienbronci · 2023-07-04T08:22:31Z

@molly-smith Do you know what causes this?

I encountered some errors when running the run_speech_recognition_ctc_streaming.sh by deepspeed ( torchrun --nproc_per_node 1 ... ) and his issue consistently occurs with my custom corpora. (I can fine-tune successfully using the Common Voice corpus)

environment:

gpu number: 1 (v100)
export CUDA_LAUNCH_BLOCKING=1
export TORCH_USE_CUDA_DSA=1
transformer: 4.31.0

DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/.local/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/usr/local/lib/python3.10/dist-packages/deepspeed']
deepspeed info ................... 0.9.5, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0

terminate called after throwing an instance of 'c10::Error'                                                                                                                           
  what():  CUDA error: an illegal memory access was encountered                                                                                                                        
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                    
                                                                                                                                                                                       
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):                                                                      
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b400ef097 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)                                   
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f7b400aaa33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)                                                                                                                                                                                      
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7b4019d5a8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)                                                                                                                                                                                     
frame #3: <unknown function> + 0x1f3de (0x7f7b401663de in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)                                                            
frame #4: <unknown function> + 0x22650 (0x7f7b40169650 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)                                                            
frame #5: <unknown function> + 0x22a35 (0x7f7b40169a35 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)                                                            
frame #6: <unknown function> + 0x4ef710 (0x7f7af1667710 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)                                                       
frame #7: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7f7b400cc393 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)                                                       
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7b400cc529 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)                                                         
frame #9: <unknown function> + 0x7761b8 (0x7f7af18ee1b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)                                                       
frame #10: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f7af18ee506 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)                                     
frame #11: <unknown function> + 0x1388e1 (0x5580685a58e1 in /usr/bin/python3)                                                                                                          
frame #12: <unknown function> + 0x1386dc (0x5580685a56dc in /usr/bin/python3)                                                                                                          
frame #13: <unknown function> + 0x138787 (0x5580685a5787 in /usr/bin/python3)                                                                                                          
frame #14: <unknown function> + 0x174ac1 (0x5580685e1ac1 in /usr/bin/python3)                                                                                                          
frame #15: <unknown function> + 0x153090 (0x5580685c0090 in /usr/bin/python3)                                                                                                          
frame #16: <unknown function> + 0x166918 (0x5580685d3918 in /usr/bin/python3)                                                                                                          
frame #17: <unknown function> + 0x2593a7 (0x5580686c63a7 in /usr/bin/python3)                                                                                                          
frame #18: <unknown function> + 0x17a7b0 (0x5580685e77b0 in /usr/bin/python3)                                                                                                          
frame #19: <unknown function> + 0x25f5c1 (0x5580686cc5c1 in /usr/bin/python3)                                                                                                          
frame #20: _PyEval_EvalFrameDefault + 0x7a99 (0x5580685b9b49 in /usr/bin/python3)                                                                                                      
frame #21: <unknown function> + 0x16ac31 (0x5580685d7c31 in /usr/bin/python3)                                                                                                          
frame #22: PyObject_Call + 0x122 (0x5580685d88e2 in /usr/bin/python3)                                                                                                                  
frame #23: <unknown function> + 0x27c30c (0x5580686e930c in /usr/bin/python3)                                                                                                          
frame #24: _PyObject_MakeTpCall + 0x25b (0x5580685c04ab in /usr/bin/python3)                                                                                                           
frame #25: _PyEval_EvalFrameDefault + 0x1a2f (0x5580685b3adf in /usr/bin/python3)                                                                                                      
frame #26: <unknown function> + 0x16ac31 (0x5580685d7c31 in /usr/bin/python3)                                                                                                          
frame #27: _PyEval_EvalFrameDefault + 0x1a2f (0x5580685b3adf in /usr/bin/python3)                                                                                                      
frame #28: _PyFunction_Vectorcall + 0x7c (0x5580685ca1ec in /usr/bin/python3)                                                                                                          
frame #29: _PyEval_EvalFrameDefault + 0x6d5 (0x5580685b2785 in /usr/bin/python3)                                                                                                       
frame #30: <unknown function> + 0x141ed6 (0x5580685aeed6 in /usr/bin/python3)                                                                                                          
frame #31: PyEval_EvalCode + 0x86 (0x5580686a5366 in /usr/bin/python3)                                                                                                                 
frame #32: <unknown function> + 0x265108 (0x5580686d2108 in /usr/bin/python3)                                                                                                          
frame #33: <unknown function> + 0x25df5b (0x5580686caf5b in /usr/bin/python3)                                                                                                          
frame #34: <unknown function> + 0x264e55 (0x5580686d1e55 in /usr/bin/python3)                                                                                                          
frame #35: _PyRun_SimpleFileObject + 0x1a8 (0x5580686d1338 in /usr/bin/python3)                                                                                                        
frame #36: _PyRun_AnyFileObject + 0x43 (0x5580686d1033 in /usr/bin/python3)                                                                                                            
frame #37: Py_RunMain + 0x2be (0x5580686c22de in /usr/bin/python3)                                                                                                                     
frame #38: Py_BytesMain + 0x2d (0x55806869832d in /usr/bin/python3)                                                                                                                    
frame #39: <unknown function> + 0x29d90 (0x7f7b5c24ad90 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                            
frame #40: __libc_start_main + 0x80 (0x7f7b5c24ae40 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                
frame #41: _start + 0x25 (0x558068698225 in /usr/bin/python3)                                                                                                                          

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 24134) of binary: /usr/bin/python3

Hprairie · 2024-06-19T18:52:56Z

@johnchienbronci, did you ever fix this issue, I am running into it too

lljjgg added bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat labels Apr 25, 2023

molly-smith self-assigned this May 5, 2023

molly-smith added cuda Related to CUDA, Triton, and similar low-level kernels step 3 DS Chat Training Step 3 labels May 5, 2023

molly-smith assigned conglongli and unassigned molly-smith May 19, 2023

ydshieh mentioned this issue Jul 3, 2023

CUDA error: an illegal memory access was encountered huggingface/transformers#24608

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Error after changing the model from opt to gpt #3373

[BUG]Error after changing the model from opt to gpt #3373

lljjgg commented Apr 25, 2023

JINGTING92 commented Apr 25, 2023

rSemology commented May 10, 2023 •

edited

Loading

molly-smith commented May 12, 2023

LeeChongKeat commented May 26, 2023

rSemology commented May 27, 2023

johnchienbronci commented Jul 4, 2023

Hprairie commented Jun 19, 2024

[BUG]Error after changing the model from opt to gpt #3373

[BUG]Error after changing the model from opt to gpt #3373

Comments

lljjgg commented Apr 25, 2023

JINGTING92 commented Apr 25, 2023

rSemology commented May 10, 2023 • edited Loading

molly-smith commented May 12, 2023

LeeChongKeat commented May 26, 2023

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

rSemology commented May 27, 2023

johnchienbronci commented Jul 4, 2023

Hprairie commented Jun 19, 2024

rSemology commented May 10, 2023 •

edited

Loading

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]