-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]Error after changing the model from opt to gpt #3373
Comments
same problem, any solutions? frame #56: __libc_start_main + 0xe7 (0x7f5d250ddc87 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::Error' Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): |
I have the same issue. Edit: |
@JINGTING92 @lljjgg @rSemology can you please provide more information about your setups? ds_report output System info (please complete the following information): OS: [e.g. Ubuntu 18.04] |
DeepSpeed C++/CUDA extension op reportNOTE: Ops not installed will be just-in-time (JIT) compiled at
|
After further diagnosing my issue, turns out I had a faulty GPU. Thanks. |
@molly-smith Do you know what causes this? I encountered some errors when running the run_speech_recognition_ctc_streaming.sh by deepspeed ( torchrun --nproc_per_node 1 ... ) and his issue consistently occurs with my custom corpora. (I can fine-tune successfully using the Common Voice corpus) environment:
|
@johnchienbronci, did you ever fix this issue, I am running into it too |
I trained the PPO model, use the gpt. I modified the option of model_name_or_path from opt to gpt2 I passed step 1 and step 2,But An error occurred in step 3.The error is as follows:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/luojiangang/423_Deep/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_fin │
│ etuning/main.py:522 in │
│ │
│ 519 │
│ 520 │
│ 521 if name == "main": │
│ ❱ 522 │ main() │
│ 523 │
│ │
│ /data/luojiangang/423_Deep/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_fin │
│ etuning/main.py:431 in main │
│ │
│ 428 │ │ │ │ prompts = prompts[:, length - args.max_prompt_seq_len:] │
│ 429 │ │ │ │ raise ValueError("Prompt length is too long") │
│ 430 │ │ │ │
│ ❱ 431 │ │ │ out = trainer.generate_experience(prompts) │
│ 432 │ │ │ exp_dataset = exp_mini_dataset.add(out) │
│ 433 │ │ │ │
│ 434 │ │ │ if exp_dataset is not None: │
│ │
│ /data/luojiangang/423_Deep/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_fin │
│ etuning/ppo_trainer.py:97 in generate_experience │
│ │
│ 94 │ │
│ 95 │ def generate_experience(self, prompts): │
│ 96 │ │ self.eval() │
│ ❱ 97 │ │ seq = self._generate_sequence(prompts) │
│ 98 │ │ self.train() │
│ 99 │ │ │
│ 100 │ │ pad_token_id = self.tokenizer.pad_token_id │
│ │
│ /data/luojiangang/423_Deep/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_fin │
│ etuning/ppo_trainer.py:91 in _generate_sequence │
│ │
│ 88 │ │ │ │ continue │
│ 89 │ │ │ else: │
│ 90 │ │ │ │ out_seq.append(seq[i:i + 1]) │
│ ❱ 91 │ │ out_seq = torch.cat(out_seq, dim=0) # concate output in the batch dim │ │ 92 │ │ │ │ 93 │ │ return out_seq │ │ 94 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: torch.cat(): expected a non-empty list of Tensors
torch.Size([4, 50264])
torch.Size([4, 50264])
!!!! kernel execution error. (m: 2048, n: 4, k: 2048, error: 14)
!!!! kernel execution error. (m: 8192, n: 4, k: 2048, error: 13)
!!!! kernel execution error. (m: 2048, n: 4, k: 2048, error: 13)
Do you know what causes this? Can you provide the training steps for gpt2.Looking forward to your reply
The text was updated successfully, but these errors were encountered: