[BUG]Killing subprocess exits with return code = 1 after saving one checkpoint #5302

selina-feng · 2024-03-21T03:59:13Z

Killing subprocess exits with return code = 1 after saving one checkpoint

[2024-03-20 20:31:29,210] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3353707
[2024-03-20 20:31:29,240] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3353708
[2024-03-20 20:31:29,269] [ERROR] [launch.py:321:sigkill_handler] ['/data/anaconda3/envs/llama/bin/python', '-u', 'src/train_bash.py', '--local_rank=4', '--stage', 'sft', '--deepspeed', 'configs/deepspeed_zero3_config.json', '--do_train', '--model_name_or_path', '/for_llm_model/', '--dataset', 'fin_', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data_c/', '--overwrite_cache', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--lr_scheduler_type', 'cosine', '--logging_steps', '10', '--save_steps', '1000', '--learning_rate', '5e-5', '--num_train_epochs', '2.0', '--plot_loss', '--fp16', '--cutoff_len', '4096', '--flash_attn'] exits with return code = 1

loadams · 2024-03-22T15:24:54Z

Hi @selina-feng - can you please share your ds_report, the number of GPUs in your system, the command you are running, and a full error log?

selina-feng added bug Something isn't working training labels Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Killing subprocess exits with return code = 1 after saving one checkpoint #5302

[BUG]Killing subprocess exits with return code = 1 after saving one checkpoint #5302

selina-feng commented Mar 21, 2024

loadams commented Mar 22, 2024

[BUG]Killing subprocess exits with return code = 1 after saving one checkpoint #5302

[BUG]Killing subprocess exits with return code = 1 after saving one checkpoint #5302

Comments

selina-feng commented Mar 21, 2024

loadams commented Mar 22, 2024