Skip to content

LLVM ERROR: Failed to compute parent layout for slice layout. #4280

@sxm7078

Description

@sxm7078

{'loss': -5e-08, 'grad_norm': 0.0001303, 'learning_rate': 1.6e-07, 'memory(GiB)': 17.56, 'train_speed(iter/s)': 0.01798, 'completions/mean_length': 685.1625, 'completions/min_length': 240.2, 'completions/max_length': 1395.6, 'completions/clipped_ratio': 0.0, 'rewards/MathFormat/mean': 0.0875, 'rewards/MathFormat/std': 0.1700653, 'reward': 0.0875, 'reward_std': 0.025, 'kl': -1.26e-06, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.01, 'global_step/max_steps': '10/1237', 'percentage': '0.81%', 'elapsed_time': '8m 3s', 'remaining_time': '16h 29m 30s'}
Train: 1%|▊ | 10/1237 [08:04<15:42:53, 46.11s/it]INFO 05-20 15:01:54 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 05-20 15:01:54 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 05-20 15:01:54 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 05-20 15:01:54 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 05-20 15:01:54 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 05-20 15:01:54 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 05-20 15:01:54 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 05-20 15:01:54 [prefix_caching_block.py:479] Successfully reset prefix cache
LLVM ERROR: Failed to compute parent layout for slice layout.
LLVM ERROR: Failed to compute parent layout for slice layout.
LLVM ERROR: Failed to compute parent layout for slice layout.
LLVM ERROR: Failed to compute parent layout for slice layout.
W0520 15:01:55.755000 732590 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 732933 closing signal SIGTERM
W0520 15:01:55.756000 732590 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 732935 closing signal SIGTERM
E0520 15:01:55.870000 732590 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 1 (pid: 732934) of binary: /home/ma-user/anaconda3/envs/test/bin/python3.1
Traceback (most recent call last):
File "/home/ma-user/anaconda3/envs/test/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ma-user/anaconda3/envs/test/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ma-user/anaconda3/envs/test/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in
main()
File "/home/ma-user/anaconda3/envs/test/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/test/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/home/ma-user/anaconda3/envs/test/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/ma-user/anaconda3/envs/test/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ma-user/anaconda3/envs/test/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/data/chatglm/retrieval_agent_new/ms_swift_train/ms-swift/swift/cli/rlhf.py FAILED

Failures:
[1]:
time : 2025-05-20_15:01:55
host : notebook-a9960b63-59e7-451b-bc74-5da2fe862355.notebook-a9960b63-59e7-451b-bc74-5da2fe862355-distributed.default.svc.cluster.local
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 732936)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 732936

Root Cause (first observed failure):
[0]:
time : 2025-05-20_15:01:55
host : notebook-a9960b63-59e7-451b-bc74-5da2fe862355.notebook-a9960b63-59e7-451b-bc74-5da2fe862355-distributed.default.svc.cluster.local
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 732934)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 732934

/home/ma-user/anaconda3/envs/test/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions