-
Notifications
You must be signed in to change notification settings - Fork 459
Description
I tried to kick off SFT of SmoLlm (ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_sft.py recipes/smollm/sft/config.yaml)
Something appears to be broken when generating the training sets. See below for the tail end of the debug output:
2025-04-03 04:17:10 - INFO - datasets.builder - Generating dataset generator (/root/.cache/huggingface/datasets/generator/default-529120a51edc719d/0.0.0)
Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-529120a51edc719d/0.0.0...
2025-04-03 04:17:10 - INFO - datasets.builder - Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-529120a51edc719d/0.0.0...
Generating train split
2025-04-03 04:17:10 - INFO - datasets.builder - Generating train split
Generating train split: 0 examples [00:19, ? examples/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/builder.py", line 1635, in _prepare_split_single
[rank0]: num_examples, num_bytes = writer.finalize()
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/arrow_writer.py", line 649, in finalize
[rank0]: raise SchemaInferenceError("Please pass features
or at least one example when writing data")
[rank0]: datasets.arrow_writer.SchemaInferenceError: Please pass features
or at least one example when writing data
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 589, in _prepare_packed_dataloader
[rank0]: packed_dataset = Dataset.from_generator(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1114, in from_generator
[rank0]: ).read()
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/io/generator.py", line 49, in read
[rank0]: self.builder.download_and_prepare(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/builder.py", line 925, in download_and_prepare
[rank0]: self._download_and_prepare(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
[rank0]: super()._download_and_prepare(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/builder.py", line 1001, in _download_and_prepare
[rank0]: self._prepare_split(split_generator, **prepare_split_kwargs)
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/builder.py", line 1487, in _prepare_split
[rank0]: for job_id, done, content in self._prepare_split_single(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/builder.py", line 1644, in _prepare_split_single
[rank0]: raise DatasetGenerationError("An error occurred while generating the dataset") from e
[rank0]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/alignment-handbook/scripts/run_sft.py", line 234, in
[rank0]: main()
[rank0]: File "/workspace/alignment-handbook/scripts/run_sft.py", line 166, in main
[rank0]: trainer = SFTTrainer(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
[rank0]: return f(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]: return func(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 368, in init
[rank0]: train_dataset = self._prepare_dataset(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 488, in _prepare_dataset
[rank0]: return self._prepare_packed_dataloader(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 593, in _prepare_packed_dataloader
[rank0]: raise ValueError(
[rank0]: ValueError: Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.
[rank0]:[W403 04:17:30.155929180 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0403 04:17:31.575000 5966 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 6119) of binary: /root/anaconda3/envs/handbook/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/handbook/bin/accelerate", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1196, in launch_command
deepspeed_launcher(args)
File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 878, in deepspeed_launcher
distrib_run.run(args)
File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/run_sft.py FAILED
Failures:
<NO_OTHER_FAILURES>