Skip to content

Unable to train SmoLlm model #214

@adeobootpin

Description

@adeobootpin

I tried to kick off SFT of SmoLlm (ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_sft.py recipes/smollm/sft/config.yaml)

Something appears to be broken when generating the training sets. See below for the tail end of the debug output:

2025-04-03 04:17:10 - INFO - datasets.builder - Generating dataset generator (/root/.cache/huggingface/datasets/generator/default-529120a51edc719d/0.0.0)
Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-529120a51edc719d/0.0.0...
2025-04-03 04:17:10 - INFO - datasets.builder - Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-529120a51edc719d/0.0.0...
Generating train split
2025-04-03 04:17:10 - INFO - datasets.builder - Generating train split
Generating train split: 0 examples [00:19, ? examples/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/builder.py", line 1635, in _prepare_split_single
[rank0]: num_examples, num_bytes = writer.finalize()
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/arrow_writer.py", line 649, in finalize
[rank0]: raise SchemaInferenceError("Please pass features or at least one example when writing data")
[rank0]: datasets.arrow_writer.SchemaInferenceError: Please pass features or at least one example when writing data
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 589, in _prepare_packed_dataloader
[rank0]: packed_dataset = Dataset.from_generator(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1114, in from_generator
[rank0]: ).read()
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/io/generator.py", line 49, in read
[rank0]: self.builder.download_and_prepare(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/builder.py", line 925, in download_and_prepare
[rank0]: self._download_and_prepare(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
[rank0]: super()._download_and_prepare(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/builder.py", line 1001, in _download_and_prepare
[rank0]: self._prepare_split(split_generator, **prepare_split_kwargs)
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/builder.py", line 1487, in _prepare_split
[rank0]: for job_id, done, content in self._prepare_split_single(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/datasets/builder.py", line 1644, in _prepare_split_single
[rank0]: raise DatasetGenerationError("An error occurred while generating the dataset") from e
[rank0]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/alignment-handbook/scripts/run_sft.py", line 234, in
[rank0]: main()
[rank0]: File "/workspace/alignment-handbook/scripts/run_sft.py", line 166, in main
[rank0]: trainer = SFTTrainer(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
[rank0]: return f(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]: return func(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 368, in init
[rank0]: train_dataset = self._prepare_dataset(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 488, in _prepare_dataset
[rank0]: return self._prepare_packed_dataloader(
[rank0]: File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 593, in _prepare_packed_dataloader
[rank0]: raise ValueError(
[rank0]: ValueError: Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.
[rank0]:[W403 04:17:30.155929180 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0403 04:17:31.575000 5966 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 6119) of binary: /root/anaconda3/envs/handbook/bin/python

Traceback (most recent call last):
File "/root/anaconda3/envs/handbook/bin/accelerate", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1196, in launch_command
deepspeed_launcher(args)
File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 878, in deepspeed_launcher
distrib_run.run(args)
File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/run_sft.py FAILED
Failures:
<NO_OTHER_FAILURES>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions