training 'Num examples' is not equal to the size of the dataset #5362

ShayKaiserman · 2024-09-04T12:47:56Z

Hi,

I’m trying to understand the reported 'Num examples' during the training process.

I am fine-tuning an LLM on my custom dataset, which contains 348,979 chunks of text. However, during training (executing llamafactory-cli train model_pt.yaml), the log shows the following:

09/04/2024 15:03:43 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
Converting format of dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 348979/348979 [00:01<00:00, 188087.31 examples/s]
09/04/2024 15:03:46 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
09/04/2024 15:03:46 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
09/04/2024 15:03:46 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
Running tokenizer on dataset (num_proc=16): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 348979/348979 [00:08<00:00, 39427.21 examples/s]
...
...
[INFO|trainer.py:2134] 2024-09-04 15:04:18,470 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-09-04 15:04:18,470 >> Num examples = 137,591
[INFO|trainer.py:2136] 2024-09-04 15:04:18,470 >> Num Epochs = 3
[INFO|trainer.py:2137] 2024-09-04 15:04:18,470 >> Instantaneous batch size per device = 4
[INFO|trainer.py:2140] 2024-09-04 15:04:18,470 >> Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:2141] 2024-09-04 15:04:18,470 >> Gradient Accumulation steps = 4
[INFO|trainer.py:2142] 2024-09-04 15:04:18,470 >> Total optimization steps = 6,450
[INFO|trainer.py:2143] 2024-09-04 15:04:18,476 >> Number of trainable parameters = 167,772,160

I expected to see 348,979 examples reported during training, but only 137,591 are indicated. Could you help me understand why this discrepancy occurs?

Additionally, since my dataset is relatively small, I want to reduce the number of examples used for evaluation. However, changing the 'val_size' in the configuration doesn't seem to have a significant effect.

Here is the configuration file I’m using:

model_name_or_path: meta-llama/Meta-Llama-3-8B
quantization_method: bitsandbytes
quantization_bit: 4

### method
stage: pt
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 64
lora_alpha: 16
lora_dropout: 0.1
use_adam_mini: true
#flash_attn: fa2
enable_liger_kernel: true

### dataset
dataset_dir: /CODE/FT/LLaMA-Factory/data
dataset: my_data
cutoff_len: 512
#max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /MODELS/base_models/local_models/FT/PT
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 2.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
#bf16: true
fp16: true
ddp_timeout: 180000000

### eval
#val_size: 0.0001
#per_device_eval_batch_size: 1
#eval_strategy: steps
#eval_steps: 500

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-09-04T13:25:41Z

we adopted sequence packing at pretraining. multiple sequences will be packed together

chanchimin · 2024-09-13T03:53:45Z

we adopted sequence packing at pretraining. multiple sequences will be packed together

Hi, I have another question regarding using sequence packing in sft stage； when i specify packing:True in the config (even if I specify stage: sft), it seems like it will adopt the sequence packing without any warnings. I am wondering whether it is a practical usage to use sequence packing in sft stage? During evaluation, I find some of the benchmark score is normal while the other is not, maybe we should not use packing in sft stage?

hiyouga · 2024-09-13T08:48:19Z

@chanchimin Packed training should use different hyperparams from the non-packed ones. You can refer to this thread for further discussions: #5426

github-actions bot added the pending This problem is yet to be addressed label Sep 4, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 4, 2024

hiyouga closed this as completed Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training 'Num examples' is not equal to the size of the dataset #5362

training 'Num examples' is not equal to the size of the dataset #5362

ShayKaiserman commented Sep 4, 2024

hiyouga commented Sep 4, 2024

chanchimin commented Sep 13, 2024

hiyouga commented Sep 13, 2024

training 'Num examples' is not equal to the size of the dataset #5362

training 'Num examples' is not equal to the size of the dataset #5362

Comments

ShayKaiserman commented Sep 4, 2024

hiyouga commented Sep 4, 2024

chanchimin commented Sep 13, 2024

hiyouga commented Sep 13, 2024