Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training 'Num examples' is not equal to the size of the dataset #5362

Closed
ShayKaiserman opened this issue Sep 4, 2024 · 3 comments
Closed
Labels
solved This problem has been already solved

Comments

@ShayKaiserman
Copy link

Hi,

I’m trying to understand the reported 'Num examples' during the training process.

I am fine-tuning an LLM on my custom dataset, which contains 348,979 chunks of text. However, during training (executing llamafactory-cli train model_pt.yaml), the log shows the following:

09/04/2024 15:03:43 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
Converting format of dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 348979/348979 [00:01<00:00, 188087.31 examples/s]
09/04/2024 15:03:46 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
09/04/2024 15:03:46 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
09/04/2024 15:03:46 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
Running tokenizer on dataset (num_proc=16): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 348979/348979 [00:08<00:00, 39427.21 examples/s]
...
...
[INFO|trainer.py:2134] 2024-09-04 15:04:18,470 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-09-04 15:04:18,470 >> Num examples = 137,591
[INFO|trainer.py:2136] 2024-09-04 15:04:18,470 >> Num Epochs = 3
[INFO|trainer.py:2137] 2024-09-04 15:04:18,470 >> Instantaneous batch size per device = 4
[INFO|trainer.py:2140] 2024-09-04 15:04:18,470 >> Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:2141] 2024-09-04 15:04:18,470 >> Gradient Accumulation steps = 4
[INFO|trainer.py:2142] 2024-09-04 15:04:18,470 >> Total optimization steps = 6,450
[INFO|trainer.py:2143] 2024-09-04 15:04:18,476 >> Number of trainable parameters = 167,772,160

I expected to see 348,979 examples reported during training, but only 137,591 are indicated. Could you help me understand why this discrepancy occurs?

Additionally, since my dataset is relatively small, I want to reduce the number of examples used for evaluation. However, changing the 'val_size' in the configuration doesn't seem to have a significant effect.

Here is the configuration file I’m using:

model_name_or_path: meta-llama/Meta-Llama-3-8B
quantization_method: bitsandbytes
quantization_bit: 4

### method
stage: pt
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 64
lora_alpha: 16
lora_dropout: 0.1
use_adam_mini: true
#flash_attn: fa2
enable_liger_kernel: true

### dataset
dataset_dir: /CODE/FT/LLaMA-Factory/data
dataset: my_data
cutoff_len: 512
#max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /MODELS/base_models/local_models/FT/PT
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 2.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
#bf16: true
fp16: true
ddp_timeout: 180000000

### eval
#val_size: 0.0001
#per_device_eval_batch_size: 1
#eval_strategy: steps
#eval_steps: 500
@github-actions github-actions bot added the pending This problem is yet to be addressed label Sep 4, 2024
@hiyouga
Copy link
Owner

hiyouga commented Sep 4, 2024

we adopted sequence packing at pretraining. multiple sequences will be packed together

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 4, 2024
@hiyouga hiyouga closed this as completed Sep 4, 2024
@chanchimin
Copy link

we adopted sequence packing at pretraining. multiple sequences will be packed together

Hi, I have another question regarding using sequence packing in sft stage; when i specify packing:True in the config (even if I specify stage: sft), it seems like it will adopt the sequence packing without any warnings. I am wondering whether it is a practical usage to use sequence packing in sft stage? During evaluation, I find some of the benchmark score is normal while the other is not, maybe we should not use packing in sft stage?

@hiyouga
Copy link
Owner

hiyouga commented Sep 13, 2024

@chanchimin Packed training should use different hyperparams from the non-packed ones. You can refer to this thread for further discussions: #5426

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

4 participants
@hiyouga @ShayKaiserman @chanchimin and others