You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m trying to understand the reported 'Num examples' during the training process.
I am fine-tuning an LLM on my custom dataset, which contains 348,979 chunks of text. However, during training (executing llamafactory-cli train model_pt.yaml), the log shows the following:
09/04/2024 15:03:43 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
Converting format of dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 348979/348979 [00:01<00:00, 188087.31 examples/s]
09/04/2024 15:03:46 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
09/04/2024 15:03:46 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
09/04/2024 15:03:46 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
Running tokenizer on dataset (num_proc=16): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 348979/348979 [00:08<00:00, 39427.21 examples/s]
...
...
[INFO|trainer.py:2134] 2024-09-04 15:04:18,470 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-09-04 15:04:18,470 >> Num examples = 137,591
[INFO|trainer.py:2136] 2024-09-04 15:04:18,470 >> Num Epochs = 3
[INFO|trainer.py:2137] 2024-09-04 15:04:18,470 >> Instantaneous batch size per device = 4
[INFO|trainer.py:2140] 2024-09-04 15:04:18,470 >> Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:2141] 2024-09-04 15:04:18,470 >> Gradient Accumulation steps = 4
[INFO|trainer.py:2142] 2024-09-04 15:04:18,470 >> Total optimization steps = 6,450
[INFO|trainer.py:2143] 2024-09-04 15:04:18,476 >> Number of trainable parameters = 167,772,160
I expected to see 348,979 examples reported during training, but only 137,591 are indicated. Could you help me understand why this discrepancy occurs?
Additionally, since my dataset is relatively small, I want to reduce the number of examples used for evaluation. However, changing the 'val_size' in the configuration doesn't seem to have a significant effect.
we adopted sequence packing at pretraining. multiple sequences will be packed together
Hi, I have another question regarding using sequence packing in sft stage; when i specify packing:True in the config (even if I specify stage: sft), it seems like it will adopt the sequence packing without any warnings. I am wondering whether it is a practical usage to use sequence packing in sft stage? During evaluation, I find some of the benchmark score is normal while the other is not, maybe we should not use packing in sft stage?
Hi,
I’m trying to understand the reported 'Num examples' during the training process.
I am fine-tuning an LLM on my custom dataset, which contains 348,979 chunks of text. However, during training (executing
llamafactory-cli train model_pt.yaml
), the log shows the following:09/04/2024 15:03:43 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
Converting format of dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 348979/348979 [00:01<00:00, 188087.31 examples/s]
09/04/2024 15:03:46 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
09/04/2024 15:03:46 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
09/04/2024 15:03:46 - INFO - llamafactory.data.loader - Loading dataset /DATA/train/raw/data_raw.json...
Running tokenizer on dataset (num_proc=16): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 348979/348979 [00:08<00:00, 39427.21 examples/s]
...
...
[INFO|trainer.py:2134] 2024-09-04 15:04:18,470 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-09-04 15:04:18,470 >> Num examples = 137,591
[INFO|trainer.py:2136] 2024-09-04 15:04:18,470 >> Num Epochs = 3
[INFO|trainer.py:2137] 2024-09-04 15:04:18,470 >> Instantaneous batch size per device = 4
[INFO|trainer.py:2140] 2024-09-04 15:04:18,470 >> Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:2141] 2024-09-04 15:04:18,470 >> Gradient Accumulation steps = 4
[INFO|trainer.py:2142] 2024-09-04 15:04:18,470 >> Total optimization steps = 6,450
[INFO|trainer.py:2143] 2024-09-04 15:04:18,476 >> Number of trainable parameters = 167,772,160
I expected to see 348,979 examples reported during training, but only 137,591 are indicated. Could you help me understand why this discrepancy occurs?
Additionally, since my dataset is relatively small, I want to reduce the number of examples used for evaluation. However, changing the 'val_size' in the configuration doesn't seem to have a significant effect.
Here is the configuration file I’m using:
The text was updated successfully, but these errors were encountered: