- 
                Notifications
    
You must be signed in to change notification settings  - Fork 463
 
Description
I was looking at the logs of your training (from this json file) and realized that the scheduling is messed up.
It's related to the ConstantLength dataset, not computing its actual length. When I train this model, the progress bar and the total number of iterations are calculated from the underlying H4 Dataset (around 208k samples) instead of the packed version that has around 139k packed sequences of 2048.
This affects the scheduler, which does not perform any warmup. I have an 8xA100 node, so I am running 2x grad accum for an adequate batch size 512.
- I am sure you are missing a 
warmup_ratio: 0.1on the sft configs 
It would be beneficial to have access to the training logs. I found them on Tensorboard :(
You can follow my training here: https://wandb.ai/capecape/zephyr/runs/zhfrhnr5
PD: When using trl, I manually compute the total number of train steps beforehand to adequately pass the warmup steps to the scheduler. I know the ConstantLength dataset is a generator that yields batches without knowing beforehand how many samples it will have.