-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible bug in TinyLlama's dataloading #67
Comments
Hi Larry, Yes, you are right. We add a shuffle operation there because starcoderdata is not shuffled. If we do not shuffle the filenames, data of one coding language may appear together during training. We thought the seeds for different workers were the same when we added that shuffle function, but as pointed out by you, they are not. This bug results in around 35% of data not being loaded into the dataloader and some data may be seen multiple times in a single epoch. I have updated the code to fix this bug. Unfortunately, we cannot afford to have a completely fresh run from scratch. What we opt to remedy this is to manually kill the process, fix the code, and resume the training at around 1.5T-token checkpoints. We will discuss this issue in detail in our upcoming technical report. Thanks a million for pointing that out! |
Hi @jzhang38 , thanks for confirming! GIven that TinyLlama-1b is performing quite well, I'm hopeful that the results will improve more after this fix :)
Hmm, but the line below ensures that the filenames are shuffled right? TinyLlama/pretrain/tinyllama.py Line 353 in 5120753
Hmm, can I check why did you not sort the glob ( TinyLlama/pretrain/tinyllama.py Line 306 in 5120753
Separately, there's one other issue with the lit_gpt code which I surfaced here. Any ideas if I'm correct? |
TinyLlama/lit_gpt/packed_dataset.py Line 188 in 5120753
The shuffle toggle there only shuffles seqs within a single chunk.
The order returned by glob.glob uses the order in which entries appear in the filesystem. I think the order is arbitrary but fixed.
If you check the source code of fabric.setup_dataloaders:
It will not add a distributed samplers to the dataloader if the dataloader has iterable datasets, which is the case of lit-git/tinyllama. |
Thanks for your replies!
At least for my case, the order was the same for machines on the same node, but different across nodes. It may be that I'm running from a docker container. |
I see. Maybe I should sort it to play safe. |
Sorry @jzhang38 , can you walk me through how you calculated 35%? I calculated 62% for 16 processes. Intention is not to nitpick (i'm grateful for the codebase!) but to understand the implications. Here's my calculations: |
good catch @larrylawl -- thanks for surfacing this! @jzhang38 would you be open to releasing the upcoming 1.5T checkpoint even in spite of this bug (perhaps with a different name to indicate the duplicate/unused partitions), before restarting training from the 1T checkpoint (which it seems like you've decided on as your course of action according to the notion update) It would be very interesting to be able to compare the two versions of the same model, trained to the same point, with and without this difference in data used to observe how the performance differs (also, in spite of this bug, I'm imagining the with-bug 1.5T checkpoint will still improve perf over the previous 1T point, useful for those of us experimenting with the model today) |
@gabrielgrant Be sure to pass the correct I actually done tiny evaluation of myself on HellaSwag, and there is about 1% improvement 1T vs 1.5T ckpt:
|
thanks for the info @MFajcik ! |
Hi,
Can I check why did you shuffle the filenames here? https://github.com/jzhang38/TinyLlama/blob/main/pretrain/tinyllama.py#L308).
To my understanding, shuffling results in different nodes taking partitions of the different list of filenames. Hence, a filename might be processed twice by different nodes (i.e.
*0.bin
in diagram), which implies another filename is left out.To my understanding, what we want is for different nodes to take partitions of the same list of filenames, so that every filename is included.
I’ve verified the above by checking if the filename (i.e.
train_slimpajama_214_0000000275.bin
for me) exist in the list of filenames. In my run, this filename which is read by different nodes.After my fix (see below), this file name is only read by one node.
Is my understanding correct? Many thanks!
The text was updated successfully, but these errors were encountered: