Possible bug in TinyLlama's dataloading #67

larrylawl · 2023-10-17T10:22:37Z

Hi,

Can I check why did you shuffle the filenames here? https://github.com/jzhang38/TinyLlama/blob/main/pretrain/tinyllama.py#L308).

To my understanding, shuffling results in different nodes taking partitions of the different list of filenames. Hence, a filename might be processed twice by different nodes (i.e. *0.bin in diagram), which implies another filename is left out.

To my understanding, what we want is for different nodes to take partitions of the same list of filenames, so that every filename is included.

I’ve verified the above by checking if the filename (i.e. train_slimpajama_214_0000000275.bin for me) exist in the list of filenames. In my run, this filename which is read by different nodes.

After my fix (see below), this file name is only read by one node.

filenames = sorted(glob.glob(str(data_dir / f"{prefix}*")))
# random.seed(seed)
# random.shuffle(filenames)

Is my understanding correct? Many thanks!

The text was updated successfully, but these errors were encountered:

jzhang38 · 2023-10-18T04:29:34Z

Hi Larry,

Yes, you are right.

We add a shuffle operation there because starcoderdata is not shuffled. If we do not shuffle the filenames, data of one coding language may appear together during training. We thought the seeds for different workers were the same when we added that shuffle function, but as pointed out by you, they are not.

This bug results in around 35% of data not being loaded into the dataloader and some data may be seen multiple times in a single epoch. I have updated the code to fix this bug. Unfortunately, we cannot afford to have a completely fresh run from scratch. What we opt to remedy this is to manually kill the process, fix the code, and resume the training at around 1.5T-token checkpoints.

We will discuss this issue in detail in our upcoming technical report. Thanks a million for pointing that out!

larrylawl · 2023-10-18T05:44:50Z

Hi @jzhang38 , thanks for confirming! GIven that TinyLlama-1b is performing quite well, I'm hopeful that the results will improve more after this fix :)

We add a shuffle operation there because starcoderdata is not shuffled. If we do not shuffle the filenames, data of one coding language may appear together during training.

Hmm, but the line below ensures that the filenames are shuffled right?

TinyLlama/pretrain/tinyllama.py

Line 353 in 5120753

shuffle=True,

I have updated the code to fix this bug.

Hmm, can I check why did you not sort the glob (filenames = sorted(glob.glob(str(data_dir / f"{prefix}*"))))? glob.glob's return value is arbitrary, thus the list of filenames will still be different across different process. I think we need to sort it to ensure the list of filenames are the same across different process.

TinyLlama/pretrain/tinyllama.py

Line 306 in 5120753

filenames = glob.glob(str(data_dir / f"{prefix}*"))

Separately, there's one other issue with the lit_gpt code which I surfaced here. Any ideas if I'm correct?

jzhang38 · 2023-10-18T05:57:29Z

Hmm, but the line below ensures that the filenames are shuffled right?

TinyLlama/lit_gpt/packed_dataset.py

Line 188 in 5120753

    
           self._block_idxs = self._rng.permutation(n_all_blocks) if self._shuffle else range(n_all_blocks)

The shuffle toggle there only shuffles seqs within a single chunk.

can I check why did you not sort the glob (filenames = sorted(glob.glob(str(data_dir / f"{prefix}*"))))

The order returned by glob.glob uses the order in which entries appear in the filesystem. I think the order is arbitrary but fixed.

Separately, there's one other issue with the lit_gpt code which I surfaced Lightning-AI/litgpt#649. Any ideas if I'm correct?

If you check the source code of fabric.setup_dataloaders:

    def _requires_distributed_sampler(self, dataloader: DataLoader) -> bool:
        return (
            getattr(self.strategy, "distributed_sampler_kwargs", None) is not None
            and not isinstance(dataloader.sampler, DistributedSampler)
            and not has_iterable_dataset(dataloader)
        )

It will not add a distributed samplers to the dataloader if the dataloader has iterable datasets, which is the case of lit-git/tinyllama.

larrylawl · 2023-10-18T06:40:25Z

Thanks for your replies!

The order returned by glob.glob uses the order in which entries appear in the filesystem. I think the order is arbitrary but fixed.

At least for my case, the order was the same for machines on the same node, but different across nodes. It may be that I'm running from a docker container.

jzhang38 · 2023-10-18T06:47:49Z

At least for my case, the order was the same for machines on the same node

I see. Maybe I should sort it to play safe.

larrylawl · 2023-10-19T03:37:58Z

This bug results in around 35% of data not being loaded into the dataloader and some data may be seen multiple times in a single epoch.

Sorry @jzhang38 , can you walk me through how you calculated 35%? I calculated 62% for 16 processes. Intention is not to nitpick (i'm grateful for the codebase!) but to understand the implications.

Here's my calculations:
Assume we have 16 processes (2 * 8 cards)
Let X be the event where the particular data file is only seen by one process.
Let Y be the event where the particular data file is not seen by another process
Let Z be the event where the particular data is seen by more than one process (i.e. duplicated)
P(X) = P(Y) ** 15 = (15/16) ** 15 = 0.38 (15/16 since each process will look at 1/16 partition of a random list, which has 15/16 prob of not containing that particular data file)
P(Z) = 1 - P(X) = 0.62

gabrielgrant · 2023-10-21T03:40:50Z

good catch @larrylawl -- thanks for surfacing this!

@jzhang38 would you be open to releasing the upcoming 1.5T checkpoint even in spite of this bug (perhaps with a different name to indicate the duplicate/unused partitions), before restarting training from the 1T checkpoint (which it seems like you've decided on as your course of action according to the notion update)

It would be very interesting to be able to compare the two versions of the same model, trained to the same point, with and without this difference in data used to observe how the performance differs

(also, in spite of this bug, I'm imagining the with-bug 1.5T checkpoint will still improve perf over the previous 1T point, useful for those of us experimenting with the model today)

MFajcik · 2023-10-23T16:38:24Z

good catch @larrylawl -- thanks for surfacing this!

@jzhang38 would you be open to releasing the upcoming 1.5T checkpoint even in spite of this bug (perhaps with a different name to indicate the duplicate/unused partitions), before restarting training from the 1T checkpoint (which it seems like you've decided on as your course of action according to the notion update)

It would be very interesting to be able to compare the two versions of the same model, trained to the same point, with and without this difference in data used to observe how the performance differs

(also, in spite of this bug, I'm imagining the with-bug 1.5T checkpoint will still improve perf over the previous 1T point, useful for those of us experimenting with the model today)

@gabrielgrant
Actually, the "wrong" 1.5T checkpoint is here...
https://huggingface.co/TinyLlama/tinyLlama-intermediate-checkpoints/tree/step-720k-token-1510B

Be sure to pass the correct revision parameter to your class-method from_pretrained.

I actually done tiny evaluation of myself on HellaSwag, and there is about 1% improvement

1T vs 1.5T ckpt:

  TinyLLAMA 1.1B - 1T- HellaSwag - Eval metrics/hellaswag/10-shot/InContextLearningMultipleChoiceAccuracy: 0.5245
  TinyLLAMA 1.1B - steps-1510B (1.5T)- HellaSwag - Eval metrics/hellaswag/10-shot/InContextLearningMultipleChoiceAccuracy: 0.5357

gabrielgrant · 2023-10-23T23:25:38Z

thanks for the info @MFajcik !

jzhang38 closed this as completed Oct 18, 2023

siddharth9820 mentioned this issue Feb 12, 2024

Fix bug in dataloader #69

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bug in TinyLlama's dataloading #67

Possible bug in TinyLlama's dataloading #67

larrylawl commented Oct 17, 2023

jzhang38 commented Oct 18, 2023

larrylawl commented Oct 18, 2023 •

edited

Loading

jzhang38 commented Oct 18, 2023 •

edited

Loading

larrylawl commented Oct 18, 2023

jzhang38 commented Oct 18, 2023

larrylawl commented Oct 19, 2023

gabrielgrant commented Oct 21, 2023

MFajcik commented Oct 23, 2023 •

edited

Loading

gabrielgrant commented Oct 23, 2023

Possible bug in TinyLlama's dataloading #67

Possible bug in TinyLlama's dataloading #67

Comments

larrylawl commented Oct 17, 2023

jzhang38 commented Oct 18, 2023

larrylawl commented Oct 18, 2023 • edited Loading

jzhang38 commented Oct 18, 2023 • edited Loading

larrylawl commented Oct 18, 2023

jzhang38 commented Oct 18, 2023

larrylawl commented Oct 19, 2023

gabrielgrant commented Oct 21, 2023

MFajcik commented Oct 23, 2023 • edited Loading

gabrielgrant commented Oct 23, 2023

larrylawl commented Oct 18, 2023 •

edited

Loading

jzhang38 commented Oct 18, 2023 •

edited

Loading

MFajcik commented Oct 23, 2023 •

edited

Loading