Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug in TinyLlama's dataloading #67

Closed
larrylawl opened this issue Oct 17, 2023 · 9 comments
Closed

Possible bug in TinyLlama's dataloading #67

larrylawl opened this issue Oct 17, 2023 · 9 comments

Comments

@larrylawl
Copy link

Hi,

Can I check why did you shuffle the filenames here? https://github.com/jzhang38/TinyLlama/blob/main/pretrain/tinyllama.py#L308).

image

To my understanding, shuffling results in different nodes taking partitions of the different list of filenames. Hence, a filename might be processed twice by different nodes (i.e. *0.bin in diagram), which implies another filename is left out.

image

To my understanding, what we want is for different nodes to take partitions of the same list of filenames, so that every filename is included.

image

I’ve verified the above by checking if the filename (i.e. train_slimpajama_214_0000000275.bin for me) exist in the list of filenames. In my run, this filename which is read by different nodes.

image

After my fix (see below), this file name is only read by one node.

filenames = sorted(glob.glob(str(data_dir / f"{prefix}*")))
# random.seed(seed)
# random.shuffle(filenames)

Is my understanding correct? Many thanks!

@jzhang38
Copy link
Owner

Hi Larry,

Yes, you are right.

We add a shuffle operation there because starcoderdata is not shuffled. If we do not shuffle the filenames, data of one coding language may appear together during training. We thought the seeds for different workers were the same when we added that shuffle function, but as pointed out by you, they are not.

This bug results in around 35% of data not being loaded into the dataloader and some data may be seen multiple times in a single epoch. I have updated the code to fix this bug. Unfortunately, we cannot afford to have a completely fresh run from scratch. What we opt to remedy this is to manually kill the process, fix the code, and resume the training at around 1.5T-token checkpoints.

We will discuss this issue in detail in our upcoming technical report. Thanks a million for pointing that out!

@larrylawl
Copy link
Author

larrylawl commented Oct 18, 2023

Hi @jzhang38 , thanks for confirming! GIven that TinyLlama-1b is performing quite well, I'm hopeful that the results will improve more after this fix :)

We add a shuffle operation there because starcoderdata is not shuffled. If we do not shuffle the filenames, data of one coding language may appear together during training.

Hmm, but the line below ensures that the filenames are shuffled right?

shuffle=True,

I have updated the code to fix this bug.

Hmm, can I check why did you not sort the glob (filenames = sorted(glob.glob(str(data_dir / f"{prefix}*"))))? glob.glob's return value is arbitrary, thus the list of filenames will still be different across different process. I think we need to sort it to ensure the list of filenames are the same across different process.

filenames = glob.glob(str(data_dir / f"{prefix}*"))

Separately, there's one other issue with the lit_gpt code which I surfaced here. Any ideas if I'm correct?

@jzhang38
Copy link
Owner

jzhang38 commented Oct 18, 2023

Hmm, but the line below ensures that the filenames are shuffled right?

self._block_idxs = self._rng.permutation(n_all_blocks) if self._shuffle else range(n_all_blocks)

The shuffle toggle there only shuffles seqs within a single chunk.

can I check why did you not sort the glob (filenames = sorted(glob.glob(str(data_dir / f"{prefix}*"))))

The order returned by glob.glob uses the order in which entries appear in the filesystem. I think the order is arbitrary but fixed.

Separately, there's one other issue with the lit_gpt code which I surfaced Lightning-AI/litgpt#649. Any ideas if I'm correct?

If you check the source code of fabric.setup_dataloaders:

    def _requires_distributed_sampler(self, dataloader: DataLoader) -> bool:
        return (
            getattr(self.strategy, "distributed_sampler_kwargs", None) is not None
            and not isinstance(dataloader.sampler, DistributedSampler)
            and not has_iterable_dataset(dataloader)
        )

It will not add a distributed samplers to the dataloader if the dataloader has iterable datasets, which is the case of lit-git/tinyllama.

@larrylawl
Copy link
Author

Thanks for your replies!

The order returned by glob.glob uses the order in which entries appear in the filesystem. I think the order is arbitrary but fixed.

At least for my case, the order was the same for machines on the same node, but different across nodes. It may be that I'm running from a docker container.

@jzhang38
Copy link
Owner

At least for my case, the order was the same for machines on the same node

I see. Maybe I should sort it to play safe.

@larrylawl
Copy link
Author

This bug results in around 35% of data not being loaded into the dataloader and some data may be seen multiple times in a single epoch.

Sorry @jzhang38 , can you walk me through how you calculated 35%? I calculated 62% for 16 processes. Intention is not to nitpick (i'm grateful for the codebase!) but to understand the implications.

Here's my calculations:
Assume we have 16 processes (2 * 8 cards)
Let X be the event where the particular data file is only seen by one process.
Let Y be the event where the particular data file is not seen by another process
Let Z be the event where the particular data is seen by more than one process (i.e. duplicated)
P(X) = P(Y) ** 15 = (15/16) ** 15 = 0.38 (15/16 since each process will look at 1/16 partition of a random list, which has 15/16 prob of not containing that particular data file)
P(Z) = 1 - P(X) = 0.62

@gabrielgrant
Copy link

good catch @larrylawl -- thanks for surfacing this!

@jzhang38 would you be open to releasing the upcoming 1.5T checkpoint even in spite of this bug (perhaps with a different name to indicate the duplicate/unused partitions), before restarting training from the 1T checkpoint (which it seems like you've decided on as your course of action according to the notion update)

It would be very interesting to be able to compare the two versions of the same model, trained to the same point, with and without this difference in data used to observe how the performance differs

(also, in spite of this bug, I'm imagining the with-bug 1.5T checkpoint will still improve perf over the previous 1T point, useful for those of us experimenting with the model today)

@MFajcik
Copy link

MFajcik commented Oct 23, 2023

good catch @larrylawl -- thanks for surfacing this!

@jzhang38 would you be open to releasing the upcoming 1.5T checkpoint even in spite of this bug (perhaps with a different name to indicate the duplicate/unused partitions), before restarting training from the 1T checkpoint (which it seems like you've decided on as your course of action according to the notion update)

It would be very interesting to be able to compare the two versions of the same model, trained to the same point, with and without this difference in data used to observe how the performance differs

(also, in spite of this bug, I'm imagining the with-bug 1.5T checkpoint will still improve perf over the previous 1T point, useful for those of us experimenting with the model today)

@gabrielgrant
Actually, the "wrong" 1.5T checkpoint is here...
https://huggingface.co/TinyLlama/tinyLlama-intermediate-checkpoints/tree/step-720k-token-1510B

Be sure to pass the correct revision parameter to your class-method from_pretrained.

I actually done tiny evaluation of myself on HellaSwag, and there is about 1% improvement

1T vs 1.5T ckpt:

  TinyLLAMA 1.1B - 1T- HellaSwag - Eval metrics/hellaswag/10-shot/InContextLearningMultipleChoiceAccuracy: 0.5245
  TinyLLAMA 1.1B - steps-1510B (1.5T)- HellaSwag - Eval metrics/hellaswag/10-shot/InContextLearningMultipleChoiceAccuracy: 0.5357

@gabrielgrant
Copy link

thanks for the info @MFajcik !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants