Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do we not set the ignore_index of FusedCrossEntropy to bos_id? #83

Closed
larrylawl opened this issue Nov 3, 2023 · 3 comments
Closed

Comments

@larrylawl
Copy link

larrylawl commented Nov 3, 2023

Hi,

Can I check why did you not ignore the loss for the bos token (<s>)?

loss_func = FusedCrossEntropyLoss()

I noticed that the preprocessing causes the remainder of the binary file to be the bos token (<s>).

builder.write_reminder()

Consequently, my model checkpoint (not TinyLlama's) outputs poor qualitative results:

Prompt: "(very long text of 1027 tokens). How many yards longer was the longest passing touchdown than the shortest?"
Output: "<s>ending<s>end<s>ent<s>ended"

Interestingly, my model of previous checkpoint (100b tokens before) performed okay.

I'm trying to fix this by specifying the loss function to ignore the <s> idx (i.e. 1). I think this is a correct fix, but i'm not sure if it fixes the underlying issue (the issue should have plagued our model from the start, why did it only happen at this iter step?).

@jzhang38
Copy link
Owner

jzhang38 commented Nov 3, 2023

I noticed that the preprocessing causes the remainder of the binary file to be the bos token (<s>).

Yes I think you are right here.

num_processes = cpu_count()

If you have 64 CPU cores. prepare_slimpajama.py will initiate 64 processes, each with a PackedDatasetBuilder and call

def prepare_full(

builder.write_reminder()

That means each process will leave a chunk file that is not fully filled with text tokens but rather with some sep tokens.
self._arr.fill(self._sep_token)

f.write(self._arr.tobytes(order="C"))

So we may have 64 files with some sep tokens remaining.

but i'm not sure if it fixes the underlying issue (the issue should have plagued our model from the start, why did it only happen at this iter step?).

I think it is because 64 files with some sep tokens remaining is a relatively small portion compared with the entire pretraining corpus(450k small bin files after processing), especially when you consider the small size of each chunk. So I do not know why your second checkpoint became really bad. Maybe it is just this specific prompt? Does the benchmark performance degrade significantly?

These are some of my preliminary thoughts. Haven't looked very deeply into it yet. Thanks for spotting this out. We will fix it soon. For example, we can opt to not call builder.write_reminder() at all.

@larrylawl
Copy link
Author

larrylawl commented Nov 5, 2023

Thanks for your reply Peiyuan!

I think it is because 64 files with some sep tokens remaining is a relatively small portion compared with the entire pretraining corpus(450k small bin files after processing), especially when you consider the small size of each chunk.

It's true that the % of files with some bos token remaining is relatively small. The chunk size is actually quite big (i.e. 2049 * 1028). This means that once a process loads the "problematic" binary chunk, it'll use this file for the next 1024 iterations.

chunk_size: int = 2049 * 1024,

But you are right that as the % of files is relatively small, it shouldn't affect. I'll let you know if I managed to fix the bug. Thanks for your help anyway!

Maybe it is just this specific prompt? Does the benchmark performance degrade significantly?

It degraded significantly across the instruct-eval benchmark.

@jzhang38
Copy link
Owner

jzhang38 commented Nov 5, 2023

#85

@jzhang38 jzhang38 closed this as completed Nov 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants