-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when reproducing the pre-training #16
Comments
The same issue! |
I haven't seen this error before. Seems like you are hitting an empty batch randomly during the first epoch. Can you do:
and re-run to get a more full stack trace and post that here? |
For me, the error occurred halfway during pre-training.
|
Not sure why you are encountering an empty batch. I would recommend perhaps adding a check / some printing / breakpoints in the HG38 dataset |
Did you solve this problem? |
Not yet... Currently, I blocked validation during pre-training. |
I just want to report that I have also seen this problem in some runs but not all. I am using two different clusters to test, and it is only happening on the NCSA DELTA system. I have not yet figured out what is different in the runs where I see this problem. |
@zhan8855 does it only happen in the validation step? |
@leannmlindsey For me it is, and it seems that it most likely happens when the trainning loader and validation loader are running simultaneously. |
The same issue! i am arrch64 |
I am facing the same issue unfortunately. It happens during eval. Investigating now... |
I have sloved this issue,The path to the file is '02_caduceus/src/dataloaders/datasets/hg38_dataset.py'. In this file, within the __getitem__function, there is an instance where the line of code seq = self.fasta(chr_name, start, end, max_length=self.max_length, i_shift=shift_idx, return_augs=self.return_augs) may sometimes assign an empty value, even though there is actually a result. A simple loop can be used to resolve this issue, for example: |
I tried to reproduce the pre-training experiment using the command
But after the first epoch, there came an error saying:
Can you look into the issue?
ThANKS!
The text was updated successfully, but these errors were encountered: