-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AssertionError: Currently only supports dynamic loading from each domain for once. #15
Comments
Hi! How large is your dataset? We currently only supports using all the data points for once and exceeding 1ep of data will cause errors. Supporting multiple epochs requires to modify the StreamingDataset logics. |
@xiamengzhou I performed data processing on the entire redpajama-1T according to your Readme, including tokenizing and sampling. This error occurred when I performed batch=[7/3200]:. It seems that the calculation has reached epoch1, and it comes error.
|
Hiiii, I am not sure why it is happening here -- I will need to take a closer look at it and will get back to you later. Could you share the configuration you are using, and the number of data points in each domain? |
Hi, thanks for your help. I think you are right, there may be something wrong with the data. Although I cannot directly read the mds file to view the number of data points, I found that the size of the data file obtained using sample sampling is smaller than usual. I'll check carefully what's wrong with the sample file. |
You can use the TextStreamingDataset to load the data and count the number of data points by simply using the |
Thank you for your patient reply. Previously I used the default script you provided without setting the number of sampling tokens. This leads to related problems. I'm resampling the data now and waiting to see if it improves, it may take a while.
In addition, the yaml file also configures the data path. When will this be used or will it be overwritten?
|
There are two ways to use a fixed data loading proportion! The first way:
The second way:
You can refer to the callback function of dynamic loading here: https://github.com/princeton-nlp/LLM-Shearing/blob/main/llmshearing/callbacks/dynamic_loading_callback.py#L32 |
Thanks for your help, the code runs smoothly. But sometimes the loss will be nan. Is this normal?
|
When the batch does not contain data from a specific domain, the loss becomes |
@Longyichen Would you like to share your pruning script? |
@Longyichen Have you noticed that c4-rp_weight is 0.9370, which is not consistent with the data in the literature? |
@lippman1125 Yes, I have a similar problem, but I don't know what causes it. It seems that the performance of the model does not suffer much loss compared to the paper. For details, we can seek for @xiamengzhou for help. |
@Longyichen Because the eval CE Loss determines the proportion, but the new proportion only affects train CE Loss. I guess, if there is some gap between the training samples and eval samples, it could lead to this problem. |
@lippman1125 Have you tried continuous pre-training? You can try using the pre-trained data set and evaluation set to see if the same problem occurs |
@Longyichen could you share your scripts? i happened same problem , and not solv. |
我尝试了上面所有的方法,还是不行;我使用的数据集是样例数据集;帮忙讲解一下,可能是哪里的问题吗? |
@coderchem Have you been using the data shared on the google drive? |
When I use a single node, 8*A100 80G configuration, I find that an error occurs:
If I delete "assert epoch == 0, "Currently only supports dynamic loading from each domain for once.", i will cause another error in
If i add "if world.is_local_leader and epoch==0:"
SharedMemory in _attach_work will come error:
The text was updated successfully, but these errors were encountered: