-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible/is there a plan to enable continued pretraining? #1547
Comments
Hi @oligiles0, you can actually use |
Thanks very much @enzoampil . Is there a reason this uses a single text file as opposed to taking a folder of text files? I wouldn't want to combine multiple documents because some chunks will then cross documents and interfere with training, but I also wouldn't want to rerun the script for individual documents. |
Please check #1896 (comment) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
馃殌 Feature
Standardised interface to pretrain various Transformers with standardised expectations with regards to formatting training data.
Motivation
To achieve state of the art within a given domain it is not sufficient to take models pretrained on nonspecific literature (wikipedia/books/etc). The ideal situation would be able to leverage all the compute put into this training and then further train on domain literature before fine tuning on a specific task. The great strength of this library is having a standard interface to use new SOTA models and it would be very helpful if this was extended to include further pretraining to help rapidly push domain SOTAs.
The text was updated successfully, but these errors were encountered: