Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible/is there a plan to enable continued pretraining? #1547

Closed
oligiles0 opened this issue Oct 17, 2019 · 4 comments
Closed

Is it possible/is there a plan to enable continued pretraining? #1547

oligiles0 opened this issue Oct 17, 2019 · 4 comments
Labels

Comments

@oligiles0
Copy link

馃殌 Feature

Standardised interface to pretrain various Transformers with standardised expectations with regards to formatting training data.

Motivation

To achieve state of the art within a given domain it is not sufficient to take models pretrained on nonspecific literature (wikipedia/books/etc). The ideal situation would be able to leverage all the compute put into this training and then further train on domain literature before fine tuning on a specific task. The great strength of this library is having a standard interface to use new SOTA models and it would be very helpful if this was extended to include further pretraining to help rapidly push domain SOTAs.

@enzoampil
Copy link
Contributor

enzoampil commented Oct 18, 2019

Hi @oligiles0, you can actually use run_lm_finetuning.py for this. You can find more details in the RoBERTa/BERT and masked language modeling section in the README

@oligiles0
Copy link
Author

oligiles0 commented Oct 21, 2019

Hi @oligiles0, you can actually use run_lm_finetuning.py for this. You can find more details in the RoBERTa/BERT and masked language modeling section in the README

Thanks very much @enzoampil . Is there a reason this uses a single text file as opposed to taking a folder of text files? I wouldn't want to combine multiple documents because some chunks will then cross documents and interfere with training, but I also wouldn't want to rerun the script for individual documents.

@iedmrc
Copy link
Contributor

iedmrc commented Dec 4, 2019

Thanks very much @enzoampil . Is there a reason this uses a single text file as opposed to taking a folder of text files? I wouldn't want to combine multiple documents because some chunks will then cross documents and interfere with training, but I also wouldn't want to rerun the script for individual documents.

Please check #1896 (comment)

@stale
Copy link

stale bot commented Feb 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants