Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for pre-training the language model #56

Closed
elyase opened this issue Jul 20, 2018 · 11 comments
Closed

Support for pre-training the language model #56

elyase opened this issue Jul 20, 2018 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@elyase
Copy link

elyase commented Jul 20, 2018

Is your feature request related to a problem? Please describe.
In order to use the classifier on different languages / specific domains it would be useful to be able to pretrain the language model.

Describe the solution you'd like
Calling .fit on a corpus (i.e.) no labels should train the language model.

model.fit(corpus)

Describe alternatives you've considered
Use the original repo which doesn't have a simple to use interface.

@madisonmay madisonmay self-assigned this Jul 20, 2018
@madisonmay madisonmay added the enhancement New feature or request label Jul 20, 2018
@madisonmay
Copy link
Contributor

Thanks for the ticket @elyase!

I have support for language model pretraining up in a branch now: https://github.com/IndicoDataSolutions/finetune/tree/madison/lm-pretraining.

Wasn't too hard to add, we actually already had the interface support for this but unintentionally stopped supporting it during a past refactor.

Still need to put together some documentation, but it works as you've requested so for now you can work directly off of that branch if you'd like.

@madisonmay
Copy link
Contributor

@elyase

Thought about this a bit more, the PR we have in will not allow you to train on different languages, but will give you some benefit in the scenario where you have a lot of unlabeled data for a specific English domain and a limited amount of a labeled training data. Just wanted to clarify what half of your issue we would be able to resolve.

@elyase
Copy link
Author

elyase commented Jul 23, 2018

@madisonmay , thanks a lot, great work. Is this line:

self.nlp = spacy.load('en', disable=['parser', 'tagger', 'ner', 'textcat'])

the only missing piece in order to train on lets say German or is there something else missing?

@madisonmay
Copy link
Contributor

@elyase there are a few pieces that would need to be modified in order to support a new language.

One of those changes would be swapping out the english tokenizer for a German tokenizer. You can find a German tokenizer as part of spacy so that portion should be relatively straightforward.

Secondly, the byte-pair encoder (that's used to decide how to split up words into subword pieces to have a useful fallback for out of vocabulary words) was "fit" on English text. This means that the current pretrained model's vocabulary primarily contains English text and uses word pieces that are based on English frequencies.

Finally, there would be some minimal changes to make in order to ensure that you're starting from randomly initialized weights rather than the weights learned by the English model.

There might be some other required changes that I'm overlooking but that's what comes to mind right now.

Know that training a language model from scratch on a new language will be a pretty big computational investment -- think along the lines of 4-8 GPUs + a week of training time.

@xuy2
Copy link

xuy2 commented Aug 2, 2018

Hi, I cannot open the branch for pretraining language model anymore. Can you kindly tell me how can I find it?

@benleetownsend
Copy link
Contributor

@xuy2 This code is merged into master now.

@madisonmay
Copy link
Contributor

madisonmay commented Aug 6, 2018

Closing this issue as finetuning the language model only is now fully supported on the master branch as of #58. Thanks again for the feature request / bug report @xuy2! Feel free to open another issue if there's something else we can help out with.

@xuy2
Copy link

xuy2 commented Aug 6, 2018

@madisonmay @benleetownsend I have a question for the pre-trained model. The paper said it uses "randomly sampled, contiguous sequences of 512 tokens" to pre-train. Does it mean doing padding for a sentence to 512 tokens, or randomly choosing contiguous 512 tokens from an article?

@madisonmay
Copy link
Contributor

madisonmay commented Aug 7, 2018

lt means the latter -- randomly choosing 512 contiguous tokens from an article. A random slice of text.

@madisonmay
Copy link
Contributor

@xuy2 that's actually a valid point -- I'm not sure if the model ever received inputs of < 512 tokens at train time.

@xuy2
Copy link

xuy2 commented Aug 7, 2018

@madisonmay Thank you! In the latter method, it seems that we can only train the model by the batch instead of the epoch. How can we evaluate the model performance using test data? If using the random slice, it's hard to guarantee that the whole test set is evaluated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants