Support for pre-training the language model #56

elyase · 2018-07-20T21:27:52Z

Is your feature request related to a problem? Please describe.
In order to use the classifier on different languages / specific domains it would be useful to be able to pretrain the language model.

Describe the solution you'd like
Calling .fit on a corpus (i.e.) no labels should train the language model.

model.fit(corpus)

Describe alternatives you've considered
Use the original repo which doesn't have a simple to use interface.

The text was updated successfully, but these errors were encountered:

madisonmay · 2018-07-20T22:47:51Z

Thanks for the ticket @elyase!

I have support for language model pretraining up in a branch now: https://github.com/IndicoDataSolutions/finetune/tree/madison/lm-pretraining.

Wasn't too hard to add, we actually already had the interface support for this but unintentionally stopped supporting it during a past refactor.

Still need to put together some documentation, but it works as you've requested so for now you can work directly off of that branch if you'd like.

madisonmay · 2018-07-23T12:15:13Z

@elyase

Thought about this a bit more, the PR we have in will not allow you to train on different languages, but will give you some benefit in the scenario where you have a lot of unlabeled data for a specific English domain and a limited amount of a labeled training data. Just wanted to clarify what half of your issue we would be able to resolve.

elyase · 2018-07-23T12:33:35Z

@madisonmay , thanks a lot, great work. Is this line:

finetune/finetune/encoding.py

Line 67 in da624ae

self.nlp = spacy.load('en', disable=['parser', 'tagger', 'ner', 'textcat'])

the only missing piece in order to train on lets say German or is there something else missing?

madisonmay · 2018-07-23T12:43:01Z

@elyase there are a few pieces that would need to be modified in order to support a new language.

One of those changes would be swapping out the english tokenizer for a German tokenizer. You can find a German tokenizer as part of spacy so that portion should be relatively straightforward.

Secondly, the byte-pair encoder (that's used to decide how to split up words into subword pieces to have a useful fallback for out of vocabulary words) was "fit" on English text. This means that the current pretrained model's vocabulary primarily contains English text and uses word pieces that are based on English frequencies.

Finally, there would be some minimal changes to make in order to ensure that you're starting from randomly initialized weights rather than the weights learned by the English model.

There might be some other required changes that I'm overlooking but that's what comes to mind right now.

Know that training a language model from scratch on a new language will be a pretty big computational investment -- think along the lines of 4-8 GPUs + a week of training time.

xuy2 · 2018-08-02T00:23:28Z

Hi, I cannot open the branch for pretraining language model anymore. Can you kindly tell me how can I find it?

benleetownsend · 2018-08-02T12:07:16Z

@xuy2 This code is merged into master now.

madisonmay · 2018-08-06T19:45:48Z

Closing this issue as finetuning the language model only is now fully supported on the master branch as of #58. Thanks again for the feature request / bug report @xuy2! Feel free to open another issue if there's something else we can help out with.

xuy2 · 2018-08-06T21:44:42Z

@madisonmay @benleetownsend I have a question for the pre-trained model. The paper said it uses "randomly sampled, contiguous sequences of 512 tokens" to pre-train. Does it mean doing padding for a sentence to 512 tokens, or randomly choosing contiguous 512 tokens from an article?

madisonmay · 2018-08-07T00:02:41Z

lt means the latter -- randomly choosing 512 contiguous tokens from an article. A random slice of text.

madisonmay · 2018-08-07T00:03:41Z

@xuy2 that's actually a valid point -- I'm not sure if the model ever received inputs of < 512 tokens at train time.

xuy2 · 2018-08-07T05:14:28Z

@madisonmay Thank you! In the latter method, it seems that we can only train the model by the batch instead of the epoch. How can we evaluate the model performance using test data? If using the random slice, it's hard to guarantee that the whole test set is evaluated.

madisonmay self-assigned this Jul 20, 2018

madisonmay added the enhancement New feature or request label Jul 20, 2018

madisonmay mentioned this issue Jul 20, 2018

FIX: lm pre-training #58

Merged

madisonmay closed this as completed in #58 Jul 23, 2018

madisonmay reopened this Jul 23, 2018

nicalvarez mentioned this issue Jul 31, 2018

Availability backbone model fpingham/SpanishULMFit#1

Closed

madisonmay closed this as completed Aug 6, 2018

madisonmay mentioned this issue Sep 7, 2018

Possibility to finetune language model only (unsupervised) #135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for pre-training the language model #56

Support for pre-training the language model #56

elyase commented Jul 20, 2018

madisonmay commented Jul 20, 2018

madisonmay commented Jul 23, 2018

elyase commented Jul 23, 2018

madisonmay commented Jul 23, 2018

xuy2 commented Aug 2, 2018

benleetownsend commented Aug 2, 2018

madisonmay commented Aug 6, 2018 •

edited

Loading

xuy2 commented Aug 6, 2018

madisonmay commented Aug 7, 2018 •

edited

Loading

madisonmay commented Aug 7, 2018

xuy2 commented Aug 7, 2018

Support for pre-training the language model #56

Support for pre-training the language model #56

Comments

elyase commented Jul 20, 2018

madisonmay commented Jul 20, 2018

madisonmay commented Jul 23, 2018

elyase commented Jul 23, 2018

madisonmay commented Jul 23, 2018

xuy2 commented Aug 2, 2018

benleetownsend commented Aug 2, 2018

madisonmay commented Aug 6, 2018 • edited Loading

xuy2 commented Aug 6, 2018

madisonmay commented Aug 7, 2018 • edited Loading

madisonmay commented Aug 7, 2018

xuy2 commented Aug 7, 2018

madisonmay commented Aug 6, 2018 •

edited

Loading

madisonmay commented Aug 7, 2018 •

edited

Loading