Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak hyper params so that lstm on IMDB trains well #10

Closed
PiotrCzapla opened this issue Nov 15, 2018 · 40 comments
Closed

Tweak hyper params so that lstm on IMDB trains well #10

PiotrCzapla opened this issue Nov 15, 2018 · 40 comments
Assignees
Labels
bug Something isn't working

Comments

@PiotrCzapla
Copy link
Member

The hyper paramters selected in train_clas don't work that well :/
The pretrained LM was a toy example trained on wikitext-2, but i guess we will have similar issues with regular models.

python -m ulmfit.train_clas --data_dir data --model_dir data/en/wikitext-2/models  --pretrain_name=wt-2-q
Dataset: imdb. Language: en.
Loading the pickled data...
Train size: 22500. Valid size: 2500. Test size: 25000.
Fine-tuning the language model...
epoch  train_loss  valid_loss  accuracy
1      4.716092    4.573514    0.258413
2      4.632831    4.475942    0.267385
Starting classifier training
epoch  train_loss  valid_loss  accuracy
1      0.553358    0.460404    0.789200
epoch  train_loss  valid_loss  accuracy
1      0.450415    0.323716    0.862000
epoch  train_loss  valid_loss  accuracy
1      0.353984    0.256489    0.895200
2      0.498414    0.364469    0.855200  <--------- either to large LR or some kind of exploding gradients?
3      0.692660    0.687355    0.514800  <----------

@NirantK, have this hyper params worked with your experiments?

@sebastianruder
Copy link
Collaborator

Good point. Those were just for testing that the script works. I did not manage to get around to tune the one cycle policy for classification. I'd imagine similar hyperparameters as for training the language model might work.

@NirantK
Copy link
Contributor

NirantK commented Nov 16, 2018

No, they don't work - mostly seen as a lot of variance. Though, I don't see exploding gradients as you do above.
I use a much smaller batch size (bs=50 or 70) though.

If I have some spare compute, I'll try the hyperparams from LM training as suggested by @sebastianruder in the previous comment.

@PiotrCzapla PiotrCzapla added the bug Something isn't working label Nov 16, 2018
@PiotrCzapla
Copy link
Member Author

PiotrCzapla commented Nov 16, 2018

I've changes the train_clas to use the hyper parameters froms lesson3 notebook. We are getting 0.89 instead of 0.94.
Here is how it trains now:

python -m ulmfit.train_clas --data_dir data --model_dir data/wiki/wikitext-103/models --pretrain_name=wt-103 --qrnn=True --name 'lm-ft'
Dataset: imdb. Language: en.
Using QRNNs...
Loading the pickled data...
Train size: 22500. Valid size: 2500. Test size: 25000.
Fine-tuning the language model...
epoch  train_loss  valid_loss  accuracy
1      5.866822    5.248559    0.201249
epoch  train_loss  valid_loss  accuracy
1      5.111968    4.840785    0.228520
2      4.809690    4.604379    0.252575
3      4.701472    4.467322    0.265213
4      4.589730    4.377279    0.274301
5      4.508160    4.316603    0.280446
6      4.426814    4.273273    0.284935
7      4.373527    4.241864    0.288155
8      4.352810    4.222559    0.290361
9      4.322117    4.212354    0.291765
10     4.344737    4.210043    0.292047
Starting classifier training
epoch  train_loss  valid_loss  accuracy
1      0.520813    0.364728    0.862000
epoch  train_loss  valid_loss  accuracy
1      0.464576    0.318856    0.872800
epoch  train_loss  valid_loss  accuracy
1      0.431815    0.290399    0.886000
epoch  train_loss  valid_loss  accuracy
1      0.390448    0.272815    0.892000
2      0.406103    0.296245    0.890000

@PiotrCzapla
Copy link
Member Author

The tokenization is adopted automatically :/, but the drop out values are different and the LM model that is fetched from fast ai seems to be better (trained for longer perhaps).

@sebastianruder
Copy link
Collaborator

I'll run some experiments on this. One other factor: As I mentioned here, the model in the paper was trained on the full training set of 25k reviews.

@PiotrCzapla
Copy link
Member Author

PiotrCzapla commented Nov 17, 2018

1      4.706873    4.410843    0.260319      # 1      4.591534    4.429290    0.251909  (12:42)
epoch  train_loss  valid_loss  accuracy
1      4.455386    4.312750    0.270096
2      4.321627    4.221711    0.279955
3      4.251776    4.153905    0.286282
4      4.142951    4.091608    0.292828
5      4.017664    4.047887    0.296991
6      3.981570    4.018512    0.300651
7      3.895004    3.998859    0.302972     # 7      3.947437    3.991286    0.300850  (14:14)
8      3.827237    3.987072    0.304241     # 8      3.897383    3.977569    0.302463  (14:15) 
9      3.792438    3.985511    0.304870     # 9      3.866736    3.972447    0.303147  (14:14) 
10     3.775744    3.986526    0.304800     # 10     3.847952    3.972852    0.303105  (14:15)
Starting classifier training
epoch  train_loss  valid_loss  accuracy
1      0.497570    0.365657    0.836800
epoch  train_loss  valid_loss  accuracy
1      0.401359    0.291795    0.876400
epoch  train_loss  valid_loss  accuracy
1      0.294043    0.241850    0.910800
epoch  train_loss  valid_loss  accuracy
and  OOM  but we are on the good path

We had a large dropout multiplier set (default value). I've just committed the fixes. to BILM branch

@PiotrCzapla
Copy link
Member Author

@sebastianruder do you know why we set learn.true_wd = False in pretrain_lm If I'm not mistaken this turns on L2 regularization that does not work with Adam.

The default weight decay will be wd, which will be handled using the method from Fixing Weight Decay Regularization in Adam if true_wd is set (otherwise it's L2 regularization).
https://docs.fast.ai/basic_train

@PiotrCzapla
Copy link
Member Author

PiotrCzapla commented Nov 17, 2018

After I've resumed training we got:

Starting classifier training
epoch  train_loss  valid_loss  accuracy
1      0.409820    0.316805    0.868400            # 0.294225    0.210385    0.918960 
epoch  train_loss  valid_loss  accuracy
1      0.381355    0.275650    0.892800            # 0.268781    0.180993    0.930760  (03:03)
epoch  train_loss  valid_loss  accuracy
1      0.309496    0.229822    0.910800            # 0.211133    0.161494    0.941280  (04:06)
epoch  train_loss  valid_loss  accuracy
1      0.269082    0.221403    0.917200             #  0.188145    0.155038    0.942480  (05:00)
2      0.272700    0.217910    0.920400             #  0.159475    0.153531    0.944040  (05:01)
Saving models at data/wiki/wikitext-103/models
accuracy: tensor(0.9204)

It is still not 0.944 accuracy that Jeremy got during lesson 3 and it isn't 0.954 accuracy that you got when the paper was published, but we are getting closer.

Some guess why we can have lower performance:

  • No unk preprocessing, my LM was trained directly from wikitext-103
  • training on a part of the training set
  • undertrained LM (it was only trained for 10 epochs on wikitext)
  • no discriminative learning rates
  • Moses tokenizer might not perform as well as fastai tokenizer.
  • Difference between QRNN and LSTM
  • Not using slices during traing of classifier.

@sebastianruder why don't we split out a validation set out from a test set ? It is not that the test set. I understand that the idea is not to overfit to the test set, but it is being done anyway by comparing with SOTA .

@sebastianruder
Copy link
Collaborator

Re the weight decay: @sgugger had set this option in a script for training with the one cycle policy, but I haven't experimented with its effect yet.
Yes, people are indirectly overfitting to the test set by comparing against SOTA, but we should still never use part of the test set for validation.

@PiotrCzapla
Copy link
Member Author

It seems that true_wd=False improves the accuracy of a language model. From 0.304871 to 0.331256. At least for qrnn on wikitext-103-unk.

@PiotrCzapla
Copy link
Member Author

Training BiLM helps a lot with an accuracy of language models but this does not translate to the accuracy of a classifier when the classifier is a simple concatenation of results from two rnns (including max and avg pooling).

@sgugger
Copy link

sgugger commented Nov 19, 2018

Be careful to adjust the weight decay when turning off an on true_wd. A good value in the case of WT-103 is wd=1e-7 with true_wd=False and it's 0.01 with true_wd=True.

@PiotrCzapla
Copy link
Member Author

I had a look through different learning rates and weight decays (I've tested 50 different scenarios). true_wd=False and wd=1e-7 seem to get the slightly better results more often. I was training a bidirectional classifier and the best result I've got so far is 92.5% accuracy, way below the original ULMFiT (architecture might be flawed). But my current best guess is that Moses Tokenization is not performing as well as the fastai tokenization, I will run some experiments, and keep you posted.

@sgugger
Copy link

sgugger commented Nov 21, 2018

Yes, we tried to remove the spacy tokenizer in fastai but I didn't get any good results by using a different tokenizer (92.5% too for our alternative), so that part is important too.

@PiotrCzapla
Copy link
Member Author

Good to know and thank you for watching this thread closely!

@sebastianruder
Copy link
Collaborator

Thanks for commenting, Sylvain. XNLI is already pre-tokenized with Moses, so I don't think using spacy will help much here.

@NirantK
Copy link
Contributor

NirantK commented Nov 22, 2018

@sebastianruder XNLI-1.0, the dataset with all the languages has a header with free text which is not tokenized. We can use the fastai tokenizer (spaCy) there and see if the results help.

@NirantK
Copy link
Contributor

NirantK commented Nov 22, 2018

@PiotrCzapla could you share your top 5/10 hyper-parameter configs? I'd like to try this for Hindi Language models as well. Please mark if you were using qrnn or not?

@PiotrCzapla
Copy link
Member Author

I haven't yet tweak hyper-params, just learning rate, the best what was working is put into the train_clas script on bilm branch. It gets me to 0.923 of accuracy on the validation set, using qrnn and full text for LM model.
I'm cleaning up and I will create a pull request so that we can merge this thing. I have some issues with the new fastai changes though, the datasets started return np.arrays that are of object type, and pytorch is complaining that it can't put them on the gpu :)

@PiotrCzapla
Copy link
Member Author

I've spent whole day try to fix small issues with datasets / dataloader on small size data sets. It takes ages as debugging pytest doesn't work well with pycharm. I need a break today, I will come back on the issues later.

@aayux
Copy link
Contributor

aayux commented Nov 25, 2018

@PiotrCzapla just looked at the classification results you are getting with the new model.

I've had this conversation with some of my colleagues at work who have used the original ULMFiT and their common complaint has been that while it clearly excels at Language Modeling tasks, getting results on the final downstream classification requires a lot of careful (and tedious!) tuning of the hyper-parameters. This is somewhat alleviated by the fact that the training is a lot faster than most of the other options out there.

Even so, I think there is potential for improvement at the classification stage. Here are some things I have considered but haven't had the reason to experiment with until now:

  1. The "pooling" step in the pooling linear classifier gives an inaccurate representation of the learned language model. Moreover now that we have a Bidirectional LM, a simple concatenation may not be the most suited way to pass the features (the language model) to the classifier. What may be a better way to do this is an open question.

  2. The actual classifier itself is also an arbitrary number of fully connected layers decided by a hyper-parameter. To me that seems a little un-intuitive — and I could be wrong here — but what is the need for 3/4 consecutive linear transformations (spaced with non-linearity, of course)? What I am trying to say is if we can get the feature representation (the "pooling") right, we should be able to classify with reasonable accuracy in just 1/2 fully connected layers as is the case with most popular convolution-net based architectures.

@PiotrCzapla
Copy link
Member Author

PiotrCzapla commented Nov 25, 2018

re. 1. I want to add a dropout after the rnn so that we have hyper-parameter to tune and I will run experiments. This is the reason why I want to do a bit of refactoring so that I can easily add new heads and train a lot of classifiers to see which one work well.
re. 2. I'm not sure why we need so many FC layers, I will make it easy to play with that hyper-parameter. Then we can run together some experiments.

@Dust0x do you want to help out here? I can think of the following things to do:

  • fix the way we save and store models and the way we select hyper params (that is on my as I have it thought out I'm going to describe the idea here: Improve the way we run experiments - saving, hyper params selection #17)
  • figure out a way to use a single RNN from biLM to test it classification accuracy, I think it might be the same quality as lager birnn classifier
  • go through the papers for classification and xnli and implement and test multiple heads and report how they are doing.
  • figure out a way to test different tokenisation strategies, without waiting for 18h for the LM to train

@aayux
Copy link
Contributor

aayux commented Nov 25, 2018

fix the way we save and store models and the way we select hyper params

Agreed.

figure out a way to use a single RNN from biLM to test it classification accuracy, I think it might be the same quality as lager birnn classifier

Are you saying you want to test the classification accuracy with only a forward or a backward QRNN? Could you please elaborate more on this?

go through the papers for classification and xnli and implement and test multiple heads and report how they are doing.

This will be important if we want to work with XNLI. What are the ideas you have in mind for the multiple heads?

figure out a way to test different tokenisation strategies, without waiting for 18h for the LM to train

This can just be one of our experiments as we had originally planned. I can't think of a better way here than to just experiment and find out what works and what doesn't, but there might be.

@PiotrCzapla
Copy link
Member Author

Are you saying you want to test the classification accuracy with only a forward or a backward QRNN? Could you please elaborate more on this?
The biLM are two separate LM's forward and backward. When they are trained they get higher accuracy than in case of a single LM trained thanks to the shared embedding weights. So It makes sense to try to use one LM (let's say forward) to do classification and see if it is any higher than when a normal LM training is being used.

What are the ideas you have in mind for the multiple heads?
Haven't thought much about it yet. I think @sebastianruder will have the most ideas

@sebastianruder
Copy link
Collaborator

Regarding performance on IMDb, I think the best we can do is leave the architecture as it is and tune the unfreezing and learning rate schedule. Jeremy and I tried a lot of different variants for the pooling and the classifier, so I think it'll be hard to find a combination that performs significantly better.
Regarding XNLI, if simply concatenating the premise and hypothesis and directly feeding them into a classifier doesn't work, we should add a self-attention layer on top of the concatenation. See here for other attention-based approaches for SNLI.

@aayux
Copy link
Contributor

aayux commented Dec 3, 2018

@sebastianruder a self attention (or something like a premise to hypothesis attention such as the co-attention module in DRCN) should help. Where would you propose introducing this module? It would probably make sense to do this after the forward and backward LM, before we join the outputs and then send it forward to the pooling step. If you agree, then I can take a crack at it.

@PiotrCzapla I finally had some free time today so I took a look at the SNLI approaches from the link that Sebastian shared. Can we start a thread somewhere (maybe in teams) to discuss what everyone is working on, the problems that are open for contribution and the ideas we want to try (or reject)?

@PiotrCzapla
Copy link
Member Author

@Dust0x perfect place I wasn't aware that there is something like that and it seems to be private. We are only missing @sebastianruder I've sent him an invite already.

@sebastianruder
Copy link
Collaborator

Cool. I don't think I received an invite so far..

@PiotrCzapla
Copy link
Member Author

@sebastianruder it is an invitation for a team, I've canceled it and added you again.

@sebastianruder
Copy link
Collaborator

Ah, thanks! Didn't realize you meant an invite for the org. Accepted it now.

@PiotrCzapla
Copy link
Member Author

I think I spot the issue we don't use the unlabelled data of imdb reviews, so our language model is trained on dataset of 50k while lesson3 uses 100k reviews. So our accuracy of 92% isn't that bad after all :)

@PiotrCzapla
Copy link
Member Author

The issue is caused by our language models, not the tokenization. I've tested the pretrained model wt103_1 on 3 tokenization (Moses, Moses + fastai pre /post procesing, fastai (spacy)). On all it gets excellent accuracy of 94+%.
However, our langauge models don't train properly on imdb even if they start on lower cross entropy loss than wt103_1 model. In fact a worse language model trained only 2 epochs has better downstream results than the one trained for 20 epochs (i haven't finished training though).

I've compared our code with Sylvain and I noticed a few possible causes:

  • we train using examples that end on a new line so quite a few of them contains zero to just a few words. So possibly models have issues remembering longer contexts and/or resenting the context.
  • Some of the tokenization mechanisms do not use EOS (fastai does not have that implemented)
  • We are using lager dropout values when we train on wikipedia than example posted by Sylvain

I think this the first cause is most probable. If so then we have an issue with our wikipedia articles for other languages as we don't denote new articles in any meaningful way. :/. I will report once I test the hypothesis above.

@sebastianruder in the way you have implemented moses tokenization you replaced BOS with EOS. Can you share your reasoning? Can we settle on BOS + article + EOS.

@sebastianruder
Copy link
Collaborator

@PiotrCzapla, thanks for looking into this.

Did I? As far as I'm aware, I only replaced newlines with the EOS symbol. I did not add a BOS symbol at all but intended to do that. BOS + article + EOS should be the standard.

@PiotrCzapla
Copy link
Member Author

Perfect! Thx, will add EOS to fastai code and BOS to the plain Moses implementation.

Once I've started training on full articles and turned off dropout I've got the same accuracy on imdb using LSTM as reported by Sylvain (94%) and slightly worse (93.5%) using QRNN (3 and 4 layers). Most likely the issue was caused by training the LM on short sentences. The tokenization (moses, moses+fastai, fastai) is not that important. Fast.ai model wt103_1 can be moved to any tokenization and it still performs well: Moses (94.5%), Moses+Fastai preproc (94.9%), Fastai (94.8%)

All of this is working in the refactor branch, which should be used with lastest fastai (not the ulmfit_multlingual branch).
I haven't updated moses and sentencepiece training, so they still use old split of wikipedia text to lines instead of articles.

@PiotrCzapla PiotrCzapla self-assigned this Dec 11, 2018
@sebastianruder
Copy link
Collaborator

Awesome! That's great news! :)

@NirantK
Copy link
Contributor

NirantK commented Dec 12, 2018

This is fantastic news @PiotrCzapla! Does this mean that we move completely to FastAIv1.0 on conda?

@PiotrCzapla
Copy link
Member Author

Not yet, I think fastai is still under active development.

@PiotrCzapla
Copy link
Member Author

I've updated sentence-piece and I'm getting 94.6% on imdb, so we are almost ready to merge the refactoring branch. The only missing tokenization is Moses I'm going to run some tests next year :)

@sebastianruder
Copy link
Collaborator

sebastianruder commented Jan 1, 2019

Awesome! That's fantastic! Happy new year by the way! :)

@PiotrCzapla
Copy link
Member Author

Happy new year! :) I've updated and simplified the Moses tokenization to work on longer articles and I'm testing it on imdb I should have results tomorrow. I'm pretty sure it will work fine so let's close this issue.

btw. I've merged the refactor branch into master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

No branches or pull requests

5 participants