New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak hyper params so that lstm on IMDB trains well #10
Comments
Good point. Those were just for testing that the script works. I did not manage to get around to tune the one cycle policy for classification. I'd imagine similar hyperparameters as for training the language model might work. |
No, they don't work - mostly seen as a lot of variance. Though, I don't see exploding gradients as you do above. If I have some spare compute, I'll try the hyperparams from LM training as suggested by @sebastianruder in the previous comment. |
I've changes the train_clas to use the hyper parameters froms lesson3 notebook. We are getting 0.89 instead of 0.94.
|
The tokenization is adopted automatically :/, but the drop out values are different and the LM model that is fetched from fast ai seems to be better (trained for longer perhaps). |
I'll run some experiments on this. One other factor: As I mentioned here, the model in the paper was trained on the full training set of 25k reviews. |
We had a large dropout multiplier set (default value). I've just committed the fixes. to BILM branch |
@sebastianruder do you know why we set
|
After I've resumed training we got:
It is still not 0.944 accuracy that Jeremy got during lesson 3 and it isn't 0.954 accuracy that you got when the paper was published, but we are getting closer. Some guess why we can have lower performance:
@sebastianruder why don't we split out a validation set out from a test set ? It is not that the test set. I understand that the idea is not to overfit to the test set, but it is being done anyway by comparing with SOTA . |
Re the weight decay: @sgugger had set this option in a script for training with the one cycle policy, but I haven't experimented with its effect yet. |
It seems that true_wd=False improves the accuracy of a language model. From |
Training BiLM helps a lot with an accuracy of language models but this does not translate to the accuracy of a classifier when the classifier is a simple concatenation of results from two rnns (including max and avg pooling). |
Be careful to adjust the weight decay when turning off an on |
I had a look through different learning rates and weight decays (I've tested 50 different scenarios). true_wd=False and wd=1e-7 seem to get the slightly better results more often. I was training a bidirectional classifier and the best result I've got so far is 92.5% accuracy, way below the original ULMFiT (architecture might be flawed). But my current best guess is that Moses Tokenization is not performing as well as the fastai tokenization, I will run some experiments, and keep you posted. |
Yes, we tried to remove the spacy tokenizer in fastai but I didn't get any good results by using a different tokenizer (92.5% too for our alternative), so that part is important too. |
Good to know and thank you for watching this thread closely! |
Thanks for commenting, Sylvain. XNLI is already pre-tokenized with Moses, so I don't think using spacy will help much here. |
@sebastianruder XNLI-1.0, the dataset with all the languages has a header with free text which is not tokenized. We can use the fastai tokenizer (spaCy) there and see if the results help. |
@PiotrCzapla could you share your top 5/10 hyper-parameter configs? I'd like to try this for Hindi Language models as well. Please mark if you were using |
I haven't yet tweak hyper-params, just learning rate, the best what was working is put into the train_clas script on bilm branch. It gets me to 0.923 of accuracy on the validation set, using qrnn and full text for LM model. |
I've spent whole day try to fix small issues with |
@PiotrCzapla just looked at the classification results you are getting with the new model. I've had this conversation with some of my colleagues at work who have used the original ULMFiT and their common complaint has been that while it clearly excels at Language Modeling tasks, getting results on the final downstream classification requires a lot of careful (and tedious!) tuning of the hyper-parameters. This is somewhat alleviated by the fact that the training is a lot faster than most of the other options out there. Even so, I think there is potential for improvement at the classification stage. Here are some things I have considered but haven't had the reason to experiment with until now:
|
re. 1. I want to add a dropout after the rnn so that we have hyper-parameter to tune and I will run experiments. This is the reason why I want to do a bit of refactoring so that I can easily add new heads and train a lot of classifiers to see which one work well. @Dust0x do you want to help out here? I can think of the following things to do:
|
Agreed.
Are you saying you want to test the classification accuracy with only a forward or a backward QRNN? Could you please elaborate more on this?
This will be important if we want to work with XNLI. What are the ideas you have in mind for the multiple heads?
This can just be one of our experiments as we had originally planned. I can't think of a better way here than to just experiment and find out what works and what doesn't, but there might be. |
|
Regarding performance on IMDb, I think the best we can do is leave the architecture as it is and tune the unfreezing and learning rate schedule. Jeremy and I tried a lot of different variants for the pooling and the classifier, so I think it'll be hard to find a combination that performs significantly better. |
@sebastianruder a self attention (or something like a premise to hypothesis attention such as the co-attention module in DRCN) should help. Where would you propose introducing this module? It would probably make sense to do this after the forward and backward LM, before we join the outputs and then send it forward to the pooling step. If you agree, then I can take a crack at it. @PiotrCzapla I finally had some free time today so I took a look at the SNLI approaches from the link that Sebastian shared. Can we start a thread somewhere (maybe in teams) to discuss what everyone is working on, the problems that are open for contribution and the ideas we want to try (or reject)? |
@Dust0x perfect place I wasn't aware that there is something like that and it seems to be private. We are only missing @sebastianruder I've sent him an invite already. |
Cool. I don't think I received an invite so far.. |
@sebastianruder it is an invitation for a team, I've canceled it and added you again. |
Ah, thanks! Didn't realize you meant an invite for the org. Accepted it now. |
I think I spot the issue we don't use the unlabelled data of imdb reviews, so our language model is trained on dataset of 50k while lesson3 uses 100k reviews. So our accuracy of 92% isn't that bad after all :) |
The issue is caused by our language models, not the tokenization. I've tested the pretrained model wt103_1 on 3 tokenization (Moses, Moses + fastai pre /post procesing, fastai (spacy)). On all it gets excellent accuracy of 94+%. I've compared our code with Sylvain and I noticed a few possible causes:
I think this the first cause is most probable. If so then we have an issue with our wikipedia articles for other languages as we don't denote new articles in any meaningful way. :/. I will report once I test the hypothesis above. @sebastianruder in the way you have implemented moses tokenization you replaced BOS with EOS. Can you share your reasoning? Can we settle on |
@PiotrCzapla, thanks for looking into this. Did I? As far as I'm aware, I only replaced newlines with the |
Perfect! Thx, will add Once I've started training on full articles and turned off dropout I've got the same accuracy on imdb using LSTM as reported by Sylvain (94%) and slightly worse (93.5%) using QRNN (3 and 4 layers). Most likely the issue was caused by training the LM on short sentences. The tokenization (moses, moses+fastai, fastai) is not that important. Fast.ai model wt103_1 can be moved to any tokenization and it still performs well: Moses (94.5%), Moses+Fastai preproc (94.9%), Fastai (94.8%) All of this is working in the refactor branch, which should be used with lastest fastai (not the ulmfit_multlingual branch). |
Awesome! That's great news! :) |
This is fantastic news @PiotrCzapla! Does this mean that we move completely to FastAIv1.0 on |
Not yet, I think fastai is still under active development. |
I've updated sentence-piece and I'm getting 94.6% on imdb, so we are almost ready to merge the refactoring branch. The only missing tokenization is Moses I'm going to run some tests next year :) |
Awesome! That's fantastic! Happy new year by the way! :) |
Happy new year! :) I've updated and simplified the Moses tokenization to work on longer articles and I'm testing it on imdb I should have results tomorrow. I'm pretty sure it will work fine so let's close this issue. btw. I've merged the |
The hyper paramters selected in train_clas don't work that well :/
The pretrained LM was a toy example trained on wikitext-2, but i guess we will have similar issues with regular models.
@NirantK, have this hyper params worked with your experiments?
The text was updated successfully, but these errors were encountered: