New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model not training beyond 1st epoch #10146
Comments
Could you please post this on the forum, rather than here? The authors of HuggingFace like to keep this place for bugs or feature requests, and they're more than happy to help you on the forum. Looking at your code, this seems more like an issue with preparing the data correctly for the model. Take a look at this example in the docs on how to perform text classification with the Trainer. |
@NielsRogge Not very pleased with your reply, please ask someone a question if you are unclear about something rather than trying to just close an issue. As regards the data, I can assure you it is in the format specified by your guide - It is in NumPy arrays converted to list and then made into a TFDataset object and has all the correct parts. The conversion was made to list because an error clearly specified that lists are to be passed. This is a bug because the model does appear to be training, just having extremely low accuracy (Which may be because of the activation function, but I am not sure) and it won't train any further than the 1st epoch, where subsequent epochs don't pick up where the previous epoch left. |
I've created a Google Colab that will hopefully resolve your issue: https://colab.research.google.com/drive/1azTvNc0AZeN5JMyzPnOGic53jddIS-QK?usp=sharing What I did was create some dummy data based on the format of your data, and then see if the model is able to overfit them (as this is one of the most common things to do first when debugging a neural network). As you can see in the notebook, it appears to do, so everything seems to be working fine. Let me know if this helps. UPDATE: looking at your code, it appears that the learning rate is way too low in your case. A typical value for Transformers is 5e-5. |
@NielsRogge Thanx a lot for the advice, I will surely update you regarding any solution. I have been trying to apply this to my own code, but I am still reproducing the bug - the warnings are there (unlike yours) I am using the latest version of
even after 35 epochs, the model does not overfit. the same accuracy/loss is maintained irrespective of the loss function.
UPDATE: You might have missed this line @NeilsRogge about using the Keras loss function rather than the default one |
I want to jump in here and let you know that this kind of behavior is inappropriate. @NielsRogge is doing his best to help you here and he is doing this on his own free time. "My model is not training" is very vague and doesn't seem like a bug, so suggesting to take this on the forums is very appropriate: more people will be able to help you there. Please respect that this is an open-source project. No one has to help you solve your bug so staying open-mined and kind will go a long way into getting the help you need. |
@sgugger with all due respect, My model was training; just that it lost all progress it had made in an epoch for the next one - starting and ending with the exact number. And this is very much a bug. And about the open-source project, I do understand that this is voluntary but, someday if you need help and someone else tells you without reading your question that whatever you have done (without any prior proof) and suggests you to ask your question somewhere else that I know for a fact is not that active, I would like to see your response. We have many projects that are not backed by a company - look at If you don't want to spend time solving my problem, that's fine. I have no issue with that. But if you do not want to solve my problem just to close down the list of issues then, it feels pretty bad. I do know that I don't understand ML very deeply and certainly not enough to make a project of mine, but I do know the difference between someone actually trying to help me versus just trying to reduce the number of open GIthub issues. |
I do think there's a bit of a misunderstanding with what we mean by a bug. Of course, since your model isn't training properly, there's a bug in your code. But in this case, it's a bug probably caused by the user (these bugs include setting hyperparameters like learning rate too low, not setting your model in training mode, improper use of the Trainer, etc.). These things are bugs, but they are caused by the user. And for such cases, the forum is the ideal place to seek help. Github issues are mostly for bugs caused by the Transformers library itself, i.e. caused by the authors (these bugs include implementations of models which are incorrect, a bug in the implementation of the Trainer, etc.). So the issue you're posting here is a perfect use case for the forum! It's not that we want to close issues as soon as possible, and it's also not the case that we don't want to help you. It's just a difference between bugs due to the user/bugs due to the library itself, and there are 2 different places for this. |
What said @NielsRogge is correct, your way of training your model is not correct (and your data might also be malformed). As far as I can see, if your data really looks like:
I guess that if you have label id up to at least 14, it certainly means that you have more than one label, then the line Nevertheless, if you really have only one label, your loss must be So as far as I can say, I second what has been said before and this post should be on the forum, not here. |
@jplu Hmm.. I had thought that num_labels was the number labels to be predicted by the model (Like if it is multi-label classification) and about the data, I am importing it in NumPy arrays after preprocessing so I don't see why the structure of the data frame might be a problem. @NielsRogge You may be right that the bug may be hyperparameter (I tried using all sorts of LR but it didn't work) but the reason why I think it is a bug in
Another reason was that trying to train the model using
UPDATE: After quite some fixing, the model is now training and seems to be learning (I am still confused about what exactly |
As mentioned before |
This example is using PyTorch, not TensorFlow. There is no hyper-parameter tuning implemented in Transformers in TensorFlow, which is why I was recommending Keras Tuner. |
Alright. Thanx a ton! |
Do you plan to add this support for TFTrainer? |
@liaocs2008 the |
@neel04 I am facing the same issue, the model seems to be resetting after each epoch. Could you please share what fixes you implemented? |
Environment info
transformers
version: 4.4.0.dev0Who can help
Models:
Information
Model I am using (Bert, XLNet ...): RoBERTa
The problem arises when using:
The tasks I am working on is:
To reproduce
First off, this issue is basically a continuation of #10055 but since that error was mostly resolved, I have thus opened another issue. I am using a private dataset, so I am not at liberty to share it. However, I can provide a clue as to how the
csv
looks like:-This is the code:-
The Problems:
Trainer()
method. The cell successfully executes, but it does nothing - does not start training at all. This is not much of a major issue but it may be a factor in this problem.Can anyone tell me how exactly to change the activation function, or maybe other thoughts on the potential problem? I have tried changing the learning rate with no effect.
The text was updated successfully, but these errors were encountered: