New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
-inf in log_prior and nan in loss breaks training #43
Comments
Hello Rafal, and thank you for using BLiTZ, giving me such a feedback and presenting this issue. Can you provide more details of your network? (the parameters passed to the constructor of each bayesian layer and maybe the whole network class, so I can try it here and see how can I help you). On a first sight, that might me related to the prior distribution parameters set, but I would like to be sure of it before taking a conclusion. Thank you. |
Thanks a lot for responding! Here is the code with my model (it is based on your example with boston dataset), normally I train it on my dataset, but I get the same error on boston so I used it here:
And here is the code for training:
Code for model_copy and optimizer_copy is to investigate state of the model from the moment before the nans appeared. The training breaks usually because of nans around 40-80 epoch. If you run code below, sometimes it returns
|
PS: Sometimes it's other than blinear1 layer that has -inf in log_prior attribute, so in code:
it can also be blinear2 or blinear3 |
Hello, a little update: |
Hello and sorry for the late reply. On your NaN case, you should check if the NaN is coming from the log likelihood relative of the weights relative to the prior distribution. If that's the case, then you should tune the parameters of that prior dsitribution. Note that you can use a gaussian mixture model too. On that case of the torch.exp returning If you are getting NaNs on the fitting cost, then either the problem is on the lib (in that I case I will try to fix it) or on the data, loss function, etc... Hope this is useful. |
Yeah, I managed to solve this issue by increasing variance of prior mixture distribution - thank you for your answer! |
Hello, first of all amazing work, and thank you for this project!
I'm trying to train simple 3-layered NN and I encountered some problems I wanted to ask about. Here is my model:
I'm training it on dataset with prices of flats/houses I recently scraped, and I've encountered problem I cannot seem to fully understand: after a few epochs, loss returned by the model.sample_elbo method is sometimes equal to nan, which when backpropagated breaks the whole training, as some of the weights are 'optimized' to nans:
I managed to track down where the incorrect values appears first, before backpropagation of these nans, and it turned out that value of log_prior in first bayesian layer is sometimes equal to -inf
Going further I checked that the problem is in weight_prior_dist, which sometimes, like one in 5 times returns -inf:
Going deeper I realised, that the problem is in prior_pdf of first prior distribution in weight_prior_dist of first layer. Some of logarithms of probabilities for the sampled values of weights (
prior_dist.dist1.log_prob(w)
) are very small, equal to ~-100, and when passed through torch.exp such small values are approximated to 0. When these 0-weights go through torch.log inprior_dist.log_prior(w)
they are equal to -inf, and the whole mean approaches then -inf, which corrupts further calculations of loss:If I understand correctly, it means that the probabilities of such sampled weights for prior distribution are very very small, approaching zero, but could you suggest me the way of tackling this problem somehow, so they remain very small, and not zero? Or maybe the problem is different?
I'm still learning details of Bayesian DL, so I hope there aren't so many silly mistakes, and thank you for any kind of help!
best regards
Rafał
The text was updated successfully, but these errors were encountered: