hmm_train failed to converge #496

Closed
davudadiguezel opened this Issue Dec 15, 2015 · 16 comments

Projects

None yet

2 participants

@davudadiguezel

Hi,
I have trouble running on some of my files. I get this error message:


INFO ] EMFit::Estimate(): iteration 251, log-likelihood -163.419.
[INFO ] EMFit::Estimate(): iteration 252, log-likelihood -163.419.
[INFO ] GMM::Estimate(): log-likelihood of trained GMM is -163.419.
[INFO ] Cluster 6 is empty.
[DEBUG] Point 0 assigned to empty cluster 6.
[INFO ] Cluster 7 is empty.
[DEBUG] Point 1 assigned to empty cluster 7.
[INFO ] KMeans::Cluster(): iteration 1, residual inf.
[INFO ] Cluster 7 is empty.
[DEBUG] Point 32 assigned to empty cluster 7.
[INFO ] KMeans::Cluster(): iteration 2, residual inf.
[INFO ] KMeans::Cluster(): iteration 3, residual 0.269213.
[INFO ] KMeans::Cluster(): iteration 4, residual 0.
[INFO ] KMeans::Cluster(): converged after 4 iterations.
[INFO ] 2240 distance calculations.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.

error: chol(): failed to converge

terminate called after throwing an instance of 'std::runtime_error'
what(): chol(): failed to converge

Program received signal SIGABRT, Aborted.
0x00007ffff64e4cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007ffff64e4cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff64e80d8 in __GI_abort () at abort.c:89
#2 0x00007ffff6def535 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007ffff6ded6d6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ffff6ded703 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ffff6ded922 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007ffff799e144 in arma::arma_bad<char [27]> (x=..., hurl=true) at /usr/include/armadillo_bits/debug.hpp:177
#7 0x00007ffff79a9387 in arma::op_chol::applyarma::Mat (out=..., X=...) at /usr/include/armadillo_bits/op_chol_meat.hpp:26
#8 0x00007ffff79a6809 in arma::Mat::Matarma::Mat<double, arma::op_chol> (this=0x7fffffffb300, X=...) at /usr/include/armadillo_bits/Mat_meat.hpp:3932
#9 0x00007ffff79a59ad in arma::Proxyarma::Op<arma::Mat<double, arma::op_chol> >::Proxy (this=0x7fffffffb300, A=...) at /usr/include/armadillo_bits/Proxy.hpp:309
#10 0x00007ffff79a2ba3 in arma::op_strans::apply_proxyarma::Op<arma::Mat<double, arma::op_chol> > (out=..., X=...) at /usr/include/armadillo_bits/op_strans_meat.hpp:220
#11 0x00007ffff799f88e in arma::op_htrans::applyarma::Op<arma::Mat<double, arma::op_chol> > (out=..., in=..., junk=0x0) at /usr/include/armadillo_bits/op_htrans_meat.hpp:265
#12 0x00007ffff799e9e2 in arma::Mat::operator=arma::Op<arma::Mat<double, arma::op_chol>, arma::op_htrans> (this=0x8fbc00, X=...) at /usr/include/armadillo_bits/Mat_meat.hpp:3948
#13 0x00007ffff799b7ab in mlpack::distribution::GaussianDistribution::FactorCovariance (this=0x8fbac0) at /org/share/home/adigueze/mlpack-master/src/mlpack/core/dists/gaussian_distribution.cpp:38
#14 0x00007ffff799b732 in mlpack::distribution::GaussianDistribution::Covariance(arma::Mat&&) (this=0x8fbac0, covariance=<unknown type in /org/share/home/adigueze/mlpack-master/bin/lib/libmlpack.so.1, CU 0x21ab0, DIE 0x52374>)

at /org/share/home/adigueze/mlpack-master/src/mlpack/core/dists/gaussian_distribution.cpp:30

#15 0x00000000005fe597 in mlpack::gmm::EMFit<mlpack::kmeans::KMeans<mlpack::metric::LMetric<2, true>, mlpack::kmeans::RandomPartition, mlpack::kmeans::MaxVarianceNewCluster, mlpack::kmeans::NaiveKMeans, arma::Mat >, mlpack::gmm::PositiveDefiniteConstraint>::InitialClustering (this=0x8fc280, observations=..., dists=..., weights=...) at /org/share/home/adigueze/mlpack-master/src/mlpack/../mlpack/methods/gmm/em_fit_impl.hpp:264
#16 0x00000000005f4472 in mlpack::gmm::EMFit<mlpack::kmeans::KMeans<mlpack::metric::LMetric<2, true>, mlpack::kmeans::RandomPartition, mlpack::kmeans::MaxVarianceNewCluster, mlpack::kmeans::NaiveKMeans, arma::Mat >, mlpack::gmm::PositiveDefiniteConstraint>::Estimate (this=0x8fc280, observations=..., dists=..., weights=..., useInitialModel=false) at /org/share/home/adigueze/mlpack-master/src/mlpack/../mlpack/methods/gmm/em_fit_impl.hpp:39
#17 0x00000000005e9027 in mlpack::gmm::GMM<mlpack::gmm::EMFit<mlpack::kmeans::KMeans<mlpack::metric::LMetric<2, true>, mlpack::kmeans::RandomPartition, mlpack::kmeans::MaxVarianceNewCluster, mlpack::kmeans::NaiveKMeans, arma::Mat >, mlpack::gmm::PositiveDefiniteConstraint> >::Estimate (this=0x8f68a0, observations=..., trials=1, useExistingModel=false) at /org/share/home/adigueze/mlpack-master/src/mlpack/../mlpack/methods/gmm/gmm_impl.hpp:190
#18 0x00000000005dd382 in mlpack::hmm::HMM<mlpack::gmm::GMM<mlpack::gmm::EMFit<mlpack::kmeans::KMeans<mlpack::metric::LMetric<2, true>, mlpack::kmeans::RandomPartition, mlpack::kmeans::MaxVarianceNewCluster, mlpack::kmeans::NaiveKMeans, arma::Mat >, mlpack::gmm::PositiveDefiniteConstraint> > >::Train (this=0x7fffffffd770, dataSeq=..., stateSeq=...) at /org/share/home/adigueze/mlpack-master/src/mlpack/methods/hmm/hmm_impl.hpp:276
#19 0x00000000005d1543 in Train::Apply<mlpack::hmm::HMM<mlpack::gmm::GMM<mlpack::gmm::EMFit<mlpack::kmeans::KMeans<mlpack::metric::LMetric<2, true>, mlpack::kmeans::RandomPartition, mlpack::kmeans::MaxVarianceNewCluster, mlpack::kmeans::NaiveKMeans, arma::Mat >, mlpack::gmm::PositiveDefiniteConstraint> > > > (hmm=..., trainSeqPtr=0x7fffffffd400) at /org/share/home/adigueze/mlpack-master/src/mlpack/methods/hmm/hmm_train_main.cpp:170
#20 0x00000000005c3415 in main (argc=14, argv=0x7fffffffe128) at /org/share/home/adigueze/mlpack-master/src/mlpack/methods/hmm/hmm_train_main.cpp:344


Here is the input file and the labels I use:

obsShoulderLeft.csv.txt
labels.csv.txt

I run hmm_train like this:
hmm_train -v -i labels.csv -o runNormalizedTest.xml -t gmm -g 10 -n 13

I guess, the error means that the EM algorithm does not converge. I normalized my data, but this didn't change anything. Is there anything else I can do? Or is there anything wrong with my data? On other files everything works fine. Any help is very welcome.
Greetings
davud

@davudadiguezel

I noticed, that when I train each column separately, everything runs fine. Some column combinations run fine, too. So I am really confused.

@rcurtin
Member
rcurtin commented Dec 18, 2015

Hi there,

Sorry for the slow response. It took me a while to dig into this one. I believe that what was happening here was that some particular Gaussian from one of the GMMs that make up your HMM emission distribution had a covariance matrix which wasn't positive definite (or, positive definite enough for chol()). Previously, this issue had been handled by the file src/mlpack/methods/gmm/positive_definite_constraint.hpp [1], which, at the end of each EM algorithm iteration during GMM training, would add small amounts to the diagonal until the determinant was greater than 0 (and thus the matrix would be positive definite).

However, what seemed to actually be occurring was that the determinant was indeed greater than 0, but because of floating point errors (I theorize) the call to arma::chol() would fail. So, in e08a8ff I committed a better strategy for checking positive definiteness: instead of using arma::det() and forcing the determinant to be greater than some small value like 1e-50, we just call arma::chol() to make sure it converges, and if not, add to the diagonal. This guarantees that when the PositiveDefiniteConstraint is applied, the covariance matrix can be decomposed with arma::chol(), and therefore the problem you're having, you shouldn't have anymore (assuming you update to git master).

As a side note, I am about to rename all of the executables to be prefixed with mlpack_, so hmm_train will become mlpack_hmm_train, as per #229, in the next couple days or so, so if you update and rebuild mlpack, you may notice that the names have changed. But no compatibility breakage or anything, so there shouldn't be any issues -- just giving you a heads up. :)

If my big long explanation turns out to have been wrong or my fix didn't actually fix your issue, feel free to reopen... (I really hope it is a fix, though, it took me a while to dig to the bottom of that one.) :)

[1] https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/gmm/positive_definite_constraint.hpp

@rcurtin rcurtin closed this Dec 18, 2015
@rcurtin rcurtin added this to the mlpack 2.0.0 milestone Dec 18, 2015
@davudadiguezel

hey Ryan,
just finished a successful training! Thanks for the great work!
Greetings
Davud

@davudadiguezel

hey ryan,
I had some more tests running and I experienced problems with the fix. It looks like the program is just running forever:


[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[INFO ] EMFit::Estimate(): iteration 7, log-likelihood 34547.2.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[INFO ] EMFit::Estimate(): iteration 8, log-likelihood inf.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.


It stayed like this all night. Do you have any idea?
Greetings and merry christmas
Davud

@rcurtin
Member
rcurtin commented Dec 22, 2015

Hm, I think what's happening here is a covariance matrix is ending up full of NaNs. I'm not sure why, but I'll look into it and see if I can reproduce it. I'd suggest that maybe your GMM has too many Gaussians in it... maybe fewer Gaussians will cause this problem less often?

@rcurtin rcurtin reopened this Dec 22, 2015
@rcurtin rcurtin removed the R: fixed label Dec 22, 2015
@rcurtin rcurtin modified the milestone: mlpack 2.0.1, mlpack 2.0.0 Dec 24, 2015
@davudadiguezel

Hi Ryan,
happy new year! Sorry for my late response. Haven't bin at office lately.

I just tried with 10, 5, and 3 gaussians and the problem does occur less often, but it still does.

greetings,
davud

@rcurtin
Member
rcurtin commented Jan 13, 2016

Hey Davud,

Sorry for the slow response; I was out of town for a while too. :)

I did some digging, and I think I've solved the issue in c505098. Here is what I think was happening:

During training, some Gaussians end up having particularly small covariances. This happens in part because many of the points in your dataset are identical. A very small covariance for a Gaussian means that the probability of a point centered at the mean will be extremely large; for sufficiently small covariances, this results in the probability being represented as inf on the machine. Then, later in the EM fitting process, normalization of probabilities turns this inf into a NaN, which ends up propagating to the covariance, which will end up full of NaNs. When the covariance is full of NaNs, the "Adding perturbation" step will continue forever, because no matter how large a number you add to a matrix full of NaNs, it still won't be positive definite.

So, what I've done is modified the positive definiteness constraint to also enforce a minimum diagonal covariance element of 1e-50 (very small), thus restricting how small a single Gaussian's PDF can get. This means that the probability of a point at the mean is bounded within the range of numbers representable by the machine, so there is no inf, and there aren't NaNs later.

But, it's possible that I did not properly solve the issue, so please reopen if you still have issues. I am more than happy to help work out issues like this. :)

@rcurtin rcurtin closed this Jan 13, 2016
@rcurtin rcurtin added the R: fixed label Jan 13, 2016
@davudadiguezel

Hey Ryan,
thanks for the response.

Unfortunately the program keeps hanging (but not as often as before):


[DEBUG] Point 2 assigned to empty cluster 17.
[INFO ] Cluster 18 is empty.
[DEBUG] Point 3 assigned to empty cluster 18.
[INFO ] KMeans::Cluster(): iteration 1, residual inf.
[INFO ] KMeans::Cluster(): iteration 2, residual 0.197714.
[INFO ] KMeans::Cluster(): iteration 3, residual 0.151819.
[INFO ] KMeans::Cluster(): iteration 4, residual 0.0707789.
[INFO ] KMeans::Cluster(): iteration 5, residual 0.0748019.
[INFO ] KMeans::Cluster(): iteration 6, residual 0.0826282.
[INFO ] KMeans::Cluster(): iteration 7, residual 0.0555508.
[INFO ] KMeans::Cluster(): iteration 8, residual 0.0666275.
[INFO ] KMeans::Cluster(): iteration 9, residual 0.0591785.
[INFO ] KMeans::Cluster(): iteration 10, residual 0.0380067.
[INFO ] KMeans::Cluster(): iteration 11, residual 0.00518312.
[INFO ] KMeans::Cluster(): iteration 12, residual 0.00368832.
[INFO ] KMeans::Cluster(): iteration 13, residual 0.00231247.
[INFO ] KMeans::Cluster(): iteration 14, residual 0.00123067.
[INFO ] KMeans::Cluster(): iteration 15, residual 0.0035152.
[INFO ] KMeans::Cluster(): iteration 16, residual 0.00228494.
[INFO ] KMeans::Cluster(): iteration 17, residual 0.00522757.
[INFO ] KMeans::Cluster(): iteration 18, residual 0.0076554.
[INFO ] KMeans::Cluster(): iteration 19, residual 0.0119473.
[INFO ] KMeans::Cluster(): iteration 20, residual 0.0112805.
[INFO ] KMeans::Cluster(): iteration 21, residual 0.0219611.
[INFO ] KMeans::Cluster(): iteration 22, residual 0.0143273.
[INFO ] KMeans::Cluster(): iteration 23, residual 0.0118793.
[INFO ] KMeans::Cluster(): iteration 24, residual 0.0191215.
[INFO ] KMeans::Cluster(): iteration 25, residual 0.0221008.
[INFO ] KMeans::Cluster(): iteration 26, residual 0.0538075.
[INFO ] KMeans::Cluster(): iteration 27, residual 0.0694079.
[INFO ] KMeans::Cluster(): iteration 28, residual 0.0346496.
[INFO ] KMeans::Cluster(): iteration 29, residual 0.00939849.
[INFO ] KMeans::Cluster(): iteration 30, residual 0.00749252.
[INFO ] KMeans::Cluster(): iteration 31, residual 0.00635273.
[INFO ] KMeans::Cluster(): iteration 32, residual 0.00476996.
[INFO ] KMeans::Cluster(): iteration 33, residual 0.00516157.
[INFO ] KMeans::Cluster(): iteration 34, residual 0.00444851.
[INFO ] KMeans::Cluster(): iteration 35, residual 0.00304399.
[INFO ] KMeans::Cluster(): iteration 36, residual 0.00174719.
[INFO ] KMeans::Cluster(): iteration 37, residual 0.000555867.
[INFO ] KMeans::Cluster(): iteration 38, residual 0.000810098.
[INFO ] KMeans::Cluster(): iteration 39, residual 0.000738585.
[INFO ] KMeans::Cluster(): iteration 40, residual 0.00020829.
[INFO ] KMeans::Cluster(): iteration 41, residual 0.
[INFO ] KMeans::Cluster(): converged after 41 iterations.
[INFO ] 3280000 distance calculations.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] EMFit::Estimate(): initial clustering log-likelihood: 10098.8
[INFO ] EMFit::Estimate(): iteration 1, log-likelihood 10098.8.
[INFO ] EMFit::Estimate(): iteration 2, log-likelihood inf.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.


Greetings,
Davud

@davudadiguezel

I just experienced another crash. This time I had split up my data by dimension, but it got stuck in an endless loop like before:

[INFO ] EMFit::Estimate(): iteration 23, log-likelihood 13842.8.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[INFO ] EMFit::Estimate(): iteration 24, log-likelihood inf.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.

@rcurtin
Member
rcurtin commented Jan 19, 2016

Just a quick update: I've been looking into this, and I found a random seed that allows me to reproduce the issue, but when I use a debugger to step through the calculations to find out where the NaN is coming from, the results don't make any sense. If I print the value with cout, it gives me something different than what I get if I don't print it and use gdb to inspect. I'm digging deeper, but I don't have anything yet...

It does appear that this occurs because the covariance matrices for individual Gaussians are getting very small. I think with your dataset this happens in part because there are so many points with the exact same values. I know this isn't a solution, but I suspect that if you were to add noise to each point (maybe Gaussian noise with standard deviation 1e-15 or something else quite small), you might encounter fewer or no crashes like this.

I'll keep looking and let you know when I get to the bottom of this. :)

@rcurtin rcurtin reopened this Jan 19, 2016
@rcurtin rcurtin removed the R: fixed label Jan 19, 2016
@davudadiguezel

thanks, I will try the noise. And thanks for all the work you're doing for supporting :)

@rcurtin
Member
rcurtin commented Jan 25, 2016

An update: I was using the debugger wrong, and the results made sense, I was just doing stupid things and reading the results wrong. I was correct that this is occurring because the covariance matrices for individual Gaussians is very small, and like in each round of debugging this particular problem, the solution will lie in positive_definite_constraint.hpp, the bit of code that forces the covariance to be positive definite.

Originally, I added to the diagonal until det(covariance) was greater than some small number (I arbitrarily chose 1e-50). But this didn't always work, so I replaced it with a check that the Cholesky decomposition was successful. But, it turns out that there are situations where the Cholesky decomposition appears to succeed but gives wildly inaccurate results, due to (I think) machine precision issues.

I think that the solution here is going to be to enforce a check based on the condition number of the matrix and the machine precision, but I need to do some reading to familiarize myself with numerical linear algebra precision issues first. When I implement a fix, I'll then test with your dataset and different random seeds for a day or two to make sure that the issue doesn't recur.

@rcurtin
Member
rcurtin commented Jan 29, 2016

Okay, with 54b2906 I think I have finally solved the issue! I wrote a check that uses the condition number of the Gaussian to prevent the Cholesky decomposition from failing, and I tested it for 10000 trials with different random seeds on your dataset, and had no failures. So I really hope the issue is solved now. I will let you close the issue when you decide that the problem is fixed, because it seems like if I close the issue then the issue is not really solved. :)

@davudadiguezel

Hey Ryan,
have been running your solution on my data and it looks good!
Many thanks for all your efforts!
Greetings
Davud :)

@rcurtin
Member
rcurtin commented Feb 1, 2016

Great, I hope you don't have more issues! :) But if you do, please let me know and I can keep digging. I'll release 2.0.1 with the GMM training fixes, as well as some other fixes, shortly. As far as I can tell my changes didn't significantly affect runtime.

@davudadiguezel

I can confirm that. Runtime is unaffected :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment