Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hmm_train number of gaussians #479

Closed
davudadiguezel opened this issue Nov 20, 2015 · 4 comments
Closed

hmm_train number of gaussians #479

davudadiguezel opened this issue Nov 20, 2015 · 4 comments

Comments

@davudadiguezel
Copy link

Hi,
I am just playing around with hmm_train and experience some problems with the number of gaussians. I don't have a strategy on how to decide the number of gaussians, yet. So I just tried some.
With values around five everything seems to be fine. But with 10 and above I get
error: Mat::col(): index out of bounds
terminate called after throwing an instance of 'std::logic_error'
I run it like this:
./hmm_train -i observationPKM.csv -n 13 -t gmm -g 10 -o hmm.xml -l labels.csv
Also the Debug informations look like this:
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Covariance matrix is not positive definite. Adding perturbation.
[DEBUG] Point 4 assigned to empty cluster 0.
[DEBUG] Point 5 assigned to empty cluster 2.
[DEBUG] Point 6 assigned to empty cluster 3.
...
Is this normal? And is there an upper limit to the number of gaussian?
Greetings
Davud

@rcurtin
Copy link
Member

rcurtin commented Nov 20, 2015

Hi Davud,

Consider running with '-v' (--verbose) for more output. I think that one issue might be that you don't have labeled points from every state. If that was the problem, it should be fixed in 2eface7. Another possible issue would have been that invalid labels were specified; if so, it should be fixed in fa89192.

If neither of those fix your issue, if you can get me a copy of observationPKM.csv and labels.csv, I can reproduce it and dig deeper.

There is no restriction on the number of Gaussians, but note that as you add more and more Gaussians to the GMM, if you don't have many samples, you may have stability issues with the empirical covariance matrices (the debug messages that mention that the covariance matrix is not positive definite could be an indicator of this).

@rcurtin rcurtin added this to the mlpack 1.1.0 milestone Nov 20, 2015
@davudadiguezel
Copy link
Author

Hi Ryan,
I got some more output with -v and with gdb:


[INFO ] GMM::Estimate(): log-likelihood of trained GMM is 9248.34.
[INFO ] Cluster 1 is empty.
[DEBUG] Point 63 assigned to empty cluster 1.
[INFO ] Cluster 2 is empty.
[DEBUG] Point 42 assigned to empty cluster 2.
[INFO ] Cluster 4 is empty.
[DEBUG] Point 43 assigned to empty cluster 4.
[INFO ] Cluster 5 is empty.

error: Mat::col(): index out of bounds

terminate called after throwing an instance of 'std::logic_error'
what(): Mat::col(): index out of bounds

Program received signal SIGABRT, Aborted.
0x00007ffff651fcc9 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007ffff651fcc9 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff65230d8 in __GI_abort () at abort.c:89
#2 0x00007ffff6e2a535 in __gnu_cxx::__verbose_terminate_handler() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007ffff6e286d6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ffff6e28703 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ffff6e28922 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00000000005c39f7 in arma::arma_stop<char const*> (
x=@0x7fffffffabf8: 0x677e20 "Mat::col(): index out of bounds")
at /usr/include/armadillo_bits/debug.hpp:113
#7 0x00000000005e5e52 in arma::arma_check<char [32]> (state=true, x=...)
at /usr/include/armadillo_bits/debug.hpp:358
#8 0x0000000000617ddd in col (col_num=77, this=0x7fffffffc8d0)
at /usr/include/armadillo_bits/Mat_meat.hpp:2588
#9 mlpack::kmeans::MaxVarianceNewCluster::EmptyCluster<mlpack::metric::LMetric<2, true>, arma::Mat > (this=0xb72460, data=..., emptyCluster=5, oldCentroids=..., newCentroids=...,
clusterCounts=..., metric=..., iteration=0)
at /org/share/home/adigueze/mlpack-master/src/mlpack/../mlpack/methods/kmeans/max_variance_new_cluster_impl.hpp:58
#10 0x0000000000610b56 in mlpack::kmeans::KMeans<mlpack::metric::LMetric<2, true>, mlpack::kmeans::RandomPartition, mlpack::kmeans::MaxVarianceNewCluster, mlpack::kmeans::NaiveKMeans, arma::Mat >::Cluster (this=0xb72450, data=..., clusters=10, centroids=...,
initialGuess=false)
at /org/share/home/adigueze/mlpack-master/src/mlpack/../mlpack/methods/kmeans/kmeans_impl.hpp:160
#11 0x0000000000609bc0 in mlpack::kmeans::KMeans<mlpack::metric::LMetric<2, true>, mlpack::kmeans::RandomPartition, mlpack::kmeans::MaxVarianceNewCluster, mlpack::kmeans::NaiveKMeans, arma::Mat >::Cluster (this=0xb72450, data=..., clusters=10, assignments=..., centroids=...,
initialAssignmentGuess=false, initialCentroidGuess=false)
at /org/share/home/adigueze/mlpack-master/src/mlpack/../mlpack/methods/kmeans/kmeans_impl.hpp:241
#12 0x0000000000602135 in mlpack::kmeans::KMeans<mlpack::metric::LMetric<2, true>, mlpack::kmeans::RandomPartition, mlpack::kmeans::MaxVarianceNewCluster, mlpack::kmeans::NaiveKMeans, arma::Mat >::Cluster (this=0xb72450, data=..., clusters=10, assignments=...,
initialGuess=false)
at /org/share/home/adigueze/mlpack-master/src/mlpack/../mlpack/methods/kmeans/kmeans_impl.hpp:64
#13 0x00000000005fcb75 in mlpack::gmm::EMFit<mlpack::kmeans::KMeans<mlpack::metric::LMetric<2, true>, mlpack::kmeans::RandomPartition, mlpack::kmeans::MaxVarianceNewCluster, mlpack::kmeans::NaiveKMeans, arma::Mat >, mlpack::gmm::PositiveDefiniteConstraint>::InitialClustering
(this=0xb72440, observations=..., dists=..., weights=...)
at /org/share/home/adigueze/mlpack-master/src/mlpack/../mlpack/methods/gmm/em_fit_impl.hpp:214
#14 0x00000000005f3524 in mlpack::gmm::EMFit<mlpack::kmeans::KMeans<mlpack::metric::LMetric<2, true>, mlpack::kmeans::RandomPartition, mlpack::kmeans::MaxVarianceNewCluster, mlpack::kmeans::NaiveKMeans, arma::Mat >, mlpack::gmm::PositiveDefiniteConstraint>::Estimate (
this=0xb72440, observations=..., dists=..., weights=..., useInitialModel=false)
---Type to continue, or q to quit---
at /org/share/home/adigueze/mlpack-master/src/mlpack/../mlpack/methods/gmm/em_fit_impl.hpp:39
#15 0x00000000005e81d5 in mlpack::gmm::GMM<mlpack::gmm::EMFit<mlpack::kmeans::KMeans<mlpack::metric::LMetric<2, true>, mlpack::kmeans::RandomPartition, mlpack::kmeans::MaxVarianceNewCluster, mlpack::kmeans::NaiveKMeans, arma::Mat >, mlpack::gmm::PositiveDefiniteConstraint> >::Estimate (this=0x9a3ed0, observations=..., trials=1, useExistingModel=false)
at /org/share/home/adigueze/mlpack-master/src/mlpack/../mlpack/methods/gmm/gmm_impl.hpp:190
#16 0x00000000005dbdbf in mlpack::hmm::HMM<mlpack::gmm::GMM<mlpack::gmm::EMFit<mlpack::kmeans::KMeans<mlpack::metric::LMetric<2, true>, mlpack::kmeans::RandomPartition, mlpack::kmeans::MaxVarianceNewCluster, mlpack::kmeans::NaiveKMeans, arma::Mat >, mlpack::gmm::PositiveDefiniteConstraint> > >::Train (this=0x7fffffffd790, dataSeq=..., stateSeq=...)
at /org/share/home/adigueze/mlpack-master/src/mlpack/methods/hmm/hmm_impl.hpp:274
#17 0x00000000005d0b53 in Train::Apply<mlpack::hmm::HMM<mlpack::gmm::GMM<mlpack::gmm::EMFit<mlpack::kmeans::KMeans<mlpack::metric::LMetric<2, true>, mlpack::kmeans::RandomPartition, mlpack::kmeans::MaxVarianceNewCluster, mlpack::kmeans::NaiveKMeans, arma::Mat >, mlpack::gmm::PositiveDefiniteConstraint> > > > (hmm=..., trainSeqPtr=0x7fffffffd420)
at /org/share/home/adigueze/mlpack-master/src/mlpack/methods/hmm/hmm_train_main.cpp:146
#18 0x00000000005c3335 in main (argc=15, argv=0x7fffffffe148)
at /org/share/home/adigueze/mlpack-master/src/mlpack/methods/hmm/hmm_train_main.cpp:320


There is a "no such file" - error, but I don't think it comes from me. I am pretty that I have labels for every state and I don't get invalid label errors.
Here are my files:
observationPKM.csv.txt
labels.csv.txt

(you will have to delete the .txt extension)

How many samples are about "enough" samples? I just run with 4000. I will get some more but I don't think I will have more than 50k. Will that be enough to do some more gaussians?
Greetings
Davud

@rcurtin
Copy link
Member

rcurtin commented Nov 24, 2015

Hey Davud,

Thanks for the backtrace; judging by its output, it looks like this bug is exactly what was fixed in #481 a few days ago. I used the dataset that you linked to, and tested with the current git master and had no problem, then tested with git master before #481 was merged (specifically, a7d8231), and had the exact same issue that you had here. So I believe the issue is solved if you update to the newest git master.

If you do try to keep increasing the number of Gaussians, though, note that you can't increase it past the number of samples in your smallest class. For instance, in your data, here is the class breakdown:

(( ryan @ adam )) ~/src/mlpack/build $ cat labels.csv | sort | uniq -c
    158 0
    101 1
     34 10
     67 11
     68 12
     45 2
     27 3
     54 4
     56 5
    194 6
     26 7
     42 8
    162 9

So you can't increase the number of Gaussians above 26, because class 7 only has 26 observations. If you, for instance, specify -g 45, but only have 26 observations, k-means (which is used before GMM training to initialize the model) will end up with empty clusters no matter what is done, and this will probably cause the program to fail (note that you'll get a warning: [WARN ] KMeans::Cluster(): more clusters requested than points given.). It would be possible to add a check for this to hmm_train but at the moment I don't have the time, unfortunately...

@davudadiguezel
Copy link
Author

Thanks for the help and the good explanation. Now everything works fine.
Greetings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants