Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cholesky decomposition failed when training plda #3328

Open
czy97 opened this issue May 16, 2019 · 14 comments
Open

Cholesky decomposition failed when training plda #3328

czy97 opened this issue May 16, 2019 · 14 comments
Labels
bug stale Stale bot on the loose

Comments

@czy97
Copy link

czy97 commented May 16, 2019

Hello,
When I train plda using some feature I extracted, this error occured.The tail of log is shown below

LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 146608
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 209565
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 5 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 140.852
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 197.8
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 6 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 105.141
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 157.448
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 7 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 1117.95
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 5858.14
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 8 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 140.276
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 395.243
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 9 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 12926.4
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 12056.3
LOG (ivector-compute-plda[5.5.0~3-327d]:GetOutput():plda.cc:540) Norm of mean of iVector distribution is 0.745405
WARNING (ivector-compute-plda[5.5.0~3-327d]:Cholesky():tp-matrix.cc:110) Cholesky decomposition failed. Maybe matrix is not positive definite. Throwing error
Cholesky decomposition failed.# Accounting: begin_time=1558008300



And when I add some very small random noise to my feature like you said in the LDA computing when facing the same problem, the error was still there.
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 7 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 101755
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 141957
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 8 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 3136.84
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 17124.3
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 9 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 1.78053e+08
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 1.22304e+08
LOG (ivector-compute-plda[5.5.0~3-327d]:GetOutput():plda.cc:540) Norm of mean of iVector distribution is 0.745405
WARNING (ivector-compute-plda[5.5.0~3-327d]:Cholesky():tp-matrix.cc:110) Cholesky decomposition failed. Maybe matrix is not positive definite. Throwing error
Cholesky decomposition failed.# Accounting: begin_time=1558007075
@czy97 czy97 added the bug label May 16, 2019
@danpovey
Copy link
Contributor

Are you sure your features aren't limited to a subspace of the space they live in? Or maybe you have fewer features than the PLDA dimension?

@czy97
Copy link
Author

czy97 commented May 17, 2019

Are you sure your features aren't limited to a subspace of the space they live in? Or maybe you have fewer features than the PLDA dimension?

Thanks for your reply. My feature dimension is 256, and more than 1 million features are used to train the plda. Moreover, what do you mean by saying my features aren't limited to a subspace of the space they live in.

@danpovey
Copy link
Contributor

danpovey commented May 17, 2019 via email

@czy97
Copy link
Author

czy97 commented May 17, 2019

I mean something like the feature values sum to one, so the covariance would not be full rank. Or one feature is always zero, something like that.

On Thu, May 16, 2019 at 9:06 PM Zhengyang Chen @.***> wrote: Are you sure your features aren't limited to a subspace of the space they live in? Or maybe you have fewer features than the PLDA dimension? Thanks for your reply. My feature dimension is 256, and more than 1 million features are used to train the plda. Moreover, what do you mean by saying my features aren't limited to a subspace of the space they live in. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3328?email_source=notifications&email_token=AAZFLOYRESISZAPF37AXTGDPVYAJVA5CNFSM4HNL3RO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTOGGA#issuecomment-493282072>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZFLO6JJQZ2ZRGYLRJI7PTPVYAJVANCNFSM4HNL3ROQ .

Ok, thanks. You can see that the trace of within/between-class variance get very big at the iteration 9 of 10. If I set the parameter --num-em-iters to 9 instead of 10, the error will disappear(the within/between-class is small after 9 iters). And when I check some normal logs of plda training, the within/between-class variance always keeps at a relatively low value(around one hundred). Does it mean that the EM algorithm not converge well? So, is it my data's fault or other cause.

@danpovey
Copy link
Contributor

I'll try to look into it at some point.

@stale
Copy link

stale bot commented Jun 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale bot on the loose label Jun 19, 2020
@shayxurui
Copy link

When I use the running script(run.sh) of aishell2 in Kaldi's egs, I get an error when I run it to steps/online/nnet2/train_ivector_extractor.sh, and the log file says cholesky decomposition failed. Maybe matrix is not positive definite. I did not modify any data or code.

@stale stale bot removed the stale Stale bot on the loose label Sep 22, 2020
@danpovey
Copy link
Contributor

danpovey commented Sep 22, 2020 via email

@shayxurui
Copy link

shayxurui commented Sep 23, 2020

Sometimes that error is harmless, anyway it's quite generic, would need to see more info (e.g. more of the log).

On Tue, Sep 22, 2020 at 9:51 AM 徐锐 @.***> wrote: When I use the running script(run.sh) of aishell2 in Kaldi's egs, I get an error when I run it to steps/online/nnet2/train_ivector_extractor.sh, and the log file says cholesky decomposition failed. Maybe matrix is not positive definite. I did not modify any data or code. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3328 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3RJVLAHMVYXQYL5VDSG77INANCNFSM4HNL3ROQ .

I am very sorry, I re-executed the code, the log file has been overwritten. A new error was encountered, still the train_ivetcor_extractor.sh script and log says expected token "",got instead "BLAS"

@danpovey
Copy link
Contributor

danpovey commented Sep 23, 2020 via email

@shayxurui
Copy link

You should learn to paste as text. My guess is that probably you ran out of memory, that part can use up a great deal of memory. You could reduce the --num-processes to 1, to train_ivector_extractor.sh, that should help, and/or reduce the --num-jobs too if you are using run.pl

On Wed, Sep 23, 2020 at 9:56 AM 徐锐 @.> wrote: Sometimes that error is harmless, anyway it's quite generic, would need to see more info (e.g. more of the log). … <#m_4403595919129091908_> On Tue, Sep 22, 2020 at 9:51 AM 徐锐 @.> wrote: When I use the running script(run.sh) of aishell2 in Kaldi's egs, I get an error when I run it to steps/online/nnet2/train_ivector_extractor.sh, and the log file says cholesky decomposition failed. Maybe matrix is not positive definite. I did not modify any data or code. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3328 (comment) <#3328 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3RJVLAHMVYXQYL5VDSG77INANCNFSM4HNL3ROQ . I am very sorry, I re-executed the code, the log file has been overwritten. A new error was encountered, [image: image] https://user-images.githubusercontent.com/30276311/93955591-abe97100-fd82-11ea-96f3-89634406df86.png and log [image: 微信图片_20200923095518] https://user-images.githubusercontent.com/30276311/93955706-f66aed80-fd82-11ea-85dd-30ffe91ff9e8.png image — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3328 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3RLWAOBLFUQY22PWTSHFIVLANCNFSM4HNL3ROQ .

thank you for your reply. I want to log out all the information, but I cannot copy the text from the virtual machine because the company has restricted it.

@danpovey
Copy link
Contributor

Likely something in your system is printing 'BLAS' to stdout every time a shell is created, e.g. in one of your .xxxrc files. Either that or (somehow) when a certain BLAS library gets loaded it prints BLAS.

@shayxurui
Copy link

Likely something in your system is printing 'BLAS' to stdout every time a shell is created, e.g. in one of your .xxxrc files. Either that or (somehow) when a certain BLAS library gets loaded it prints BLAS.

When I delete the relevant information of ivector and train again, there is no error. It seems that training aishell2 does not necessarily require ivector.

@stale
Copy link

stale bot commented Nov 24, 2020

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Stale bot on the loose
Projects
None yet
Development

No branches or pull requests

3 participants