-
Notifications
You must be signed in to change notification settings - Fork 11
Add Dlantr #39
Conversation
|
The cgo for Dlange seems to be fine, but Dlantr with cgo seems to be giving nonsense answers. Do you see where the error may be @kortschak ? |
cgo/lapack.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/lapcak/lapack/ and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Done.
|
The cgo code that should reproduce this is: It does not fail. The errors I see with cgo are either "free(): invalid next size (fast)" or "double free or corruption (out)". So presumably the C code is freeing something that does not belong to it. I'm looking further. |
|
If I remove the 2 frees in the LAPACKE_dlantr path (work and a_t) we stop this failure (instead failing during the dlange call - if I remove these, the panic goes away, but the tests fail). The failure is the same for go1.4.2, so it's not the new GC. This is obviously insane; the allocations are made by C and then freed by C. If DLANTR is told to RETURN immediately, the panic goes away, so this looks like an unpleasant interaction between Go, C and fortran. |
|
This change to OpenBLAS is the minimal change necessary and sufficient to stop the panic - it still fails due to bad answers: diff --git a/lapack-netlib/SRC/dlantr.f b/lapack-netlib/SRC/dlantr.f
index 6088e8c..97393ec 100644
--- a/lapack-netlib/SRC/dlantr.f
+++ b/lapack-netlib/SRC/dlantr.f
@@ -281,7 +281,7 @@
END IF
ELSE
IF( LSAME( DIAG, 'U' ) ) THEN
- DO 210 I = 1, N
+ DO 210 I = 1, M
WORK( I ) = ONE
210 CONTINUE
DO 220 I = N + 1, MThis matches the documentation for WORK which says it should be >= M (via LWORK). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be m?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naively, no. This is the row vs. column major thing. In column major, one computes the MaxRowSum looping over the columns and incrementing, so len(work) = m. In row major, one computes the column sums with incrementing, so len(work) = n. Maybe this is a problem with the RowMajor implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're getting the right answers for at least one case where OpenBLAS disagrees and work is required (the one that was small enough for me to check), so I don't think this is a problem, I just wanted to check.
|
I'm confident that we are correct and either or both of OpenBLAS and NETLIB are wrong. |
|
What's the link for the file of your code change? In the netlib reference code it only goes to m |
|
Do you have a good reproducer to report to OpenBLAS? In the meantime, should I log the failure instead of using t.Errorf? I assume that's better than disabling the cgo test. |
|
By which I mean, would you like to report your reproducer to OpenBLAS? |
|
The reproducer for C doesn't fail. I think just point to this. The error on my machine is a double free, so we have already lost by the time we call; the test needs to be disabled for cgo. I just blatted the whole of NETLIB LAPACK on top of OpenBLAS and it still fails - the change in the link you provided is not in the 3.5.0 tarball, nor is it in the svn - looking now, it's not in the link above either (look at L284). I think this is an issue against NETLIB. |
|
Disabled the test |
|
@xianyi we're seeing a bunch of cases where the results from Dlantr do not match the results for the equivalent matrix in Dlange. We're also seeing some possible memory corruption issues. Do you know what the issue is? |
|
PTAL @kortschak |
|
Yay! I don't know what I did, but the code above (edited now) does cause a memory corruption failure: Filing with OpenBLAS. |
testlapack/dlantr.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use got and want here - when debugging and the world is uncertain, it makes life easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
I've tried the change at the OpenBLAS issue and when the tests are reinstated we get failures like this (added output): Want is wrong. |
testlapack/dlantr.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split the parameters onto a new line. Also, change Norm = %s -> ...=%c and drop the string conversion, make Diag->unitdiag and Uplo->upper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
That's really weird. For the last case I get the same answer as you do. However, if you change the order of computation (of got and want), the want answer changes (though still wrong). Even weirder is that when I change DlangeTest to contain effectively the same matrix (though the zeros are populated), the answer is correctly computed to be 505. |
|
Yes, I'm not very happy with the situation right now - there is too much uncertainty in a wide variety of parameters to be able to know what is going on. |
|
Well, on the bright side, I don't think mat64 should have to call this directly. It will instead call Dtrcon (etc.) directly which are hopefully working in cgo. |
|
Are you confident (via hand crafted cases) that Dlange and Dlantr in native are OK? |
|
Dlantr is compared with Dlange. Over many cases (I ran with trials = 10,000) Dlantr = Dlange. Thus, I think if Dlange is right, Dlantr is right. The Dlange answer is compared with an alternate coding of the same value, and it matches. Additionally, I have a version of Dgecon implemented and tested. Dgecon uses Dlange, and in a few hand-coded trials the answer is the same. |
|
OK. |
|
Also, the plan is to test Dgecon against cgo using a fuzzer given the difficulty of testing some of the other subroutines. That should find any errors. |
|
I'll have a look at the code proper later today. |
|
Here are current failing cases (I have a fix for the LAPACKE_dlantr issue I think) and they appear to be due to Dlange: calculating Dlange second: and now calculating Dlange first: Independently allocating a zeroed work for each call gives the second case. LAPACK_dlange is wrong, and in a way that depends on work, but not entirely. I'll let you file this one with NETLIB, @btracey. |
|
Filed a bug at http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=13&t=4789 |
|
The Go Dlange results seem accurate? |
|
It never spits out any bad things according to your tests which seem OK to me. I'll make a comparison against cgo.
Correction, they agree entirely (the previous was with the LAPACKE_dlantr fixes and some changes in LAPACKE_dlange that I was playing with - the dlantr changes don't alter this). Now I'm even more confused. |
|
Excellent! |
|
Mmmm. Ish. I'm glad our things are good. For people coming to look, the diff I made to LAPACKE to get dlantr to work is: diff --git a/lapack-netlib/lapacke/src/lapacke_dlantr.c b/lapack-netlib/lapacke/src/lapacke_dlantr.c
index 522122c..2cde1eb 100644
--- a/lapack-netlib/lapacke/src/lapacke_dlantr.c
+++ b/lapack-netlib/lapacke/src/lapacke_dlantr.c
@@ -53,7 +53,7 @@ double LAPACKE_dlantr( int matrix_order, char norm, char uplo, char diag,
/* Allocate memory for working array(s) */
if( LAPACKE_lsame( norm, 'i' ) || LAPACKE_lsame( norm, '1' ) ||
LAPACKE_lsame( norm, '0' ) ) {
- work = (double*)LAPACKE_malloc( sizeof(double) * MAX(1,m) );
+ work = (double*)LAPACKE_malloc( sizeof(double) * MAX(1,MAX(m,n)) );
if( work == NULL ) {
info = LAPACK_WORK_MEMORY_ERROR;
goto exit_level_0;
diff --git a/lapack-netlib/lapacke/src/lapacke_dlantr_work.c b/lapack-netlib/lapacke/src/lapacke_dlantr_work.c
index 0a937bd..169f85e 100644
--- a/lapack-netlib/lapacke/src/lapacke_dlantr_work.c
+++ b/lapack-netlib/lapacke/src/lapacke_dlantr_work.c
@@ -46,7 +46,7 @@ double LAPACKE_dlantr_work( int matrix_order, char norm, char uplo,
info = info - 1;
}
} else if( matrix_order == LAPACK_ROW_MAJOR ) {
- lapack_int lda_t = MAX(1,n);
+ lapack_int lda_t = MAX(1,m);
double* a_t = NULL;
/* Check leading dimension(s) */
if( lda < n ) {
@@ -55,13 +55,13 @@ double LAPACKE_dlantr_work( int matrix_order, char norm, char uplo,
return info;
}
/* Allocate memory for temporary array(s) */
- a_t = (double*)LAPACKE_malloc( sizeof(double) * lda_t * MAX(1,n) );
+ a_t = (double*)LAPACKE_malloc( sizeof(double) * lda_t * MAX(1,MAX(m,n)) );
if( a_t == NULL ) {
info = LAPACK_TRANSPOSE_MEMORY_ERROR;
goto exit_level_0;
}
/* Transpose input matrices */
- LAPACKE_dtr_trans( matrix_order, uplo, diag, n, a, lda, a_t, lda_t );
+ LAPACKE_dtr_trans( matrix_order, uplo, diag, m, a, lda, a_t, lda_t );
/* Call LAPACK function and adjust info */
res = LAPACK_dlantr( &norm, &uplo, &diag, &m, &n, a_t, &lda_t, work );
info = 0; /* LAPACK call is ok! */This accords with some of the code in LAPACKE_dlange (lda_t = MAX(1,m) for example), but I can't get dlange to function correctly by mimicking what I've done here. It seems that the malloc size should not need to be lda_t * MAX(1,m,n), MAX(1,n) should be enough, but it do; memory corruption otherwise. |
|
If the call to LAPACKE_dtr_trans is made with MAX(m,n) instead of m or n. Success. |
|
Demonstration of success: https://github.com/gonum/lapack/tree/lapacke-dlantr-fixed Interestingly the cgo tests can take much longer than the native tests (sometime failing to finish within the time-out). |
|
Does it look good? |
|
Sorry, caught up in $JOB. Meetings until this afternoon. |
|
Mo $$ mo problems
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to have this before the other switch on norm? It seems you can roll a whole heap of logic into one switch if not. I guess the only problem is you want to do these checks before the min(m, n) == 0 test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is to make the comparison with cgo easier. The panic checks should be the same in both, and this way the code "preamble" can be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
|
LGTM with minor comments and trepidation because of the weirdness relating to col->row major translation of the fortran. The tests passing and all the stuff I looked at during the LAPACKE_dlantr debugging also helps with confidence. |
No description provided.