Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in the s/dger case (mx1 vs 1xn) in mtimes #8

Closed
ShigekiKarita opened this issue Feb 20, 2018 · 6 comments
Closed

error in the s/dger case (mx1 vs 1xn) in mtimes #8

ShigekiKarita opened this issue Feb 20, 2018 · 6 comments

Comments

@ShigekiKarita
Copy link
Contributor

I met the following errors in this special case. I think using ger inside mtimes is the best.

unittest
{
    import std.stdio;
    auto a = [1.0, 2.0].sliced(2, 1);
    auto b = [1.0, 2.0].sliced(2, 1);
    auto bt = b.transposed;

    // without slice
    // (linked with cblas) ** On entry to cblas_dgemm parameter number 11 had an illegal value
    // (linked with mkl_rt) Intel MKL ERROR: Parameter 11 was incorrect on entry to cblas_dgemm.
    writeln(mtimes(a, bt)); // [[6.911e-310, 9.85365e-317], [0, 0]]

    // but OK with slice
    assert(mtimes(a, bt.slice) == [[1, 2], [2, 4]]);

    // also I think mir.blas.ger is nice when a.length!1 == 1 && b.length!0 == 1
    import mir.blas : ger;
    auto c = uninitSlice!double(a.length!0, bt.length!1);
    c[] = 0;
    ger(1.0, a[0..$,0], b[0..$,0], c);
    assert(c == [[1, 2], [2, 4]]);
}

I think this is related to the LDB (the 11th arg) specification http://www.netlib.org/lapack/explore-html/d1/d54/group__double__blas__level3_gaeda3cbd99c8fb834a60a6412878226e1.html#gaeda3cbd99c8fb834a60a6412878226e1

       On entry, LDB specifies the first dimension of B as declared
       in the calling (sub) program. When  TRANSB = 'N' or 'n' then
       LDB must be at least  max( 1, k ), otherwise  LDB must be at
       least  max( 1, n ).

where B.shape == [k, n]

P.S.

The following transposed B (1 < k < n) seems to be OK. I think only k=1 has the problem.

unittest
{
    auto a = [1.0, 2.0,
              3.0, 4.0,
              5.0, 6.0].sliced(3, 2);
    auto b = [1.0, 2.0,
              3.0, 4.0,
              5.0, 6.0].sliced(3, 2);
    auto bt = b.transposed;
    auto result = [[5, 11, 17], [11, 25, 39], [17, 39, 61]];
    // OK
    assert(mtimes(a, bt) == result);
}
@9il 9il closed this as completed Feb 20, 2018
@ShigekiKarita
Copy link
Contributor Author

ShigekiKarita commented Feb 20, 2018

@9il Thank you for fast merge! but I still have a question on my ger example for the recurrent neural network backpropgation algorithm.

git clone https://github.com/ShigekiKarita/numir-char-rnn.git
git checkout -b  <branch-name> origin/<branch-name>
git submodule update --init --recursive # lubeck is submoduled
dub clean --all-packages
dub run --compiler=ldc2 -b=release-nobounds
dub run --compiler=ldc2 -b=release-nobounds

using lubeck.mtimes (ger disabled)

https://github.com/ShigekiKarita/numir-char-rnn/blob/mir-blas-gemm/source/app.d#L63-L72
dub run --compiler=ldc2 -b=release-nobounds 59.09s user 0.94s system 395% cpu 15.180 total

using lubeck.mtimes (ger enabled)

https://github.com/ShigekiKarita/numir-char-rnn/blob/mir-blas-ger/source/app.d#L63-L72
dub run --compiler=ldc2 -b=release-nobounds 46.70s user 0.59s system 395% cpu 11.968 total

using direct call of mir.blas.ger

https://github.com/ShigekiKarita/numir-char-rnn/blob/master/source/app.d#L84-L94
dub run --compiler=ldc2 -b=release-nobounds 21.89s user 0.20s system 390% cpu 5.661 total

Do you have any comments on the reason why this ger inside mtimes is much slower than the raw ger?

@9il
Copy link
Contributor

9il commented Feb 20, 2018

@ShigekiKarita

  1. gemm vs ger: BLAS operations are not optimized for cases like k = 1. So gemm performs gemm even if it can call ger. Other possible optimization is call gemv if only on of matrixes is a vector.
  2. Lubeck vs mir-blas/mir-lapack: Lubeck allocates a memory for result. This memory probably not in CPU cache. So, if you preallocate all memory and use only mir-blas/mir-lapack (and without D's hashmaps) then your algorithm will be faster. Both Python and Lubeck use internal BLAS implementation. So if you want win Python you need to preallocate memory, which is not possible in numpy.

3.Take a look into this line:

auto dh = mtimes(params["Why"].transposed, dy).slice; 

There are two memory allocations. The first one is mtimes, the second one is slice.

@ShigekiKarita
Copy link
Contributor Author

Thanks, I see. Do you have a plan to add an optional argument to store the result in lubeck? As we can see it in numpy and torch, it is good for the preallocation strategy.

@jmh530
Copy link
Contributor

jmh530 commented Feb 20, 2018

@9il The mtimes documentation could be adjusted to say "General matrix-matrix multiplication. Allocates result to an uninitialized slice using GC."

I don't know how many of the other functions that will be in lubeck will need to think about these sorts of issues. It probably makes sense to create an Issue for further discussion on pre-allocation or the use of alternate allocators.

@9il
Copy link
Contributor

9il commented Feb 20, 2018

Thanks, I see. Do you have a plan to add an optional argument to store the result in lubeck? As we can see it in numpy and torch, it is good for the preallocation strategy.

The gemm wrapper for ndslice from mir-blas do the same.

@9il The mtimes documentation could be adjusted to say "General matrix-matrix multiplication. Allocates result to an uninitialized slice using GC."

PR is welcome. In the same time it is a general Lubeck concept. Lubeck was originally created to port a commercial Matlab library to D. Similarity, readability, and simplicity were key features. The speed was too, but it was already increased more then one hundred times compared with the original Matlab code.

I don't know how many of the other functions that will be in lubeck will need to think about these sorts of issues. It probably makes sense to create an Issue for further discussion on pre-allocation or the use of alternate allocators.

Currently @EmTee70 works on matrix classes that can hold different payloads (like symmteric, diagonal and other matrixes) and have "*" overloaded operation.

I thought a lot about RC based separate Matrix type system with clever expressions that will be felt like Julia or Matlab. Something like that:

Assume it can hold different types of matrixes:

  1. dense
  2. symmetric
  3. diagonal
  4. sparse
  5. tridiagonal

and is clever enough, for example, to fuse at run-time an expressions like

Mat C = alpha * J * J.t - beta * B * R;

into two BLAS (openblas) calls:

syrk (11 - symmetric rank k operation) for

C = alpha * J * J.t.

and than

gemm (12 - general rank k operation) for

`C -= beta * B * R`

Plus it would be able to solve linear systems using lapack:

B /= A;

@Laeeth, @firoozye, possibly it may be a good concept for math programming in D.

@jmh530
Copy link
Contributor

jmh530 commented Feb 20, 2018

@9il Submitted PR.

I"m glad progress is being made on general matrix classes. Would Jean-Louis Leroy's open multi-methods library be useful for these run-time features? He has Matrix examples, but I don't see ones that implement operator overloading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants