Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to implement Caffe LSTM layer by MKL-DNN #669

Closed
xiaoweiChen opened this issue Mar 14, 2020 · 15 comments
Closed

How to implement Caffe LSTM layer by MKL-DNN #669

xiaoweiChen opened this issue Mar 14, 2020 · 15 comments
Assignees
Labels

Comments

@xiaoweiChen
Copy link

Caffe LSTM implement is different from other framwork.
I want use MKL-DNN implement a Caffe LSTM.

Could give me some suggestions? Thanks!!

@vpirogov vpirogov self-assigned this Mar 16, 2020
@vpirogov
Copy link
Member

Hi @xiaoweiChen,

It looks like IntelCaffe LSTM implementation rewrites LSTM cell in terms of inner product and elementwise operations, which should call corresponding DNNL layers. If you want to make more efficient implementation you will need to implement a new DNNL-based version of the layer using DNNL's RNN primitive. Do you see any differences in the cell definition or parameters?

An important thing to keep in mind is that IntelCaffe relies on Intel MKL-DNN v0.21, which is not supported anymore. Transition to v1.2 will require significant changes to the codebase.

If you have a Caffe model and looking for efficient inference deployment tool you may want to try OpenVINO, an Intel supported engine for deep learning model deployment. OpenVINO can run Caffe models with similar or better performance than IntelCaffe and thread safe.

OpenVINO langing page: https://github.com/opencv/dldt
Github: https://github.com/opencv/dldt

@xiaoweiChen
Copy link
Author

xiaoweiChen commented Mar 17, 2020

Thanks for your reply @vpirogov

In fact, I am a OpenVINO player. I have few Caffe models which have LSTM layer(s). While, OpenVINO is not support Caffe LSTM layer. So, I add the Caffe LSTM layer as the custom into OpenVINO(Base OpenVINO 2019 R3.1 OpenSource), also do some modify for model-optimizer and inference-engine and make this custom layer work well.

I implement the Scale, Eltwise(SUM), FullyConnected by DNNL 1.2.1 in OpenVINO Source code. The performace is good for my team.
But, I also want to implement the LSTM cell by DNNL 1.2.1, may this can improve the Character Recognition(This part is not fast enough, I think) proformance.

And I have read RNN primitive, while I can not understand how to give the param to lstm_forward::desc(may caffe implement is different from other framework).

So, I open this issue, and want to get a lstm sample for caffe, or let me know how to give the param to lstm_forward::desc in Caffe LSTM layer.

Thanks again!

@emfomenk
Copy link

At first glance LSTM in Caffe matches the definition we have. @xiaoweiChen, could you please clarify what exactly you mean by saying "let me know how to give the param to lstm_forward::desc in Caffe LSTM layer"?

We recently added simple primitive examples for each primitive, and LSTM is one of them. Did you have a change to look at it?

@xiaoweiChen
Copy link
Author

xiaoweiChen commented Mar 18, 2020

Thanks for your work @emfomenk . I look your example just now. But, I also confusion for something.

Preparatory working....

  1. The classical LSTM data flow is like https://arxiv.org/pdf/1906.06440.pdf (Figure 3).
    1111

  2. The Caffe implement like below(without static input)(just miss a line from cont_x to LSTMUnit):

  3. caffe LSTM unit implement:

const int num = bottom[0]->shape(1);
  const int x_dim = hidden_dim_ * 4;
  const Dtype* C_prev = bottom[0]->cpu_data();
  const Dtype* X = bottom[1]->cpu_data();
  const Dtype* cont = bottom[2]->cpu_data();
  Dtype* C = top[0]->mutable_cpu_data();
  Dtype* H = top[1]->mutable_cpu_data();
  for (int n = 0; n < num; ++n) {
    for (int d = 0; d < hidden_dim_; ++d) {
      const Dtype i = sigmoid(X[d]);
      const Dtype f = (*cont == 0) ? 0 :
          (*cont * sigmoid(X[1 * hidden_dim_ + d]));
      const Dtype o = sigmoid(X[2 * hidden_dim_ + d]);
      const Dtype g = tanh(X[3 * hidden_dim_ + d]);
      const Dtype c_prev = C_prev[d];
      const Dtype c = f * c_prev + i * g;
      C[d] = c;
      const Dtype tanh_c = tanh(c);
      H[d] = o * tanh_c;
    }
    C_prev += hidden_dim_;
    X += x_dim;
    C += hidden_dim_;
    H += hidden_dim_;
    ++cont;
  }
  1. lstm_forward::desc constructor like below:
lstm_forward::desc(
    aprop,
    direction,
    src_layer_desc, 
    src_iter_h_desc,
    src_iter_c_desc,
    weights_layer_desc, 
    weights_iter_desc,
    bias_desc, 
    dst_layer_desc,
    dst_iter_h_desc,
    dst_iter_c_desc);

OK, preparatory work completed.

I will give my opinion, may my opinion is not right.

Confusion 1:
lstm_forward::desc can implement Caffe LSTMUnit?

My opinion:
I think, I can not implement. Because Caffe LSTMUnit without weight, the weight process in InnerProduct Layer. It differenet from the classical LSTM implement. So, I can't replace LSTMUnit by DNNL lstm_forward.

Confusion 2:
Base on Confusion 1, the LSTM need re-implement by DNNL lstm_forward, and without Scale, InnerProduct, Eltwise... balabala layers, just use many lstm_forward::desc constitute the whole LSTM layer.
Then, I meet the confusion...
How to process cont input(which you can see the caffe implement image, it shape is (16,3))?
cont is match the src_iter_c_desc in lstm_forward::desc?
cont is used in caffe implement, I can not ignore it.

In primitive example for LSTM and example rnn-inference-fp32 src_iter_c_desc is memory::desc().
So I don't know how to process cont for lstm_forward::desc.

My opinion:
This is my main question for this issue.

Hopeful, I explain my confusions clearly.

Looking for your reply! Thanks.

@emfomenk
Copy link

Hi @xiaoweiChen,

Sorry for the delay with respond and thanks for the nice explanation and pictures!

To be aligned on terms, let me try to map Caffe variables on DNNL names.

Caffe DNNL Comment
x [T, N, SLC] src_layer [T, N, SLC]
c_0 [1, N, DIC] src_iter_c [L, D, N, DIC]
c_$i, i = 1 .. 15 invisible intermediate cell states
c_16 [1, N, DIC] dst_iter_c [L, D, N, DIC]
h_0 [1, N, SIC] src_iter [L, D, N, SIC]
h_$i, i = 1 .. 15 invisible intermediate hidden states
h_16 [1, N, DIC] dst_iter [L, D, N, DIC]
h [T, N, DIC] dst_layer [ T, N, DIC ] concat of h_1 ... h_16

Parameters table:

L = 1 # number of LSTM layers
D = 1 # number of directions
T = 16 # time stamps
N = 3 # batch
SLC = 4096
DIC = 256

Now, coming back to your questions:

I think, I can not implement. Because Caffe LSTMUnit without weight, the weight process in InnerProduct Layer. It differenet from the classical LSTM implement. So, I can't replace LSTMUnit by DNNL lstm_forward.

The weights that are processed in InnerProduct here, is what you showed on the first picture x_t * {W_o, W_f, W_c, W_i}, and will be processed by LSTM primitive.

The InnerProduct here, corresponds to h_t-1 * {R_o, R_f, R_c, R_i}, and will be handled by LSTM primitive as well.

Base on Confusion 1, the LSTM need re-implement by DNNL lstm_forward, and without Scale, InnerProduct, Eltwise... balabala layers, just use many lstm_forward::desc constitute the whole LSTM layer.
Then, I meet the confusion...
How to process cont input(which you can see the caffe implement image, it shape is (16,3))?
cont is match the src_iter_c_desc in lstm_forward::desc?
cont is used in caffe implement, I can not ignore it.

cont is an implementation detail of Caffe. It equals 0 for the first time stamp and 1 for all the other time stamps (see here). LSTM primitive will take care of it on its own. You should not worry about cont tensor from Caffe.

In primitive example for LSTM and example rnn-inference-fp32 src_iter_c_desc is memory::desc().
So I don't know how to process cont for lstm_forward::desc.

As I mentioned above, cont should not appear when you use DNNL LSTM primitive. The tensors you need to use are the following:

// x
memory_desc src_layer_md({T, N, SLC}, dt::f32, format_tag::tnc);
memory src_layer(src_layer_md, engine, x);

// c_0
memory_desc src_iter_c_md({L, D, N, DIC}, dt::f32, format_tag::ldnc);
memory src_iter_c(src_iter_c_md, engine, c_0);

// c_T
memory_desc dst_iter_c_md({L, D, N, DIC}, dt::f32, format_tag::ldnc);
memory dst_iter_c(dst_iter_c_md, engine, c_T);

// h_0
memory_desc src_iter_md({L, D, N, SIC}, dt::f32, format_tag::ldnc);
memory src_iter(src_iter_md, engine, h_0);

// h_16
memory_desc dst_iter_md({L, D, N, DIC}, dt::f32, format_tag::ldnc);
memory dst_iter(dst_iter_md, engine, h_16);

// h
memory_desc dst_layer_md({T, N, DIC}, dt::f32, format_tag::tnc);
memory dst_layer(dst_layer_md, engine, h);

// W_xc
memory_desc weights_layer_md({L, D, SLC, 4, DIC}, dt::f32, format_tag::ldigo);
memory weights_layer(weights_layer_md, engine, W_xc);

// b_c
memory_desc bias_md({L, D, 4, DIC}, dt::f32, format_tag::ldgo);
memory bias(bias_md, engine, b_c);

// W_hc
memory_desc weights_iter_md({L, D, SIC, 4, DIC}, dt::f32, format_tag::ldigo);
memory weights_iter(weights_iter_md, engine, W_hc);

lstm_forward::desc(forward_inference, unidirectional_left2right,
    src_layer_md, src_iter_md, src_iter_c_md,
    weights_layer_md, weights_iter_md, bias_md, 
    dst_layer_md, dst_iter_md, dst_iter_c_md);

@emfomenk emfomenk assigned emfomenk and unassigned vpirogov Apr 17, 2020
@xiaoweiChen
Copy link
Author

xiaoweiChen commented Apr 18, 2020

Thanks for your words @emfomenk

I known how to map the caffe lstm to dnnl lstm interface.

While, base on your words, I have a question:

as I know:

L = 1 # number of LSTM layers
D = 1 # number of directions
T = 16 # time stamps
N = 3 # batch
SLC = 4096
DIC = 256

but...

// h_0
memory_desc src_iter_md({L, D, N, SIC}, dt::f32, format_tag::ldnc);
and
// W_hc
memory_desc weights_iter_md({L, D, SIC, 4, DIC}, dt::f32, format_tag::ldigo);

I don't know the SIC value.
This is jut a mistake?
SIC->DIC or SIC->SLC?

And I try to resolve this by myself.
h_0: SIC->DIC
W_hc: SIC->SLC

and use c++ implement this lstm layer, and make code compile-well.

En... Unfortunately, I meet runtime error

terminate called after throwing an instance of 'dnnl::error'
what(): could not create a descriptor for an LSTM forward propagation primitive

My code paste below, may there have some argument(s) for lstm_forward::primitive_desc is wrong...

 // DNNL implement
  using tag = dnnl::memory::format_tag;
  using dt = dnnl::memory::data_type;

  dnnl::engine eng(dnnl::engine::kind::cpu, 0);
  dnnl::stream s(eng);

  std::size_t L = 1; // number of LSTM layers
  std::size_t D = 1; // number of directions
  std::size_t T = x_dims.at(0); // time stamps
  std::size_t N = x_dims.at(1); // batch

  std::size_t SLC = x_dims.at(2);
  std::size_t DIC = h_dims.at(2);

   // void compile-time warning as error
  long int L_i = L;
  long int D_i = D;
  long int T_i = T;
  long int N_i = N;
  long int SLC_i = SLC;
  long int DIC_i = DIC;

  // x 
  dnnl::memory::desc src_layer_md(
  { T_i, N_i, SLC_i }, dt::f32, tag::tnc);
  dnnl::memory src_layer(src_layer_md, eng, x->buffer());

  SizeVector inner_c_h_shape{ L, D, N, DIC };
  auto inner_c_h_desc = TensorDesc(Precision::FP32, inner_c_h_shape, ANY);

  // c_0
  auto c_0 = std::make_shared<TBlob<float>>(inner_c_h_desc);
  c_0->allocate();
  dnnl::memory::desc src_iter_c_md(
  { L_i, D_i, N_i, DIC_i }, dt::f32, tag::ldnc);
  dnnl::memory src_iter_c(src_iter_c_md, eng, c_0->buffer());
  memset(
      static_cast<void*>(c_0->buffer()),
      0,
      L*D*N* DIC * sizeof(float));

  // c_T
  auto c_T = std::make_shared<TBlob<float>>(inner_c_h_desc);
  c_T->allocate();
  dnnl::memory::desc dst_iter_c_md(
  { L_i, D_i, N_i, DIC_i }, dt::f32, tag::ldnc);
  dnnl::memory dst_iter_c(dst_iter_c_md, eng, c_T->buffer());
  memset(
      static_cast<void*>(c_T->buffer()),
      0,
      L*D*N* DIC * sizeof(float));

  // h_0
  auto h_0 = std::make_shared<TBlob<float>>(inner_c_h_desc);
  h_0->allocate();
  dnnl::memory::desc  src_iter_md(
  { L_i, D_i, N_i, DIC_i }, dt::f32, tag::ldnc);
  dnnl::memory src_iter(src_iter_md, eng, h_0->buffer());
  memset(
      static_cast<void*>(h_0->buffer()),
      0,
      L*D*N* DIC * sizeof(float));

  // h_16
  auto h_16 = std::make_shared<TBlob<float>>(inner_c_h_desc);
  h_16->allocate();
  dnnl::memory::desc  dst_iter_md(
  { L_i, D_i, N_i, DIC_i }, dt::f32, tag::ldnc);
  dnnl::memory dst_iter(dst_iter_md, eng, h_16->buffer());
  memset(
      static_cast<void*>(h_16->buffer()),
      0,
      L*D*N* DIC * sizeof(float));

  // h
  dnnl::memory::desc dst_layer_md(
  { T_i, N_i, DIC_i }, dt::f32, tag::tnc);
  dnnl::memory dst_layer(dst_layer_md, eng, h->buffer());

  // W_xc
  dnnl::memory::desc weights_layer_md(
  { L_i, D_i, SLC_i, 4, DIC_i }, dt::f32, tag::ldigo);
  dnnl::memory weights_layer(weights_layer_md, eng, this->W_xc->buffer());

  // b_c
  dnnl::memory::desc bias_md({ L_i, D_i, 4, DIC_i }, dt::f32, tag::ldgo);
  dnnl::memory bias(bias_md, eng, this->b_c->buffer());

  // W_hc
  dnnl::memory::desc weights_iter_md({ L_i, D_i, SLC_i, 4, DIC_i }, dt::f32, tag::ldigo);
  dnnl::memory weights_iter(weights_iter_md, eng, this->W_hc->buffer());

  auto lstm_desc =
      dnnl::lstm_forward::desc(
          dnnl::prop_kind::forward_inference,
          dnnl::rnn_direction::unidirectional_left2right,
          src_layer_md,
          src_iter_md,
          src_iter_c_md,
          weights_layer_md,
          weights_iter_md,
          bias_md,
          dst_layer_md,
          dst_iter_md,
          dst_iter_c_md);

  auto lstm_prim_desc =
      dnnl::lstm_forward::primitive_desc(
          lstm_desc,
          eng);

  dnnl::lstm_forward(lstm_prim_desc).execute(
      s,
      {
          { DNNL_ARG_SRC_LAYER, src_layer },
          { DNNL_ARG_WEIGHTS_LAYER, weights_layer },
          { DNNL_ARG_WEIGHTS_ITER, weights_iter },
          { DNNL_ARG_BIAS, bias },
          { DNNL_ARG_DST_LAYER, dst_layer },
          { DNNL_ARG_SRC_ITER_C, src_iter_c },
          { DNNL_ARG_DST_ITER_C, dst_iter_c },
          { DNNL_ARG_SRC_ITER, src_iter },
          { DNNL_ARG_DST_ITER, dst_iter },
      });

  s.wait();

@emfomenk
Copy link

In your case SIC == DIC and should be used for both h_0 and W_hc.

@xiaoweiChen
Copy link
Author

xiaoweiChen commented Apr 19, 2020

Thanks @emfomenk , SIC == DIC can resolve the above exception.

Now, I just make the layer as a program without OpenVINO IE framework. The program would crash at LSTM execute.

I try to debug this base on DNNL Release source code 1.2.1 version(on windows):

<root>\src\cpu\gemm\f32\jit_avx_gemm_f32.cpp:2586

sgemm_nocopy_driver(transa, transb, myM, myN, myK, p_alpha, myA,
                        lda, myB, ldb, &myBeta, myC, ld, myBias, ws);

The second hit would lead program crash

The crash location is
<root>\src\cpu\gemm\f32\jit_avx_gemm_f32.cpp:2313

ker_(m, n, k, alpha, a, lda, b, ldb, beta, c, ldc, bias, ws);

The local values is
11111

I found the bias is an empty pointer and I can not debug into this line any more...

Well, show my program code here:

#include <vector>

#include "dnnl.hpp"

using SizeVector = std::vector<std::size_t>;

int main() {

    std::vector<std::size_t> x_dims {16, 3, 4096};
    auto x_size = 1;
    for (auto num : x_dims) {
        x_size *= num;
    }
    auto x = new float[x_size];

    std::vector<std::size_t> h_dims {16, 3, 256};
    auto h_size = 1;
    for (auto num : h_dims) {
        h_size *= num;
    }
    auto h = new float[h_size];

    // DNNL implement
    using tag = dnnl::memory::format_tag;
    using dt = dnnl::memory::data_type;

    dnnl::engine eng(dnnl::engine::kind::cpu, 0);
    dnnl::stream s(eng);

    std::size_t L = 1; // number of LSTM layers
    std::size_t D = 1; // number of directions
    std::size_t T = x_dims.at(0); // time stamps
    std::size_t N = x_dims.at(1); // batch

    std::size_t SLC = x_dims.at(2);
    std::size_t DIC = h_dims.at(2);
    auto SIC = DIC;

    // void compile-time warning as error
    long int L_i = L;
    long int D_i = D;
    long int T_i = T;
    long int N_i = N;
    long int SLC_i = SLC;
    long int DIC_i = DIC;
    long int SIC_i = SIC;

    // x
    dnnl::memory::desc src_layer_md({T_i, N_i, SLC_i}, dt::f32, tag::tnc);
    //dnnl::memory src_layer(src_layer_md, eng, x->buffer());
    dnnl::memory src_layer(src_layer_md, eng, x);

    SizeVector inner_c_h_shape {L, D, N, DIC};
    //auto inner_c_h_desc = TensorDesc(Precision::FP32, inner_c_h_shape, ANY);
    auto inner_c_h_size = 1;
    for (auto num : inner_c_h_shape) {
        inner_c_h_size *= num;
    }

    // c_0
    //auto c_0 = std::make_shared<TBlob<float>>(inner_c_h_desc);
    //c_0->allocate();
    auto c_0 = new float[inner_c_h_size];
    dnnl::memory::desc src_iter_c_md(
            {L_i, D_i, N_i, DIC_i}, dt::f32, tag::ldnc);
    dnnl::memory src_iter_c(src_iter_c_md, eng, c_0);
    memset(c_0, 0, inner_c_h_size * sizeof(float));

    // c_T
    //auto c_T = std::make_shared<TBlob<float>>(inner_c_h_desc);
    //c_T->allocate();
    auto c_T = new float[inner_c_h_size];
    dnnl::memory::desc dst_iter_c_md(
            {L_i, D_i, N_i, DIC_i}, dt::f32, tag::ldnc);
    dnnl::memory dst_iter_c(dst_iter_c_md, eng, c_T);
    memset(c_T, 0, inner_c_h_size * sizeof(float));

    // h_0
    //auto h_0 = std::make_shared<TBlob<float>>(inner_c_h_desc);
    //h_0->allocate();
    auto h_0 = new float[inner_c_h_size];
    dnnl::memory::desc src_iter_md({L_i, D_i, N_i, DIC_i}, dt::f32, tag::ldnc);
    dnnl::memory src_iter(src_iter_md, eng, h_0);
    memset(h_0, 0, inner_c_h_size * sizeof(float));

    // h_16
    //auto h_16 = std::make_shared<TBlob<float>>(inner_c_h_desc);
    //h_16->allocate();
    auto h_16 = new float[inner_c_h_size];
    dnnl::memory::desc dst_iter_md({L_i, D_i, N_i, DIC_i}, dt::f32, tag::ldnc);
    dnnl::memory dst_iter(dst_iter_md, eng, h_16);
    memset(h_16, 0, inner_c_h_size * sizeof(float));

    // h
    dnnl::memory::desc dst_layer_md({T_i, N_i, DIC_i}, dt::f32, tag::tnc);
    dnnl::memory dst_layer(dst_layer_md, eng, h);

    // W_xc
    SizeVector W_xc_shape {L, D, SLC, 4, DIC};
    auto W_xc_size = 1;
    for (auto num : W_xc_shape) {
        W_xc_size *= num;
    }
    auto W_xc = new float[W_xc_size];
    dnnl::memory::desc weights_layer_md(
            {L_i, D_i, SLC_i, 4, DIC_i}, dt::f32, tag::ldigo);
    dnnl::memory weights_layer(weights_layer_md, eng, W_xc);

    // b_c
    SizeVector bias_shape {L, D, 4, DIC};
    auto bias_size = 1;
    for (auto num : bias_shape) {
        bias_size *= num;
    }
    auto b_c = new float[bias_size];
    dnnl::memory::desc bias_md({L_i, D_i, 4, DIC_i}, dt::f32, tag::ldgo);
    dnnl::memory bias(bias_md, eng, b_c);

    // W_hc
    SizeVector W_hc_shape {L, D, SIC, 4, DIC};
    auto W_hc_size = 1;
    for (auto num : W_hc_shape) {
        W_hc_size *= num;
    }
    auto W_hc = new float[bias_size];
    dnnl::memory::desc weights_iter_md(
            {L_i, D_i, SIC_i, 4, DIC_i}, dt::f32, tag::ldigo);
    dnnl::memory weights_iter(weights_iter_md, eng, W_hc);

    auto lstm_desc = dnnl::lstm_forward::desc(
            dnnl::prop_kind::forward_inference,
            dnnl::rnn_direction::unidirectional_left2right, src_layer_md,
            src_iter_md, src_iter_c_md, weights_layer_md, weights_iter_md,
            bias_md, dst_layer_md, dst_iter_md, dst_iter_c_md);

    auto lstm_prim_desc = dnnl::lstm_forward::primitive_desc(lstm_desc, eng);

    dnnl::lstm_forward(lstm_prim_desc)
            .execute(s,
                    {
                            {DNNL_ARG_SRC_LAYER, src_layer},
                            {DNNL_ARG_WEIGHTS_LAYER, weights_layer},
                            {DNNL_ARG_WEIGHTS_ITER, weights_iter},
                            {DNNL_ARG_BIAS, bias},
                            {DNNL_ARG_DST_LAYER, dst_layer},
                            {DNNL_ARG_SRC_ITER_C, src_iter_c},
                            {DNNL_ARG_DST_ITER_C, dst_iter_c},
                            {DNNL_ARG_SRC_ITER, src_iter},
                            {DNNL_ARG_DST_ITER, dst_iter},
                    });

    s.wait();

    return 0;
}

Because this is a small program for test, so I not use delete to release the new operator allocated memory.

@xiaoweiChen
Copy link
Author

xiaoweiChen commented Apr 19, 2020

En... auto W_hc = new float[bias_size]; to auto W_hc = new float[W_hc_size]; would resolve the crash.

Now, the test program is running well. And, Tomorrow I will test this under OpenVINO IE farmework.

If everything is OK, I will show the layer benchmark data here ;)

@emfomenk
Copy link

Glad to head you resolved the issues :)

@xiaoweiChen
Copy link
Author

xiaoweiChen commented Apr 24, 2020

hi @emfomenk ,

I use the test program code into framework, but get the wrong result.

I move out my caffe LSTM implement from framework(also use DNNL to implement).

I shared my test code and implment code here:
xiaowei.tar.gz

Two questions here:

  1. When use the std::fill to fill the input data and weight/bais data(you can see them in test code file), the test is passed. But, I don't think the result is right.
    My output like below:
[PASSED] caffe result is little difference as native result!
caffe_h[0] : 0.761594 vs. native_h[0] : 0.761594
caffe_h[1] : 0.761594 vs. native_h[1] : 0.761594
caffe_h[2] : 0.761594 vs. native_h[2] : 0.761594
caffe_h[3] : 0.761594 vs. native_h[3] : 0.761594
caffe_h[4] : 0.761594 vs. native_h[4] : 0.761594
caffe_h[5] : 0.761594 vs. native_h[5] : 0.761594
caffe_h[6] : 0.761594 vs. native_h[6] : 0.761594
caffe_h[7] : 0.761594 vs. native_h[7] : 0.761594
caffe_h[8] : 0.761594 vs. native_h[8] : 0.761594
caffe_h[9] : 0.761594 vs. native_h[9] : 0.761594

caffe cost : 317.369 ms
native cost : 20.844 ms

E:\openSourceProjecrts\mirrors-MKL-DNN-v1.3\MKL-DNN\build-windows\xiaowei\Debug\LSTM-test.exe (进程 20092)已退出,代码为 0。
按任意键关闭此窗口. . .
  1. When use the random number to fill the input data and weight/bais data, the test is failed.
    My output like below:
caffe_h[0] : -0.761594 vs. native_h[0] : 0

E:\openSourceProjecrts\mirrors-MKL-DNN-v1.3\MKL-DNN\build-windows\xiaowei\Debug\LSTM-test.exe (进程 2036)已退出,代码为 -1。
按任意键关闭此窗口. . .

You can see the LSTM-test.cpp:58, there is a #if 1, make this to #if 0 can switch to random number mode.

If you have time, please help me. I don't know where is wrong...

Thanks!!!

@xiaoweiChen
Copy link
Author

I will close this issue. The last question is a precision question, may.

I find the way to resolve this question. and make the result same with my caffe lstm implementation.

I do some modify in jit_uni_rnn_common_postgemm.hpp, this part dnnl use 'kernel_' to compute the lstm-cell, and this use assembly with xbyak. I am not familar with xbyak, I can not dump the optimization code. So I have no idea for which part is wrong in assembly.

Well, I paste the modify code here.

jit_uni_rnn_common_postgemm.hpp

...
template <typename src_data_t, typename acc_data_t, typename scratch_data_t>
    rnn_postgemm_sig(execute_fwd) {
        using namespace rnn_utils;
        rnn_utils::ws_gates_aoc<src_data_t> ws_gates(rnn, ws_gates_);
        rnn_utils::ws_gates_aoc<scratch_data_t> scratch_gates(
                rnn, scratch_gates_);
        rnn_utils::weights_peephole_aoc_t<const float> weights_peephole(
                rnn, weights_peephole_);
        rnn_utils::bias_aoc_t bias(rnn, bias_);
        auto src_iter_ld = rnn.src_iter_ld(cell_position);
        auto src_iter_c_ld = rnn.src_iter_c_ld(cell_position);
        auto dst_iter_c_ld = rnn.dst_iter_c_ld(cell_position);
        auto dst_ld = rnn.dst_ld(cell_position);
        auto dst_copy_ld = rnn.dst_copy_ld(cell_position);
        rnn_utils::ws_states_aoc<src_data_t> states_t_l(
                rnn, states_t_l_, dst_ld);
        rnn_utils::ws_states_aoc<src_data_t> states_t_l_copy(
                rnn, states_t_l_copy_, dst_copy_ld);
        rnn_utils::ws_states_aoc<const src_data_t> states_tm1_l(
                rnn, states_tm1_l_, src_iter_ld);
        rnn_utils::ws_states_aoc<float> c_states_t_l(
                rnn, c_states_t_l_, dst_iter_c_ld);
        rnn_utils::ws_states_aoc<const float> c_states_tm1_l(
                rnn, c_states_tm1_l_, src_iter_c_ld);
        rnn_utils::ws_gates_aoc<scratch_data_t> scratch_cell(
                rnn, scratch_cell_);
        utils::array_offset_calculator<src_data_t, 2> ws_Wh_b(
                ws_grid_, rnn.mb, rnn.dic);

        static std::atomic_int cont = {0};

        auto caffe_lstm_cell = [](
         int dic,
         int cont,
         void *param2_,
         const void *param3_,
         void *param4_,
         const void *param6_, 
         void *param7_) {

         float *X = static_cast<float *>(param2_);
         const float *biases = static_cast<const float *>(param3_);
         float *h = static_cast<float *>(param4_);
         const float *pre_c = static_cast<const float *>(param6_);
         float *c = static_cast<float *>(param7_);

         auto sigmoid = [](float x) { 
           return 1.f / (1.f + expf(-x));
         };

         auto i_index_offset = 0 * dic;
         auto f_index_offset = 1 * dic;
         auto o_index_offset = 2 * dic;
         auto g_index_offset = 3 * dic;

         for (auto index = 0; index < dic; ++index) {
             auto i_index = i_index_offset + index;
             auto f_index = f_index_offset + index;
             auto o_index = o_index_offset + index;
             auto g_index = g_index_offset + index;

             auto i_gate_num = X[i_index] + biases[i_index];
             auto f_gate_num = X[f_index] + biases[f_index];
             auto o_gate_num = X[o_index] + biases[o_index];
             auto g_gate_num = X[g_index] + biases[g_index];

             auto i_gate = sigmoid(i_gate_num);
             auto f_gate = cont == 0 ? 0.f : sigmoid(f_gate_num);
             auto o_gate = sigmoid(o_gate_num);
             auto g = tanhf(g_gate_num);

             auto preC = pre_c[index];
             auto c_curr = f_gate * preC + i_gate * g;
             auto h_curr = o_gate * tanhf(c_curr);

             c[index] = c_curr;
             h[index] = h_curr;
         }
        };

        // Todo: add parallelization on dic for the batch 1 case
        // Assumption: the kernel runs a loop on dic elements
        parallel_nd(rnn.mb, [&](int i) {
            void *param1_ = &ws_gates(i, 0, 0); // RNN, LSTM, GRU
            void *param2_ = &scratch_gates(i, 0, 0); // RNN, LSTM, GRU
            const void *param3_ = &bias(0, 0); // RNN, LSTM, GRU
            void *param4_ = &states_t_l(i, 0); // RNN, LSTM, GRU
            void *param5_ = states_t_l_copy_
                    ? &states_t_l_copy(i, 0)
                    : states_t_l_copy_; // RNN, LSTM, GRU
            const void *param6_;
            void *param7_, *param8_;
            void *param9_ = nullptr;
            switch (pd_->cell_kind()) {
                case alg_kind::vanilla_lstm:
                    param6_ = &c_states_tm1_l(i, 0);
                    param7_ = &c_states_t_l(i, 0);
                    param8_ = (void *)&weights_peephole(0, 0);
                    break;
                case alg_kind::lbr_gru:
                    param6_ = &states_tm1_l(i, 0);
                    param7_ = &scratch_cell(i, 0, 0);
                    param8_ = &ws_Wh_b(i, 0);
                    break;
                case alg_kind::vanilla_gru:
                    param6_ = &states_tm1_l(i, 0);
                    param7_ = nullptr;
                    param8_ = nullptr;
                    break;
                default:
                    param6_ = nullptr;
                    param7_ = nullptr;
                    param8_ = nullptr;
                    break;
            }
            //kernel_(param1_, param2_, param3_, param4_, param5_, param6_,
            //        param7_, param8_, param9_);

            caffe_lstm_cell(
              rnn.dic,
              cont.load(), 
              param2_, 
              param3_, 
              param4_,
              param6_, 
              param7_);
        });

        cont.fetch_add(1, std::memory_order_relaxed);
    }
...

@emfomenk
Copy link

Thanks for the analysis! The only difference I see is:

auto f_gate = cont == 0 ? 0.f : sigmoid(f_gate_num);

while in oneDNN we have:

auto f_gate = sigmoid(f_gate_num);

According to oneDNN's RNN definition (link) the f_gate is always computed as sigmoid:

image

Maybe you could initialize the f-gate part of the first iteration of the input_layer data with -infinity, to make the corresponding sigmoid return 0.

@xiaoweiChen
Copy link
Author

xiaoweiChen commented May 26, 2020

Happy for seeing your reply and Thanks point the difference . @emfomenk

In fact, I don't want to modify the code in DNNL.

According your words "initialize the f-gate part of the first iteration of the input_layer data with -infinity". And f_gate value equationauto f_gate_num = X[f_index] + biases[f_index];, I think I can modify the the biases to complete this. (I am not do it now.)

This would not modify anything for DNNL.

Emmm.... May, this is best way for me now.

In your opinion, this is a good idea? or, you can give some suggestions for doing this work?

@emfomenk
Copy link

You cannot modify bias, because it is the same for all timestamps: if you do so, you will get incorrect results for t=1 .. T.

I suggested to initialize h0, but now, looking at the formula I realized that it will be multiplied by U-matrix, which could lead to nans (as weights could be both positive and negative).

So, it seems this is the incompatibility that cannot be easily fixed. The only way to work-around this issue is to split the LSTM in to two steps: t = 0 and t = 1 ... T. For the first LSTM (with t=0) you can do the trick with the bias as you mentioned. For t = 1 .. T you just use the parameters from the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants