Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: API changes for the upcoming v1.0 #384

Closed
wants to merge 4 commits into from
Closed

Conversation

emfomenk
Copy link

@emfomenk emfomenk commented Jan 15, 2019

The Intel(R) MKL-DNN team is planning to release v1.0 of the library in mid-2019. With this major version change, we would like to clean up the API to make Intel MKL-DNN easier to use.

This RFC describes user-visible and some important internal changes to help developers understand what to expect in the future. The Intel MKL-DNN team would appreciate your feedback and suggestions on the given proposal! Feel free to post them in this PR.

UPDATE 03/08/19: v1.0 Preview Candidate is released.
UPDATE 03/12/19: Update the RFC. See changes here.

@emfomenk emfomenk self-assigned this Jan 15, 2019
@emfomenk
Copy link
Author

Cast @4pao, @kruus, and @tensor-tang.

@tensor-tang
Copy link
Contributor

Great to know.
Congratulations !

@kruus
Copy link
Contributor

kruus commented Jan 31, 2019

With the current framework, pretty much everything is ugly wrt. memory, and I welcome the efforts to fix things. The current memory descriptor zoo certainly could use some nice, logically organized cages.

On the other hand, in the current disorganized situation, I could (with much work) define a new named format with whatever bizarre behaviors I wanted. So let me ask about a concrete example I've been thinking of recently, and at least one extension that I'd like to see supportable in the new framework.

Ex. (VE chip) default alignment for float data is 4, but there's a nice packed vector load that reads 512 contiguous floats into a vector register... but requires the load address to be 8-byte aligned. (Otherwise I need a slow'n'ugly vec-read-upper, vec-read-lower, vec-merge.)

  1. perhaps add a 'mkldnn_memory_desc_t::extra::align' field, with 0 ~ "I'm OK with compiler default"
  • I would like any nonzero extra::align to be a requirement rather than a hint.
  • even for x86, you would have the option to force alignment with cache line/page boundary
    • then my JIT code would never need to look at address alignment [and innermost tensor dimension] at runtime (ouch) to decide on whether to issue the 'nice' read, or issue the ugly upper-read,lower-read,merge sequence (or punt to a nice implementation or an ugly one) based on some possibly random default allocator alignment.

For my example, if the base tensor alignment were 8, and the innermost tension were even, then I could JIT for only the nice reads, because all internal pointers would maintain the extra alignment of 8.

  1. So how would I describe is a tensor whose logical innermost dimension can be odd, but is actually stored as a dense tensor with right/left zero padding (so fast primitives for dense tensors can "just work")?
  • Is it possible to specify a tensor whose innermost dimension has been left [or right] zero padded with one zero, so that we can execute primitives that assume dense data?
  • Would this be described as abcD2d in the new framework? Or is there maybe an easier way to describe this type of padding?
  • I was not sure about what the intent of the padding fields of the mkldnn_memory_desc_t were. Are they are intended for tensor views (subtensors), with no dense storage implication?. Could you document those fields? (clarify logical vs. physical padding)
  • Comment: I don't see a real use for x86 of this physical innermost-padding. It seems exactly the opposite of the proposed blocking descriptor, which can nicely pad the outermost dims to some blocking factor.

Apart from the above minor questions, I really like what I see 👍

@tensor-tang
Copy link
Contributor

Firstly, sorry for the late feedback. I saw some API changes have been applied.
So here I just give some my opinion for your reference. Hope it would be helpful.

  1. 2.3. Towards stateless primitives: explicit scratchpad management

I like the explicit way ! 👍
And maybe this change should be more thoroughly, do not give two choices since the other one would be a performance trick, otherwise you have planed to give more doc for it.

  1. 3.2. Operation primitives cannot be used as inputs (use memory instead)

With the new API, users will be allowed to pass only memory type as inputs and outputs for primitives.

This should be more clear. 👍

  1. 3.5 Short summary
    I did not get the benefits? From the codes lines, it seems not no reduction.

  2. As for the memory desc&block, that's a headache problem.
    So my suggestions is that why we just WYSIWYG and make it as simple as possible.

  • As a user, I may not care about the block dims you would use inside, so I hope I could just pass nchw as I known, so I would like to give a user dims{1,16,2,3} and format nchw or abcd.
  • As for developer, the format should be reordered. So just save as physical dims{1,1,2,3,16} and format nChw16cor aBcd16 . Dims check can refer to the format type so only the last 16 should be fixed. I thought use a vector to save the actual dims is simple, the others desgin should be focusing on how to use it as better as we could.
    And the only way user need to know when reorder is needed should be the time of creating primitive desc. And user even do not need to be aware of the block format, just give a md to user and reorder it. And only guarantee is PDesc should always give the recommendations MD.

@emfomenk
Copy link
Author

Hi @kruus,

Thanks a lot for the comments!
And I really apologize for the delay with the respond to you!

  1. Alignment is always a pain. In general, I like your idea -- this might be useful for CPU as well. On the other hand what slightly bothers me is the convenience from the user perspective (yes, I know, our API is not about convenience at all 🤢). If there is no requirement on the alignment it is relatively easy to friend mkldnn_memory and custom memory manager (e.g. the one that TF has). All what a user should do is to query the size of memory_desc, ask that amount of memory from memory manager and pass the pointer to the mkldnn_memory_create. However, with the alignment a user should also take that into account.

Let me think on that more... This indeed might be a good thing to have.

  1. Well, the padding is somewhat obscure. Let me try to clarify it here, and I will try to clarify that in the docs as well.

The padding field in mkldnn_memory_desc_t is physical. That means that padded_dims \ dims filled with zeros. This is a requirement. The ideas is pretty simple -- to make primitives that have some SIMD restrictions just work (as you said).

Example. 1D tensor, with dim 30. A primitive needs the size to be a multiple of 16. The corresponding memory would be described as:

    md {
        ndims = 1;
        dims = {30};
        padded_dims = {32}; // div_up(30, 16);
        padded_offsets = {0}; // padding zeros are on the right only, no zeros from the left
        // padded_offset = {1} if we want one zero from the left and one from the right
        // but implementation-wise that would be super tough to support
        ...
    }

Once you have defined logical dims and padded dims, you should also define a physical layout of the memory. The physical layout mostly relates to the padded dims (and hence logical dims, since they are folded into the physical dims).
The physical layout description can be different:

  • blocked (something that can be represented as multi-dimentional array)
  • wino (special format that contains necessary information to transform the weights to Winograd domain)
  • RNN (special format that is used for RNN weights)

The most typical layout is blocked. All plain layouts like nchw are blocked. Intel MKL-DNN v0.x supports one level of blocking, i.e. you can slit n dimension in blocks of size Bn, hence splitting n into N x Bn. Here N = div_up(n, Bn).
Intel MKL-DNN v1.x would also support arbitrary level of blocking. E.g. n = 128 can be split into 2 * 4 * 16. The only restriction that inner-most blocks should be dense in memory (i.e. the structure doesn't contain strides for the blocks).

Hence, the logical tensor abcd can be something like: ABCD_inner-most-blocks-of-a-b-c-and-d.
The blocks A, B, C, and D can have arbitrary strides.

Coming back to your question.

So how would I describe is a tensor whose logical innermost dimension can be odd, but is actually stored as a dense tensor with right/left zero padding (so fast primitives for dense tensors can "just work")?

Yes. Say, you work with logical 4D tensor abcd. And dimension d is odd (e.g. 31). But the implementation needs this dim to be a multiple of 16. If you don't need blocking, what you can do is just set padded_dims[3] = 32. And that would be enough. The library would guarantee that stride[c] = 32, and mem[32*k + 31] == 0, for every k.

Is it possible to specify a tensor whose innermost dimension has been left [or right] zero padded with one zero, so that we can execute primitives that assume dense data?

Yes. See above.

Would this be described as abcD2d in the new framework? Or is there maybe an easier way to describe this type of padding?

Well, the library does't expose any functions that allow specifying the padded_dims. The assumption is that that would be primitives who define such peculiar formats. There are multiple ways to create such memory. I would most likely go with the following:

    mkldnn_memory_desc_init_by_strides(&md,
        ndims=4, dims={a, b, c, d}, data_type=f32,
        strides = {b*c*32, c*32, 32, 1});
    md.padded_dims[3] = 32;

Another way of doing the same would be introduce the format tag abcD16d. This makes sense if you have a lot of primitives working on this format and you don't want always do the trick shown above. But to me this is an overkill.

I was not sure about what the intent of the padding fields of the mkldnn_memory_desc_t were. Are they are intended for tensor views (subtensors), with no dense storage implication?. Could you document those fields? (clarify logical vs. physical padding)

No, view (that by the way, are replaced with regular memory descriptors now) mostly rely on strides. Padding is something more physical, since the library expects zeros in the padded area. There is no such thing as logical padding for memory. The logical padding might only be as a property of the operation, like convolution, where it means that even though the input has the size (H, W), please treat it as (H + 2, W + 2), where the border elements are assumed to be zero.

Comment: I don't see a real use for x86 of this physical innermost-padding. It seems exactly the opposite of the proposed blocking descriptor, which can nicely pad the outermost dims to some blocking factor.

Padding is the property of the tensor with logical dimensions. It is not a property of some particular blocking strategy. For instance:

    md1 = { dims = {1, 31, 2, 2}, padded_dims = {1, 32, 2, 2}, format_tag = `nchw` };
    md2 = { dims = {1, 31, 2, 2}, padded_dims = {1, 32, 2, 2}, format_tag = `nChw16c` };

Here both md1 and md2 have padded area (every 32nd channel would be zero). But they have different physical layout. So I am not sure what do you mean by saying innermost-padding.

@emfomenk
Copy link
Author

@tensor-tang,

Thanks for the comments! Appreciate your review!

I like the explicit way ! 👍

Well, it turned out that enforcing explicit scratchpads might be a real pain for users who don't care about memory. So we decide to roll that back a little. The final decision is to have two modes:

  • implicit scratchpads (default) -- same as in v0.x
  • explicit scratchpads -- a user has to pass scratchpad to a primitive. A user should also set a special property of a primitive descriptor to indicate that they want to provide scratchpad. Alas, this is a bit verbose.

The reason is that users might face with performance regressions if they would provide scratchpads on their own (e.g. if the scratchpad provided would be NUMA-unaware). If users are fine with that, they can go with explicit scratchpads. For those who don't want to invest effort into clever memory management can simply rely on our implicit scratchpads.

I will update RFC to highlight this change.

3.5 Short summary

I did not get the benefits? From the codes lines, it seems not no reduction.

Well, the example is indeed not shorter. But there was no intend to reduce the amount of code-lines. The intend was to make it simpler to understand and program in frameworks. Hope that would hold :)

As for the memory desc&block, that's a headache problem.

So my suggestions is that why we just WYSIWYG and make it as simple as possible

I understand this point. But have explicit layouts like nchw really make the code vague and error prone. The reasons are covered in Introduction of the memory descriptor RFC.

From a FWK side, I would expect it to know whether the data is in FWK native format or MKL-DNN's one. Once this is know, there shouldn't be any need to know what exact format Intel MKL-DNN is using.

As a user, I may not care about the block dims you would use inside, so I hope I could just pass nchw as I known, so I would like to give a user dims{1,16,2,3} and format nchw or abcd.

Yeah, this works perfectly fine with the new API. You have two ways to do that:

    mkldnn_memory_desc_init_by_tag(&md, dims={1, 16, 2, 3}, format_tag = nchw);
    // or
    mkldnn_memory_desc_init_by_strides(&md, dims={1, 16, 2, 3}, strides = {16*2*3, 2*3, 3, 1});

As for developer, the format should be reordered. So just save as physical dims{1,1,2,3,16} and format nChw16cor aBcd16.

This would not work with-out additional information. Consider the following:
Say you have weights, o = 32, i = 64, h = 1, w = 1. Now we might have potentially two reorderss:

  • oihw (32, 64, 1, 1) --> OIhw16i16o (2, 4, 1, 1, 16, 16)
  • oihw (32, 64, 1, 1) --> OIhw16o16i (2, 4, 1, 1, 16, 16)

As you can, w/o knowing the format you cannot distinguish between OIhw16i16o and OIhw16o16i.

But if you know the format, it is pretty straightforward to recover the sizes. So I don't see why we would prefer having (2, 4, 1, 1, 16, 16) instead of (32, 64, 1, 1).

@emfomenk
Copy link
Author

RFC updated. The changes:

  • Remove s16 data type, rounding mode;
  • Implicit and explicit scratchpad modes;
  • Dropping memory primitive descriptor;
  • Minor clean ups;

@kruus
Copy link
Contributor

kruus commented Mar 14, 2019

Thanks for explanations --- you've convinced me that pretty much all useful padding/blocking cases can be dealt with. It helped clarify that the new formats cover pretty much everything and that padding is always physical. Your comment

Well, the library does't expose any functions that allow specifying the padded_dims. The assumption is that that would be primitives who define such peculiar formats.

was particularly useful, since I think I had been thinking of everything as client-visible when I wrote my questions. It's good that many things described by the RFC are actually "internal detail". Thanks.

For alignment, I really need to download the RFC source tarball and look at the new headers.
But because I'm here... If alignment is an internal requirement, it's easiest if the user gives us a null pointer. Otherwise, if mkl-dnn considers wrong alignment to be an error, the external memory manager would need a way to query (and respect) the alignment requirement.

OTOH, maybe there are more clever ways to hide the alignment requirement by lying to the memory manager about the size. Then the client would not be able to assume tensor data really started at the pointer he gave to mkl-dnn. If client already, in general, needs to coerce the memory format before making any assumptions about where any tensor element is stored, then maybe it's OK to provide a rounded-up size and internally use next correctly aligned pointer, "alignment = bigger size + implicit initial padding"?

@emfomenk
Copy link
Author

Hi @kruus,

since I think I had been thinking of everything as client-visible when I wrote my questions

Well, strictly speaking the padding stuff is user visible (since memory descriptor is a transparent structure). But even so, I wouldn't expect a lot of users would directly look into the memory descriptor structure itself. The assumption is that users would typically use things like mkldnn_memory_desc_equal(), to check whether reorder is required or not, and mkldnn_memory_desc_get_size() to get the size required for a memory.

For alignment, ...

The biggest issue with the alignment is that sometimes the frameworks pass their memory to MKL-DNN just as is (of course wrapping it into MKL-DNN memory object). If the alignment becomes a requirement that means the frameworks would sometime have to copy the data even if the only thing that differs is alignment. Also the comparison function would become a little bit trickier:

    md1 = { ..., .extra.alignment = 1; };
    md2 = m1;
    md2.extra,alignment = 64;
    mkldnn_memory_desc_is_equal(&md1, &md2) == ?

If the function returns true you might end up having a run-time error if running a kernel on an unaligned data. If the function returns false then a reorder (simply copy) is required even if the data was actually aligned.

So it seems that the alignment as a property of memory descriptor is too tough requirement from the user POV. It probably just makes sense handle that on the library side, by being ready to call a fallback if a pointer is bad (unaligned). In this case you have functional correctness, and you would have performance if the data is aligned (which is typically the case for most of the time).

@tensor-tang
Copy link
Contributor

Thanks @emfomenk,

This would not work with-out additional information. Consider the following:
Say you have weights, o = 32, i = 64, h = 1, w = 1. Now we might have potentially two reorderss:
oihw (32, 64, 1, 1) --> OIhw16i16o (2, 4, 1, 1, 16, 16)
oihw (32, 64, 1, 1) --> OIhw16o16i (2, 4, 1, 1, 16, 16)
As you can, w/o knowing the format you cannot distinguish between OIhw16i16o and OIhw16o16i.

I may not clear of my point. I thought the saved dims should be exactly match the format, because it's simple to understand.

Take reorder as an example,

src mem: (dims:{1, 2, 3, 4}, format: abcd),
dst mem: (dims:{4, 2, 3, 1}, format: dbca),
reorder(src, dst)

or

src mem: (dims:{32, 64, 1, 1), format: abcd),
dst1 mem: (dims:{2, 4, 1, 1, 16, 16), format: ABcd16b16a),
dst2 mem: (dims:{2, 4, 1, 1, 16, 16), format: ABcd16a16b),
r1 = reorder(src, dst1)
r2 = reorder(src, dst2)

The order is triggered by format difference, so r1 and r2 are not equal.

And as for a conv example.

convpd = conv_pd(origin_src_md(dims:{8, 32, 5, 6}, format:abcd),
                 origin_wgt_md(dims:{64, 32, 5, 6}, format:abcd),
                 ... /*other attrs*/))

expected_src_mem = create_mem_from_pd(convpd.expected_src_pd) // maybe it's dims: {8, 2, 5, 6, 16}, format: aBcd16b
expected_wgt_mem= create_mem_from_pd(convpd.expected_dst_pd) // dims:{4, 2, 5, 6, 16, 16), format: ABcd16b16a

reorder(origin_src_mem, expected_src_mem)
reorder(origin_wgt_mem, expected_wgt_mem)

....// some other works, maybe other op preparation

// when execute
conv.execute(s, {
        {MKLDNN_ARG_SRC, expected_src_mem},
        {MKLDNN_ARG_WEIGHTS, expected_wgt_mem},
        ...});

And maybe there's not need to compute the strides of saved dims because the dims is already straight forward, however it's fine for developer's convenience.

Then if we wanna to view or debug some expected_wgt_mem, we can just index from its returned dims.

Just some thoughts, hope it would be helpful.

@kruus
Copy link
Contributor

kruus commented Mar 15, 2019

If the function returns true you might end up having a run-time error if running a kernel on an unaligned data. If the function returns false then a reorder (simply copy) is required even if the data was actually aligned.

Sigh, yes, I suppose reorder cannot be converted into an opportunistic no-op without a lot of trickery.

OK, so I looked at the new headers in the tarball a bit more today.
For alignment you've pointed out 2 main routines whose behavior needs better specification. An optional extra.alignment also has 2 choices, requirement or suggestion. I list various options,
and at the end there seem only 3 options {2A, 2B, 3A} that work reasonably. The only new facet that has not been mentioned is whether we can require instead of encourage clients to use mkldnn_memory_desc_get_size. The guess I make is that there may exist clients that don't do this, and they might break in horrible ways when impls requiring alignment creep into existence!

mkldnn_memory_desc_equal behavior mirrors alignment being suggestion or requirement

  1. alignment is a suggestion : mkldnn_memory_desc_equal does not care about alignment.

    • If misaligned native_handles are to be allowed and mkl-dnn cannot internally align them
      then mkl-dnn will need impls that can handle pointers with 'O/S default alignment'.
      • (mkl-dnn cannot internally align misaligend pointers [see below] because it's impossible to detect clients that fail to call mkldnn_memory_desc_get_size)
  2. alignment is a requirement : mkldnn_memory_desc_equal checks alignment equality

    • mkl-dnn has no fallback routines and client must reorder (case 2B below)
    • (good) reorder is opportunistically a no-op if native_handle is well-enough aligned.
    • (easier) reorder is a simple copy even if native_handle is well-enough aligned, and we document that clients may achieve non-optimal performance in case some chipset impls ask for additional alignment.
    • (best?) ... or avoid reorder by other means (2A below).
    • In any case, clients must never decrease a non-zero alignment requirement (might lead to 'illegal instruction' if lucky, or wrong results if unlucky).

Should alignment checks occur within primitives? Must primitives supply fallback codes for "O/S default" alignment? Answers really depend on how 1.0 specifies mkldnn_memory_create.

mkldnn_memory_create behavior

  1. mkldnn_memory_create(...MKLDNN_NATIVE_HANDLE_ALLOCATE)
    • Never a problem!
    • mkl-dnn provides aligned memory as per mkldnn_memory_desc_t
  2. mkldnn_memory_create(...void* native_handle) alignment is required in mkl-dnn (no fallback impls)
  • A) mkldnn_memory_create(..native_handle) fails with mkldnn_invalid_arguments (perhaps provide a more descriptive error), forcing clients to update their allocator to respect extra.alignment.
    • Simple. Possibly acceptable, if most clients are already going to modify memory code for 1.0 anyway.
    • because no engines require alignment yet clients can get away with ignoring the new behavior for quite some time during their first pass of adopting RFC-1.0. Only impact is that we document possibility of future failures.
    • Are there any clients for which it's absolutely impossible to adjust the allocator code (eventually)?
  • B) misaligned native handle accepted (extra.alignment ignored) and client not required to use mkldnn_memory_desc_get_size
    • *equal must be called to force a reorder to required alignment.
    • document that clients that do not check the extra.alignment field may require additional reorders.
    • aligned impls are currently rare, this might be OK for RFC-1.0.
  • C) client must use mkldnn_memory_desc_get_size to allocate native_handle
    • if extra.alignment, then ask user for bytes+extra.alignment-1 and internally adjusts any native_handle` to the correct alignment.
    • might waste a little memory. mkl-dnn can limit the maximum alignment; ex. mkl-dnn might support 128 or 256, but alignment=4096 seems unreasonable (not related to execution speed).
    • RFC must then require that mkldnn_memory_desc_get_size is used. Currently "users are encouraged to use this function for better code portability", which is not strong enough.
    • But no way to detect client compliance, so 2A/2B are better
  1. mkldnn_memory_create(...void* native_handle) alignment is suggested in mkl-dnn (and mkl-dnn provides fallback impls)
  • use of fallbacks can still be minimized:
  • A). mkldnn_memory_create(..native_handle) never fails if native_handle has the O/S default alignment appropriate for the data type.
    • fallback routines required, in general, but
    • if the client handles extra.alignment, we guarantee the fastest impls.
  • (hacking mkldnn_memory_desc_get_size so mkl-dnn can internally align is 3C, below, and also avoid fallback routines, but has other issues.)
  • B) client does not use mkldnn_memory_desc_get_size and ignores extra.alignment.
    • Here we are allowing client to assume, say, N*C*H*W*sizeof(T) bytes for native_handle.
    • mkl-dnn must deal with misaligned pointers use fallback routines.
    • additional reorders are avoided.
  • C) client must use mkldnn_memory_desc_get_size to allocate native_handle
    • would avoid mkl-dnn need for fallback routines, but
    • same as 2C), we have no way to detect non-compliant clients (that I guess might reasonably already exist).
    • so this is not so good.

2C, 3A and 3C don't seem right, because there is no way to force a client to call 'mkldnn_memory_desc_get_size`, so I worry that existing clients might, months down the line, break horribly.

This leaves:
[ ] 2A (introduce a new 'misaligned' error code that is guaranteed to force clients to modify their code to support alignment-capable impls in post-1.0 mkl-dnn
[ ] 2B that allows clients to oblivously continue and documents that they may be creating additional reorders in future in post-1.0 mkl-dnn.
[ ] 3A mkl-dnn provides fall-back routines, and clients optionally respect alignment of native_handle,

and of course
[ ] @emfomenk's much cleverer solution (TBD)
[ ] no alignment support

@emfomenk
Copy link
Author

Hi @tensor-tang,

src mem: (dims:{1, 2, 3, 4}, format: abcd),
dst mem: (dims:{4, 2, 3, 1}, format: dbca),

No, this is not how Intel MKL-DNN handles memory.
The dims array doesn't depend on the format. The dims array reflects the logical dimensions that come in some order (for data typically in order N, C, [D], [H], [W]).
The format tag (e.g. abcd, or dbca) defines how those logical dimensions are laid out in memory.

Let's consider 2D example: [dim0 x dim1].
If the data is kept in the RowMajor format, the offset of the element (i, j) is i * dim1 + j.
If the data is kept in the ColumnMajor format, the offset of the element (i, j) is (i + j * dim0).

Notice, that no matter for the format is I don't change the notation of accessing an element -- it is always (i, j). The same is for the data itself. No matter whether this is RowMajor or ColumnMajor I describe the data as [dim0 x dim1].

With MKL-DNN the situation is the same. The logical description doesn't depend on the physical one.

    // RowMajor
    memory_desc_init_by_tag({dim0, dim1}, format_tag = ab); // ab -- natural order, b is the innermost
    // ColumnMajor
    memory_desc_init_by_tag({dim0, dim1}, format_tag = ba); // ba -- a becomes the innermost in memory

    //
    // the same, but using strides instead of the format tags
    //

    // RowMajor
    memory_desc_init_by_strides({dim0, dim1}, strides = {dim1, 1}); // dim1 has stride 1
    // ColumnMajor
    memory_desc_init_by_strides({dim0, dim1}, strides = {1, dim0}); // dim0 has stride 1

The format tags now imply the strides only and they don't have any meaning other than that.

This approach allows having simple checks for shape matching. It also allows deriving the physical layout pretty easily by looking at the strides and (if relevant) to the blocking structure.

@tensor-tang
Copy link
Contributor

No, this is not how Intel MKL-DNN handles memory.

Yes, thanks for your kindly explanations @emfomenk . I know it's not how MKL-DNN actually handles memory. That just my suggestions. I think combining the format and dims should be clear. So maybe do not need distinguish logical and physical description.

This approach allows having simple checks for shape matching. It also allows deriving the physical layout pretty easily by looking at the strides and (if relevant) to the blocking structure.

I am not getting this point.

@nhasabni
Copy link

nhasabni commented May 7, 2019

Hi @tensor-tang,

src mem: (dims:{1, 2, 3, 4}, format: abcd),
dst mem: (dims:{4, 2, 3, 1}, format: dbca),

No, this is not how Intel MKL-DNN handles memory.
The dims array doesn't depend on the format. The dims array reflects the logical dimensions that come in some order (for data typically in order N, C, [D], [H], [W]).
The format tag (e.g. abcd, or dbca) defines how those logical dimensions are laid out in memory.

@emfomenk Any thoughts about using labels while specifying dims?

src mem: (dims:{a:1, b:2, c:3, d:4}, format: abcd),

instead of implicitly assuming the order of dimensions?

With that logic, the memory description above is same as:

src mem: (dims: {d:4, c:3, b:2, a:1}, format: abcd)

@emfomenk
Copy link
Author

emfomenk commented May 8, 2019

Seems over-complicated: requires map and also non-clear how to deal with aliases like nchw (it is not safe to assume that n always corresponds a).
This will also make querying the sizes back more complicated and confusing: right now this is as simple as just say md.dims[dim_idx]...

This closes #490
Kudos to @heagoo
@blueberry
Copy link

The initial message says that the inteded release date for v1.0 is somewhere mid-2019. I can see that there is a recent release v1.0-rc 10 days ago. Is there a more precise estimate now of how close we are to the v1.0 release?

@emfomenk
Copy link
Author

emfomenk commented Jul 5, 2019

Hi @blueberry,

I would expect the final release to appear somewhere next week.

@blueberry
Copy link

That's great news indeed! Thanks!

@vpirogov
Copy link
Member

Closing as v1.0 is released.

@vpirogov vpirogov closed this Jul 15, 2019
@vpirogov vpirogov added the RFC A design document label Mar 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC A design document
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants