Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access violation in JIT-compiled code #78

Open
ccaughie opened this issue Sep 28, 2021 · 1 comment
Open

Access violation in JIT-compiled code #78

ccaughie opened this issue Sep 28, 2021 · 1 comment

Comments

@ccaughie
Copy link

Summary

Disclaimer: I'm not very familiar with this library, or with neural network processing in general. I'm working with a model developed by a sister company (created using PyTorch, converted to OpenVINO IR from ONNX) and we are getting a persistent crash when running on Windows with a large number of concurrent streams. With some effort I narrowed down the cause and identified a fix that worked for us, but there are probably much more elegant ways to fix this, hence this is an issue report instead of a pull request.

The cause of the crash seems to be this instruction:

00000179D623120F C4 E3 6D 4A 18 00    vblendvps   ymm3,ymm2,ymmword ptr [rax],ymm0  

which is part of a JIT routine called by

template <cpu_isa_t isa, data_type_t acc_type, data_type_t dst_type>
void jit_pp_kernel_t<isa, acc_type, dst_type>::operator()(dst_data_t *dst,
        const acc_data_t *acc, const char *bias, const float *scales,
        size_t start, size_t end, size_t runtime_oc,
        const float *dst_zero_points) const

in jit_gemm_inner_product_utils.cpp. At the time of the crash RAX is pointing less than 32 bytes from the end of an allocated memory page, and the addresses following this page are invalid. The YMM instructions operate on 32 bytes (256 bits) at a time, so this causes an access violation.

For example, in one of my crash dumps RAX has the value 00000179EC8A6FE4, and the memory there looks like this:

0x00000179EC8A6FE4  06 6f 8f 3d fd fd fd fd  .o.=ýýýý
0x00000179EC8A6FEC  dd dd dd dd dd dd dd dd  ÝÝÝÝÝÝÝÝ
0x00000179EC8A6FF4  dd dd dd dd dd dd dd dd  ÝÝÝÝÝÝÝÝ
0x00000179EC8A6FFC  dd dd dd dd ?? ?? ?? ??  ÝÝÝÝ....
0x00000179EC8A7004  ?? ?? ?? ?? ?? ?? ?? ??  ........
...

I seem to have fixed this by adding 32 bytes of padding to all node/edge memory allocations, using the following patch:

diff --git a/src/common/memory_desc_wrapper.hpp b/src/common/memory_desc_wrapper.hpp
index 4017db3dc..158120ad5 100644
--- a/src/common/memory_desc_wrapper.hpp
+++ b/src/common/memory_desc_wrapper.hpp
@@ -162,7 +162,7 @@ struct memory_desc_wrapper : public c_compatible {
                 max_size = utils::array_product(bd.inner_blks, bd.inner_nblks);
             }

-            return max_size * data_type_size() + additional_buffer_size();
+            return max_size * data_type_size() + additional_buffer_size() + 32;
         }
     }

This is enough to allow us to move forward, but I'm sure someone with a better understanding of the code can do better.

Version

Git hash is e0381c3. This is the version referenced by OpenVINO 2021.4.

Environment

CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
OS version: Windows 10 Enterprise LTSC (10.0.17763)
Compiler version: Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30133 for x64
CMake version: 3.16.2

Steps to reproduce

I wish I could give you something better to go on here, but we were only able to reproduce this in a system that was live streaming from 64 network cameras and running them all through the model in real time. I was not able to reproduce using a test harness that read the data from disk instead of over a network.

Observed behavior

The program crashes intermittently (see summary). This can happen after a couple of minutes or a couple of hours.

Expected behavior

The program does not crash.

@ccaughie
Copy link
Author

I just stumbled on this PR: #68. @dmitry-gorokhov do you think the crash I was seeing is the same one you fixed with that commit? We were using OpenVINO 2021.4 which did not have that fix. We're going to try 2021.4.2 when we get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant