Skip to content
This repository has been archived by the owner on May 9, 2024. It is now read-only.

Sanitizers do not work with JNI #34

Closed
alexbaden opened this issue Jun 7, 2022 · 40 comments
Closed

Sanitizers do not work with JNI #34

alexbaden opened this issue Jun 7, 2022 · 40 comments
Assignees
Labels

Comments

@alexbaden
Copy link
Contributor

Needs investigation.

@leshikus
Copy link
Contributor

leshikus commented Jun 24, 2022

I started with investigation how ASAN worked for omniscidb

@Garra1980
Copy link
Contributor

@alexbaden do you have smth in mind already?

@leshikus
Copy link
Contributor

There are link errors when linking ArrowBasedExecuteTest with ASAN, namely _start absence in Scrt.o; comparing the linking command line to the one which works

@alexbaden
Copy link
Contributor Author

@ienkovich and I suspect JNI will be the biggest problem getting ASAN to work with HDK. So, our idea was to drop Calcite out of the unit tests when compiling with ASAN or TSAN on. We will intercept the payload from Calcite (optimized RA tree), write it to disk, then run the ASAN binary using that payload (bypassing SQL parse and Calcite optimization). A first PR to support this was already merged: https://github.com/intel-ai/omniscidb/pull/430

@ienkovich
Copy link
Contributor

Right. So far it is supported only in ArrowBasedExecuteTest. You can run it with --build-rel-alg-cache=<file> to collect Calcite responses and then run again with --use-rel-alg-cache=<file> to avoid Calcite construction and any JNI calls. I didn't try to use it with sanitizers yet.

@leshikus
Copy link
Contributor

There are some old reports that ASAN works with JNI with instrumentation, I wonder if we can get something out of it. google/sanitizers#271

@leshikus
Copy link
Contributor

leshikus commented Jul 1, 2022

With ArrowBasedExecuteTest related tests excluded and ASAN_OPTIONS="detect_leaks=0" most other tests pass. I can commit this to CI if no one objects to the exclusion.

@leshikus
Copy link
Contributor

leshikus commented Jul 1, 2022

asan.log

@alexbaden
Copy link
Contributor Author

ResultSetTest large buffers should be skipped under TSAN and ASAN because the memory usage is too large; the increase from TSAN/ASAN tagging makes the buffers too big.

What errors are you getting with detect leaks on? Only limited to JNI? We might try a suppression in that case.

@leshikus
Copy link
Contributor

leshikus commented Jul 1, 2022

Most of leaks are reported in libjvm.so and originate from standard jni memory allocation routines; there is a report that jvm can be rebuilt in a way that the leaks are shown correctly.

@alexbaden
Copy link
Contributor Author

@leshikus
Copy link
Contributor

leshikus commented Jul 1, 2022

thanks for the link, will try

@leshikus
Copy link
Contributor

leshikus commented Jul 5, 2022

Tried suppressions feature, it did not work for me yet. Also tried LSAN_OPTIONS

Looking further

@ienkovich
Copy link
Contributor

Most of leaks are reported in libjvm.so and originate from standard jni memory allocation routines; there is a report that jvm can be rebuilt in a way that the leaks are shown correctly.

We can avoid JNI by using cached Calcite responses. So far it is supported only in ArrowBasedExecuteTest. You can run it with --build-rel-alg-cache=<file> to collect Calcite responses and then run it again with --use-rel-alg-cache=<file> to avoid Calcite construction and any JNI calls. I tried it with ASAN some time ago and it worked fine. If it works for you, I can support this feature for all other test suites.

@leshikus
Copy link
Contributor

I have tried both options, total Calcite parsing time decreased from 28263 ms to 3753 ms; seems working

@leshikus
Copy link
Contributor

ASAN aborted after the first testcase of ArrowBasedExecuteTest, the following errors are not related to Calcite as expected. I believe one may continue ASAN enabling after this is fixed.

=================================================================
==377441==ERROR: AddressSanitizer: new-delete-type-mismatch on 0x614000428440 in thread T0:
  object passed to delete has wrong type:
  size of the allocated type:   408 bytes;
  size of the deallocated type: 304 bytes.
    #0 0x7ff4c4b19c65 in operator delete(void*, unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:177
    #1 0x562c06d4e764 in query_template(llvm::Module*, unsigned long, bool, bool, QueryMemoryDescriptor const&, ExecutorDeviceType, bool, GpuSharedMemoryContext const&) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x2034764)
    #2 0x562c06cef531 in Executor::compileWorkUnit(std::vector<InputTableInfo, std::allocator<InputTableInfo> > const&, RelAlgExecutionUnit const&, CompilationOptions const&, ExecutionOptions const&, GpuMgr const*, bool, std::shared_ptr<RowSetMemoryOwner>, unsigned long, signed char, bool, DataProvider*, std::unordered_map<int, std::unordered_map<int, std::shared_ptr<ColumnarResults const>, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<ColumnarResults const> > > >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::unordered_map<int, std::shared_ptr<ColumnarResults const>, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<ColumnarResults const> > > > > > >&) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x1fd5531)
    #3 0x562c075d105d in QueryCompilationDescriptor::compile(unsigned long, signed char, bool, RelAlgExecutionUnit const&, std::vector<InputTableInfo, std::allocator<InputTableInfo> > const&, ColumnFetcher const&, CompilationOptions const&, ExecutionOptions const&, Executor*) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x28b705d)
    #4 0x562c069aa4e3 in Executor::executeWorkUnitImpl(unsigned long&, bool, bool, std::vector<InputTableInfo, std::allocator<InputTableInfo> > const&, RelAlgExecutionUnit const&, CompilationOptions const&, ExecutionOptions const&, std::shared_ptr<RowSetMemoryOwner>, bool, DataProvider*, std::unordered_map<int, std::unordered_map<int, std::shared_ptr<ColumnarResults const>, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<ColumnarResults const> > > >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::unordered_map<int, std::shared_ptr<ColumnarResults const>, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<ColumnarResults const> > > > > > >&) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x1c904e3)
    #5 0x562c069afca2 in Executor::executeWorkUnit(unsigned long&, bool, std::vector<InputTableInfo, std::allocator<InputTableInfo> > const&, RelAlgExecutionUnit const&, CompilationOptions const&, ExecutionOptions const&, bool, DataProvider*, std::unordered_map<int, std::unordered_map<int, std::shared_ptr<ColumnarResults const>, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<ColumnarResults const> > > >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::unordered_map<int, std::shared_ptr<ColumnarResults const>, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<ColumnarResults const> > > > > > >&) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x1c95ca2)
    #6 0x562c06f75ce8 in ExecutionResult RelAlgExecutor::executeWorkUnit(RelAlgExecutor::WorkUnit const&, std::vector<TargetMetaInfo, std::allocator<TargetMetaInfo> > const&, bool, CompilationOptions const&, ExecutionOptions const&, long, std::optional<unsigned long>)::{lambda(auto:1, bool, bool)#1}::operator()<unsigned long>(unsigned long, bool, bool) const (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x225bce8)
    #7 0x562c06fdb9dd in RelAlgExecutor::executeWorkUnit(RelAlgExecutor::WorkUnit const&, std::vector<TargetMetaInfo, std::allocator<TargetMetaInfo> > const&, bool, CompilationOptions const&, ExecutionOptions const&, long, std::optional<unsigned long>) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x22c19dd)
    #8 0x562c06fddf5f in RelAlgExecutor::executeCompound(RelCompound const*, CompilationOptions const&, ExecutionOptions const&, long) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x22c3f5f)
    #9 0x562c06fe2f5d in RelAlgExecutor::executeRelAlgStep(RaExecutionSequence const&, unsigned long, CompilationOptions const&, ExecutionOptions const&, long) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x22c8f5d)
    #10 0x562c06fe5998 in RelAlgExecutor::executeRelAlgSeq(RaExecutionSequence const&, CompilationOptions const&, ExecutionOptions const&, long, bool) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x22cb998)
    #11 0x562c06fe875e in RelAlgExecutor::executeRelAlgQueryNoRetry(CompilationOptions const&, ExecutionOptions const&, bool) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x22ce75e)
    #12 0x562c06feb15d in RelAlgExecutor::executeRelAlgQuery(CompilationOptions const&, ExecutionOptions const&, bool) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x22d115d)
    #13 0x562c0660f6b0 in TestHelpers::ArrowSQLRunner::(anonymous namespace)::ArrowSQLRunnerImpl::runSqlQuery(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, CompilationOptions const&, ExecutionOptions const&) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18f56b0)
    #14 0x562c06610a69 in TestHelpers::ArrowSQLRunner::(anonymous namespace)::ArrowSQLRunnerImpl::run_multiple_agg(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ExecutorDeviceType, bool) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18f6a69)
    #15 0x562c0662154e in TestHelpers::ArrowSQLRunner::run_simple_agg(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ExecutorDeviceType, bool) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x190754e)
    #16 0x562c05e7db6a in Distributed50_FailOver_Test::TestBody() (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x1163b6a)
    #17 0x562c06608217 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18ee217)
    #18 0x562c065cfab4 in testing::Test::Run() [clone .part.0] (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18b5ab4)
    #19 0x562c065d071c in testing::TestInfo::Run() [clone .part.0] (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18b671c)
    #20 0x562c065d0e99 in testing::TestSuite::Run() [clone .part.0] (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18b6e99)
    #21 0x562c065d46d1 in testing::internal::UnitTestImpl::RunAllTests() (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18ba6d1)
    #22 0x562c065d5342 in testing::UnitTest::Run() (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18bb342)
    #23 0x562c05c9aa87 in main (/omniscidb/build/Tests/ArrowBasedExecuteTest+0xf80a87)
    #24 0x7ff4c07df082 in __libc_start_main ../csu/libc-start.c:308
    #25 0x562c05db7b2d in _start (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x109db2d)

0x614000428440 is located 0 bytes inside of 408-byte region [0x614000428440,0x6140004285d8)
allocated by thread T0 here:
    #0 0x7ff4c4b18587 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:104
    #1 0x562c06d4e993 in query_template(llvm::Module*, unsigned long, bool, bool, QueryMemoryDescriptor const&, ExecutorDeviceType, bool, GpuSharedMemoryContext const&) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x2034993)
    #2 0x562c06cef531 in Executor::compileWorkUnit(std::vector<InputTableInfo, std::allocator<InputTableInfo> > const&, RelAlgExecutionUnit const&, CompilationOptions const&, ExecutionOptions const&, GpuMgr const*, bool, std::shared_ptr<RowSetMemoryOwner>, unsigned long, signed char, bool, DataProvider*, std::unordered_map<int, std::unordered_map<int, std::shared_ptr<ColumnarResults const>, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<ColumnarResults const> > > >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::unordered_map<int, std::shared_ptr<ColumnarResults const>, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<ColumnarResults const> > > > > > >&) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x1fd5531)
    #3 0x562c075d105d in QueryCompilationDescriptor::compile(unsigned long, signed char, bool, RelAlgExecutionUnit const&, std::vector<InputTableInfo, std::allocator<InputTableInfo> > const&, ColumnFetcher const&, CompilationOptions const&, ExecutionOptions const&, Executor*) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x28b705d)
    #4 0x562c069aa4e3 in Executor::executeWorkUnitImpl(unsigned long&, bool, bool, std::vector<InputTableInfo, std::allocator<InputTableInfo> > const&, RelAlgExecutionUnit const&, CompilationOptions const&, ExecutionOptions const&, std::shared_ptr<RowSetMemoryOwner>, bool, DataProvider*, std::unordered_map<int, std::unordered_map<int, std::shared_ptr<ColumnarResults const>, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<ColumnarResults const> > > >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::unordered_map<int, std::shared_ptr<ColumnarResults const>, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<ColumnarResults const> > > > > > >&) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x1c904e3)
    #5 0x562c069afca2 in Executor::executeWorkUnit(unsigned long&, bool, std::vector<InputTableInfo, std::allocator<InputTableInfo> > const&, RelAlgExecutionUnit const&, CompilationOptions const&, ExecutionOptions const&, bool, DataProvider*, std::unordered_map<int, std::unordered_map<int, std::shared_ptr<ColumnarResults const>, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<ColumnarResults const> > > >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::unordered_map<int, std::shared_ptr<ColumnarResults const>, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<ColumnarResults const> > > > > > >&) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x1c95ca2)
    #6 0x562c06f75ce8 in ExecutionResult RelAlgExecutor::executeWorkUnit(RelAlgExecutor::WorkUnit const&, std::vector<TargetMetaInfo, std::allocator<TargetMetaInfo> > const&, bool, CompilationOptions const&, ExecutionOptions const&, long, std::optional<unsigned long>)::{lambda(auto:1, bool, bool)#1}::operator()<unsigned long>(unsigned long, bool, bool) const (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x225bce8)
    #7 0x562c06fdb9dd in RelAlgExecutor::executeWorkUnit(RelAlgExecutor::WorkUnit const&, std::vector<TargetMetaInfo, std::allocator<TargetMetaInfo> > const&, bool, CompilationOptions const&, ExecutionOptions const&, long, std::optional<unsigned long>) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x22c19dd)
    #8 0x562c06fddf5f in RelAlgExecutor::executeCompound(RelCompound const*, CompilationOptions const&, ExecutionOptions const&, long) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x22c3f5f)
    #9 0x562c06fe2f5d in RelAlgExecutor::executeRelAlgStep(RaExecutionSequence const&, unsigned long, CompilationOptions const&, ExecutionOptions const&, long) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x22c8f5d)
    #10 0x562c06fe5998 in RelAlgExecutor::executeRelAlgSeq(RaExecutionSequence const&, CompilationOptions const&, ExecutionOptions const&, long, bool) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x22cb998)
    #11 0x562c06fe875e in RelAlgExecutor::executeRelAlgQueryNoRetry(CompilationOptions const&, ExecutionOptions const&, bool) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x22ce75e)
    #12 0x562c06feb15d in RelAlgExecutor::executeRelAlgQuery(CompilationOptions const&, ExecutionOptions const&, bool) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x22d115d)
    #13 0x562c0660f6b0 in TestHelpers::ArrowSQLRunner::(anonymous namespace)::ArrowSQLRunnerImpl::runSqlQuery(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, CompilationOptions const&, ExecutionOptions const&) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18f56b0)
    #14 0x562c06610a69 in TestHelpers::ArrowSQLRunner::(anonymous namespace)::ArrowSQLRunnerImpl::run_multiple_agg(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ExecutorDeviceType, bool) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18f6a69)
    #15 0x562c0662154e in TestHelpers::ArrowSQLRunner::run_simple_agg(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ExecutorDeviceType, bool) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x190754e)
    #16 0x562c05e7db6a in Distributed50_FailOver_Test::TestBody() (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x1163b6a)
    #17 0x562c06608217 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18ee217)
    #18 0x562c065cfab4 in testing::Test::Run() [clone .part.0] (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18b5ab4)
    #19 0x562c065d071c in testing::TestInfo::Run() [clone .part.0] (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18b671c)
    #20 0x562c065d0e99 in testing::TestSuite::Run() [clone .part.0] (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18b6e99)
    #21 0x562c065d46d1 in testing::internal::UnitTestImpl::RunAllTests() (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18ba6d1)
    #22 0x562c065d5342 in testing::UnitTest::Run() (/omniscidb/build/Tests/ArrowBasedExecuteTest+0x18bb342)
    #23 0x562c05c9aa87 in main (/omniscidb/build/Tests/ArrowBasedExecuteTest+0xf80a87)
    #24 0x7ff4c07df082 in __libc_start_main ../csu/libc-start.c:308

SUMMARY: AddressSanitizer: new-delete-type-mismatch ../../../../src/libsanitizer/asan/asan_new_delete.cc:177 in operator delete(void*, unsigned long)
==377441==HINT: if you don't care about these errors you may set ASAN_OPTIONS=new_delete_type_mismatch=0
==377441==ABORTING

@leshikus
Copy link
Contributor

leshikus commented Oct 4, 2022

How to reproduce:

git clone https://github.com/intel-ai/omniscidb.git
python3 scripts/conda/make-m2-proxy.py # create proxy if needed

cmake -B ./build -S . -DENABLE_TESTS=on -DENABLE_CUDA=off -DENABLE_ASAN=off # build
cmake --build /omniscidb/build/ --parallel 4

cd build/Tests # build cache
./ArrowBasedExecuteTest --build-rel-alg-cache=cache.txt

cd ../.. # rebuild with ASAN
cmake -B ./build -S . -DENABLE_TESTS=on -DENABLE_CUDA=off -DENABLE_ASAN=on
cmake --build ./build/ --parallel 16

cd build/Tests # run the test with cache and ASAN
./ArrowBasedExecuteTest --use-rel-alg-cache=cache.txt >asan.log

@leshikus
Copy link
Contributor

leshikus commented Oct 5, 2022

I've rerun the tests with ASAN_OPTIONS="detect_leaks=0" and ArrowBasedExecuteTest excluded. A new memory problem was found in UdfTest.

$ ./UdfTest
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
=================================================================
==2209597==ERROR: AddressSanitizer: requested allocation size 0x7fe144de3169 (0x7fe144de4170 after adjustments for alignment, red zones etc.) exceeds maximum supported size of 0x10000000000 (thread T0)
    #0 0x7fe14e3bc587 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:104
    #1 0x55c9ac652694 in (anonymous namespace)::get_clang_path(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (/omniscidb/build/Tests/UdfTest+0xce1694)
    #2 0x55c9ac6533e6 in UdfCompiler::UdfCompiler(CudaMgr_Namespace::NvidiaDeviceArch, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >) (/omniscidb/build/Tests/UdfTest+0xce23e6)
    #3 0x55c9ac543792 in (anonymous namespace)::SQLTestEnv::SetUp() (/omniscidb/build/Tests/UdfTest+0xbd2792)
    #4 0x55c9ac6117ea in testing::internal::UnitTestImpl::RunAllTests() (/omniscidb/build/Tests/UdfTest+0xca07ea)
    #5 0x55c9ac614132 in testing::UnitTest::Run() (/omniscidb/build/Tests/UdfTest+0xca3132)
    #6 0x55c9ac48ae7c in main (/omniscidb/build/Tests/UdfTest+0xb19e7c)
    #7 0x7fe1448cd082 in __libc_start_main ../csu/libc-start.c:308

==2209597==HINT: if you don't care about these errors you may set allocator_may_return_null=1
SUMMARY: AddressSanitizer: allocation-size-too-big ../../../../src/libsanitizer/asan/asan_new_delete.cc:104 in operator new(unsigned long)
==2209597==ABORTING```

The leak detector does not work for most tests which invoke Calcite via JNI and create unreadable log. 

@leshikus
Copy link
Contributor

leshikus commented Oct 5, 2022

The error in UdfTest can be mitigated via adding /usr/lib/llvm-12/bin to PATH; the problem does not happen on the test main execution branch. The change happens due to transition to LLVM 12.

@leshikus
Copy link
Contributor

leshikus commented Oct 5, 2022

Here is a generic test run log
asan.log

@alexbaden
Copy link
Contributor Author

Ok, thanks! I have an idea of where the first issue is coming from, but fixing it opened up some additional issues. Will look at the UDF test issue afterwards.

@leshikus
Copy link
Contributor

leshikus commented Oct 5, 2022

TSAN run results
when a test fails due to race condition, tsan hangs, see StringDictionaryTest, ParallelSortTest
Here is the log
tsan.log

@alexbaden
Copy link
Contributor Author

@leshikus
Copy link
Contributor

leshikus commented Oct 8, 2022

@alexbaden thanks, now all test cases of ArrowBasedExecuteTest are able to run; there is still a problem with TBB:

=================================================================
==485705==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 24600 byte(s) in 3 object(s) allocated from:
    #0 0x7f50ff502787 in operator new[](unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:107
    #1 0x7f50fc9c966f  (/lib/x86_64-linux-gnu/libtbb.so.2+0x2766f)

SUMMARY: AddressSanitizer: 24600 byte(s) leaked in 3 allocation(s).

@leshikus
Copy link
Contributor

leshikus commented Oct 8, 2022

@alexbaden
Copy link
Contributor Author

@aregm said tbb had improved, so maybe we need to update. We could also add a suppression: https://github.com/intel-ai/omniscidb/blob/jit-engine/config/asan.suppressions for TSAN I would prefer not to do that, but for ASAN it's probably ok.

@leshikus
Copy link
Contributor

With exception to TBB and JNI it seems there are no ASAN issues in most tests. The following tests have JNI issues, @ienkovich maybe Ilya's option is needed here:

CorrelatedSubqueryTest
DataRecyclerTest
ArrayTest
TableFunctionsTest
NoCatalogSqlTest
SQLHintTest
TopKTest
ResultSetArrowConversion
ArrowStorageSqlTest
StringFunctionsTest
UdfTest
ResultSetTest
GroupByTest
JoinHashTableTest
BumpAllocatorTest

@alexbaden
Copy link
Contributor Author

Let's split ASAN into two remaining tasks:

  1. TBB -- can the issues be rectified by either building TBB from source with ASAN enabled, or by adding a suppression?
  2. Let's ignore the remaining tests in favor of getting ArrowBasedExecuteTest working, and go from there. Likely the remaining tests have JNI issues.

@leshikus
Copy link
Contributor

  1. I'm ok with both solutions;
  2. I will try to setup CI for ArrowBasedExecuteTest separately

@leshikus
Copy link
Contributor

I've heard there are plans to merge hdk and omniscidb workspaces; better to do 2 after workspaces are merged

@alexbaden
Copy link
Contributor Author

There are plans to do that, but I don't know that it will happen in the next few weeks.

@leshikus
Copy link
Contributor

Here is a PR which is required to add tests to HDK build, https://github.com/intel-ai/omniscidb/pull/581

@leshikus
Copy link
Contributor

When running tests in HDK, I get multiple problems, see the log below. Some of them are related to function get_root_abs_path

LastTest.log

@leshikus
Copy link
Contributor

leshikus commented Nov 22, 2022

The implementation of get_root_abs_path differs because BUILD_SHARED_LIBS and ENABLE_SHARED_LIBS have different default values for omniscidb and hdk. I wonder if this is intentional. Should get_root_abs_path point to omniscidb dir or hdk dir?

@leshikus
Copy link
Contributor

The following patch seems to fix the issue

diff --git a/OSDependent/Unix/omnisci_path.cpp b/OSDependent/Unix/omnisci_path.cpp
index 0a113fe53..40f98729e 100644
--- a/OSDependent/Unix/omnisci_path.cpp
+++ b/OSDependent/Unix/omnisci_path.cpp
@@ -52,9 +52,10 @@ std::string get_root_abs_path() {
     /* Despite the dlinfo man page claim that l_name is absolute path,
        it is so only when the location path to the library is absolute,
        say, as specified in LD_LIBRARY_PATH. */
+
     std::filesystem::path abs_exe_dir(std::filesystem::absolute(
         std::filesystem::canonical(std::string(link_map->l_name))));
-    const auto mapd_root = abs_exe_dir.parent_path().parent_path();
+    const auto mapd_root = abs_exe_dir.parent_path().parent_path().parent_path();
     return mapd_root.string();
   }
 #endif

@leshikus
Copy link
Contributor

leshikus commented Nov 22, 2022

The next problem is at

  void createCalciteServerHandler(JNIEnv* env, const std::string& udf_filename) {
    jclass handler_cls = findClass(env, "com/mapd/parser/server/CalciteServerHandler");

I need to correct the class path somehow

Upd. This patch fixes the class path

diff --git a/Calcite/CalciteJNI.cpp b/Calcite/CalciteJNI.cpp
index a98c3787c..c99f5c704 100644
--- a/Calcite/CalciteJNI.cpp
+++ b/Calcite/CalciteJNI.cpp
@@ -119,7 +119,7 @@ class JVM {
   static std::shared_ptr<JVM> createJVM(size_t max_mem_mb) {
     auto root_abs_path = omnisci::get_root_abs_path();
     std::string class_path_arg = "-Djava.class.path=" + root_abs_path +
-                                 "/bin/calcite-1.0-SNAPSHOT-jar-with-dependencies.jar";
+                                 "/../bin/calcite-1.0-SNAPSHOT-jar-with-dependencies.jar";
     std::string max_mem_arg = "-Xmx" + std::to_string(max_mem_mb) + "m";
     JavaVMInitArgs vm_args;
     auto options = std::make_unique<JavaVMOption[]>(2);

@leshikus
Copy link
Contributor

The problem with paths is moved to #120

@leshikus
Copy link
Contributor

It was good @Garra1980 suggested to try launching ASAN with omniscidb. Another problem manifested itself - Github virtual machines do not have enough virtual memory for ASAN to start due to their virtualization settings.

==12699==ReserveShadowMemoryRange failed while trying to map 0xdfff0001000 bytes. Perhaps you're using ulimit -v

@leshikus
Copy link
Contributor

I have tried a small test and it worked; is it possible to scale a data size via option for ArrowBasedExecuteTest?

https://github.com/leshikus/ghtest/actions/runs/3585577524/jobs/6033734506

@leshikus
Copy link
Contributor

ASAN works now

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants