Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AbstractTestNativeGeneralQueries.testSystemTables is flaky #22129

Closed
elharo opened this issue Mar 8, 2024 · 7 comments
Closed

AbstractTestNativeGeneralQueries.testSystemTables is flaky #22129

elharo opened this issue Mar 8, 2024 · 7 comments
Assignees

Comments

@elharo
Copy link
Contributor

elharo commented Mar 8, 2024

TestPrestoNativeGeneralQueriesThrift>AbstractTestNativeGeneralQueries.testSystemTables:1413->AbstractTestQueryFramework.assertQuery:154 Execution of 'actual' query failed: SELECT * FROM (SELECT DISTINCT regionkey FROM tmp_presto_30a13f6f35074dde851d6931c829c9fc) t INNER JOIN (SELECT regionkey FROM "tmp_presto_30a13f6f35074dde851d6931c829c9fc$partitions") p ON t.regionkey = p.regionkey

We've seen a lot of failures in TestPrestoNativeGeneralQueriesThrift. It's possible the whole class (or the superclass?) is flaky rather than individual test methods.

@elharo
Copy link
Contributor Author

elharo commented Mar 13, 2024

A Velox problem?

[ERROR] com.facebook.presto.nativeworker.TestPrestoNativeGeneralQueriesThrift.testSystemTables  Time elapsed: 0.877 s  <<< FAILURE!
java.lang.AssertionError: Execution of 'actual' query failed: SELECT * FROM    (SELECT DISTINCT regionkey FROM tmp_presto_8d2fe12547ca472093a6cdf415f47510) t INNER JOIN    (SELECT regionkey FROM "tmp_presto_8d2fe12547ca472093a6cdf415f47510$partitions") p ON t.regionkey = p.regionkey
	at org.testng.Assert.fail(Assert.java:98)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:178)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:106)
	at com.facebook.presto.tests.AbstractTestQueryFramework.assertQuery(AbstractTestQueryFramework.java:154)
	at com.facebook.presto.nativeworker.AbstractTestNativeGeneralQueries.testSystemTables(AbstractTestNativeGeneralQueries.java:1448)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:135)
	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:673)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:220)
	at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
	at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:945)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:193)
	at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
	at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.RuntimeException: buffer != nullptr getData received after its buffer is deleted. Destination: 3, sequence: 0
	at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:733)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:701)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175)
	... 18 more
Caused by: VeloxRuntimeError: buffer != nullptr getData received after its buffer is deleted. Destination: 3, sequence: 0
	at Unknown.# 0  _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source)
	at Unknown.# 1 

@arhimondr
Copy link
Member

arhimondr commented Mar 21, 2024

It looks like a real problem in Velox / Prestissimo

I managed to capture a following sequence of events:

Full stack trace:

VeloxRuntimeError: buffer != nullptr getData received after its buffer is deleted. Destination: 15, sequence: 0
	at Unknown.# 0  facebook::velox::VeloxException::VeloxException(char const*, unsigned long, char const*, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, bool, facebook::velox::VeloxException::Type, std::basic_string_view<char, std::char_traits<char> >)(Unknown Source)
	at Unknown.# 1  void facebook::velox::detail::veloxCheckFail<facebook::velox::VeloxRuntimeError, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>(facebook::velox::detail::VeloxCheckFailArgs const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)(Unknown Source)
	at Unknown.# 2  facebook::velox::exec::OutputBuffer::getData(int, unsigned long, long, std::function<void (std::vector<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::allocator<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> > > >, long, std::vector<long, std::allocator<long> >)>, std::function<bool ()>)(Unknown Source)
	at Unknown.# 3  facebook::velox::exec::OutputBufferManager::getData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, unsigned long, long, std::function<void (std::vector<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::allocator<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> > > >, long, std::vector<long, std::allocator<long> >)>, std::function<bool ()>)(Unknown Source)
	at Unknown.# 4  facebook::presto::(anonymous namespace)::getData(std::shared_ptr<facebook::presto::PromiseHolder<std::unique_ptr<facebook::presto::Result, std::default_delete<facebook::presto::Result> > > >, std::weak_ptr<facebook::presto::http::CallbackRequestHandlerState>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, long, facebook::presto::protocol::DataSize, facebook::velox::exec::OutputBufferManager&)(Unknown Source)
	at Unknown.# 5  facebook::presto::TaskManager::getDataForResultRequests(std::unordered_map<long, std::shared_ptr<facebook::presto::ResultRequest>, std::hash<long>, std::equal_to<long>, std::allocator<std::pair<long const, std::shared_ptr<facebook::presto::ResultRequest> > > > const&)(Unknown Source)
	at Unknown.# 6  facebook::presto::TaskManager::createOrUpdateTask(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, facebook::velox::core::PlanFragment const&, std::vector<facebook::presto::protocol::TaskSource, std::allocator<facebook::presto::protocol::TaskSource> > const&, facebook::presto::protocol::OutputBuffers const&, std::shared_ptr<facebook::velox::core::QueryCtx>, long)(Unknown Source)
	at Unknown.# 7  facebook::presto::TaskManager::createOrUpdateTask(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, facebook::presto::protocol::TaskUpdateRequest const&, facebook::velox::core::PlanFragment const&, std::shared_ptr<facebook::velox::core::QueryCtx>, long)(Unknown Source)
	at Unknown.# 8  std::_Function_handler<std::unique_ptr<facebook::presto::protocol::TaskInfo, std::default_delete<facebook::presto::protocol::TaskInfo> > (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long), facebook::presto::TaskResource::createOrUpdateTask(proxygen::HTTPMessage*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)::$_18>::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long&&)(Unknown Source)
	at Unknown.# 9  _ZN5folly6detail8function14FunctionTraitsIFvRNS_7futures6detail8CoreBaseEONS_8Executor9KeepAliveIS7_EEPNS_17exception_wrapperEEE7callBigIZNS4_4CoreINS_4UnitEE11setCallbackIZNS4_10FutureBaseISH_E18thenImplementationIZNOS_6FutureISH_E9thenValueIZNS_3viaIZZN8facebook6presto12TaskResource22createOrUpdateTaskImplEPN8proxygen11HTTPMessageERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS13_EERKSt8functionIFSt10unique_ptrINSS_8protocol8TaskInfoESt14default_deleteIS1B_EERKS13_S1G_lEEENK3$_2clESW_RKSX_IS19_INS_5IOBufES1C_IS1M_EESaIS1O_EEPNSU_15ResponseHandlerESt10shared_ptrINSS_4http27CallbackRequestHandlerStateEEEUlvE_EENSN_INS_20isFutureOrSemiFutureIDTclclsr3stdE7declvalIT_EEEEE5InnerEEES9_OS21_EUlS26_E_EENSN_INS4_19valueCallableResultISH_S21_E10value_typeEEES26_EUlSA_ONS_3TryISH_EEE_NS4_25tryExecutorCallableResultISH_S2F_vEEEENSt9enable_ifIXntsrNT0_13ReturnsFutureE5valueENS2J_6ReturnEE4typeES26_S2J_NS4_18InlineContinuationEEUlSA_S2E_E_EEvS26_OS1V_INS_14RequestContextEES2N_EUlS6_SA_SC_E_EEvS6_SA_SC_RNS1_4DataE(Unknown Source)
	at Unknown.# 10 void folly::detail::function::FunctionTraits<void ()>::callSmall<folly::Executor::KeepAlive<folly::Executor>::add<folly::futures::detail::CoreBase::doCallback(folly::Executor::KeepAlive<folly::Executor>&&, folly::futures::detail::State)::$_0>(folly::futures::detail::CoreBase::doCallback(folly::Executor::KeepAlive<folly::Executor>&&, folly::futures::detail::State)::$_0&&) &&::{lambda()#1}>(folly::detail::function::Data&)(Unknown Source)
	at Unknown.# 11 folly::ThreadPoolExecutor::runTask(std::shared_ptr<folly::ThreadPoolExecutor::Thread> const&, folly::ThreadPoolExecutor::Task&&)(Unknown Source)
	at Unknown.# 12 folly::CPUThreadPoolExecutor::threadRun(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)(Unknown Source)
	at Unknown.# 13 void folly::detail::function::FunctionTraits<void ()>::callSmall<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&)(Unknown Source)
	at Unknown.# 14 execute_native_thread_routine(Unknown Source)
	at Unknown.# 15 start_thread(Unknown Source)
	at Unknown.# 16 __clone3(Unknown Source)

@arhimondr
Copy link
Member

In Presto the buffer is not deleted on abort and getData requests for an aborted buffer are allowed. Same can be done in Velox (facebookincubator/velox#9207). It looks like some of the other operations are already allowed to come out of order (for example an acknowledge after a delete https://github.com/facebookincubator/velox/blob/main/velox/exec/OutputBuffer.cpp#L636).

An alternative would be to try to abort pending requests, but it might be more difficult due to inherent "raciness" of these operations. Also it is unclear if there are other sequences of requests possible that can lead to a query failure.

@mbasmanova @xiaoxmeng Thoughts?

@mbasmanova
Copy link
Contributor

In Presto the buffer is not deleted on abort and getData requests for an aborted buffer are allowed.

I see. It would be nice to update documentation in https://prestodb.io/docs/current/develop/worker-protocol.html to clarify that.

Let me check facebookincubator/velox#9207

@mbasmanova
Copy link
Contributor

I'm seeing this issue in production from time to time as well: 20240324_170232_00988_y57ss

@steveburnett
Copy link
Contributor

In Presto the buffer is not deleted on abort and getData requests for an aborted buffer are allowed.

I see. It would be nice to update documentation in https://prestodb.io/docs/current/develop/worker-protocol.html to clarify that.

Noted, thanks!

@amitkdutta
Copy link
Contributor

This is a known issue that happens ~10 times a day with failure message VeloxRuntimeError: buffer != nullptr getData received after its buffer is deleted. Destination: <N>, sequence: 0 where N is a integer. Thanks @arhimondr for debugging this complex issue.

Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this issue Apr 4, 2024
Summary:
Sometimes requests can arrive out of order. When a `getData` arrives after a buffer is closed it may cause an unwanted failure.

Fixes prestodb/presto#22129

Pull Request resolved: facebookincubator#9207

Reviewed By: mbasmanova

Differential Revision: D55591528

Pulled By: arhimondr

fbshipit-source-id: db493477e236b6c25ea39887a3bab5273558455b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants