Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] Add support for ORC reader #23037

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

wypb
Copy link
Contributor

@wypb wypb commented Jun 20, 2024

Description

We have recently merged the PR for reading ORC statistics and implementing OrcReader based on DwrfReader on the velox side. Now it is time to add support for ORC reader it in Prestissimo.

NOTE: Because Presto uses RLEv2 encoding to write ORC files, and some types of Velox ORC readers do not implement fast path readers, which will cause exceptions when Velox reads ORC, so end-to-end tests for ORC are not added here. Once Velox implements fast path readers for ORC RLEv2 encoding, we need to add ORC tests.

@wypb wypb requested a review from a team as a code owner June 20, 2024 08:26
@wypb wypb force-pushed the orc_reader branch 3 times, most recently from 55a8d5b to 7325337 Compare June 21, 2024 01:52
@tdcmeehan tdcmeehan self-assigned this Jun 23, 2024
@wypb wypb changed the title [native] Add support for ORC reader and add orc native tests [native] Add support for ORC reader Jun 25, 2024
@wypb
Copy link
Contributor Author

wypb commented Jun 25, 2024

Hi @majetideepak @aditi-pandit could you please help review this PR? Thanks!

@majetideepak
Copy link
Collaborator

@wypb can you add some end-to-end tests? Thanks!

@aditi-pandit
Copy link
Contributor

@wypb : Would be great to use ORC with the QueryRunners (https://github.com/prestodb/presto/blob/master/presto-native-execution/src/test/java/com/facebook/presto/nativeworker/PrestoNativeQueryRunnerUtils.java) in an e2e test. The test should highlight differences of ORC wrt Parquet, demonstrate filter pushdown as well. Using ORC with Hive and as a format with Iceberg is perfect.

@wypb
Copy link
Contributor Author

wypb commented Jun 26, 2024

Hi @majetideepak @aditi-pandit I added TPCH tests for ORC, including the Iceberg data source. The TPCDS test for ORC is not added because some types of Velox's ORC reader currently do not implement fast path, which will cause exceptions when reading data.

Caused by: java.lang.RuntimeException: rawResultNulls_ && rawValues_  Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1
	at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:777)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:745)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175)
	... 30 more
Caused by: VeloxRuntimeError: rawResultNulls_ && rawValues_  Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1
	at Unknown.# 0  _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source)
	at Unknown.# 1  _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_(Unknown Source)
	at Unknown.# 2  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_(Unknown Source)
	at Unknown.# 3  _ZN8facebook5velox17VeloxRuntimeErrorC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_(Unknown Source)
	at Unknown.# 4  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorENS1_22CompileTimeEmptyStringEEEvRKNS1_18VeloxCheckFailArgsET0_(Unknown Source)
	at Unknown.# 5  _ZN8facebook5velox4dwio6common21SelectiveColumnReader7addNullIiEEvv(Unknown Source)
	at Unknown.# 6  _ZN8facebook5velox4dwio6common15ExtractToReader7addNullIiEEvi(Unknown Source)
	at Unknown.# 7  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE7addNullEv(Unknown Source)
	at Unknown.# 8  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE19filterPassedForNullEv(Unknown Source)
	at Unknown.# 9  _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE11processNullERb(Unknown Source)
	at Unknown.# 10 _ZN8facebook5velox4dwrf12RleDecoderV2ILb0EE15readWithVisitorILb1ENS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS6_15ExtractToReaderELb1EEEEEvPKmT0_(Unknown Source)
	at Unknown.# 11 _ZN8facebook5velox4dwio6common21SelectiveColumnReader17decodeWithVisitorINS0_4dwrf12RleDecoderV2ILb0EEENS2_29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EEEEEvPNS2_10IntDecoderIXsrT_9kIsSignedEEERT0_(Unknown Source)
	at Unknown.# 12 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader15readWithVisitorINS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS5_15ExtractToReaderELb1EEEEEvN5folly5RangeIPKiEET_(Unknown Source)
	at Unknown.# 13 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader10readHelperINS0_6common10AlwaysTrueELb1ENS0_4dwio6common15ExtractToReaderEEEvPNS4_6FilterEN5folly5RangeIPKiEET1_(Unknown Source)
	at Unknown.# 14 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader13processFilterILb1ENS0_4dwio6common15ExtractToReaderEEEvPNS0_6common6FilterEN5folly5RangeIPKiEET0_(Unknown Source)
	at Unknown.# 15 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader4readEiN5folly5RangeIPKiEEPKm(Unknown Source)
	at Unknown.# 16 _ZN8facebook5velox4dwio6common12ColumnLoader12loadInternalEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 17 _ZN8facebook5velox12VectorLoader4loadEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 18 _ZN8facebook5velox12VectorLoader12loadInternalERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 19 _ZN8facebook5velox12VectorLoader4loadERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
	at Unknown.# 20 _ZNK8facebook5velox10LazyVector18loadVectorInternalEv(Unknown Source)
	at Unknown.# 21 _ZNK8facebook5velox10LazyVector18loadedVectorSharedEv(Unknown Source)
	at Unknown.# 22 _ZNK8facebook5velox10LazyVector12loadedVectorEv(Unknown Source)
	at Unknown.# 23 _ZN8facebook5velox10serializer6presto17PrestoVectorSerde22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source)
	at Unknown.# 24 _ZN8facebook5velox17VectorStreamGroup22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source)
	at Unknown.# 25 _ZN8facebook5velox4exec17PartitionedOutput16estimateRowSizesEv(Unknown Source)
	at Unknown.# 26 _ZN8facebook5velox4exec17PartitionedOutput8addInputESt10shared_ptrINS0_9RowVectorEE(Unknown Source)
	at Unknown.# 27 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE(Unknown Source)
	at Unknown.# 28 _ZN8facebook5velox4exec6Driver3runESt10shared_ptrIS2_E(Unknown Source)
	at Unknown.# 29 _ZZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS2_EENKUlvE_clEv(Unknown Source)
	at Unknown.# 30 _ZN5folly6detail8function5call_IZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS6_EEUlvE_Lb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
	at Unknown.# 31 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
	at Unknown.# 32 _ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE(Unknown Source)
	at Unknown.# 33 _ZN5folly21CPUThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE(Unknown Source)
	at Unknown.# 34 _ZSt13__invoke_implIvRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEERPS1_JRS4_EET_St21__invoke_memfun_derefOT0_OT1_DpOT2_(Unknown Source)
	at Unknown.# 35 _ZSt8__invokeIRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEJRPS1_RS4_EENSt15__invoke_resultIT_JDpT0_EE4typeEOSC_DpOSD_(Unknown Source)
	at Unknown.# 36 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE(Unknown Source)
	at Unknown.# 37 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EEclIJEvEET0_DpOT_(Unknown Source)
	at Unknown.# 38 _ZN5folly6detail8function5call_ISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS4_6ThreadEEEPS4_S7_EELb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
	at Unknown.# 39 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
	at Unknown.# 40 _ZZN5folly18NamedThreadFactory9newThreadEONS_8FunctionIFvvEEEENUlvE_clEv(Unknown Source)
	at Unknown.# 41 _ZSt13__invoke_implIvZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEET_St14__invoke_otherOT0_DpOT1_(Unknown Source)
	at Unknown.# 42 _ZSt8__invokeIZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS8_DpOS9_(Unknown Source)
	at Unknown.# 43 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE(Unknown Source)
	at Unknown.# 44 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEEclEv(Unknown Source)
	at Unknown.# 45 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS3_8FunctionIFvvEEEEUlvE_EEEEE6_M_runEv(Unknown Source)
	at Unknown.# 46 0x00000000000c2b23(Unknown Source)
	at Unknown.# 47 start_thread(Unknown Source)
	at Unknown.# 48 clone(Unknown Source)

@aditi-pandit
Copy link
Contributor

@wypb : Your code looks fine. When I search for ORC in the presto-native-execution directory I also see the following usage.

https://github.com/prestodb/presto/blob/master/presto-native-execution/src/test/java/com/facebook/presto/nativeworker/AbstractTestWriter.java#L71 needs a fix as well

Please can you check about it.

@wypb wypb force-pushed the orc_reader branch 2 times, most recently from 0d3570c to 9615017 Compare June 28, 2024 09:32
@wypb
Copy link
Contributor Author

wypb commented Jun 28, 2024

Good catch, thank you @aditi-pandit I've fixed it.

@wypb
Copy link
Contributor Author

wypb commented Jun 28, 2024

@aditi-pandit I looked at the code again and found that this should not be removed. testCreateTableWithUnsupportedFormats is used to test the Velox ORC writer, and Velox currently does not support ORC writing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants