New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ORC support for iceberg connector #16391
Conversation
Cherry-pick of trinodb/trino@ecce4a2 Co-authored-by: Xingyuan Lin <linxingyuan1102@gmail.com>
@ChunxuTang @zhenxiao @beinan Could you help take a look? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, @junyi1313 one minor thing
@beinan @ChunxuTang could you please take a look?
@@ -184,5 +184,9 @@ | |||
<directory>${project.build.directory}/dependency/presto-bigquery-${project.version}</directory> | |||
<outputDirectory>plugin/presto-bigquery</outputDirectory> | |||
</fileSet> | |||
<fileSet> | |||
<directory>${project.build.directory}/dependency/presto-iceberg-${project.version}</directory> | |||
<outputDirectory>plugin/iceberg</outputDirectory> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presto-iceberg or iceberg? either is fine, just a note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine. Just like mysql connector: assembly/presto.xml#presto-mysql
@junyi1313 @zhenxiao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junyi1313 Thanks for your work! This is a very nice feature we want for the iceberg connector!!
From my initial review, the implementation on the iceberg connector part generally looks good to me.
Just one reminder: I noticed that there're some new features/improvements but are not in the PRs you cherry-picked. I think the new features may deserve a bit more documentation or clarification as they are unique contributions.
@zhenxiao @beinan There are some changes in the presto-orc package, including the upgrade of ORC. I'm not familiar with that package. Could you folks have a closer look at those presto-orc changes? Or any other folks know more?
@Singleton | ||
@Provides | ||
public OrcFileTailSource createOrcFileTailSource(OrcCacheConfig orcCacheConfig, MBeanExporter exporter) | ||
{ | ||
OrcFileTailSource orcFileTailSource = new StorageOrcFileTailSource(); | ||
if (orcCacheConfig.isFileTailCacheEnabled()) { | ||
Cache<OrcDataSourceId, OrcFileTail> cache = CacheBuilder.newBuilder() | ||
.maximumWeight(orcCacheConfig.getFileTailCacheSize().toBytes()) | ||
.weigher((id, tail) -> ((OrcFileTail) tail).getFooterSize() + ((OrcFileTail) tail).getMetadataSize()) | ||
.expireAfterAccess(orcCacheConfig.getFileTailCacheTtlSinceLastAccess().toMillis(), MILLISECONDS) | ||
.recordStats() | ||
.build(); | ||
CacheStatsMBean cacheStatsMBean = new CacheStatsMBean(cache); | ||
orcFileTailSource = new CachingOrcFileTailSource(orcFileTailSource, cache); | ||
exporter.export(generatedNameOf(CacheStatsMBean.class, connectorId + "_OrcFileTail"), cacheStatsMBean); | ||
} | ||
return orcFileTailSource; | ||
} | ||
|
||
@Singleton | ||
@Provides | ||
public StripeMetadataSource createStripeMetadataSource(OrcCacheConfig orcCacheConfig, MBeanExporter exporter) | ||
{ | ||
StripeMetadataSource stripeMetadataSource = new StorageStripeMetadataSource(); | ||
if (orcCacheConfig.isStripeMetadataCacheEnabled()) { | ||
Cache<StripeReader.StripeId, Slice> footerCache = CacheBuilder.newBuilder() | ||
.maximumWeight(orcCacheConfig.getStripeFooterCacheSize().toBytes()) | ||
.weigher((id, footer) -> toIntExact(((Slice) footer).getRetainedSize())) | ||
.expireAfterAccess(orcCacheConfig.getStripeFooterCacheTtlSinceLastAccess().toMillis(), MILLISECONDS) | ||
.recordStats() | ||
.build(); | ||
Cache<StripeReader.StripeStreamId, Slice> streamCache = CacheBuilder.newBuilder() | ||
.maximumWeight(orcCacheConfig.getStripeStreamCacheSize().toBytes()) | ||
.weigher((id, stream) -> toIntExact(((Slice) stream).getRetainedSize())) | ||
.expireAfterAccess(orcCacheConfig.getStripeStreamCacheTtlSinceLastAccess().toMillis(), MILLISECONDS) | ||
.recordStats() | ||
.build(); | ||
CacheStatsMBean footerCacheStatsMBean = new CacheStatsMBean(footerCache); | ||
CacheStatsMBean streamCacheStatsMBean = new CacheStatsMBean(streamCache); | ||
stripeMetadataSource = new CachingStripeMetadataSource(stripeMetadataSource, footerCache, streamCache); | ||
exporter.export(generatedNameOf(CacheStatsMBean.class, connectorId + "_StripeFooter"), footerCacheStatsMBean); | ||
exporter.export(generatedNameOf(CacheStatsMBean.class, connectorId + "_StripeStream"), streamCacheStatsMBean); | ||
} | ||
return stripeMetadataSource; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the PRs you cherry-picked, but it seems that this snippet of code is not in those PRs. Is this a new feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try to provide some context of these two functions -- they are for caching the metadata of orc files -- more details: #13501
But looks like these two are duplicated with the functions in presto-hive/src/main/java/com/facebook/presto/hive/HiveClientModule.java
Have we included HiveClientModule in the iceberg connector? if not, can we reuse the functions in HiveClientModule or extract them to a separate module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beinan @ChunxuTang These two functions are copied from HiveClientModule because we haven't included HiveClientModule in the iceberg connector. Besides, the presto-raptor StorageModule also has these two functions. I think extracting them to a separate module is better. Should we do it in this PR or open a new PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junyi1313 I'm ok with either. @ChunxuTang what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junyi1313 @beinan
Thanks for your clarification! Yeah, I'm also ok with either way. @junyi1313 your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I will find time to send a new PR about this work after this PR has been merged.
@@ -0,0 +1,164 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is a new file that is not in the PRs cherry-picked. Any specific reasons to create this class?
nit: Some setter functions (e.g. setOrcType
, setAttributes
, etc.) are unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IcebergOrcColumn.java is similar to the OrcColumn. I have updated the cherry-pick infos(add trinodb/trino#1629, trinodb/trino#3483) and removed the unused functions. Pls help with the review again. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Thanks for your work!
min = min.setScale(((Types.DecimalType) icebergType).scale()); | ||
max = max.setScale(((Types.DecimalType) icebergType).scale()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May we directly import DecimalType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Great contribution, many thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junyi1313
Thanks for your nice work! Looks that there're some errors in CI tests. Could you fix the errors and update the PR to pass the tests?
Cherry-pick of trinodb/trino#1067, trinodb/trino#2042, trinodb/trino#4055, trinodb/trino#1629, trinodb/trino#3483 Co-authored-by: Parth Brahmbhatt <pbrahmbhatt@netflix.com> Co-authored-by: David Phillips <david@acz.org> Co-authored-by: Xingyuan Lin <linxingyuan1102@gmail.com> Co-authored-by: Dain Sundstrom <dain@iq80.com>
@ChunxuTang I have updated the PR and fixed the CI errors. |
Cherry-pick of trinodb/trino#1067, trinodb/trino#2042, trinodb/trino#4055, trinodb/trino#1629, trinodb/trino#3483, trinodb/trino@ecce4a2
Co-authored-by: Parth Brahmbhatt pbrahmbhatt@netflix.com
Co-authored-by: David Phillips david@acz.org
Co-authored-by: Xingyuan Lin linxingyuan1102@gmail.com
Co-authored-by: Dain Sundstrom dain@iq80.com
This PR implements the issue: #16305
Test plan - Unit Tests