Refactor TableStatsCollectorUtil #417
Conversation
… and clarity in commit event publishing
…rom milliseconds to seconds
…ment during data collection
|
The refactor introduces a subtle issue with the caching logic. In the original code, Also, there's now two For a cleaner refactor, I'd suggest having the helper just build and return the DataFrame without managing caching or counting. Let the caller handle the lifecycle: static Dataset<Row> buildEnrichedPartitionDataFrame(Table table, SparkSession spark) {
if (table.spec().isUnpartitioned()) {
return null;
}
// just build and return - no caching, no counting
return partitionsPerCommitDF.join(snapshotsDF, "snapshot_id").select(...);
}Then the caller can cache, collect, and unpersist as needed. This keeps the helper simple and reusable without baking in behavior that future callers might not want. |
…ement and clarify commit event handling
Looks like Christian's comment is already addressed. Thanks. |
abhisheknath2011
left a comment
There was a problem hiding this comment.
Thanks for extracting the refactored code in this PR.
Refactors TableStatsCollectorUtil by extracting reusable helper methods from the populateCommitEventTablePartitions implementation. This improves code organization, testability, and enables future code reuse without changing any functionality.
Summary
This is a pure refactoring PR that extracts well-designed, reusable helper methods from inline code in
populateCommitEventTablePartitions. The goal is to:
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.