Refactor TableStatsCollectorUtil by srawat98-dev · Pull Request #417 · linkedin/openhouse

srawat98-dev · 2025-12-23T06:31:39Z

Refactors TableStatsCollectorUtil by extracting reusable helper methods from the populateCommitEventTablePartitions implementation. This improves code organization, testability, and enables future code reuse without changing any functionality.

Summary

This is a pure refactoring PR that extracts well-designed, reusable helper methods from inline code in
populateCommitEventTablePartitions. The goal is to:

Improve code organization and readability
Create reusable building blocks for future features
Reduce code duplication
No functional changes - behavior remains identical.

Changes

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

… and clarity in commit event publishing

…rom milliseconds to seconds

…ment during data collection

cbb330 · 2026-01-03T01:43:33Z

The refactor introduces a subtle issue with the caching logic.

In the original code, partitionsPerCommitDF stays cached until after collectAsList() executes the join. In the new code, buildEnrichedPartitionDataFrame calls unpersist() on partitionsPerCommitDF before returning enrichedDF - but since Spark is lazy, the join hasn't executed yet at that point. When enrichedDF.count() later triggers the join, the cached data may already be gone and Spark has to recompute it.

Also, there's now two count() calls where there used to be one - the helper counts partitionsPerCommitDF for the early return check, then the caller counts enrichedDF again for logging/cache materialization. That's an extra Spark action that wasn't there before, which goes against the "no functional changes" goal.

For a cleaner refactor, I'd suggest having the helper just build and return the DataFrame without managing caching or counting. Let the caller handle the lifecycle:

static Dataset<Row> buildEnrichedPartitionDataFrame(Table table, SparkSession spark) {
    if (table.spec().isUnpartitioned()) {
        return null;
    }
    // just build and return - no caching, no counting
    return partitionsPerCommitDF.join(snapshotsDF, "snapshot_id").select(...);
}

Then the caller can cache, collect, and unpersist as needed. This keeps the helper simple and reusable without baking in behavior that future callers might not want.

…ement and clarify commit event handling

…presence checks

abhisheknath2011 · 2026-01-06T05:35:26Z

The refactor introduces a subtle issue with the caching logic.

In the original code, partitionsPerCommitDF stays cached until after collectAsList() executes the join. In the new code, buildEnrichedPartitionDataFrame calls unpersist() on partitionsPerCommitDF before returning enrichedDF - but since Spark is lazy, the join hasn't executed yet at that point. When enrichedDF.count() later triggers the join, the cached data may already be gone and Spark has to recompute it.

Also, there's now two count() calls where there used to be one - the helper counts partitionsPerCommitDF for the early return check, then the caller counts enrichedDF again for logging/cache materialization. That's an extra Spark action that wasn't there before, which goes against the "no functional changes" goal.

For a cleaner refactor, I'd suggest having the helper just build and return the DataFrame without managing caching or counting. Let the caller handle the lifecycle:
static Dataset<Row> buildEnrichedPartitionDataFrame(Table table, SparkSession spark) {
    if (table.spec().isUnpartitioned()) {
        return null;
    }
    // just build and return - no caching, no counting
    return partitionsPerCommitDF.join(snapshotsDF, "snapshot_id").select(...);
}
Then the caller can cache, collect, and unpersist as needed. This keeps the helper simple and reusable without baking in behavior that future callers might not want.

Looks like Christian's comment is already addressed. Thanks.

abhisheknath2011

Thanks for extracting the refactored code in this PR.

Refactor TableStatsCollectorUtil for improved partition data handling…

f25a7d8

… and clarity in commit event publishing

srawat98-dev marked this pull request as ready for review December 23, 2025 06:35

srawat added 2 commits December 23, 2025 12:25

Fix comment in TableStatsCollectorUtil to clarify timestamp casting f…

4e110e3

…rom milliseconds to seconds

Refactor TableStatsCollectorUtil to improve caching and memory manage…

738760c

…ment during data collection

srawat added 2 commits January 6, 2026 01:20

Refactor TableStatsCollectorUtil to enhance DataFrame lifecycle manag…

32fc2aa

…ement and clarify commit event handling

Refactor TableStatsCollectorUtil to improve logging clarity for data …

cc525cd

…presence checks

abhisheknath2011 approved these changes Jan 6, 2026

View reviewed changes

abhisheknath2011 merged commit 8207ff4 into linkedin:main Jan 6, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor TableStatsCollectorUtil #417

Refactor TableStatsCollectorUtil #417
abhisheknath2011 merged 5 commits intolinkedin:mainfrom
srawat98-dev:srawat/refactoringTableStatsCollectorUtil

srawat98-dev commented Dec 23, 2025 •

edited

Loading

Uh oh!

cbb330 commented Jan 3, 2026

Uh oh!

abhisheknath2011 commented Jan 6, 2026

Uh oh!

abhisheknath2011 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

srawat98-dev commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing Done

Additional Information

Uh oh!

cbb330 commented Jan 3, 2026

Uh oh!

abhisheknath2011 commented Jan 6, 2026

Uh oh!

abhisheknath2011 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

srawat98-dev commented Dec 23, 2025 •

edited

Loading