[controller] change `LogCompactionService` scope from controller to cluster level + check creation time of ongoing push store version #1636

WhitneyDeng · 2025-03-20T00:57:53Z

Problem Statement

Fixing bugs found when end-to-end testing adhoc repush controller endpoint.

Bugs found:

Bug: LogCompactionService should read log.compaction.scheduling.enabled config from cluster-level rather than "commonConfig" (aka the first cluster entry in the multiClusterProperties map). Reading the config at for the corresponding cluster ensures scheduled compaction is isolated to clusters where this feature is enabled.
Fix: Read from getControllerConfig(clusterName).isLogCompactionSchedulingEnabled() when iterating through clusters in LogCompactionService::LogCompactionTask::compactStoresInClusters()

Solution

move LogCompactionService instance creation from VeniceController to when a parent controller transitions from standby to leader (1f51114)

functionality addition:

check creation time of largest store version rather than the current version to measure store's compaction staleness.
- Reason: The largest version may be larger than the current version if there is an ongoing push. The purpose of this function is to check if the last compaction time is older than the threshold. An ongoing push is regarded as the most recent compaction.

Code changes

Added new code behind a config. If so list the config names and their default values in the PR description.
Introduced new log lines.
- Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

New unit tests added.
- TestLogCompactionService::testThrottlingRepushQuotaPerCompactionCycle 2294f5c
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.
Yes. Clearly explain the behavior change and its impact.

… to per cluster on its leader parent controller

… than VeniceControllerMultiClusterConfig

…ore rather than a random cluster

…from config

…ycle for LogCompactionService::compactStoresInClusters

mynameborat · 2025-05-20T22:41:21Z

The bug fix looks good. Can we get the bug fix separated out so that we can quickly merge that change?

Here are some concerns around the current throttling handling

We enforce throttling limit at the candidate phase. This makes it rigid and doesn't allow the orchestrator to evolve the throttling logic beyond what it can work with. e.g., Every improvement is limited to number of candidates it receives.
LogCompactionService currently doesn't have a sense of the status of ongoing pushes and thus can end queuing more jobs every hour. E.g., assume the pushes take 4 hours, the throttling limit is 3, at the end of 3rd hour you have 9 pushes which is different from the expectation as throttling is expected to be closely tied to orchestrator to depend on factors like hardware capacity, pool size etc.
The above can get amplified if we reduce the interval and thus requires additional guardrails on the orchestrator layer on top of this.

I'd suggest introducing the throttling limit on the RepushOrchestrator layer due to following reasons

It can choose to apply this throttling limit by potentially checking w/ pool capacity and ongoing jobs in the pool
It can react to potential completion of the push jobs thereby being more reactive to update available rate limit tokens to process or drop the incoming request
It also abstracts the throttling to orchestrator layer thereby manual request, or scheduled request or any other future request work on one central throttling layer as opposed to doing their own.

Based on the above, it might be worth to split the PR and merge the bug fixes to expedite.

huangminchn

Thanks Whitney!

services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceController.java

...troller/src/main/java/com/linkedin/venice/controller/VeniceControllerMultiClusterConfig.java

...nice-controller/src/main/java/com/linkedin/venice/controller/VeniceControllerStateModel.java

services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java

...troller/src/main/java/com/linkedin/venice/controller/logcompaction/LogCompactionService.java

…h limit from config" This reverts commit b12bf56.

…pactionCycle for LogCompactionService::compactStoresInClusters" This reverts commit 2294f5c.

… creation time of largest version rather than current version reason: a version larger than the current version being present is indication of an ongoing push. If a push has been triggered in the last 24 hours, this store is not stale and thus not eligible for compaction.

… to check the largest version rather than the current version largest version might not be current version if it is an ongoing push, which is regarded as the most recent compaction

…nce of starting services

…tionService::compactStoresInClusters since LogCompactionService will only be created if isLogCompactionSchedulingEnabled = true in HelixVeniceClusterResources

...controller/src/main/java/com/linkedin/venice/controller/logcompaction/CompactionManager.java

huangminchn

Thanks for the hard work Whitney!

WhitneyDeng force-pushed the log_compaction/e2e-testing-fixes branch from a201c52 to 0f192e5 Compare May 16, 2025 23:31

move LogCompactionService creation from VeniceController instance…

1f51114

… to per cluster on its leader parent controller

WhitneyDeng force-pushed the log_compaction/e2e-testing-fixes branch from 0f192e5 to 1f51114 Compare May 20, 2025 01:34

Whitney Deng added 4 commits May 19, 2025 18:38

LogCompactionService read from VeniceControllerClusterConfig rather…

1780fba

… than VeniceControllerMultiClusterConfig

add config for log.compaction.max.repush.per.compaction.cycle

be9727e

check isLogCompactionEnabled config of the specific cluster of the st…

613c172

…ore rather than a random cluster

throttle number of repushes per compaction cycle by max repush limit …

b12bf56

…from config

WhitneyDeng marked this pull request as ready for review May 20, 2025 02:17

WhitneyDeng changed the title ~~[controller] adhoc repush controller endpoint E2E testing fixes~~ [controller] throttling scheduled log compaction repushes May 20, 2025

WhitneyDeng changed the title ~~[controller] throttling scheduled log compaction repushes~~ [controller] bug fix & throttling scheduled log compaction repushes May 20, 2025

Whitney Deng added 3 commits May 20, 2025 10:16

use one variable instead of two to count repushes

94ac8ce

start and stop LogCompactionService based on Helix state transition

61e7951

add TestLogCompactionService::testThrottlingRepushQuotaPerCompactionC…

2294f5c

…ycle for LogCompactionService::compactStoresInClusters

huangminchn reviewed May 20, 2025

View reviewed changes

Whitney Deng added 2 commits May 20, 2025 18:55

Revert throttling changes. Isolating this PR to bug fixes only

a3971d7

Revert "throttle number of repushes per compaction cycle by max repus…

52103ff

…h limit from config" This reverts commit b12bf56.

WhitneyDeng changed the title ~~[controller] bug fix & throttling scheduled log compaction repushes~~ [controller] link LogCompactionService instance creation to state transition to leader parent controller May 21, 2025

WhitneyDeng mentioned this pull request May 21, 2025

[controller] throttling number of repushes for log compaction #1815

Closed

14 tasks

Whitney Deng added 5 commits May 20, 2025 19:04

Revert "add TestLogCompactionService::testThrottlingRepushQuotaPerCom…

561ed34

…pactionCycle for LogCompactionService::compactStoresInClusters" This reverts commit 2294f5c.

add TestLogCompactionService

47f9a0b

nit

d4c2335

Modify CompactionManager::isLastCompactionTimeOlderThanThresholdHours…

808a66e

… to check the largest version rather than the current version largest version might not be current version if it is an ongoing push, which is regarded as the most recent compaction

WhitneyDeng changed the title ~~[controller] link LogCompactionService instance creation to state transition to leader parent controller~~ [controller] change LogCompactionService scope from controller to cluster level + check creation time of ongoing push store version May 21, 2025

Whitney Deng added 2 commits May 21, 2025 14:08

[pr] sequence of stopping services should be the reverse of the seque…

2d1a648

…nce of starting services

[pr] nix isLogCompactionSchedulingEnabled config check in LogCompac…

7fe2c42

…tionService::compactStoresInClusters since LogCompactionService will only be created if isLogCompactionSchedulingEnabled = true in HelixVeniceClusterResources

WhitneyDeng enabled auto-merge (squash) May 22, 2025 17:06

Merge branch 'linkedin:main' into log_compaction/e2e-testing-fixes

be25ac1

huangminchn requested changes May 22, 2025

View reviewed changes

...controller/src/main/java/com/linkedin/venice/controller/logcompaction/CompactionManager.java Outdated Show resolved Hide resolved

[pr] get largest non failed version

93ff6c2

WhitneyDeng requested a review from huangminchn May 22, 2025 21:58

[pr] get largest non failed version by iterating through Versions

d136993

huangminchn reviewed May 23, 2025

View reviewed changes

...controller/src/main/java/com/linkedin/venice/controller/logcompaction/CompactionManager.java Show resolved Hide resolved

[pr] catch exception when store has never had an active version & log it

737d322

WhitneyDeng requested a review from huangminchn May 23, 2025 00:51

[pr] if statement to check null instead of exception

ea62600

huangminchn approved these changes May 23, 2025

View reviewed changes

WhitneyDeng merged commit 184c6ac into linkedin:main May 23, 2025
58 checks passed

WhitneyDeng deleted the log_compaction/e2e-testing-fixes branch May 23, 2025 01:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[controller] change `LogCompactionService` scope from controller to cluster level + check creation time of ongoing push store version #1636

[controller] change `LogCompactionService` scope from controller to cluster level + check creation time of ongoing push store version #1636

Uh oh!

WhitneyDeng commented Mar 20, 2025 •

edited

Loading

Uh oh!

mynameborat commented May 20, 2025

Uh oh!

huangminchn left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huangminchn left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[controller] change LogCompactionService scope from controller to cluster level + check creation time of ongoing push store version #1636

[controller] change LogCompactionService scope from controller to cluster level + check creation time of ongoing push store version #1636

Uh oh!

Conversation

WhitneyDeng commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

mynameborat commented May 20, 2025

Uh oh!

huangminchn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huangminchn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[controller] change `LogCompactionService` scope from controller to cluster level + check creation time of ongoing push store version #1636

[controller] change `LogCompactionService` scope from controller to cluster level + check creation time of ongoing push store version #1636

WhitneyDeng commented Mar 20, 2025 •

edited

Loading