Skip to content

Conversation

@WhitneyDeng
Copy link
Contributor

@WhitneyDeng WhitneyDeng commented Mar 20, 2025

Problem Statement

Fixing bugs found when end-to-end testing adhoc repush controller endpoint.

Bugs found:

  • Bug: LogCompactionService should read log.compaction.scheduling.enabled config from cluster-level rather than "commonConfig" (aka the first cluster entry in the multiClusterProperties map). Reading the config at for the corresponding cluster ensures scheduled compaction is isolated to clusters where this feature is enabled.
  • Fix: Read from getControllerConfig(clusterName).isLogCompactionSchedulingEnabled() when iterating through clusters in LogCompactionService::LogCompactionTask::compactStoresInClusters()

Solution

  • move LogCompactionService instance creation from VeniceController to when a parent controller transitions from standby to leader (1f51114)

functionality addition:

  • check creation time of largest store version rather than the current version to measure store's compaction staleness.
    • Reason: The largest version may be larger than the current version if there is an ongoing push. The purpose of this function is to check if the last compaction time is older than the threshold. An ongoing push is regarded as the most recent compaction.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
    • TestLogCompactionService::testThrottlingRepushQuotaPerCompactionCycle 2294f5c
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

@WhitneyDeng WhitneyDeng force-pushed the log_compaction/e2e-testing-fixes branch from a201c52 to 0f192e5 Compare May 16, 2025 23:31
… to per cluster on its leader parent controller
@WhitneyDeng WhitneyDeng force-pushed the log_compaction/e2e-testing-fixes branch from 0f192e5 to 1f51114 Compare May 20, 2025 01:34
@WhitneyDeng WhitneyDeng marked this pull request as ready for review May 20, 2025 02:17
@WhitneyDeng WhitneyDeng changed the title [controller] adhoc repush controller endpoint E2E testing fixes [controller] throttling scheduled log compaction repushes May 20, 2025
@WhitneyDeng WhitneyDeng changed the title [controller] throttling scheduled log compaction repushes [controller] bug fix & throttling scheduled log compaction repushes May 20, 2025
@mynameborat
Copy link
Contributor

The bug fix looks good. Can we get the bug fix separated out so that we can quickly merge that change?

Here are some concerns around the current throttling handling

  • We enforce throttling limit at the candidate phase. This makes it rigid and doesn't allow the orchestrator to evolve the throttling logic beyond what it can work with. e.g., Every improvement is limited to number of candidates it receives.
  • LogCompactionService currently doesn't have a sense of the status of ongoing pushes and thus can end queuing more jobs every hour. E.g., assume the pushes take 4 hours, the throttling limit is 3, at the end of 3rd hour you have 9 pushes which is different from the expectation as throttling is expected to be closely tied to orchestrator to depend on factors like hardware capacity, pool size etc.
  • The above can get amplified if we reduce the interval and thus requires additional guardrails on the orchestrator layer on top of this.

I'd suggest introducing the throttling limit on the RepushOrchestrator layer due to following reasons

  • It can choose to apply this throttling limit by potentially checking w/ pool capacity and ongoing jobs in the pool
  • It can react to potential completion of the push jobs thereby being more reactive to update available rate limit tokens to process or drop the incoming request
  • It also abstracts the throttling to orchestrator layer thereby manual request, or scheduled request or any other future request work on one central throttling layer as opposed to doing their own.

Based on the above, it might be worth to split the PR and merge the bug fixes to expedite.

Copy link
Contributor

@huangminchn huangminchn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Whitney!

@WhitneyDeng WhitneyDeng changed the title [controller] bug fix & throttling scheduled log compaction repushes [controller] link LogCompactionService instance creation to state transition to leader parent controller May 21, 2025
Whitney Deng added 5 commits May 20, 2025 19:04
…pactionCycle for LogCompactionService::compactStoresInClusters"

This reverts commit 2294f5c.
… creation time of largest version rather than current version

reason: a version larger than the current version being present is indication of an ongoing push. If a push has been triggered in the last 24 hours, this store is not stale and thus not eligible for compaction.
… to check the largest version rather than the current version

largest version might not be current version if it is an ongoing push, which is regarded as the most recent compaction
@WhitneyDeng WhitneyDeng changed the title [controller] link LogCompactionService instance creation to state transition to leader parent controller [controller] change LogCompactionService scope from controller to cluster level + check creation time of ongoing push store version May 21, 2025
Whitney Deng added 2 commits May 21, 2025 14:08
…tionService::compactStoresInClusters

since LogCompactionService will only be created if isLogCompactionSchedulingEnabled = true in HelixVeniceClusterResources
@WhitneyDeng WhitneyDeng enabled auto-merge (squash) May 22, 2025 17:06
@WhitneyDeng WhitneyDeng requested a review from huangminchn May 22, 2025 21:58
@WhitneyDeng WhitneyDeng requested a review from huangminchn May 23, 2025 00:51
Copy link
Contributor

@huangminchn huangminchn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work Whitney!

@WhitneyDeng WhitneyDeng merged commit 184c6ac into linkedin:main May 23, 2025
58 checks passed
@WhitneyDeng WhitneyDeng deleted the log_compaction/e2e-testing-fixes branch May 23, 2025 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants