Skip to content

feat(maintenance): Distributed Maintenance Coordination via Raft (v2.0.0)#4630

Merged
makr-code merged 3 commits intodevelopfrom
copilot/distributed-maintenance-coordination-raft
Apr 13, 2026
Merged

feat(maintenance): Distributed Maintenance Coordination via Raft (v2.0.0)#4630
makr-code merged 3 commits intodevelopfrom
copilot/distributed-maintenance-coordination-raft

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 13, 2026

In a multi-node cluster, every node independently fires the same maintenance schedules, causing compaction storms and duplicate work. This PR introduces a pluggable distributed lock that the orchestrator acquires before each scheduled job, ensuring only one node executes per schedule per cron tick.

Description

IDistributedLock interface + InProcessDistributedLock (include/maintenance/i_distributed_lock.h)

  • New pluggable interface: tryAcquire(key, ttl_ms), release(key), getHolderNodeId(key), nodeId()
  • InProcessDistributedLock: thread-safe in-process implementation for single-node and tests; TTL expiry checked per acquire
  • Production wiring: inject a Raft-backed implementation via setDistributedLock()

Orchestrator lock integration (database_maintenance_orchestrator.*)

  • setDistributedLock(shared_ptr<IDistributedLock>) — DI setter, thread-safe, nullable (disables feature)
  • In executeSchedule(), lock is acquired before the window check; on failure: job → SKIPPED, DEBUG log: "schedule {id} skipped — lock held by peer {node_id}"
  • RAII DistLockGuard ensures release on every exit path (success, window skip, DAG error, cancellation)

TTL computation

// lock_ttl_ms == 0 → auto-derive from window + 30 s safety margin
int64_t ttl_ms = entry.lock_ttl_ms > 0
    ? entry.lock_ttl_ms
    : window_hours * 3600'000LL + 30'000LL;

MaintenanceScheduleEntry::lock_ttl_ms (default 0) makes TTL configurable per schedule.

New fields

  • MaintenanceScheduleEntry::lock_ttl_ms — serialized in toJson()/fromJson()/applyPatch()

Linked Issues

Closes #252

Type of Change

  • Bug fix (non-breaking)
  • New feature (non-breaking)
  • Refactoring (non-breaking)
  • Documentation
  • Breaking change (requires MAJOR version bump — see VERSIONING.md)
  • Security fix
  • Other:

Breaking Change Checklist

  • MAJOR version bump planned in VERSION and CMakeLists.txt
  • Migration guide added in docs/migration/
  • Announcement prepared for GitHub Discussions (≥ 2 weeks before release)
  • CHANGELOG ### Removed / ### Changed section updated

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • Benchmarks run (if performance-sensitive change)

14 new tests in test_database_maintenance_orchestrator.cpp:

  • DL-1: no lock configured → backwards-compatible, job runs normally
  • DL-2: lock acquired → job executes, acquire + release called once
  • DL-3: lock held by peer → SKIPPED, error message contains peer node ID
  • DL-4/5: TTL auto-compute from window vs. explicit lock_ttl_ms
  • DL-6/7: lock_ttl_ms JSON round-trip and applyPatch
  • DL-8: setDistributedLock(nullptr) clears lock, no acquire/release calls
  • InProcessDistributedLock: first acquires, second blocked; release; TTL expiry; getHolderNodeId after expiry

📚 Research & Knowledge (wenn applicable)

  • Diese PR basiert auf wissenschaftlichen Paper(s) oder Best Practices?
    • Falls JA: Research-Dateien in /docs/research/ angelegt?
    • Falls JA: Im Modul-README unter "Wissenschaftliche Grundlagen" verlinkt?
    • Falls JA: In /docs/research/implementation_influence/ eingetragen?

Relevante Quellen:

  • Paper:
  • Best Practice:
  • Architecture Decision:

Checklist

  • Code follows project style guidelines (clang-format / clang-tidy)
  • Self-review completed
  • Documentation updated (if needed)
  • CHANGELOG.md updated under [Unreleased]
  • No new warnings introduced
  • Security-sensitive paths reviewed by security maintainer (if applicable)

Copilot AI linked an issue Apr 13, 2026 that may be closed by this pull request
4 tasks
- Add IDistributedLock interface + InProcessDistributedLock implementation
- Add lock_ttl_ms field to MaintenanceScheduleEntry (configurable per schedule)
- Add setDistributedLock() to DatabaseMaintenanceOrchestrator
- Integrate distributed lock in executeSchedule(): tryAcquire before job, RAII
  release guard on every code path, SKIPPED + DEBUG log when peer holds lock
- Auto-compute TTL from window_duration + 30s safety margin when lock_ttl_ms==0
- Add 14 new tests (DL-1..9 + InProcessDistributedLock unit tests)
- Update FUTURE_ENHANCEMENTS.md, ROADMAP.md, artifacts roadmap body

Agent-Logs-Url: https://github.com/makr-code/ThemisDB/sessions/5249e380-12f0-4bb6-8f83-7f577da13c19

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement distributed maintenance coordination via Raft feat(maintenance): Distributed Maintenance Coordination via Raft (v2.0.0) Apr 13, 2026
Copilot AI requested a review from makr-code April 13, 2026 16:25
@makr-code makr-code marked this pull request as ready for review April 13, 2026 16:27
@makr-code makr-code merged commit 53b0c36 into develop Apr 13, 2026
16 of 26 checks passed
makr-code added a commit that referenced this pull request Apr 14, 2026
…0.0) (#4630)

* Initial plan

* feat(maintenance): Distributed Maintenance Coordination via Raft (#252)

- Add IDistributedLock interface + InProcessDistributedLock implementation
- Add lock_ttl_ms field to MaintenanceScheduleEntry (configurable per schedule)
- Add setDistributedLock() to DatabaseMaintenanceOrchestrator
- Integrate distributed lock in executeSchedule(): tryAcquire before job, RAII
  release guard on every code path, SKIPPED + DEBUG log when peer holds lock
- Auto-compute TTL from window_duration + 30s safety margin when lock_ttl_ms==0
- Add 14 new tests (DL-1..9 + InProcessDistributedLock unit tests)
- Update FUTURE_ENHANCEMENTS.md, ROADMAP.md, artifacts roadmap body

Agent-Logs-Url: https://github.com/makr-code/ThemisDB/sessions/5249e380-12f0-4bb6-8f83-7f577da13c19

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Distributed Maintenance Coordination via Raft

2 participants