Fix #23765: Duplicate Notifications in Multi-Replica Deployment#24866
Fix #23765: Duplicate Notifications in Multi-Replica Deployment#24866
Conversation
🔍 CI failure analysis for 8d4bd72: Playwright shard (6,6) has 10 test failures with mixed timeout patterns (loader timeouts, hook timeouts, browser crashes, 5-min lineage timeouts). Combined with shard (5,6), total 16 flaky failures. All are CI infrastructure issues (resource exhaustion, browser instability), NOT related to backend scheduler changes. Multiple tests passed on retry. Non-blocking, should be retried or ignored.Combined Playwright CI Failure Analysis - Shards (5, 6) and (6, 6)Shard (5, 6) - Previously Analyzed6 test failures - All with identical UI loader timeout pattern
Shard (6, 6) - NEW Analysis10 test failures - Mixed timeout patterns Failed Tests
Error Patterns (Shard 6, 6)Pattern 1: UI Loader Timeout (same as shard 5,6) Pattern 2: BeforeEach Hook Timeout Pattern 3: BeforeAll Hook Timeout Pattern 4: Strict Mode Violation (Glossary test) Pattern 5: Long Test Timeout Root Cause AnalysisPrimary Cause: Infrastructure/Resource IssuesAll failures indicate CI environment problems, NOT code bugs:
Why This Is NOT Related to PR ChangesPR Modifications:
Failing Tests:
Evidence of Environmental Flakiness
Historical ContextThis PR has experienced flaky Playwright failures across multiple shards:
Pattern: Different tests failing in different shards, but same infrastructure symptoms. Impact Assessment✅ Non-blocking: Environmental test infrastructure issues ❌ NOT related to PR code changes: Backend scheduler changes cannot cause frontend UI test timeouts 🔄 Recommendation: Retry CI jobs or ignore these failures Technical Separation
ConclusionBoth shard (5, 6) and shard (6, 6) failures are flaky infrastructure issues: ✅ Environmental resource exhaustion/timing issues Verdict: ❌ NOT BLOCKING - These failures should be retried or ignored Code Review
|
| Auto-apply | ✅ Code review | Compact |
|
|
|
Was this helpful? React with 👍 / 👎 | This comment will update automatically (Docs)
|



Describe your changes:
Fixes #23765
Preventing Duplicate Notifications in Multi-Server OpenMetadata
Problem Statement
In a multi-server OpenMetadata deployment, each server runs its own
EventSubscriptionSchedulerthat polls thechange_eventstable. Without coordination, each server processes the same events independently, resulting in duplicate notifications being sent.Root Cause
The original
EventSubscriptionSchedulerused RAM-based Quartz scheduler (RAMJobStore) without clustering support. Each server maintained its own independent job store with no awareness of other servers.Solution: Quartz JDBC Clustering
Architecture Overview
flowchart TB subgraph DB["Shared Database"] CE[("change_events<br/>Event data to process")] QL[("QRTZ_LOCKS<br/>Row-level locking")] QFT[("QRTZ_FIRED_TRIGGERS<br/>Job ownership tracking")] end subgraph S1["OM Server 1"] SCH1["Scheduler<br/>instanceId: AUTO-generated"] JS1["JDBC JobStore"] end subgraph S2["OM Server 2"] SCH2["Scheduler<br/>instanceId: AUTO-generated"] JS2["JDBC JobStore"] end subgraph S3["OM Server 3"] SCH3["Scheduler<br/>instanceId: AUTO-generated"] JS3["JDBC JobStore"] end JS1 <--> QL JS2 <--> QL JS3 <--> QL JS1 <--> QFT JS2 <--> QFT JS3 <--> QFT SCH1 --> CE SCH2 --> CE SCH3 --> CEConcurrency Prevention Mechanism
flowchart TD A[Job Trigger Fires] --> B{Server attempts to<br/>acquire row lock in<br/>QRTZ_LOCKS table} B -->|Lock Acquired| C[Execute Job] B -->|Lock Denied| D[Skip Execution<br/>Another server has it] C --> E["@DisallowConcurrentExecution<br/>prevents same job from<br/>running again until complete"] E --> F[Process Events from<br/>change_events table] F --> G[Send Notification] G --> H[Release Lock] D --> I[Wait for Next<br/>Scheduled Trigger]Job Distribution Flow
sequenceDiagram participant S1 as Server 1 participant DB as Database (QRTZ_LOCKS) participant S2 as Server 2 participant S3 as Server 3 Note over S1,S3: Job "subscription-123" triggers at 10:00:00 S1->>DB: SELECT FOR UPDATE (acquire lock) S2->>DB: SELECT FOR UPDATE (acquire lock) S3->>DB: SELECT FOR UPDATE (acquire lock) DB-->>S1: Lock granted ✓ DB-->>S2: Blocked (waiting) DB-->>S3: Blocked (waiting) Note over S1: Executes job S1->>S1: Process change_events S1->>S1: Send notification S1->>DB: Release lock DB-->>S2: Lock denied (job complete) DB-->>S3: Lock denied (job complete) Note over S2,S3: Skip execution - job already ranFailover Handling
flowchart LR subgraph Normal["Normal Operation"] A1[Server 1<br/>Running Job X] end subgraph Failure["Server 1 Crashes"] A2[Server 1<br/>💀 Dead] end subgraph Recovery["After clusterCheckinInterval"] A3[Server 2<br/>Detects failure] A4[Server 2<br/>Takes over Job X] end Normal --> Failure Failure --> Recovery A3 --> A4Key Configuration
jobStore.classJobStoreTXjobStore.isClusteredtruescheduler.instanceIdAUTOclusterCheckinInterval20000mstablePrefixQRTZ_Configuration Code
Files Modified
EventSubscriptionScheduler.javaCLUSTERED_SCHEDULER_CONFIGwith JDBC JobStore settingsEventSubscriptionSchedulerTest.javaEventSubscriptionSchedulerClusteringTest.javaDistribution with Many Notifications
With 100+ notification subscriptions:
flowchart TB subgraph Jobs["100 Notification Jobs"] J1[Job 1] J2[Job 2] J3[Job 3] JN[Job N...] end subgraph Distribution["Distribution via Lock Contention"] LC["First server to acquire<br/>database lock wins"] end subgraph Servers["3 Server Cluster"] S1["Server 1<br/>Runs: Jobs 1, 4, 7..."] S2["Server 2<br/>Runs: Jobs 2, 5, 8..."] S3["Server 3<br/>Runs: Jobs 3, 6, 9..."] end Jobs --> LC LC --> ServersJobs are distributed via lock contention (not load balancing):
Guarantees
clusterCheckinInterval(20s)QRTZ_*tables, no schema migration neededExisting Infrastructure Leveraged
QRTZ_*) already exist in OpenMetadata database@DisallowConcurrentExecutionalready present onAbstractEventConsumerAppSchedulerwhich already uses clustering