Skip to content

Conversation

@sylvesterdamgaard
Copy link
Contributor

Summary

This PR eliminates race conditions in Redis operations and implements queue-level metrics aggregation for real-time autoscaling decisions.

Changes

Queue Metrics Aggregation (Feature)

  • New Action: CalculateQueueMetricsAction - Aggregates job-level metrics into queue-level metrics
  • New Command: queue-metrics:calculate - Artisan command for manual/scheduled execution
  • Scheduling: Automatic calculation every minute via service provider
  • Metrics Calculated:
    • throughput_per_minute - Jobs completed in last 60 seconds
    • avg_duration_ms - Weighted average duration across all job classes
    • failure_rate - Percentage of failed jobs (0-100)

Race Condition Fixes (Critical)

  • Atomic Discovery: Job and queue discovery now happens atomically with recordStart()
  • Transaction Consolidation: Changed hostname metrics from pipeline to transaction
  • Atomic Failure Recording: All failure operations consolidated into single transaction
  • Atomic Exception Recording: Exception tracking wrapped in transaction
  • Atomic Retry/Timeout: Retry and timeout recording made atomic
  • Lua Script: Throughput calculation uses Lua script to prevent timestamp race conditions

Defensive Programming

  • Added division by zero check in trend analysis throughput calculation
  • Comprehensive PHPDoc documentation for sample methods
  • Clarified timestamp storage and ordering behavior

Test Updates

  • Updated RecordJobStartActionTest to reflect atomic queue discovery
  • Added comprehensive test suite for queue metrics calculation
  • All 118 tests passing

Test Plan

✅ Completed Tests

  • Unit tests pass (118 passed, 0 failed)
  • RecordJobStartActionTest - All 5 tests passing
  • CalculateQueueMetricsTest - All 6 scenarios covered (requires Redis)
  • Code formatting verified with Pint
  • Atomic operations tested via unit tests with mocked repositories

Test Suites Covered

  • tests/Unit/Actions/RecordJobStartActionTest.php
  • tests/Feature/CalculateQueueMetricsTest.php
  • tests/Unit/Actions/RecordJobCompletionActionTest.php
  • tests/Unit/Actions/RecordJobFailureActionTest.php

Integration Testing (Requires Redis)

To run Redis-dependent tests:

REDIS_AVAILABLE=1 vendor/bin/pest --group redis

Breaking Changes

Failure Rate Format: The failure_rate metric now returns a percentage (0-100) instead of a ratio (0-1). Applications expecting the ratio format should divide by 100.

Performance Impact

  • Positive: Lua script reduces round trips for throughput calculation
  • Positive: Consolidated operations reduce total Redis commands
  • Neutral: Transaction overhead minimal for small operation sets
  • Overall: Maintained performance while ensuring correctness

Migration Notes

  • No database migrations required
  • No configuration changes needed
  • Redis schema unchanged
  • Backward compatible except for failure_rate format

Implement queue-level metrics by aggregating job-level data to provide
real-time queue performance insights for autoscaling decisions.

- Add CalculateQueueMetricsAction for weighted average calculations
- Add queue-metrics:calculate artisan command
- Schedule automatic calculation every minute
- Calculate throughput_per_minute and avg_duration_ms
- Calculate failure_rate as percentage (0-100)
- Support both all-queues and single-queue modes
- Add comprehensive test coverage

This enables the scaling package to accurately calculate worker
requirements based on near real-time queue metrics.
Eliminate race conditions in Redis operations by wrapping critical
sections in atomic transactions and using Lua scripts.

Critical fixes:
- Discovery set registration now atomic with recordStart()
- Queue discovery consolidated into recordStart() transaction
- Hostname metrics changed from pipeline to transaction
- Failure recording fully atomic including hostname updates
- Exception recording wrapped in transaction
- Retry recording made atomic
- Timeout recording made atomic
- Throughput calculation uses Lua script for atomicity

Medium priority fixes:
- Add comprehensive documentation for sample methods
- Clarify timestamp storage and ordering behavior

This ensures data consistency under concurrent job processing without
race conditions between separate Redis operations.
Remove markQueueDiscovered() mock expectations as queue discovery now
happens atomically within recordStart() transaction.
Add proper type guards and remove unused dependencies to pass PHPStan
level 9 analysis.

- Add type guards for mixed array values in CalculateQueueMetricsAction
- Add type assertions for command option values
- Remove unused QueueMetricsRepository dependency from RecordJobStartAction
- Handle mixed types from getMetrics() return values safely

All 118 tests still passing with strict type safety.
… args

Fixed critical bug in PipelineWrapper where addToSet() and removeFromSet()
were passing arrays directly to PhpRedis sadd()/srem() commands instead of
using spread operator. This caused job discovery to fail completely.

Changes:
- PipelineWrapper::addToSet(): Changed `sadd($key, $members)` to `sadd($key, ...$members)`
- PipelineWrapper::removeFromSet(): Changed `srem($key, $members)` to `srem($key, ...$members)`
- CalculateQueueMetricsTest: Added recordStart() calls before recordCompletion()
- CalculateQueueMetricsTest: Added Redis flush in beforeEach() for clean state
- CalculateQueueMetricsTest: Changed float assertions from toBe() to toEqualWithDelta()

Root Cause:
PhpRedis expects variadic arguments for set operations:
- sadd(key, member1, member2, ...) ✅
- sadd(key, [member1, member2]) ❌

Impact:
This bug prevented all job discovery tracking, causing listJobs() to return
empty arrays even though job metrics were being recorded correctly. This
made queue-level metric aggregation fail with 0 values.
@sylvesterdamgaard sylvesterdamgaard merged commit a0d976b into main Nov 20, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants