Skip to content

Conversation

@sylvesterdamgaard
Copy link
Contributor

Summary

This PR addresses critical architectural issues discovered during testing where metrics from different abstraction layers were being mixed, causing confusion and incorrect data representation. The most critical issue was server CPU metrics showing individual worker process CPU (6.39%) instead of actual server CPU (40%).

Changes

🔴 Critical Bug Fixes

Fixed jobs/minute calculation (5x multiplication error)

  • Previously used cumulative worker uptime (sum of all workers)
  • Now uses actual wall-clock elapsed time (oldest worker uptime)
  • Example: 5 workers running 10 min each = 50 min cumulative → now correctly 10 min elapsed
  • Files: WorkerMetricsQueryService.php:270-284

Fixed server CPU metrics confusion

  • Worker process CPU (6%) was shown as server CPU when actual was 40%
  • Now clearly separated into worker_processes (process-level) and server_resources (system-level)

🏗️ Server Metrics Restructure

Separated server metrics into 4 clear abstraction tiers:

  1. Application tier: queue_workers

    • count: total, active, idle workers
    • utilization.current_busy_percent: % workers busy RIGHT NOW
    • utilization.lifetime_busy_percent: % TIME workers have been busy
  2. Application tier: job_processing

    • lifetime: total_processed, total_failed, failure_rate_percent
    • current: jobs_per_minute (based on elapsed time), avg_duration_ms
  3. Process tier: worker_processes

    • Per-worker CPU/memory averages
    • Peak memory across all workers
  4. System tier: server_resources

    • Actual server CPU/memory from SystemMetrics
    • This is the REAL server usage
  5. Capacity tier: capacity

    • Scaling recommendations based on utilization

Breaking changes:

// Before
'workers' => ['total' => 5],
'utilization' => ['server_utilization' => 0.8],
'system_limits' => ['cpu' => [...]]

// After
'queue_workers' => [
    'count' => ['total' => 5, 'active' => 4, 'idle' => 1],
    'utilization' => [
        'current_busy_percent' => 80.0,
        'lifetime_busy_percent' => 75.2,
    ],
],
'server_resources' => ['cpu' => [...]]

📊 Queue Metrics Separation

Separated queue metrics by time scope with explicit windows:

  • depth: Instantaneous queue state (current snapshot)

    • total, pending, scheduled, reserved
    • oldest_job_age_seconds, oldest_job_age_status
  • performance_60s: Windowed performance metrics

    • throughput_per_minute, avg_duration_ms
    • window_seconds: 60 (explicit time window)
  • lifetime: Lifetime metrics since first job

    • failure_rate_percent
  • workers: Worker state and efficiency

    • active_count
    • current_busy_percent (% workers busy now)
    • lifetime_busy_percent (% time spent busy)

Breaking changes:

// Before
'depth' => 42,
'pending' => 10,
'throughput_per_minute' => 100,
'utilization_rate' => 75.5

// After
'depth' => [
    'total' => 42,
    'pending' => 10,
    // ...
],
'performance_60s' => [
    'throughput_per_minute' => 100,
    'window_seconds' => 60,
],
'workers' => [
    'current_busy_percent' => 80.0,
    'lifetime_busy_percent' => 75.5,
]

⏱️ Trend Analysis Enhancement

Added comprehensive time window context to all trend methods:

New time_window object in all trend responses:

'time_window' => [
    'window_seconds' => 3600,
    'window_start' => '2025-01-20T10:00:00+00:00',
    'window_end' => '2025-01-20T11:00:00+00:00',
    'analyzed_at' => '2025-01-20T11:00:00+00:00',
    'sample_count' => 60,
    'sample_interval_seconds' => 60, // queue depth only
]

Breaking changes:

  • Added time_window wrapper object
  • Moved period_seconds into time_window.window_seconds
  • Added ISO8601 timestamps for start/end/analyzed_at
  • Added sample_count for transparency

🎛️ Dashboard Filtering Updates

Updated all dashboard filter methods to:

  • Map new hierarchical structures correctly
  • Extract nested values properly (depth.total, performance_60s.throughput_per_minute)
  • Maintain separation of current vs lifetime metrics
  • Preserve both utilization percentages

Breaking Changes

⚠️ All changes involve response structure modifications

API Response Changes

  1. Server Metrics (/api/metrics/workers or getOverview())

    • system_limitsserver_resources
    • Flat structure → 4-tier hierarchy
    • utilization_ratecurrent_busy_percent + lifetime_busy_percent
  2. Queue Metrics (/api/metrics/queues or getAllQueuesWithMetrics())

    • Flat depth values → nested depth object
    • throughput_per_minuteperformance_60s.throughput_per_minute
    • utilization_rateworkers.current_busy_percent + lifetime_busy_percent
  3. Trend Metrics (/api/metrics/trends or trend analysis methods)

    • Added time_window wrapper object
    • period_secondstime_window.window_seconds

Test Plan

  • PHPStan analysis passes (no type errors)
  • Laravel Pint formatting applied
  • All 130 tests pass (3 skipped - require queue worker)
  • Unit tests for WorkerMetricsQueryService pass
  • Unit tests for QueueMetricsQueryService pass
  • Feature tests for CalculateQueueMetrics pass
  • Performance benchmarks pass
  • Manual testing with live worker metrics
  • Verify server CPU shows actual system usage (not worker process CPU)
  • Verify jobs/minute calculation matches actual elapsed time
  • Verify current vs lifetime utilization percentages are distinct

Test Results:

Tests:    3 skipped, 130 passed (363 assertions)
Duration: 3.62s

Files Modified

  • src/Services/WorkerMetricsQueryService.php - Server metrics restructure + jobs/min fix
  • src/Services/QueueMetricsQueryService.php - Queue metrics separation + worker utilization
  • src/Services/OverviewQueryService.php - Dashboard filtering updates
  • src/Services/TrendAnalysisService.php - Time window context additions

Migration Notes

Frontend consumers will need to update:

  • Access nested structures: data.depth.total instead of data.depth
  • Use performance_60s.throughput_per_minute instead of data.throughput_per_minute
  • Use server_resources instead of system_limits
  • Handle both current_busy_percent and lifetime_busy_percent for different use cases
  • Extract trend timestamps from time_window object

sylvesterdamgaard and others added 4 commits November 20, 2025 17:30
- Update gophpeek/system-metrics to v1.4.0 for 21x faster CPU metrics
- Add system resource limits to server metrics (CPU cores, memory, load avg)
- Implement 5-second cache for CPU metrics to avoid macOS performance issues
- Restructure server response to clearly separate worker vs system metrics
- Add load average metrics (1min, 5min, 15min) via new SystemMetrics API
- Update API usage to handle v1.4.0 breaking changes

System limits now include:
- CPU: cores, usage_percent, load_average
- Memory: total_mb, used_mb, available_mb, usage_percent

Performance improvements:
- CPU metrics: 2300ms → 105ms on macOS (21x faster via FFI)
- Load average: 12x faster via native FFI calls
- Static caching prevents repeated expensive syscalls
Fix race condition where throughput_per_minute and avg_duration_ms
were calculated from different time windows, causing mismatched metrics.

Changes:
- Add getAverageDurationInWindow() method to calculate duration from same window as throughput
- Use 60-second window for both throughput and avg_duration calculations
- Store duration/memory/CPU samples with unique "jobId:value" format to prevent overwrites
- Calculate weighted average duration across all jobs in the queue
- Separate lifetime metrics (failure_rate) from windowed metrics (throughput, avg_duration)

Technical details:
- Atomic Lua script calculates average duration from windowed samples
- Prevents race condition where jobs with identical values overwrote each other
- Ensures metric consistency by using the same Redis ZRANGEBYSCORE window
BREAKING CHANGE: Major restructuring of metrics API responses to separate
abstraction layers, time windows, and current vs lifetime metrics.

## Server Metrics Restructure

Separated server metrics into 4 clear tiers:

1. Application tier: queue_workers (count, current/lifetime utilization)
2. Application tier: job_processing (lifetime totals, current performance)
3. Process tier: worker_processes (per-worker resource averages)
4. System tier: server_resources (actual server CPU/memory)

Breaking changes:
- Renamed system_limits → server_resources for clarity
- Split flat utilization_rate into current_busy_percent and lifetime_busy_percent
- Restructured workers/performance/utilization into nested hierarchy

## Queue Metrics Separation

Separated queue metrics by time scope:

- depth: Instantaneous queue state (current snapshot)
- performance_60s: Windowed metrics with explicit window_seconds
- lifetime: Lifetime metrics (failure_rate_percent since first job)
- workers: Current vs lifetime busy percentages

Breaking changes:
- Flat depth/pending/etc moved into depth object
- throughput_per_minute moved to performance_60s.throughput_per_minute
- utilization_rate split into workers.current_busy_percent and lifetime_busy_percent

## Trend Analysis Enhancement

Added comprehensive time_window context to all trend methods:

- window_seconds: Duration of analysis window
- window_start/window_end: ISO8601 timestamps
- analyzed_at: When analysis was performed
- sample_count: Number of data points analyzed

Breaking changes:
- Added time_window object wrapper to all trend responses
- Moved period_seconds into time_window structure

## Critical Bug Fix

Fixed jobs/minute calculation error (5x multiplication):
- Changed from cumulative worker uptime to actual elapsed time
- Used oldest worker uptime as proxy for wall-clock elapsed time
- Example: 5 workers × 10min = 50min cumulative → now 10min elapsed

## Dashboard Filtering Updates

Updated all dashboard filter methods to:
- Map new hierarchical structures correctly
- Maintain separation of current vs lifetime metrics
- Preserve time window context

Files modified:
- src/Services/WorkerMetricsQueryService.php
- src/Services/QueueMetricsQueryService.php
- src/Services/OverviewQueryService.php
- src/Services/TrendAnalysisService.php
@sylvesterdamgaard sylvesterdamgaard merged commit 4119d72 into main Nov 20, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants