refactor!: Restructure metrics response for clear abstraction separation #3

sylvesterdamgaard · 2025-11-20T18:04:19Z

Summary

This PR addresses critical architectural issues discovered during testing where metrics from different abstraction layers were being mixed, causing confusion and incorrect data representation. The most critical issue was server CPU metrics showing individual worker process CPU (6.39%) instead of actual server CPU (40%).

Changes

🔴 Critical Bug Fixes

Fixed jobs/minute calculation (5x multiplication error)

Previously used cumulative worker uptime (sum of all workers)
Now uses actual wall-clock elapsed time (oldest worker uptime)
Example: 5 workers running 10 min each = 50 min cumulative → now correctly 10 min elapsed
Files: WorkerMetricsQueryService.php:270-284

Fixed server CPU metrics confusion

Worker process CPU (6%) was shown as server CPU when actual was 40%
Now clearly separated into worker_processes (process-level) and server_resources (system-level)

🏗️ Server Metrics Restructure

Separated server metrics into 4 clear abstraction tiers:

Application tier: queue_workers
- count: total, active, idle workers
- utilization.current_busy_percent: % workers busy RIGHT NOW
- utilization.lifetime_busy_percent: % TIME workers have been busy
Application tier: job_processing
- lifetime: total_processed, total_failed, failure_rate_percent
- current: jobs_per_minute (based on elapsed time), avg_duration_ms
Process tier: worker_processes
- Per-worker CPU/memory averages
- Peak memory across all workers
System tier: server_resources
- Actual server CPU/memory from SystemMetrics
- This is the REAL server usage
Capacity tier: capacity
- Scaling recommendations based on utilization

Breaking changes:

// Before
'workers' => ['total' => 5],
'utilization' => ['server_utilization' => 0.8],
'system_limits' => ['cpu' => [...]]

// After
'queue_workers' => [
    'count' => ['total' => 5, 'active' => 4, 'idle' => 1],
    'utilization' => [
        'current_busy_percent' => 80.0,
        'lifetime_busy_percent' => 75.2,
    ],
],
'server_resources' => ['cpu' => [...]]

📊 Queue Metrics Separation

Separated queue metrics by time scope with explicit windows:

depth: Instantaneous queue state (current snapshot)
- total, pending, scheduled, reserved
- oldest_job_age_seconds, oldest_job_age_status
performance_60s: Windowed performance metrics
- throughput_per_minute, avg_duration_ms
- window_seconds: 60 (explicit time window)
lifetime: Lifetime metrics since first job
- failure_rate_percent
workers: Worker state and efficiency
- active_count
- current_busy_percent (% workers busy now)
- lifetime_busy_percent (% time spent busy)

Breaking changes:

// Before
'depth' => 42,
'pending' => 10,
'throughput_per_minute' => 100,
'utilization_rate' => 75.5

// After
'depth' => [
    'total' => 42,
    'pending' => 10,
    // ...
],
'performance_60s' => [
    'throughput_per_minute' => 100,
    'window_seconds' => 60,
],
'workers' => [
    'current_busy_percent' => 80.0,
    'lifetime_busy_percent' => 75.5,
]

⏱️ Trend Analysis Enhancement

Added comprehensive time window context to all trend methods:

New time_window object in all trend responses:

'time_window' => [
    'window_seconds' => 3600,
    'window_start' => '2025-01-20T10:00:00+00:00',
    'window_end' => '2025-01-20T11:00:00+00:00',
    'analyzed_at' => '2025-01-20T11:00:00+00:00',
    'sample_count' => 60,
    'sample_interval_seconds' => 60, // queue depth only
]

Breaking changes:

Added time_window wrapper object
Moved period_seconds into time_window.window_seconds
Added ISO8601 timestamps for start/end/analyzed_at
Added sample_count for transparency

🎛️ Dashboard Filtering Updates

Updated all dashboard filter methods to:

Map new hierarchical structures correctly
Extract nested values properly (depth.total, performance_60s.throughput_per_minute)
Maintain separation of current vs lifetime metrics
Preserve both utilization percentages

Breaking Changes

⚠️ All changes involve response structure modifications

API Response Changes

Server Metrics (/api/metrics/workers or getOverview())
- system_limits → server_resources
- Flat structure → 4-tier hierarchy
- utilization_rate → current_busy_percent + lifetime_busy_percent
Queue Metrics (/api/metrics/queues or getAllQueuesWithMetrics())
- Flat depth values → nested depth object
- throughput_per_minute → performance_60s.throughput_per_minute
- utilization_rate → workers.current_busy_percent + lifetime_busy_percent
Trend Metrics (/api/metrics/trends or trend analysis methods)
- Added time_window wrapper object
- period_seconds → time_window.window_seconds

Test Plan

Test Results:

Tests:    3 skipped, 130 passed (363 assertions)
Duration: 3.62s

Files Modified

src/Services/WorkerMetricsQueryService.php - Server metrics restructure + jobs/min fix
src/Services/QueueMetricsQueryService.php - Queue metrics separation + worker utilization
src/Services/OverviewQueryService.php - Dashboard filtering updates
src/Services/TrendAnalysisService.php - Time window context additions

Migration Notes

Frontend consumers will need to update:

Access nested structures: data.depth.total instead of data.depth
Use performance_60s.throughput_per_minute instead of data.throughput_per_minute
Use server_resources instead of system_limits
Handle both current_busy_percent and lifetime_busy_percent for different use cases
Extract trend timestamps from time_window object

- Update gophpeek/system-metrics to v1.4.0 for 21x faster CPU metrics - Add system resource limits to server metrics (CPU cores, memory, load avg) - Implement 5-second cache for CPU metrics to avoid macOS performance issues - Restructure server response to clearly separate worker vs system metrics - Add load average metrics (1min, 5min, 15min) via new SystemMetrics API - Update API usage to handle v1.4.0 breaking changes System limits now include: - CPU: cores, usage_percent, load_average - Memory: total_mb, used_mb, available_mb, usage_percent Performance improvements: - CPU metrics: 2300ms → 105ms on macOS (21x faster via FFI) - Load average: 12x faster via native FFI calls - Static caching prevents repeated expensive syscalls

Fix race condition where throughput_per_minute and avg_duration_ms were calculated from different time windows, causing mismatched metrics. Changes: - Add getAverageDurationInWindow() method to calculate duration from same window as throughput - Use 60-second window for both throughput and avg_duration calculations - Store duration/memory/CPU samples with unique "jobId:value" format to prevent overwrites - Calculate weighted average duration across all jobs in the queue - Separate lifetime metrics (failure_rate) from windowed metrics (throughput, avg_duration) Technical details: - Atomic Lua script calculates average duration from windowed samples - Prevents race condition where jobs with identical values overwrote each other - Ensures metric consistency by using the same Redis ZRANGEBYSCORE window

BREAKING CHANGE: Major restructuring of metrics API responses to separate abstraction layers, time windows, and current vs lifetime metrics. ## Server Metrics Restructure Separated server metrics into 4 clear tiers: 1. Application tier: queue_workers (count, current/lifetime utilization) 2. Application tier: job_processing (lifetime totals, current performance) 3. Process tier: worker_processes (per-worker resource averages) 4. System tier: server_resources (actual server CPU/memory) Breaking changes: - Renamed system_limits → server_resources for clarity - Split flat utilization_rate into current_busy_percent and lifetime_busy_percent - Restructured workers/performance/utilization into nested hierarchy ## Queue Metrics Separation Separated queue metrics by time scope: - depth: Instantaneous queue state (current snapshot) - performance_60s: Windowed metrics with explicit window_seconds - lifetime: Lifetime metrics (failure_rate_percent since first job) - workers: Current vs lifetime busy percentages Breaking changes: - Flat depth/pending/etc moved into depth object - throughput_per_minute moved to performance_60s.throughput_per_minute - utilization_rate split into workers.current_busy_percent and lifetime_busy_percent ## Trend Analysis Enhancement Added comprehensive time_window context to all trend methods: - window_seconds: Duration of analysis window - window_start/window_end: ISO8601 timestamps - analyzed_at: When analysis was performed - sample_count: Number of data points analyzed Breaking changes: - Added time_window object wrapper to all trend responses - Moved period_seconds into time_window structure ## Critical Bug Fix Fixed jobs/minute calculation error (5x multiplication): - Changed from cumulative worker uptime to actual elapsed time - Used oldest worker uptime as proxy for wall-clock elapsed time - Example: 5 workers × 10min = 50min cumulative → now 10min elapsed ## Dashboard Filtering Updates Updated all dashboard filter methods to: - Map new hierarchical structures correctly - Maintain separation of current vs lifetime metrics - Preserve time window context Files modified: - src/Services/WorkerMetricsQueryService.php - src/Services/QueueMetricsQueryService.php - src/Services/OverviewQueryService.php - src/Services/TrendAnalysisService.php

sylvesterdamgaard and others added 4 commits November 20, 2025 17:30

Fix styling

1d2a1bf

sylvesterdamgaard merged commit 4119d72 into main Nov 20, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

refactor!: Restructure metrics response for clear abstraction separation #3

refactor!: Restructure metrics response for clear abstraction separation #3

Uh oh!

sylvesterdamgaard commented Nov 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

refactor!: Restructure metrics response for clear abstraction separation #3

refactor!: Restructure metrics response for clear abstraction separation #3

Uh oh!

Conversation

sylvesterdamgaard commented Nov 20, 2025

Summary

Changes

🔴 Critical Bug Fixes

🏗️ Server Metrics Restructure

📊 Queue Metrics Separation

⏱️ Trend Analysis Enhancement

🎛️ Dashboard Filtering Updates

Breaking Changes

API Response Changes

Test Plan

Files Modified

Migration Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants