-
Notifications
You must be signed in to change notification settings - Fork 383
Treeherder RW Postgres — Load Increase Investigation
Instance: treeherder-prod-prod-v1-postgres (Cloud SQL, Postgres 15, db-custom-8-53248 = 8 vCPU / 52 GB), project moz-fx-treeherder-prod-c739, region us-west1. A read replica (treeherder-prod-prod-v1-postgres-replica-0) exists.
Investigated: June 2026, read-only, via the Cloud Monitoring API and Cloud SQL Query Insights per-query metrics. All raw numbers and the exact API calls used are in the appendices so this can be independently re-created.
Sustained CPU on the RW instance roughly tripled between summer 2025 and May 2026 (monthly-mean CPU ~0.13 → ~0.42; daily peaks pegged at 100% for months), then dropped back to ~0.23 in June 2026.
Root cause of the increase: one query — the Log Parser Intermittent Detector (LPID) classification query in treeherder/log_parser/intermittents.py (check_and_mark_intermittent), added and expanded across a series of 2025 PRs (Joel Maher). It grew steadily more expensive as the tables it scans accumulated rows.
Cause of the June decrease: that query was optimized in #9553 (Cameron Dawson, merged 2026-05-27) — confirmed not a summer-traffic artifact, since call volume was flat-to-rising over the same window (§3).
Quantified at the query level (Query Insights, RW instance):
| LPID query |
/jobs list query |
|
|---|---|---|
| total time, May 17–24 (pre-fix) | 112 h/wk — #1 | 61 h — #2 |
| total time, Jun 8–15 (post-fix) | 2.8 h/wk — #6 | 22 h — #1 |
| calls/wk pre → post | 678k → 685k (flat) | 1.87M → 2.31M (+23%) |
| mean per call pre → post | 597 ms → 15 ms (40× faster) | 117 ms → 35 ms |
Same call volume, ~40× cheaper per call → the win is the optimization, not reduced load. The current #1 consumer is /jobs, the target of the pending reference-data caching and read-replica work (§4, §5).
This conclusion is supported by three independent signals corroborating one another — the CPU curve, the git feature timeline, and per-query execution metrics — though exact deploy timestamps and the data-growth mechanism are inferred, not directly confirmed (see Caveats).
Daily-mean CPU utilization of the RW instance (cloudsql.googleapis.com/database/cpu/utilization, ALIGN_MEAN, daily), by month (full raw series in Appendix B.1):
| Month | Mean CPU | Note |
|---|---|---|
| 2025-06 | 0.147 | baseline |
| 2025-07 | 0.168 | LPID introduced (#8696, Jul 9, Joel Maher) |
| 2025-08 | 0.133 | |
| 2025-09 | 0.121 | LPID expanded (#8881/#8905/#8923/#8940/#8968) |
| 2025-10 | 0.204 | step up (LPID rollout takes effect) |
| 2025-11 | 0.194 | |
| 2025-12 | 0.234 | |
| 2026-01 | 0.222 | |
| 2026-02 | 0.274 | steady ramp |
| 2026-03 | 0.335 | |
| 2026-04 | 0.399 | |
| 2026-05 | 0.418 | peak (weekday peaks 0.50–0.56) |
| 2026-06 | 0.231 | drop after #9553 (15 days) |
Two inflection points:
- October 2025 — step up (~0.13 → ~0.20), following the LPID feature rollout/expansion of Jul–Sep 2025.
-
~May 24, 2026 — sharp drop (weekday peaks ~0.50 → ~0.30). The daily series shows a clean regime change between May 23 (0.505) and May 24 (0.226), holding through June (raw daily series in Appendix B.2). Aligns with the #9553 optimization (local commit
454c660b4, May 25; mergedf14ec66aa#9553, May 27).
Daily peak CPU has been saturated (= 1.0) intermittently since ≥ December 2025, so peak CPU is not discriminating; mean (sustained) load is the signal.
check_and_mark_intermittent (with its _check_and_mark_infra companion) runs once per parsed job log to decide whether a failure is intermittent. For non-try repos it builds a list of pushes in the last 36 hours for the repository, joins group → job_log → job → group_status (plus job_type, failure_classification) across that window, and aggregates pass/fail rates per group.
Why it produced the load curve:
- Feature rollout (the Oct step). Introduced Jul 2025 (#8696); expanded through Aug–Sep 2025 to cover infra failures, tasks with no groups, multi-push history, and "classify all tasks in a single update" (#8881, #8905, #8923, #8940, #8968). All authored by Joel Maher. More jobs ran the query and each did more work.
-
Data growth (the Oct→May ramp). The 36-hour scan covers a fixed time range but an ever-growing row count as
group/job_log/group_statusaccumulate. The query's own weekly total nearly doubled within May (87 h in May 1–8 → 112–151 h in May 17–24), i.e. cost-per-call was rising, not just call count. -
The fix (the May drop). #9553 (Cameron Dawson) replaced the join through
push.repository_idwith a direct filter onjob.repository_id, letting Postgres use the(repository, job_type, push)index onjoband drop the push join. Per-call mean fell 597 ms → 15 ms; weekly total 112 h → 2.8 h; sustained CPU ~0.42 → ~0.23.
The timing of all three phases (rollout → ramp → fix) lines up with both the CPU curve and the per-query execution-time numbers.
A reasonable alternative is that the late-May/June drop reflects fewer commits/test runs during summer (less ingestion → fewer query invocations). The per-query call counts rule this out for the dominant queries (Query Insights perquery/latencies distribution count; Appendix B.4):
- LPID query: 678k calls/wk (pre) → 685k calls/wk (post) — flat/slightly up. It is invoked just as often; each call is 40× cheaper. The drop is entirely the optimization.
-
/jobsquery: 1.87M → 2.31M calls/wk — up 23%. Polling traffic did not fall in this window.
So in the windows measured, query volume was flat-to-rising; the CPU decrease is the query work getting cheaper, not the workload shrinking. (A genuine vacation lull, if it comes later in summer, would show up as falling call counts — worth re-checking in July/August, but it is not what drove the May/June change.)
Secondary effect worth noting: the /jobs query also got cheaper per call (117 ms → 35 ms) despite more calls. This is most consistent with contention relief — once LPID stopped consuming ~112 h/wk of CPU, the whole instance had more headroom (less CPU/IO competition), so other queries sped up too.
Total execution time per week (Query Insights; Appendix B.3):
| Rank | h/wk | Query | Related work | Status |
|---|---|---|---|---|
| 1 | 32 | /jobs list |
camd/jobs-refdata-cache (ref-data caching) + camd/read-replica-queries (polling reads) |
not merged |
| 2 | 20 |
bugscache search |
#9554 + #9555 (already merged, see §5) | merged ~May 27 |
| 3–4 | 19 + 18 |
performance_datum SELECT (Perfherder) |
camd/read-replica-queries; camd/query-optimization-performance-datum
|
not merged |
| 5 | 16 |
performance_datum UPDATE (job_id backfill) |
write — not replica-eligible | — |
| 6 | 2.8 | LPID intermittent query | #9553 (fixed) | merged May 27 |
| 7 | 6 |
group_status DELETE (data cycling) |
— | — |
Note on merged-vs-deployed: a PR merged to
mastermay not yet be on production. Where the expected effect is visible in the metrics, it has almost certainly been deployed.
Caused the increase (LPID rollout) — all merged & deployed, author Joel Maher:
| PR | Date | Title |
|---|---|---|
| #8696 | 2025-07-09 | Add post log parser to look for repeated test runs and annotate as intermittent |
| #8881 | 2025-08-07 | support infra issues and tasks with no groups in log parser intermittent detector |
| #8905 | 2025-08-14 | LPID checks — ensure 'infra' query for extra_jobs has a completed task and parsed job log |
| #8923 | 2025-08-25 | support reclassifying jobs to failures when new additional failures come in |
| #8940 | 2025-08-28 | Tweak queries for LPID for fine tuning |
| #8968 | 2025-09-15 | Fix LPID — proper list of pushes, classify all tasks in single .update where possible |
Reduced the load — merged & deployed (effect visible in metrics), author Cameron Dawson:
| PR | Date | Title | Evidence of deploy |
|---|---|---|---|
| #9553 | 2026-05-27 | Optimize intermittent-failure classification query | LPID mean 597 ms → 15 ms |
| #9554 | 2026-05-27 | Reduce over-fetching in Bugscache queries | bugscache total ~27 h → ~20 h |
| #9555 | 2026-05-27 | Add GIN trigram index on bugscache.summary (migration 0052) |
(paired with #9554) |
Pending — not yet merged to master (author Cameron Dawson):
| Branch | Targets | Notes (from team) |
|---|---|---|
camd/jobs-refdata-cache |
/jobs query — caches job_type/job_group/machine_platform to drop joins |
May be rejected. (It caches those three small reference tables, not reference_data_signatures, and handles newly-ingested ids via refresh-on-miss — but flagged because reference data changes often.) |
camd/read-replica-queries |
Routes Perfherder reads (and could route /jobs polling) to the replica |
Likely to be merged soon to test |
camd/query-optimization-performance-datum |
performance_datum SELECTs (Perfherder, #3–4 above) |
Generally approved; may merge soon |
camd/bugscache-summary-gin-trgm-index |
bugscache search | The underlying improvement appears already landed as #9554/#9555 |
- Exact deploy timestamps not independently confirmed. The CPU drop begins May 24; #9553's commit is dated May 25 and merged May 27. The 1–3 day spread is within deploy + weekend-traffic noise. Effects are clearly visible in the metrics, so the changes are deployed, but the precise production release times were not pulled from the deploy log.
- The October step is noisy. The Sep→Oct shift is real in the monthly means but has high day-to-day variance; it reflects the center of mass moving up, not a single clean step like May.
-
Data-growth mechanism is inferred from the rising per-call cost (87 h → 112–151 h within May), not from historical row counts of
group/job_log/group_status(not pulled). Confirming would mean trending those table sizes over the period. - Query Insights retention (~6 weeks) limits the per-query before/after to May–June windows; the 6-month "which query" story relies on (CPU curve) + (git feature timeline) + (recent per-query dominance) corroborating one another.
- Seasonality not yet excluded for later summer. Call counts were flat-to-up in the windows measured (May/June); a July/August lull would appear as falling call counts and should be re-checked then if comparing raw CPU across the summer.
-
Instance overview / system insights (CPU, memory, connections):
https://console.cloud.google.com/sql/instances/treeherder-prod-prod-v1-postgres/system-insights?project=moz-fx-treeherder-prod-c739 -
Query Insights (top queries by load, per-query trends):
https://console.cloud.google.com/sql/instances/treeherder-prod-prod-v1-postgres/insights?project=moz-fx-treeherder-prod-c739 -
Metrics Explorer (build the CPU chart):
https://console.cloud.google.com/monitoring/metrics-explorer?project=moz-fx-treeherder-prod-c739Then select metricCloud SQL Database → CPU utilization, filterdatabase_id = moz-fx-treeherder-prod-c739:treeherder-prod-prod-v1-postgres, alignermean, period1 day.
(Cloud Monitoring deep-links encode the full query in an opaque pageState blob and are not reliably hand-constructable; the API calls in A.2 are the reproducible source of every number in this doc.)
Auth (interactive): gcloud auth login cdawson@mozilla.com, then TOKEN=$(gcloud auth print-access-token).
Two metric families use different resource types:
- Instance metrics → resource type
cloudsql_database, labeldatabase_id. - Query Insights per-query metrics → resource type
cloudsql_instance_database, labelresource_id.
For this instance both labels equal moz-fx-treeherder-prod-c739:treeherder-prod-prod-v1-postgres.
TOKEN=$(gcloud auth print-access-token)
PROJ=moz-fx-treeherder-prod-c739
ID="moz-fx-treeherder-prod-c739:treeherder-prod-prod-v1-postgres"
# (1) Sustained CPU — daily mean
curl -s -G "https://monitoring.googleapis.com/v3/projects/$PROJ/timeSeries" \
-H "Authorization: Bearer $TOKEN" \
--data-urlencode 'filter=metric.type="cloudsql.googleapis.com/database/cpu/utilization" AND resource.labels.database_id="'"$ID"'"' \
--data-urlencode 'interval.startTime=2025-06-01T00:00:00Z' \
--data-urlencode 'interval.endTime=2026-06-15T00:00:00Z' \
--data-urlencode 'aggregation.alignmentPeriod=86400s' \
--data-urlencode 'aggregation.perSeriesAligner=ALIGN_MEAN'
# Use ALIGN_MAX for daily peaks.
# (2) Top queries by total execution time (per week), grouped by query text
curl -s -G "https://monitoring.googleapis.com/v3/projects/$PROJ/timeSeries" \
-H "Authorization: Bearer $TOKEN" \
--data-urlencode 'filter=metric.type="cloudsql.googleapis.com/database/postgresql/insights/perquery/execution_time" AND resource.labels.resource_id="'"$ID"'"' \
--data-urlencode 'interval.startTime=2026-06-08T00:00:00Z' \
--data-urlencode 'interval.endTime=2026-06-15T00:00:00Z' \
--data-urlencode 'aggregation.alignmentPeriod=604800s' \
--data-urlencode 'aggregation.perSeriesAligner=ALIGN_DELTA' \
--data-urlencode 'aggregation.crossSeriesReducer=REDUCE_SUM' \
--data-urlencode 'aggregation.groupByFields=metric.label.querystring'
# (3) Per-query CALL COUNT + mean latency — metric perquery/latencies is a DISTRIBUTION;
# sum distributionValue.count over the window = number of executions,
# distributionValue.mean = mean latency (µs). ALIGN_DELTA, group by querystring.
curl -s -G "https://monitoring.googleapis.com/v3/projects/$PROJ/timeSeries" \
-H "Authorization: Bearer $TOKEN" \
--data-urlencode 'filter=metric.type="cloudsql.googleapis.com/database/postgresql/insights/perquery/latencies" AND resource.labels.resource_id="'"$ID"'"' \
--data-urlencode 'interval.startTime=2026-06-08T00:00:00Z' \
--data-urlencode 'interval.endTime=2026-06-15T00:00:00Z' \
--data-urlencode 'aggregation.alignmentPeriod=604800s' \
--data-urlencode 'aggregation.perSeriesAligner=ALIGN_DELTA'execution_time values are cumulative microseconds (divide by 1e6 for seconds). The querystring label holds the normalized SQL; the LPID query starts with SELECT "group"."name" and the /jobs query with SELECT "job"."submit_time".
2025-06 0.147 2025-09 0.121 2025-12 0.234 2026-03 0.335 2026-06 0.231 (15d)
2025-07 0.168 2025-10 0.204 2026-01 0.222 2026-04 0.399
2025-08 0.133 2025-11 0.194 2026-02 0.274 2026-05 0.418
2026-05-21 0.561 05-26 0.271 05-31 0.194 06-05 0.225 06-10 0.257
2026-05-22 0.549 05-27 0.322 06-01 0.189 06-06 0.215 06-11 0.322
2026-05-23 0.505 05-28 0.411 06-02 0.247 06-07 0.159 06-12 0.287
2026-05-24 0.226 05-29 0.309 06-03 0.265 06-08 0.144 06-13 0.279
2026-05-25 0.195 05-30 0.303 06-04 0.280 06-09 0.216 06-14 0.180 / 06-15 0.198
Clean regime change between 05-23 (0.505) and 05-24 (0.226).
May 1–8 May 17–24 Jun 8–15
LPID intermittent (#1→#6) 87 112/151* 2.8 ← #9553
/jobs list 36 61/79* 32 (now #1)
bugscache search 24 27 20 ← #9554/#9555
performance_datum SELECT (x2) 27+15 21+16 19+18
performance_datum UPDATE — 19 16
group_status DELETE — — 6
*Two figures where execution_time (REDUCE_SUM) and latencies (count×mean) windows differ slightly; both are reported for transparency.
PRE-fix (May 17–24) POST-fix (Jun 8–15)
LPID query calls=677,881 mean=597.1ms calls=685,256 mean= 14.9ms (total 112.4h → 2.8h)
/jobs query calls=1,874,587 mean=116.7ms calls=2,309,509 mean= 35.1ms (total 60.8h → 22.5h)
LPID call volume is flat (+1%) while per-call cost fell 40× → the drop is the optimization, not reduced traffic.