Treeherder RW Postgres — Load Increase Investigation

Instance: treeherder-prod-prod-v1-postgres (Cloud SQL, Postgres 15, db-custom-8-53248 = 8 vCPU / 52 GB), project moz-fx-treeherder-prod-c739, region us-west1. A read replica (treeherder-prod-prod-v1-postgres-replica-0) exists.

Investigated: June 2026, read-only, via the Cloud Monitoring API and Cloud SQL Query Insights per-query metrics. All raw numbers and the exact API calls used are in the appendices so this can be independently re-created.

Executive summary

Sustained CPU on the RW instance roughly tripled between summer 2025 and May 2026 (monthly-mean CPU ~0.13 → ~0.42; daily peaks pegged at 100% for months), then dropped back to ~0.23 in June 2026.

Root cause of the increase: one query — the Log Parser Intermittent Detector (LPID) classification query in treeherder/log_parser/intermittents.py (check_and_mark_intermittent), added and expanded across a series of 2025 PRs (Joel Maher). It grew steadily more expensive as the tables it scans accumulated rows.

Cause of the June decrease: that query was optimized in #9553 (Cameron Dawson, merged 2026-05-27) — confirmed not a summer-traffic artifact, since call volume was flat-to-rising over the same window (§3).

Quantified at the query level (Query Insights, RW instance):

	LPID query	`/jobs` list query
total time, May 17–24 (pre-fix)	112 h/wk — #1	61 h — #2
total time, Jun 8–15 (post-fix)	2.8 h/wk — #6	22 h — #1
calls/wk pre → post	678k → 685k (flat)	1.87M → 2.31M (+23%)
mean per call pre → post	597 ms → 15 ms (40× faster)	117 ms → 35 ms

Same call volume, ~40× cheaper per call → the win is the optimization, not reduced load. The current #1 consumer is /jobs, the target of the pending reference-data caching and read-replica work (§4, §5).

This conclusion is supported by three independent signals corroborating one another — the CPU curve, the git feature timeline, and per-query execution metrics — though exact deploy timestamps and the data-growth mechanism are inferred, not directly confirmed (see Caveats).

1. When and where — the load timeline

Daily-mean CPU utilization of the RW instance (cloudsql.googleapis.com/database/cpu/utilization, ALIGN_MEAN, daily), by month (full raw series in Appendix B.1):

Month	Mean CPU	Note
2025-06	0.147	baseline
2025-07	0.168	LPID introduced (#8696, Jul 9, Joel Maher)
2025-08	0.133
2025-09	0.121	LPID expanded (#8881/#8905/#8923/#8940/#8968)
2025-10	0.204	step up (LPID rollout takes effect)
2025-11	0.194
2025-12	0.234
2026-01	0.222
2026-02	0.274	steady ramp
2026-03	0.335
2026-04	0.399
2026-05	0.418	peak (weekday peaks 0.50–0.56)
2026-06	0.231	drop after #9553 (15 days)

Two inflection points:

October 2025 — step up (~0.13 → ~0.20), following the LPID feature rollout/expansion of Jul–Sep 2025.
~May 24, 2026 — sharp drop (weekday peaks ~0.50 → ~0.30). The daily series shows a clean regime change between May 23 (0.505) and May 24 (0.226), holding through June (raw daily series in Appendix B.2). Aligns with the #9553 optimization (local commit 454c660b4, May 25; merged f14ec66aa #9553, May 27).

Daily peak CPU has been saturated (= 1.0) intermittently since ≥ December 2025, so peak CPU is not discriminating; mean (sustained) load is the signal.

2. Root cause — the LPID intermittent-classification query

check_and_mark_intermittent (with its _check_and_mark_infra companion) runs once per parsed job log to decide whether a failure is intermittent. For non-try repos it builds a list of pushes in the last 36 hours for the repository, joins group → job_log → job → group_status (plus job_type, failure_classification) across that window, and aggregates pass/fail rates per group.

Why it produced the load curve:

Feature rollout (the Oct step). Introduced Jul 2025 (#8696); expanded through Aug–Sep 2025 to cover infra failures, tasks with no groups, multi-push history, and "classify all tasks in a single update" (#8881, #8905, #8923, #8940, #8968). All authored by Joel Maher. More jobs ran the query and each did more work.
Data growth (the Oct→May ramp). The 36-hour scan covers a fixed time range but an ever-growing row count as group/job_log/group_status accumulate. The query's own weekly total nearly doubled within May (87 h in May 1–8 → 112–151 h in May 17–24), i.e. cost-per-call was rising, not just call count.
The fix (the May drop). #9553 (Cameron Dawson) replaced the join through push.repository_id with a direct filter on job.repository_id, letting Postgres use the (repository, job_type, push) index on job and drop the push join. Per-call mean fell 597 ms → 15 ms; weekly total 112 h → 2.8 h; sustained CPU ~0.42 → ~0.23.

The timing of all three phases (rollout → ramp → fix) lines up with both the CPU curve and the per-query execution-time numbers.

3. Ruling out the summer-vacation hypothesis

A reasonable alternative is that the late-May/June drop reflects fewer commits/test runs during summer (less ingestion → fewer query invocations). The per-query call counts rule this out for the dominant queries (Query Insights perquery/latencies distribution count; Appendix B.4):

LPID query: 678k calls/wk (pre) → 685k calls/wk (post) — flat/slightly up. It is invoked just as often; each call is 40× cheaper. The drop is entirely the optimization.
/jobs query: 1.87M → 2.31M calls/wk — up 23%. Polling traffic did not fall in this window.

So in the windows measured, query volume was flat-to-rising; the CPU decrease is the query work getting cheaper, not the workload shrinking. (A genuine vacation lull, if it comes later in summer, would show up as falling call counts — worth re-checking in July/August, but it is not what drove the May/June change.)

Secondary effect worth noting: the /jobs query also got cheaper per call (117 ms → 35 ms) despite more calls. This is most consistent with contention relief — once LPID stopped consuming ~112 h/wk of CPU, the whole instance had more headroom (less CPU/IO competition), so other queries sped up too.

4. Current top consumers (post-fix, Jun 8–15) and how our work maps to them

Total execution time per week (Query Insights; Appendix B.3):

Rank	h/wk	Query	Related work	Status
1	32	`/jobs` list	`camd/jobs-refdata-cache` (ref-data caching) + `camd/read-replica-queries` (polling reads)	not merged
2	20	`bugscache` search	#9554 + #9555 (already merged, see §5)	merged ~May 27
3–4	19 + 18	`performance_datum` SELECT (Perfherder)	`camd/read-replica-queries`; `camd/query-optimization-performance-datum`	not merged
5	16	`performance_datum` UPDATE (job_id backfill)	write — not replica-eligible	—
6	2.8	LPID intermittent query	#9553 (fixed)	merged May 27
7	6	`group_status` DELETE (data cycling)	—	—

5. Related changes (merged vs pending)

Note on merged-vs-deployed: a PR merged to master may not yet be on production. Where the expected effect is visible in the metrics, it has almost certainly been deployed.

Caused the increase (LPID rollout) — all merged & deployed, author Joel Maher:

PR	Date	Title
#8696	2025-07-09	Add post log parser to look for repeated test runs and annotate as intermittent
#8881	2025-08-07	support infra issues and tasks with no groups in log parser intermittent detector
#8905	2025-08-14	LPID checks — ensure 'infra' query for extra_jobs has a completed task and parsed job log
#8923	2025-08-25	support reclassifying jobs to failures when new additional failures come in
#8940	2025-08-28	Tweak queries for LPID for fine tuning
#8968	2025-09-15	Fix LPID — proper list of pushes, classify all tasks in single .update where possible

Reduced the load — merged & deployed (effect visible in metrics), author Cameron Dawson:

PR	Date	Title	Evidence of deploy
#9553	2026-05-27	Optimize intermittent-failure classification query	LPID mean 597 ms → 15 ms
#9554	2026-05-27	Reduce over-fetching in Bugscache queries	bugscache total ~27 h → ~20 h
#9555	2026-05-27	Add GIN trigram index on `bugscache.summary` (migration `0052`)	(paired with #9554)

Pending — not yet merged to master (author Cameron Dawson):

Branch	Targets	Notes (from team)
`camd/jobs-refdata-cache`	`/jobs` query — caches `job_type`/`job_group`/`machine_platform` to drop joins	May be rejected. (It caches those three small reference tables, not `reference_data_signatures`, and handles newly-ingested ids via refresh-on-miss — but flagged because reference data changes often.)
`camd/read-replica-queries`	Routes Perfherder reads (and could route `/jobs` polling) to the replica	Likely to be merged soon to test
`camd/query-optimization-performance-datum`	`performance_datum` SELECTs (Perfherder, #3–4 above)	Generally approved; may merge soon
`camd/bugscache-summary-gin-trgm-index`	bugscache search	The underlying improvement appears already landed as #9554/#9555

6. Caveats / open questions

Exact deploy timestamps not independently confirmed. The CPU drop begins May 24; #9553's commit is dated May 25 and merged May 27. The 1–3 day spread is within deploy + weekend-traffic noise. Effects are clearly visible in the metrics, so the changes are deployed, but the precise production release times were not pulled from the deploy log.
The October step is noisy. The Sep→Oct shift is real in the monthly means but has high day-to-day variance; it reflects the center of mass moving up, not a single clean step like May.
Data-growth mechanism is inferred from the rising per-call cost (87 h → 112–151 h within May), not from historical row counts of group/job_log/group_status (not pulled). Confirming would mean trending those table sizes over the period.
Query Insights retention (~6 weeks) limits the per-query before/after to May–June windows; the 6-month "which query" story relies on (CPU curve) + (git feature timeline) + (recent per-query dominance) corroborating one another.
Seasonality not yet excluded for later summer. Call counts were flat-to-up in the windows measured (May/June); a July/August lull would appear as falling call counts and should be re-checked then if comparing raw CPU across the summer.

Appendix A — Sources and how to reproduce

A.1 Shareable Cloud Console links (require project access)

Instance overview / system insights (CPU, memory, connections): https://console.cloud.google.com/sql/instances/treeherder-prod-prod-v1-postgres/system-insights?project=moz-fx-treeherder-prod-c739
Query Insights (top queries by load, per-query trends): https://console.cloud.google.com/sql/instances/treeherder-prod-prod-v1-postgres/insights?project=moz-fx-treeherder-prod-c739
Metrics Explorer (build the CPU chart): https://console.cloud.google.com/monitoring/metrics-explorer?project=moz-fx-treeherder-prod-c739 Then select metric Cloud SQL Database → CPU utilization, filter database_id = moz-fx-treeherder-prod-c739:treeherder-prod-prod-v1-postgres, aligner mean, period 1 day.

(Cloud Monitoring deep-links encode the full query in an opaque pageState blob and are not reliably hand-constructable; the API calls in A.2 are the reproducible source of every number in this doc.)

A.2 Exact Monitoring API calls

Auth (interactive): gcloud auth login cdawson@mozilla.com, then TOKEN=$(gcloud auth print-access-token).

Two metric families use different resource types:

Instance metrics → resource type cloudsql_database, label database_id.
Query Insights per-query metrics → resource type cloudsql_instance_database, label resource_id.

For this instance both labels equal moz-fx-treeherder-prod-c739:treeherder-prod-prod-v1-postgres.

TOKEN=$(gcloud auth print-access-token)
PROJ=moz-fx-treeherder-prod-c739
ID="moz-fx-treeherder-prod-c739:treeherder-prod-prod-v1-postgres"

# (1) Sustained CPU — daily mean
curl -s -G "https://monitoring.googleapis.com/v3/projects/$PROJ/timeSeries" \
  -H "Authorization: Bearer $TOKEN" \
  --data-urlencode 'filter=metric.type="cloudsql.googleapis.com/database/cpu/utilization" AND resource.labels.database_id="'"$ID"'"' \
  --data-urlencode 'interval.startTime=2025-06-01T00:00:00Z' \
  --data-urlencode 'interval.endTime=2026-06-15T00:00:00Z' \
  --data-urlencode 'aggregation.alignmentPeriod=86400s' \
  --data-urlencode 'aggregation.perSeriesAligner=ALIGN_MEAN'
# Use ALIGN_MAX for daily peaks.

# (2) Top queries by total execution time (per week), grouped by query text
curl -s -G "https://monitoring.googleapis.com/v3/projects/$PROJ/timeSeries" \
  -H "Authorization: Bearer $TOKEN" \
  --data-urlencode 'filter=metric.type="cloudsql.googleapis.com/database/postgresql/insights/perquery/execution_time" AND resource.labels.resource_id="'"$ID"'"' \
  --data-urlencode 'interval.startTime=2026-06-08T00:00:00Z' \
  --data-urlencode 'interval.endTime=2026-06-15T00:00:00Z' \
  --data-urlencode 'aggregation.alignmentPeriod=604800s' \
  --data-urlencode 'aggregation.perSeriesAligner=ALIGN_DELTA' \
  --data-urlencode 'aggregation.crossSeriesReducer=REDUCE_SUM' \
  --data-urlencode 'aggregation.groupByFields=metric.label.querystring'

# (3) Per-query CALL COUNT + mean latency — metric perquery/latencies is a DISTRIBUTION;
#     sum distributionValue.count over the window = number of executions,
#     distributionValue.mean = mean latency (µs). ALIGN_DELTA, group by querystring.
curl -s -G "https://monitoring.googleapis.com/v3/projects/$PROJ/timeSeries" \
  -H "Authorization: Bearer $TOKEN" \
  --data-urlencode 'filter=metric.type="cloudsql.googleapis.com/database/postgresql/insights/perquery/latencies" AND resource.labels.resource_id="'"$ID"'"' \
  --data-urlencode 'interval.startTime=2026-06-08T00:00:00Z' \
  --data-urlencode 'interval.endTime=2026-06-15T00:00:00Z' \
  --data-urlencode 'aggregation.alignmentPeriod=604800s' \
  --data-urlencode 'aggregation.perSeriesAligner=ALIGN_DELTA'

execution_time values are cumulative microseconds (divide by 1e6 for seconds). The querystring label holds the normalized SQL; the LPID query starts with SELECT "group"."name" and the /jobs query with SELECT "job"."submit_time".

Appendix B — Raw data

B.1 Monthly mean CPU (daily ALIGN_MEAN, year)

2025-06 0.147   2025-09 0.121   2025-12 0.234   2026-03 0.335   2026-06 0.231 (15d)
2025-07 0.168   2025-10 0.204   2026-01 0.222   2026-04 0.399
2025-08 0.133   2025-11 0.194   2026-02 0.274   2026-05 0.418

B.2 Daily mean CPU around the May/June drop

2026-05-21 0.561   05-26 0.271   05-31 0.194   06-05 0.225   06-10 0.257
2026-05-22 0.549   05-27 0.322   06-01 0.189   06-06 0.215   06-11 0.322
2026-05-23 0.505   05-28 0.411   06-02 0.247   06-07 0.159   06-12 0.287
2026-05-24 0.226   05-29 0.309   06-03 0.265   06-08 0.144   06-13 0.279
2026-05-25 0.195   05-30 0.303   06-04 0.280   06-09 0.216   06-14 0.180 / 06-15 0.198

Clean regime change between 05-23 (0.505) and 05-24 (0.226).

B.3 Top queries by total execution time (h/wk)

                                 May 1–8   May 17–24   Jun 8–15
LPID intermittent   (#1→#6)        87        112/151*    2.8     ← #9553
/jobs list                         36         61/79*     32      (now #1)
bugscache search                   24         27         20      ← #9554/#9555
performance_datum SELECT (x2)    27+15      21+16      19+18
performance_datum UPDATE            —         19         16
group_status DELETE                 —          —          6

*Two figures where execution_time (REDUCE_SUM) and latencies (count×mean) windows differ slightly; both are reported for transparency.

B.4 Call counts and per-call cost (perquery/latencies distribution)

                 PRE-fix  (May 17–24)            POST-fix (Jun 8–15)
LPID query   calls=677,881  mean=597.1ms     calls=685,256  mean= 14.9ms   (total 112.4h → 2.8h)
/jobs query  calls=1,874,587 mean=116.7ms    calls=2,309,509 mean= 35.1ms  (total  60.8h → 22.5h)

LPID call volume is flat (+1%) while per-call cost fell 40× → the drop is the optimization, not reduced traffic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!