Skip to content

Treeherder RW Postgres — Load Increase Investigation

Cameron Dawson edited this page Jun 27, 2026 · 1 revision

Instance: treeherder-prod-prod-v1-postgres (Cloud SQL, Postgres 15, db-custom-8-53248 = 8 vCPU / 52 GB), project moz-fx-treeherder-prod-c739, region us-west1. A read replica (treeherder-prod-prod-v1-postgres-replica-0) exists.

Investigated: June 2026, read-only, via the Cloud Monitoring API and Cloud SQL Query Insights per-query metrics. All raw numbers and the exact API calls used are in the appendices so this can be independently re-created.


Executive summary

Sustained CPU on the RW instance roughly tripled between summer 2025 and May 2026 (monthly-mean CPU ~0.13 → ~0.42; daily peaks pegged at 100% for months), then dropped back to ~0.23 in June 2026.

Root cause of the increase: one query — the Log Parser Intermittent Detector (LPID) classification query in treeherder/log_parser/intermittents.py (check_and_mark_intermittent), added and expanded across a series of 2025 PRs (Joel Maher). It grew steadily more expensive as the tables it scans accumulated rows.

Cause of the June decrease: that query was optimized in #9553 (Cameron Dawson, merged 2026-05-27) — confirmed not a summer-traffic artifact, since call volume was flat-to-rising over the same window (§3).

Quantified at the query level (Query Insights, RW instance):

LPID query /jobs list query
total time, May 17–24 (pre-fix) 112 h/wk — #1 61 h — #2
total time, Jun 8–15 (post-fix) 2.8 h/wk — #6 22 h — #1
calls/wk pre → post 678k → 685k (flat) 1.87M → 2.31M (+23%)
mean per call pre → post 597 ms → 15 ms (40× faster) 117 ms → 35 ms

Same call volume, ~40× cheaper per call → the win is the optimization, not reduced load. The current #1 consumer is /jobs, the target of the pending reference-data caching and read-replica work (§4, §5).

This conclusion is supported by three independent signals corroborating one another — the CPU curve, the git feature timeline, and per-query execution metrics — though exact deploy timestamps and the data-growth mechanism are inferred, not directly confirmed (see Caveats).


1. When and where — the load timeline

Daily-mean CPU utilization of the RW instance (cloudsql.googleapis.com/database/cpu/utilization, ALIGN_MEAN, daily), by month (full raw series in Appendix B.1):

Month Mean CPU Note
2025-06 0.147 baseline
2025-07 0.168 LPID introduced (#8696, Jul 9, Joel Maher)
2025-08 0.133
2025-09 0.121 LPID expanded (#8881/#8905/#8923/#8940/#8968)
2025-10 0.204 step up (LPID rollout takes effect)
2025-11 0.194
2025-12 0.234
2026-01 0.222
2026-02 0.274 steady ramp
2026-03 0.335
2026-04 0.399
2026-05 0.418 peak (weekday peaks 0.50–0.56)
2026-06 0.231 drop after #9553 (15 days)

Two inflection points:

  1. October 2025 — step up (~0.13 → ~0.20), following the LPID feature rollout/expansion of Jul–Sep 2025.
  2. ~May 24, 2026 — sharp drop (weekday peaks ~0.50 → ~0.30). The daily series shows a clean regime change between May 23 (0.505) and May 24 (0.226), holding through June (raw daily series in Appendix B.2). Aligns with the #9553 optimization (local commit 454c660b4, May 25; merged f14ec66aa #9553, May 27).

Daily peak CPU has been saturated (= 1.0) intermittently since ≥ December 2025, so peak CPU is not discriminating; mean (sustained) load is the signal.


2. Root cause — the LPID intermittent-classification query

check_and_mark_intermittent (with its _check_and_mark_infra companion) runs once per parsed job log to decide whether a failure is intermittent. For non-try repos it builds a list of pushes in the last 36 hours for the repository, joins group → job_log → job → group_status (plus job_type, failure_classification) across that window, and aggregates pass/fail rates per group.

Why it produced the load curve:

  • Feature rollout (the Oct step). Introduced Jul 2025 (#8696); expanded through Aug–Sep 2025 to cover infra failures, tasks with no groups, multi-push history, and "classify all tasks in a single update" (#8881, #8905, #8923, #8940, #8968). All authored by Joel Maher. More jobs ran the query and each did more work.
  • Data growth (the Oct→May ramp). The 36-hour scan covers a fixed time range but an ever-growing row count as group/job_log/group_status accumulate. The query's own weekly total nearly doubled within May (87 h in May 1–8 → 112–151 h in May 17–24), i.e. cost-per-call was rising, not just call count.
  • The fix (the May drop). #9553 (Cameron Dawson) replaced the join through push.repository_id with a direct filter on job.repository_id, letting Postgres use the (repository, job_type, push) index on job and drop the push join. Per-call mean fell 597 ms → 15 ms; weekly total 112 h → 2.8 h; sustained CPU ~0.42 → ~0.23.

The timing of all three phases (rollout → ramp → fix) lines up with both the CPU curve and the per-query execution-time numbers.


3. Ruling out the summer-vacation hypothesis

A reasonable alternative is that the late-May/June drop reflects fewer commits/test runs during summer (less ingestion → fewer query invocations). The per-query call counts rule this out for the dominant queries (Query Insights perquery/latencies distribution count; Appendix B.4):

  • LPID query: 678k calls/wk (pre) → 685k calls/wk (post) — flat/slightly up. It is invoked just as often; each call is 40× cheaper. The drop is entirely the optimization.
  • /jobs query: 1.87M → 2.31M calls/wk — up 23%. Polling traffic did not fall in this window.

So in the windows measured, query volume was flat-to-rising; the CPU decrease is the query work getting cheaper, not the workload shrinking. (A genuine vacation lull, if it comes later in summer, would show up as falling call counts — worth re-checking in July/August, but it is not what drove the May/June change.)

Secondary effect worth noting: the /jobs query also got cheaper per call (117 ms → 35 ms) despite more calls. This is most consistent with contention relief — once LPID stopped consuming ~112 h/wk of CPU, the whole instance had more headroom (less CPU/IO competition), so other queries sped up too.


4. Current top consumers (post-fix, Jun 8–15) and how our work maps to them

Total execution time per week (Query Insights; Appendix B.3):

Rank h/wk Query Related work Status
1 32 /jobs list camd/jobs-refdata-cache (ref-data caching) + camd/read-replica-queries (polling reads) not merged
2 20 bugscache search #9554 + #9555 (already merged, see §5) merged ~May 27
3–4 19 + 18 performance_datum SELECT (Perfherder) camd/read-replica-queries; camd/query-optimization-performance-datum not merged
5 16 performance_datum UPDATE (job_id backfill) write — not replica-eligible
6 2.8 LPID intermittent query #9553 (fixed) merged May 27
7 6 group_status DELETE (data cycling)

5. Related changes (merged vs pending)

Note on merged-vs-deployed: a PR merged to master may not yet be on production. Where the expected effect is visible in the metrics, it has almost certainly been deployed.

Caused the increase (LPID rollout) — all merged & deployed, author Joel Maher:

PR Date Title
#8696 2025-07-09 Add post log parser to look for repeated test runs and annotate as intermittent
#8881 2025-08-07 support infra issues and tasks with no groups in log parser intermittent detector
#8905 2025-08-14 LPID checks — ensure 'infra' query for extra_jobs has a completed task and parsed job log
#8923 2025-08-25 support reclassifying jobs to failures when new additional failures come in
#8940 2025-08-28 Tweak queries for LPID for fine tuning
#8968 2025-09-15 Fix LPID — proper list of pushes, classify all tasks in single .update where possible

Reduced the load — merged & deployed (effect visible in metrics), author Cameron Dawson:

PR Date Title Evidence of deploy
#9553 2026-05-27 Optimize intermittent-failure classification query LPID mean 597 ms → 15 ms
#9554 2026-05-27 Reduce over-fetching in Bugscache queries bugscache total ~27 h → ~20 h
#9555 2026-05-27 Add GIN trigram index on bugscache.summary (migration 0052) (paired with #9554)

Pending — not yet merged to master (author Cameron Dawson):

Branch Targets Notes (from team)
camd/jobs-refdata-cache /jobs query — caches job_type/job_group/machine_platform to drop joins May be rejected. (It caches those three small reference tables, not reference_data_signatures, and handles newly-ingested ids via refresh-on-miss — but flagged because reference data changes often.)
camd/read-replica-queries Routes Perfherder reads (and could route /jobs polling) to the replica Likely to be merged soon to test
camd/query-optimization-performance-datum performance_datum SELECTs (Perfherder, #3–4 above) Generally approved; may merge soon
camd/bugscache-summary-gin-trgm-index bugscache search The underlying improvement appears already landed as #9554/#9555

6. Caveats / open questions

  • Exact deploy timestamps not independently confirmed. The CPU drop begins May 24; #9553's commit is dated May 25 and merged May 27. The 1–3 day spread is within deploy + weekend-traffic noise. Effects are clearly visible in the metrics, so the changes are deployed, but the precise production release times were not pulled from the deploy log.
  • The October step is noisy. The Sep→Oct shift is real in the monthly means but has high day-to-day variance; it reflects the center of mass moving up, not a single clean step like May.
  • Data-growth mechanism is inferred from the rising per-call cost (87 h → 112–151 h within May), not from historical row counts of group/job_log/group_status (not pulled). Confirming would mean trending those table sizes over the period.
  • Query Insights retention (~6 weeks) limits the per-query before/after to May–June windows; the 6-month "which query" story relies on (CPU curve) + (git feature timeline) + (recent per-query dominance) corroborating one another.
  • Seasonality not yet excluded for later summer. Call counts were flat-to-up in the windows measured (May/June); a July/August lull would appear as falling call counts and should be re-checked then if comparing raw CPU across the summer.

Appendix A — Sources and how to reproduce

A.1 Shareable Cloud Console links (require project access)

  • Instance overview / system insights (CPU, memory, connections): https://console.cloud.google.com/sql/instances/treeherder-prod-prod-v1-postgres/system-insights?project=moz-fx-treeherder-prod-c739
  • Query Insights (top queries by load, per-query trends): https://console.cloud.google.com/sql/instances/treeherder-prod-prod-v1-postgres/insights?project=moz-fx-treeherder-prod-c739
  • Metrics Explorer (build the CPU chart): https://console.cloud.google.com/monitoring/metrics-explorer?project=moz-fx-treeherder-prod-c739 Then select metric Cloud SQL Database → CPU utilization, filter database_id = moz-fx-treeherder-prod-c739:treeherder-prod-prod-v1-postgres, aligner mean, period 1 day.

(Cloud Monitoring deep-links encode the full query in an opaque pageState blob and are not reliably hand-constructable; the API calls in A.2 are the reproducible source of every number in this doc.)

A.2 Exact Monitoring API calls

Auth (interactive): gcloud auth login cdawson@mozilla.com, then TOKEN=$(gcloud auth print-access-token).

Two metric families use different resource types:

  • Instance metrics → resource type cloudsql_database, label database_id.
  • Query Insights per-query metrics → resource type cloudsql_instance_database, label resource_id.

For this instance both labels equal moz-fx-treeherder-prod-c739:treeherder-prod-prod-v1-postgres.

TOKEN=$(gcloud auth print-access-token)
PROJ=moz-fx-treeherder-prod-c739
ID="moz-fx-treeherder-prod-c739:treeherder-prod-prod-v1-postgres"

# (1) Sustained CPU — daily mean
curl -s -G "https://monitoring.googleapis.com/v3/projects/$PROJ/timeSeries" \
  -H "Authorization: Bearer $TOKEN" \
  --data-urlencode 'filter=metric.type="cloudsql.googleapis.com/database/cpu/utilization" AND resource.labels.database_id="'"$ID"'"' \
  --data-urlencode 'interval.startTime=2025-06-01T00:00:00Z' \
  --data-urlencode 'interval.endTime=2026-06-15T00:00:00Z' \
  --data-urlencode 'aggregation.alignmentPeriod=86400s' \
  --data-urlencode 'aggregation.perSeriesAligner=ALIGN_MEAN'
# Use ALIGN_MAX for daily peaks.

# (2) Top queries by total execution time (per week), grouped by query text
curl -s -G "https://monitoring.googleapis.com/v3/projects/$PROJ/timeSeries" \
  -H "Authorization: Bearer $TOKEN" \
  --data-urlencode 'filter=metric.type="cloudsql.googleapis.com/database/postgresql/insights/perquery/execution_time" AND resource.labels.resource_id="'"$ID"'"' \
  --data-urlencode 'interval.startTime=2026-06-08T00:00:00Z' \
  --data-urlencode 'interval.endTime=2026-06-15T00:00:00Z' \
  --data-urlencode 'aggregation.alignmentPeriod=604800s' \
  --data-urlencode 'aggregation.perSeriesAligner=ALIGN_DELTA' \
  --data-urlencode 'aggregation.crossSeriesReducer=REDUCE_SUM' \
  --data-urlencode 'aggregation.groupByFields=metric.label.querystring'

# (3) Per-query CALL COUNT + mean latency — metric perquery/latencies is a DISTRIBUTION;
#     sum distributionValue.count over the window = number of executions,
#     distributionValue.mean = mean latency (µs). ALIGN_DELTA, group by querystring.
curl -s -G "https://monitoring.googleapis.com/v3/projects/$PROJ/timeSeries" \
  -H "Authorization: Bearer $TOKEN" \
  --data-urlencode 'filter=metric.type="cloudsql.googleapis.com/database/postgresql/insights/perquery/latencies" AND resource.labels.resource_id="'"$ID"'"' \
  --data-urlencode 'interval.startTime=2026-06-08T00:00:00Z' \
  --data-urlencode 'interval.endTime=2026-06-15T00:00:00Z' \
  --data-urlencode 'aggregation.alignmentPeriod=604800s' \
  --data-urlencode 'aggregation.perSeriesAligner=ALIGN_DELTA'

execution_time values are cumulative microseconds (divide by 1e6 for seconds). The querystring label holds the normalized SQL; the LPID query starts with SELECT "group"."name" and the /jobs query with SELECT "job"."submit_time".


Appendix B — Raw data

B.1 Monthly mean CPU (daily ALIGN_MEAN, year)

2025-06 0.147   2025-09 0.121   2025-12 0.234   2026-03 0.335   2026-06 0.231 (15d)
2025-07 0.168   2025-10 0.204   2026-01 0.222   2026-04 0.399
2025-08 0.133   2025-11 0.194   2026-02 0.274   2026-05 0.418

B.2 Daily mean CPU around the May/June drop

2026-05-21 0.561   05-26 0.271   05-31 0.194   06-05 0.225   06-10 0.257
2026-05-22 0.549   05-27 0.322   06-01 0.189   06-06 0.215   06-11 0.322
2026-05-23 0.505   05-28 0.411   06-02 0.247   06-07 0.159   06-12 0.287
2026-05-24 0.226   05-29 0.309   06-03 0.265   06-08 0.144   06-13 0.279
2026-05-25 0.195   05-30 0.303   06-04 0.280   06-09 0.216   06-14 0.180 / 06-15 0.198

Clean regime change between 05-23 (0.505) and 05-24 (0.226).

B.3 Top queries by total execution time (h/wk)

                                 May 1–8   May 17–24   Jun 8–15
LPID intermittent   (#1→#6)        87        112/151*    2.8     ← #9553
/jobs list                         36         61/79*     32      (now #1)
bugscache search                   24         27         20      ← #9554/#9555
performance_datum SELECT (x2)    27+15      21+16      19+18
performance_datum UPDATE            —         19         16
group_status DELETE                 —          —          6

*Two figures where execution_time (REDUCE_SUM) and latencies (count×mean) windows differ slightly; both are reported for transparency.

B.4 Call counts and per-call cost (perquery/latencies distribution)

                 PRE-fix  (May 17–24)            POST-fix (Jun 8–15)
LPID query   calls=677,881  mean=597.1ms     calls=685,256  mean= 14.9ms   (total 112.4h → 2.8h)
/jobs query  calls=1,874,587 mean=116.7ms    calls=2,309,509 mean= 35.1ms  (total  60.8h → 22.5h)

LPID call volume is flat (+1%) while per-call cost fell 40× → the drop is the optimization, not reduced traffic.