perf: test report cache priming, pagination, and query optimization#3505
perf: test report cache priming, pagination, and query optimization#3505stbenjam wants to merge 15 commits intoopenshift:mainfrom
Conversation
When collapse=false, TestsByNURPAndStandardDeviation builds a query that self-joins prow_test_report_7d_matview 3 times: 1. Outer query - gets the raw rows 2. pass_rates subquery - computes per-variant percentages 3. stats subquery - computes AVG/STDDEV across variants The name/variant filters were only applied to the outermost query. Subqueries 2 and 3 scanned all rows for the release to compute aggregates for every test, even when only a single test was requested. For release 4.22 with a name filter, this meant: | | Before (outer only) | After (pushed down) | |--------------------|-------------------------|---------------------| | Stats subquery | Seq Scan, 1.28M rows | Index Scan, 142 | | Estimated cost | 802,603 - 1,137,530 | 7.53 - 1,371 | | Speedup | - | ~830x | TestsByNURPAndStandardDeviation now accepts optional filter functions (variadic, backward-compatible) that are applied to both the stats and pass_rates subqueries. The filter is still also applied to the outer query, so results are identical. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Variant-specific filters (e.g., NOT has entry "never-stable") must not be pushed into the stats subquery, which computes AVG/STDDEV across all variants for a test. Filtering out variants there would skew the delta_from_*_average and standard deviation calculations. Split SubqueryFilter into a struct with a VariantOnly flag and an isVariantFilter helper. At the call site, the rawFilter is further split: name filters go to both stats and passRates subqueries (safe, just narrows to the matching test), while variant filters go only to passRates (preserving cross-variant stats semantics). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds server-side pagination, sorting, and filtering to the Tests API (Postgres and BigQuery), exports SubqueryFilter for per-variant vs global DB filters, introduces cache priming utilities and a test-report-cache loader, validates pagination input (with tests), and wires frontend TestTable for server-driven pagination and cache-aware results. ChangesTests API, DB Query Filters, Pagination, and Cache Priming
Sequence Diagram(s)sequenceDiagram
participant Client
participant Frontend as TestTable
participant Server
participant DB
participant Cache
Client->>Frontend: change page/filter/sort
Frontend->>Server: GET /api/tests?release=X&page=Y&perPage=Z&...
Server->>Server: parse & validate pagination/spec
Server->>Cache: check cached paginated/unfiltered results
alt cache miss
Server->>DB: call TestsByNURPAndStandardDeviation(..., subqueryFilters...)
DB-->>Server: filtered rows (+ total_rows if counted)
Server->>Cache: store results (use TestResultsCacheDuration)
end
Server-->>Frontend: JSON { rows, total_rows, page_size, page }
Frontend-->>Client: render rows and pagination UI
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/db/query/test_queries.go (1)
247-260: ⚡ Quick winSplit the godoc so it documents the right exported symbols.
Inserting
SubqueryFilterhere makes the existingTestsByNURPAndStandardDeviationdoc block attach toSubqueryFilter, so the type now starts with the function description and the exported function no longer has its own godoc. Please giveSubqueryFiltera short type comment and move the analytics-query description back aboveTestsByNURPAndStandardDeviation.As per coding guidelines, "Name each function succinctly but accurately indicating its purpose relative to its package or receiver. When adding new functions, types, or fields, include a brief godoc if the name alone would not make the purpose obvious to someone unfamiliar with the feature."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/db/query/test_queries.go` around lines 247 - 260, The file's godoc for TestsByNURPAndStandardDeviation was accidentally attached to the SubqueryFilter type; add a short type comment above SubqueryFilter (one sentence describing it as a wrapper for filter functions with metadata) and move the existing longer analytics-query comment back so it immediately precedes the TestsByNURPAndStandardDeviation function declaration; ensure SubqueryFilter has its own concise godoc and TestsByNURPAndStandardDeviation retains the original multi-line doc block.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/db/query/test_queries.go`:
- Around line 247-260: The file's godoc for TestsByNURPAndStandardDeviation was
accidentally attached to the SubqueryFilter type; add a short type comment above
SubqueryFilter (one sentence describing it as a wrapper for filter functions
with metadata) and move the existing longer analytics-query comment back so it
immediately precedes the TestsByNURPAndStandardDeviation function declaration;
ensure SubqueryFilter has its own concise godoc and
TestsByNURPAndStandardDeviation retains the original multi-line doc block.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 24a16f3d-8b6b-4be1-9e37-fab115d7b00a
📒 Files selected for processing (2)
pkg/api/tests.gopkg/db/query/test_queries.go
|
Scheduling required tests: |
The /api/tests endpoint previously returned all matching rows (up to 50k+ for uncollapsed views), causing the frontend to barely load. This adds server-side pagination following the existing pattern used by the job runs endpoint. When perPage/page query parameters are present, the backend now: - Applies ORDER BY, COUNT, LIMIT, and OFFSET at the SQL level - Returns a PaginationResult envelope with rows, total_rows, page_size, page - Bypasses the cache (paginated queries are fast with LIMIT/OFFSET) When pagination params are absent, existing behavior is preserved for backward compatibility. Frontend changes switch the DataGrid to paginationMode="server" and send perPage/page params, following the JobRunsTable pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@stbenjam: This pull request references TRT-2575 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling required tests: |
- Validate sort direction against allowlist (asc/desc) instead of interpolating raw user input into ORDER BY clause - Add bounds validation for perPage (1-1000) and page (>= 0) in getPaginationParams to prevent DoS via unbounded queries - Check COUNT query error instead of silently ignoring it - Fix BigQuery path to return PaginationResult envelope when pagination params are present (frontend expects this format) - Move COUNT before ORDER BY to avoid unnecessary overhead - Inline isVariantFilter (trivial single-field access) - Separate SubqueryFilter doc comment from function doc block - Update API docs in pkg/api/README.md with new pagination params and response format - Add unit tests for getPaginationParams bounds validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Review Panel VerdictDisposition: APPROVE — all BLOCKING findings from initial review resolved Specialist FindingsArchitecture Reviewer: Cross-file impact is clean. Security & Supply Chain Reviewer: Sort direction uses an allowlist ( UX & API Reviewer: Backward compatible — when Codebase Consistency Reviewer: QA Engineer: Devil's Advocate: Sort direction injection — resolved, allowlist prevents arbitrary tokens. BigQuery response shape — resolved, wraps in Technical Writer: API docs in DBA Expert (300 years PostgreSQL): Filter pushdown into stats/passRates subqueries is the key performance win (~830x for filtered queries). Panel SynthesisAll eight specialists converged on the same five BLOCKING findings in the initial review, and all five have been resolved:
No remaining BLOCKING findings. The SUGGESTION items (inline Required Actions Before MergeNone. Optional Follow-ups
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/api/tests.go`:
- Around line 401-409: The BigQuery branch sets TotalRows to len(testsResult)
after pagination, returning the page count rather than the full dataset count;
update the logic in the handler (the BigQuery path in pkg/api/tests.go where
RespondWithJSON is called for pagination) to compute and return the true total
row count by either running a separate COUNT(*) query before applying
limit/offset (mirror the Postgres path) or by using any available pre-limit
count variable (e.g., a totalCount/rowsBeforeLimit value if present) and set
apitype.PaginationResult.TotalRows to that value instead of len(testsResult);
ensure this runs only when pagination != nil and does not change the existing
paged Rows/Page/PageSize fields.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: adf153c4-f9ab-461c-9a0e-18064adad5ef
📒 Files selected for processing (7)
pkg/api/README.mdpkg/api/tests.gopkg/db/query/test_queries.gopkg/sippyserver/parameters.gopkg/sippyserver/parameters_test.gopkg/sippyserver/server.gosippy-ng/src/tests/TestTable.js
The BQ path was setting TotalRows to len(testsResult) after limit, which gave the page count instead of the dataset total. Now captures the total before applying pagination slice. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/hold |
The paginated tests API path was bypassing the cache and running the expensive three-layer nested subquery on every page/sort change. Now both paginated and non-paginated paths share the same cached result set (1 hour TTL), with sorting and pagination applied in memory. The collapsed result set is ~5k rows, making in-memory operations trivial. This also removes the separate COUNT(*) query that was doubling the DB work per paginated request. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cache the full unfiltered test result set and apply filters in memory. This means any filter, sort, or page change is served instantly from cache without hitting the database. The cache TTL is increased to 4 hours to match the cache primer schedule. Also adds a test-report-cache data loader that can be used with the cache primer cronjob (--loader=test-report-cache) to warm the test results cache for all releases on both default and twoDay periods. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sults Only prime cache for OCP development releases (no GA date, has payloadTags capability) to avoid wasting time on GA/OKD releases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/api/tests.go (1)
485-490: 💤 Low valueConsider adding a brief godoc for the
matview()method.While the method is simple, a short comment clarifying its purpose (e.g.,
// matview returns the materialized view name based on the spec's period.) would improve readability for unfamiliar readers. As per coding guidelines, include a brief godoc if the name alone would not make the purpose obvious.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/api/tests.go` around lines 485 - 490, Add a brief godoc comment above the TestResultsSpec.matview() method explaining what it returns and when (e.g., "matview returns the materialized view name based on the spec's Period, choosing 2-day or 7-day view"). Update the comment to reference the behavior using the Period field and the returned constants testReport2dMatView and testReport7dMatView, keeping it concise and placed immediately above the func declaration for matview().pkg/dataloader/testreportcacheloader/testreportcacheloader.go (1)
14-27: 💤 Low valueConsider adding brief godoc for the exported
Newfunction.While the implementation is straightforward and follows established loader patterns, the coding guidelines suggest including a brief godoc if the name alone would not make the purpose obvious. A short comment like
// New creates a testReportCacheLoader that primes cache for development releases.would help unfamiliar readers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/dataloader/testreportcacheloader/testreportcacheloader.go` around lines 14 - 27, Add a brief godoc comment above the exported New function describing its purpose and behavior (for example: it creates a testReportCacheLoader that primes the cache for development releases); place the comment immediately above the New function declaration in the testreportcacheloader package and reference the returned type testReportCacheLoader and the parameters (dbc, cacheClient, releases) so readers understand what the constructor does.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/api/tests.go`:
- Around line 485-490: Add a brief godoc comment above the
TestResultsSpec.matview() method explaining what it returns and when (e.g.,
"matview returns the materialized view name based on the spec's Period, choosing
2-day or 7-day view"). Update the comment to reference the behavior using the
Period field and the returned constants testReport2dMatView and
testReport7dMatView, keeping it concise and placed immediately above the func
declaration for matview().
In `@pkg/dataloader/testreportcacheloader/testreportcacheloader.go`:
- Around line 14-27: Add a brief godoc comment above the exported New function
describing its purpose and behavior (for example: it creates a
testReportCacheLoader that primes the cache for development releases); place the
comment immediately above the New function declaration in the
testreportcacheloader package and reference the returned type
testReportCacheLoader and the parameters (dbc, cacheClient, releases) so readers
understand what the constructor does.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 943dbb47-6c35-4c48-9d01-e609f69cc228
📒 Files selected for processing (4)
cmd/sippy/load.gopkg/api/tests.gopkg/dataloader/testreportcacheloader/testreportcacheloader.gopkg/dataloader/testreportcacheloader/testreportcacheloader_test.go
PrimeTestResultsCache now bypasses the cache read path entirely, always regenerating from the database and writing the fresh result. Previously it went through GetDataFromCacheOrMatview which would return stale cached data if the matview hadn't been refreshed yet. Also adds hack/bench-test-api.sh for comparing prod vs local API response times across various query patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The production API is too slow to complete within curl's timeout, making the comparison benchmark impractical. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The cache primer runs every 4 hours, so a 4-hour TTL risked cache expiry on the boundary before the next primer run. Extending to 5 hours ensures primed entries never expire between runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/api/tests.go (2)
421-437: 💤 Low valueConsider extracting pagination helper to reduce duplication.
The pagination logic (lines 421-437) is nearly identical to the Postgres path (lines 377-393). Per coding guidelines, check
pkg/util/for existing helpers or consider extracting a shared pagination function.♻️ Example helper extraction
// In pkg/util or locally in this file: func paginate[T any](items []T, pagination *apitype.Pagination) ([]T, int64) { totalRows := int64(len(items)) start := pagination.Page * pagination.PerPage end := start + pagination.PerPage if start > int(totalRows) { start = int(totalRows) } if end > int(totalRows) { end = int(totalRows) } return items[start:end], totalRows }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/api/tests.go` around lines 421 - 437, The pagination slicing logic in the in-memory branch duplicates the Postgres path; extract a reusable helper (e.g., paginate[T any](items []T, pagination *apitype.Pagination) ([]T, int64)) into pkg/util or top of this file and replace the block in the function that builds the apitype.PaginationResult (where we currently compute totalRows, start, end and slice sorted[start:end]) to call that helper, then pass the returned rows and totalRows into RespondWithJSON for consistency with the Postgres path.
230-241: 💤 Low valueFilter errors are silently ignored.
When
f.Filter(t)returns an error (e.g., due to an invalid filter field), the test is silently excluded from results. Consider logging at debug level to aid troubleshooting.♻️ Suggested improvement
func (tests TestsAPIResult) filter(f *filter.Filter) TestsAPIResult { if f == nil || len(f.Items) == 0 { return tests } var result TestsAPIResult for _, t := range tests { - if match, err := f.Filter(t); err == nil && match { + match, err := f.Filter(t) + if err != nil { + log.WithError(err).Debugf("filter error for test %s", t.Name) + continue + } + if match { result = append(result, t) } } return result }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/api/tests.go` around lines 230 - 241, In TestsAPIResult.filter, errors returned by f.Filter(t) are currently ignored; update the loop in the TestsAPIResult.filter method to log any non-nil error from f.Filter(t) at debug level (including the error and identifying info about t) before continuing so failures to evaluate a filter are visible for troubleshooting while preserving the current behavior of skipping non-matching items; i.e., when handling the result of f.Filter(t), if err != nil call your package's debug logger (e.g., log.Debugf or the existing logger) with the error and t, then continue.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/api/tests.go`:
- Around line 421-437: The pagination slicing logic in the in-memory branch
duplicates the Postgres path; extract a reusable helper (e.g., paginate[T
any](items []T, pagination *apitype.Pagination) ([]T, int64)) into pkg/util or
top of this file and replace the block in the function that builds the
apitype.PaginationResult (where we currently compute totalRows, start, end and
slice sorted[start:end]) to call that helper, then pass the returned rows and
totalRows into RespondWithJSON for consistency with the Postgres path.
- Around line 230-241: In TestsAPIResult.filter, errors returned by f.Filter(t)
are currently ignored; update the loop in the TestsAPIResult.filter method to
log any non-nil error from f.Filter(t) at debug level (including the error and
identifying info about t) before continuing so failures to evaluate a filter are
visible for troubleshooting while preserving the current behavior of skipping
non-matching items; i.e., when handling the result of f.Filter(t), if err != nil
call your package's debug logger (e.g., log.Debugf or the existing logger) with
the error and t, then continue.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 2eeee8b4-20e8-4802-b8ec-4371f48873fb
📒 Files selected for processing (1)
pkg/api/tests.go
|
Scheduling required tests: |
The API defaults IncludeOverall to true when collapse is false, but the primer was leaving it as false. This caused a cache key mismatch, resulting in cache misses for uncollapsed requests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old code path through GetDataFromCacheOrMatview handled nil cache gracefully, but the direct write path panicked on nil. Return an error instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Benchmark: Local API with cache priming (release 5.0)
Collapsed results see a ~26x improvement from caching. Uncollapsed results are ~12x faster cached vs uncached (2+ minutes down to ~10s). The uncollapsed cached time is dominated by deserializing the ~1GB JSON blob; the uncollapsed uncached query is a full self-joining matview scan that takes over 2 minutes. |
Add WithCompression() option to CacheSet that gzip-compresses the JSON before writing to Redis. Both cache read paths auto-detect gzip via magic header bytes and decompress transparently. Only the test results cache primer uses compression for now, as the uncollapsed result set is ~1GB of JSON. All other callers are unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Scheduling required tests: |
2 similar comments
|
Scheduling required tests: |
|
Scheduling required tests: |
|
This is huge and that benchmark comment answers the main question I had, having the option to see the uncompressed list again would be handy, I had to give up on that some time ago. |
|
/lgtm Just making sure I don't step on TRT's toes if they want a crack at this. Thanks for improving this, will help daily. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, stbenjam The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@stbenjam: No Jira issue is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
I am not sure this is really the right approach, the caching is clunky and bad, @smg247 is going to look at it, if we can't figure out something better we could try to take this but it is really putting a tuxedo on a toad. |
|
Scheduling required tests: |
|
@stbenjam: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Cache priming for test results
The test report cache loader now primes both collapsed and non-collapsed results in Redis, eliminating cold-cache latency for the most common test report views. Previously only collapsed results were primed, leaving non-collapsed queries (the detailed per-variant NURP+ view) to hit the database on first access.
To avoid unnecessary work, the cache primer now only targets OCP development releases (identified by having the
payloadTagscapability and no GA date set). GA releases and non-OCP products (OKD, etc.) are skipped.Server-side pagination
The /api/tests endpoint previously returned all matching rows (up to 50k+ for uncollapsed views), causing the tests page to barely load. This adds server-side pagination following the existing pattern used by the job runs endpoint.
When perPage/page query parameters are present, the backend:
When pagination params are absent, existing behavior is preserved for backward compatibility.
Frontend changes switch the DataGrid to
paginationMode="server"and sendperPage/pageparams, following theJobRunsTablepattern.Filter pushdown (~830x improvement for filtered queries)
When collapse=false, TestsByNURPAndStandardDeviation builds a query that self-joins prow_test_report_7d_matview 3 times. Name/variant filters were only applied to the outermost query, causing subqueries to scan all rows for the release. Filters are now pushed into the stats and pass_rates subqueries, allowing index use on cache misses and paginated queries.
Replaces #3290 (rebased on main; original e2e failure was an unrelated sippy-load-job timeout).
Summary by CodeRabbit
Release Notes
New Features
perPageandpageparameters for efficient data retrievalDocumentation