-
Notifications
You must be signed in to change notification settings - Fork 86
Add automated Claude failure analysis for Prow CI #2034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: oadp-dev
Are you sure you want to change the base?
Add automated Claude failure analysis for Prow CI #2034
Conversation
|
Skipping CI for Draft Pull Request. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Excluded labels (none allowed) (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kaovilai The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test all |
b51f15b to
5429631
Compare
|
/test all |
|
new cred |
Integrates Claude Code via Vertex AI to automatically analyze E2E test
failures in Prow CI and generate comprehensive failure reports.
Changes:
- build/ci-Dockerfile: Install Node.js 20.x and Claude CLI with multi-arch support
- tests/e2e/scripts/analyze_failures.sh: New analysis script with:
- Vertex AI integration via Claude Code CLI headless mode (--print flag)
- Comprehensive secret redaction (AWS keys, GCP keys, tokens, passwords)
- Graceful degradation when credentials unavailable
- 10-minute timeout with partial analysis on timeout
- Makefile: Set GOOGLE_APPLICATION_CREDENTIALS and ANTHROPIC_VERTEX_PROJECT_ID
from vault files before running analysis script
- docs/design/claude-prow-failure-analysis_design.md: Complete design document
- tests/e2e/claude_test_failure_test.go: Simple failing test for verification
- tests/e2e/backup_restore_suite_test.go: Realistic failing test that triggers
must-gather collection
The analysis runs post-suite and does not affect test execution or results.
Output is saved to ${ARTIFACT_DIR}/claude-failure-analysis.md with automatic
secret redaction to prevent credential leakage.
Requires Vertex AI credentials in vault collection files:
- /var/run/oadp-credentials/gcp-claude-code-credentials
- /var/run/oadp-credentials/gcp-claude-code-project-id
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
5429631 to
22ad6b9
Compare
|
/test all |
…mentation This commit introduces a new file, CLAUDE.md, which provides comprehensive guidance for developers working with the OADP project. It includes sections on project overview, prerequisites, essential development commands, testing commands, cloud authentication deployment, and important environment variables. Additionally, the existing failure analysis documentation in docs/design/claude-prow-failure-analysis_design.md has been updated to reflect changes in the analysis process, emphasizing the use of JUnit reports and per-test logs instead of build logs, which are not available during analysis. The analysis script has also been modified to focus on available artifacts and improve clarity in the analysis tasks. Changes: - New file: CLAUDE.md with detailed developer instructions - Updated failure analysis design document to clarify artifact usage and analysis process - Modifications to the analyze_failures.sh script to remove references to build-log.txt and enhance artifact handling This update aims to streamline the development workflow and improve the efficiency of failure analysis in Prow CI. Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
|
/test all |
This commit updates the analyze_failures.sh script to include the 'set -o pipefail' option, ensuring that the first non-zero exit code in a pipeline is returned. Additionally, the documentation within the script has been revised to clarify the known flake patterns, now referencing the source file that contains detailed information about flake patterns and error ignore patterns. This aims to streamline the failure analysis process by providing clearer guidance on diagnosing test failures. Changes: - Added 'set -o pipefail' to improve error handling in pipelines. - Updated documentation to reference the source file for known flake patterns and error ignore patterns. Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
This commit improves the analyze_failures.sh script by adding support for preprocessing large log files using a subagent pattern. It introduces functions to extract relevant error messages and context from large logs, creating a summary file for quick access during analysis. Additionally, the script now checks for the availability of the Claude CLI before execution and captures exit codes properly to ensure accurate error handling. Changes: - Added functions for extracting errors from large log files. - Implemented preprocessing of large artifacts to create a summary of extracted errors. - Enhanced documentation to clarify the analysis process and artifact usage. These updates aim to streamline the failure analysis workflow and improve the accuracy of insights generated from log files. Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
|
/test all |
…e analysis This commit updates the `.claude/config.json` file to remove specific path permissions that are now handled at runtime using the `--allowedTools` flag. The documentation in `docs/design/claude-prow-failure-analysis_design.md` has been expanded to clarify the new permission model, emphasizing the use of runtime permissions to bypass sandbox restrictions. Additionally, the `analyze_failures.sh` script has been updated to utilize the `--allowedTools` flag for explicit file access during analysis. Changes: - Removed specific path permissions from `.claude/config.json`. - Enhanced documentation to explain the runtime permissions model. - Updated `analyze_failures.sh` to use `--allowedTools` for file access. These updates aim to improve the clarity and functionality of the failure analysis process in Prow CI. Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
|
/test all |
|
/retest ai-retester: Both failures are “intentional” – they’re deliberately triggered by the test author to make sure the Claude‑based failure‑analysis plumbing works.
comment for /pull/2034 |
- Add --allowedTools flag to grant Read access to artifact paths - Fix argument order: --allowedTools must come before --print - Add AVAILABLE TOOLS section to prompts so Claude knows its constraints - Simplify .claude/config.json (path permissions handled at runtime) - Update design doc with runtime permissions approach The Claude Code CLI sandbox restricts filesystem access to the current working directory. Since artifacts are at /logs/artifacts/ (outside CWD), we use --allowedTools to explicitly grant Read permissions at runtime.
…n and scripts - Replace `--allowedTools` with `--add-dir` for granting directory access in the analysis script. - Enhance documentation to clarify the use of `--add-dir` and `--allowedTools` for bypassing sandbox CWD restrictions. - Ensure consistent usage of CLI flags across the `analyze_failures.sh` script and design documentation. These changes improve clarity and functionality in the failure analysis process, ensuring proper access to necessary directories during runtime. Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
|
/test all |
|
|
that's pretty hot! I'll be back tomorrow if you guys want to chat :) THANK YOU! For pushing this along |
|
/test 4.20-e2e-test-aws |
Original PR: openshift#2034 Author: Tiger Kaovilai <tkaovila@redhat.com> Date: Fri Nov 28 09:17:23 2025 -0500 refactor: update runtime permissions in failure analysis documentation and scripts - Replace `--allowedTools` with `--add-dir` for granting directory access in the analysis script. - Enhance documentation to clarify the use of `--add-dir` and `--allowedTools` for bypassing sandbox CWD restrictions. - Ensure consistent usage of CLI flags across the `analyze_failures.sh` script and design documentation. These changes improve clarity and functionality in the failure analysis process, ensuring proper access to necessary directories during runtime. Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
|
/hold |
|
moving to #2038 |
|
@kaovilai: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Integrates Claude Code via Vertex AI to automatically analyze E2E test
failures in Prow CI and generate comprehensive failure reports.
Changes:
from vault files before running analysis script
must-gather collection
The analysis runs post-suite and does not affect test execution or results.
Output is saved to ${ARTIFACT_DIR}/claude-failure-analysis.md with automatic
secret redaction to prevent credential leakage.
Requires Vertex AI credentials in vault collection files:
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com
Why the changes were made
How to test the changes made