Skip to content

Conversation

@kaovilai
Copy link
Member

Integrates Claude Code via Vertex AI to automatically analyze E2E test
failures in Prow CI and generate comprehensive failure reports.

Changes:

  • build/ci-Dockerfile: Install Node.js 20.x and Claude CLI with multi-arch support
  • tests/e2e/scripts/analyze_failures.sh: New analysis script with:
    • Vertex AI integration via Claude Code CLI headless mode (--print flag)
    • Comprehensive secret redaction (AWS keys, GCP keys, tokens, passwords)
    • Graceful degradation when credentials unavailable
    • 10-minute timeout with partial analysis on timeout
  • Makefile: Set GOOGLE_APPLICATION_CREDENTIALS and ANTHROPIC_VERTEX_PROJECT_ID
    from vault files before running analysis script
  • docs/design/claude-prow-failure-analysis_design.md: Complete design document
  • tests/e2e/claude_test_failure_test.go: Simple failing test for verification
  • tests/e2e/backup_restore_suite_test.go: Realistic failing test that triggers
    must-gather collection

The analysis runs post-suite and does not affect test execution or results.
Output is saved to ${ARTIFACT_DIR}/claude-failure-analysis.md with automatic
secret redaction to prevent credential leakage.

Requires Vertex AI credentials in vault collection files:

  • /var/run/oadp-credentials/gcp-claude-code-credentials
  • /var/run/oadp-credentials/gcp-claude-code-project-id

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

Why the changes were made

How to test the changes made

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 22, 2025
@openshift-ci
Copy link

openshift-ci bot commented Nov 22, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 22, 2025

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Excluded labels (none allowed) (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link

openshift-ci bot commented Nov 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 22, 2025
@kaovilai
Copy link
Member Author

/test all

@kaovilai kaovilai force-pushed the claude-prow-failure-analysis branch from b51f15b to 5429631 Compare November 24, 2025 16:12
@kaovilai
Copy link
Member Author

/test all

@kaovilai
Copy link
Member Author

new cred
/test all

Integrates Claude Code via Vertex AI to automatically analyze E2E test
failures in Prow CI and generate comprehensive failure reports.

Changes:
- build/ci-Dockerfile: Install Node.js 20.x and Claude CLI with multi-arch support
- tests/e2e/scripts/analyze_failures.sh: New analysis script with:
  - Vertex AI integration via Claude Code CLI headless mode (--print flag)
  - Comprehensive secret redaction (AWS keys, GCP keys, tokens, passwords)
  - Graceful degradation when credentials unavailable
  - 10-minute timeout with partial analysis on timeout
- Makefile: Set GOOGLE_APPLICATION_CREDENTIALS and ANTHROPIC_VERTEX_PROJECT_ID
  from vault files before running analysis script
- docs/design/claude-prow-failure-analysis_design.md: Complete design document
- tests/e2e/claude_test_failure_test.go: Simple failing test for verification
- tests/e2e/backup_restore_suite_test.go: Realistic failing test that triggers
  must-gather collection

The analysis runs post-suite and does not affect test execution or results.
Output is saved to ${ARTIFACT_DIR}/claude-failure-analysis.md with automatic
secret redaction to prevent credential leakage.

Requires Vertex AI credentials in vault collection files:
- /var/run/oadp-credentials/gcp-claude-code-credentials
- /var/run/oadp-credentials/gcp-claude-code-project-id

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@kaovilai kaovilai force-pushed the claude-prow-failure-analysis branch from 5429631 to 22ad6b9 Compare November 24, 2025 22:42
@kaovilai
Copy link
Member Author

/test all

…mentation

This commit introduces a new file, CLAUDE.md, which provides comprehensive guidance for developers working with the OADP project. It includes sections on project overview, prerequisites, essential development commands, testing commands, cloud authentication deployment, and important environment variables.

Additionally, the existing failure analysis documentation in docs/design/claude-prow-failure-analysis_design.md has been updated to reflect changes in the analysis process, emphasizing the use of JUnit reports and per-test logs instead of build logs, which are not available during analysis. The analysis script has also been modified to focus on available artifacts and improve clarity in the analysis tasks.

Changes:
- New file: CLAUDE.md with detailed developer instructions
- Updated failure analysis design document to clarify artifact usage and analysis process
- Modifications to the analyze_failures.sh script to remove references to build-log.txt and enhance artifact handling

This update aims to streamline the development workflow and improve the efficiency of failure analysis in Prow CI.

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai
Copy link
Member Author

/test all

This commit updates the analyze_failures.sh script to include the 'set -o pipefail' option, ensuring that the first non-zero exit code in a pipeline is returned. Additionally, the documentation within the script has been revised to clarify the known flake patterns, now referencing the source file that contains detailed information about flake patterns and error ignore patterns. This aims to streamline the failure analysis process by providing clearer guidance on diagnosing test failures.

Changes:
- Added 'set -o pipefail' to improve error handling in pipelines.
- Updated documentation to reference the source file for known flake patterns and error ignore patterns.

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
This commit improves the analyze_failures.sh script by adding support for preprocessing large log files using a subagent pattern. It introduces functions to extract relevant error messages and context from large logs, creating a summary file for quick access during analysis. Additionally, the script now checks for the availability of the Claude CLI before execution and captures exit codes properly to ensure accurate error handling.

Changes:
- Added functions for extracting errors from large log files.
- Implemented preprocessing of large artifacts to create a summary of extracted errors.
- Enhanced documentation to clarify the analysis process and artifact usage.

These updates aim to streamline the failure analysis workflow and improve the accuracy of insights generated from log files.

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai
Copy link
Member Author

/test all

…e analysis

This commit updates the `.claude/config.json` file to remove specific path permissions that are now handled at runtime using the `--allowedTools` flag. The documentation in `docs/design/claude-prow-failure-analysis_design.md` has been expanded to clarify the new permission model, emphasizing the use of runtime permissions to bypass sandbox restrictions. Additionally, the `analyze_failures.sh` script has been updated to utilize the `--allowedTools` flag for explicit file access during analysis.

Changes:
- Removed specific path permissions from `.claude/config.json`.
- Enhanced documentation to explain the runtime permissions model.
- Updated `analyze_failures.sh` to use `--allowedTools` for file access.

These updates aim to improve the clarity and functionality of the failure analysis process in Prow CI.

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai
Copy link
Member Author

/test all

@kaovilai
Copy link
Member Author

/retest

ai-retester: Both failures are “intentional” – they’re deliberately triggered by the test author to make sure the Claude‑based failure‑analysis plumbing works.

  1. Claude Analysis Test Failure – The test asserts true should be false. That is exactly what the test code is checking, so it fails as expected.

  2. MySQL CSI Claude Test (INTENTIONAL FAILURE) – In this test the code raises an error (CLAUDE TEST: This is an intentional test failure…) after the backup/restore flow, which is meant to force a test failure so the Claude analysis is invoked. Hence both errors are intentional, not a bug in the deployment itself.

comment for /pull/2034

- Add --allowedTools flag to grant Read access to artifact paths
- Fix argument order: --allowedTools must come before --print
- Add AVAILABLE TOOLS section to prompts so Claude knows its constraints
- Simplify .claude/config.json (path permissions handled at runtime)
- Update design doc with runtime permissions approach

The Claude Code CLI sandbox restricts filesystem access to the current
working directory. Since artifacts are at /logs/artifacts/ (outside CWD),
we use --allowedTools to explicitly grant Read permissions at runtime.
…n and scripts

- Replace `--allowedTools` with `--add-dir` for granting directory access in the analysis script.
- Enhance documentation to clarify the use of `--add-dir` and `--allowedTools` for bypassing sandbox CWD restrictions.
- Ensure consistent usage of CLI flags across the `analyze_failures.sh` script and design documentation.

These changes improve clarity and functionality in the failure analysis process, ensuring proper access to necessary directories during runtime.

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai
Copy link
Member Author

/test all

@kaovilai
Copy link
Member Author

Current output

Perfect! Now I have all the information I need. Let me generate the comprehensive failure analysis report:

OADP E2E Test Failure Analysis

Generated by Claude via Vertex AI on 2025-11-28 16:02 UTC

Executive Summary

  • Total Tests: 51
  • Failed Tests: 2
  • Known Flakes: 0
  • Critical Issues: 0 (both failures are intentional test cases)
  • Environmental Issues: 0

Analysis Result: All test failures are intentional test cases designed to verify the Claude AI failure analysis system. The OADP backup/restore functionality is working correctly - backups completed successfully, restores completed successfully, and cluster health is good.

Failed Tests Analysis

1. MySQL CSI Claude Test (INTENTIONAL FAILURE) [FLAKE]

Root Cause: Intentional test failure injected via custom verification function to validate Claude analysis pipeline

Evidence:

junit_report.xml: "CLAUDE TEST: This is an intentional test failure to verify Claude analysis script"
Test code (/go/src/github.com/openshift/oadp-operator/tests/e2e/backup_restore_suite_test.go:421):
  return fmt.Errorf("CLAUDE TEST: This is an intentional test failure to verify Claude analysis script")
must-gather: Both backup and restore completed successfully (Phase: Completed)
  - Backup: mysql-csi-claude-test-405f0736-cc6f-11f0-a9e6-0a580a81a1fb (Completed)
  - Restore: mysql-csi-claude-test-405f073b-cc6f-11f0-a9e6-0a580a81a1fb (Completed)

Diagnosis: This test intentionally injects a failure in the PostRestoreVerify custom function at line 421 of backup_restore_suite_test.go. The actual backup and restore operations completed successfully:

  1. Backup Phase (15:32:04 - 15:33:11): Successfully created CSI snapshot (snap-0df4d9ac3e001128b) of 1GB MySQL PVC
  2. Restore Phase (15:34:00 - 15:34:03): Successfully restored from snapshot, created PVC and pods
  3. Failure Injection (15:34:11): Test deliberately returns error from custom verification function

The test demonstrates that OADP functionality is working correctly - the failure is purely synthetic.

Likely Cause: Test scaffolding for AI analysis validation (not a real failure)

Recommended Actions:

  1. Remove this test after validating the Claude analysis system is working
  2. This test should not be included in production CI runs
  3. Consider moving to a separate test suite for tooling validation

Related Issues: N/A - This is a deliberate test case, not a bug


2. Claude Analysis Test Failure [FLAKE]

Root Cause: Intentional assertion failure to verify Claude analysis script functionality

Evidence:

junit_report.xml: "Expected <bool>: true to be false"
Test code (/go/src/github.com/openshift/oadp-operator/tests/e2e/claude_test_failure_test.go:13):
  gomega.Expect(true).To(gomega.BeFalse(), "This is an intentional failure to test Claude analysis script")

Diagnosis: This is a standalone test case (claude_test_failure_test.go) that deliberately invokes an impossible assertion (expecting true to be false). The test file includes clear documentation:

// This test is intentionally designed to fail for testing Claude failure analysis
// It should be removed after verifying the Claude analysis script works correctly

This test executes instantly (0.000228615s) and has no actual OADP operations - it's purely a validation test for the analysis pipeline.

Likely Cause: Tooling validation test (not a real failure)

Recommended Actions:

  1. Delete /go/src/github.com/openshift/oadp-operator/tests/e2e/claude_test_failure_test.go after Claude analysis validation is complete
  2. Remove the "MySQL CSI Claude Test" entry from backup_restore_suite_test.go (lines 412-423)
  3. Update CI configuration to exclude these tests from production runs

Related Issues: N/A - This is a deliberate test case designed to be temporary


Known Flakes Detected

✗ VolumeSnapshotBeingCreated race condition (not detected)
✗ AWS rate limiting (not detected)

No known flake patterns were found in the logs.

Cluster Health Summary

From must-gather analysis:

OADP Components:

  • Velero deployment: Running successfully (pod: velero-bc95449f7-8mmdz)
  • OADP Controller: Running (pod: openshift-adp-controller-manager-d899f68bc-vjznh)
  • Backup Storage Location: ts-velero-test-1 - Available ✅
  • Volume Snapshot Location: None configured (expected for CSI-only backups)
  • OADP Version: 99.0.0 (development build)

Cluster Resources:

  • Platform: OpenShift 4.20.0-0.nightly-2025-11-27-212931 on AWS
  • Architecture: linux/amd64
  • CSI drivers: ebs.csi.aws.com available
  • Storage classes: Available (gp2-csi storage class in use)

Backup/Restore Operations:
Successfully completed 2 backup/restore cycles during test run:

  1. mysql-csi-e2e-9844e6ae: Backup Completed (15:27:17-15:28:12), Restore Completed (15:29:03-15:29:06)
  2. mysql-csi-claude-test: Backup Completed (15:32:04-15:33:11), Restore Completed (15:34:00-15:34:03)

All CSI snapshots created successfully:

  • snap-0bff8df5fc977aa6d (1GB)
  • snap-0df4d9ac3e001128b (1GB)

DataProtectionTest:

  • e2e-uploadtest-dpt-1764344071: Complete ✅
  • Upload speed: 283 MBps
  • Encryption: AES256 enabled

Recent Events:

  • Some expected restore warnings for pre-existing resources (ConfigMaps, Secrets, RoleBindings) - these are normal
  • Minor errors about authorization.openshift.io/v1 RoleBindings not supporting watch (expected, handled by fallback)
  • No critical errors in velero.log (preprocessed-logs.txt shows only debug-level EOF messages from plugin lifecycle)

Recommendations (Prioritized)

Immediate Actions (Critical)

None required - No critical bugs detected. Both failures are intentional test cases.

Investigation Needed

None required - All OADP functionality operating correctly.

Flake Handling

None required - No flakes detected.

Configuration Review

  1. Remove validation tests: Delete temporary Claude analysis test files after validation complete
    • File: /go/src/github.com/openshift/oadp-operator/tests/e2e/claude_test_failure_test.go
    • Lines 412-423 in /go/src/github.com/openshift/oadp-operator/tests/e2e/backup_restore_suite_test.go
  2. Test suite organization: Consider creating separate test suite for tooling/pipeline validation tests
  3. CI configuration: Ensure intentional failure tests are excluded from production test runs

Analysis Confidence

  • High Confidence: Both failures are intentional test cases (100% certainty from source code inspection)
  • Medium Confidence: N/A
  • Low Confidence: N/A

Suggested Next Steps for Developer

  1. Validate Claude analysis is working - This analysis confirms the AI system can correctly identify and analyze test failures
  2. Clean up test scaffolding: Remove the two intentional failure tests now that validation is complete
  3. Document the pattern: If periodic validation of the analysis pipeline is needed, create a dedicated test suite
  4. Re-run tests: After removing intentional failures, the test suite should have 100% pass rate (49 passed tests, 30 skipped, 0 failed)

Additional Context

The test run demonstrates excellent OADP functionality:

  • CSI snapshot backup/restore: Working perfectly
  • Backup storage: S3-compatible storage (AWS) functioning correctly
  • Restore operations: All resources restored successfully with expected warnings
  • Performance: 283 MBps upload speed, efficient snapshot operations
  • No real failures: The only failures are synthetic test cases

Passing Tests Include:

  • MySQL application CSI backup/restore (baseline test passed)
  • Multiple BSL with custom CA cert handling (3 BSLs tested)
  • DPA reconciliation tests (11 different DPA configurations tested)
  • DPA deletion test passed
  • All tests confirm OADP operator is functioning correctly

The test suite demonstrates comprehensive validation of OADP deployment, configuration, and backup/restore capabilities on AWS infrastructure.

@weshayutin
Copy link
Contributor

that's pretty hot! I'll be back tomorrow if you guys want to chat :) THANK YOU! For pushing this along

@weshayutin
Copy link
Contributor

/test 4.20-e2e-test-aws

mpryc added a commit to mpryc/oadp-operator that referenced this pull request Dec 3, 2025
Original PR: openshift#2034

Author: Tiger Kaovilai <tkaovila@redhat.com>
Date:   Fri Nov 28 09:17:23 2025 -0500

    refactor: update runtime permissions in failure analysis documentation and scripts

    - Replace `--allowedTools` with `--add-dir` for granting directory access in the analysis script.
    - Enhance documentation to clarify the use of `--add-dir` and `--allowedTools` for bypassing sandbox CWD restrictions.
    - Ensure consistent usage of CLI flags across the `analyze_failures.sh` script and design documentation.

    These changes improve clarity and functionality in the failure analysis process, ensuring proper access to necessary directories during runtime.

    Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@weshayutin
Copy link
Contributor

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2025
@weshayutin
Copy link
Contributor

moving to #2038

@openshift-ci
Copy link

openshift-ci bot commented Dec 3, 2025

@kaovilai: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.20-e2e-test-kubevirt-aws 57152d3 link true /test 4.20-e2e-test-kubevirt-aws
ci/prow/4.20-e2e-test-cli-aws 57152d3 link true /test 4.20-e2e-test-cli-aws
ci/prow/4.20-e2e-test-aws 57152d3 link true /test 4.20-e2e-test-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants