[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mmcmanus1/rlhf-canary/blob/main/notebooks/07_ci_cd_integration.ipynb)

# CI/CD Integration: GitHub Actions & PR Gating

Integrate RLHF Canary into your development workflow with automated PR checks, GitHub comments, and nightly regression testing.

**What you'll learn:**
1. Using the `gh-report` command for GitHub integration
2. Understanding PR comment formatting and commit status
3. Setting up GitHub Actions workflows for PR gating
4. Configuring nightly soak tests
5. Customizing workflow files for your repository
6. Programmatic access to GitHub integration functions

**Requirements:** GPU runtime (Runtime > Change runtime type > T4 GPU)

**Runtime:** ~15 minutes (includes a canary run)

## 1. Setup

In [None]:
import os
import re
import sys

print("Starting Environment Setup...")

# --- 1. Clone the repo first ---
if not os.path.exists("/content/rlhf-canary"):
    !git clone https://github.com/mmcmanus1/rlhf-canary.git /content/rlhf-canary

%cd /content/rlhf-canary

# --- 2. Force-Install the "Safe Harbor" Stack ---
# These specific versions avoid the TRL 0.12+ dtype bug and Transformers 4.46+ generator bug
!pip install "trl==0.11.4" "transformers==4.44.2" "peft==0.12.0" "accelerate==0.34.2" "tokenizers==0.19.1" --force-reinstall --no-deps --quiet
!pip install -q datasets pydantic click PyYAML bitsandbytes
print("Libraries installed (TRL 0.11.4 / Transformers 4.44.2)")

# --- 3. Patch pyproject.toml (Prevent future drift) ---
project_file = "/content/rlhf-canary/pyproject.toml"
if os.path.exists(project_file):
    with open(project_file, "r") as f:
        content = f.read()
    
    if "trl==0.11.4" not in content:
        content = re.sub(r'trl[<>=!~]+[\d\.]+', 'trl==0.11.4', content)
        with open(project_file, "w") as f:
            f.write(content)
        print("Config file patched to lock TRL 0.11.4")

# --- 4. Patch Source Code (Compatibility Fix) ---
runner_file = "/content/rlhf-canary/canary/runner/local.py"
if os.path.exists(runner_file):
    with open(runner_file, "r") as f:
        code = f.read()
    
    if "processing_class=" in code:
        code = code.replace("processing_class=", "tokenizer=")
        with open(runner_file, "w") as f:
            f.write(code)
        print("Code patched: Reverted 'processing_class' to 'tokenizer'")
    else:
        print("Code is already compatible.")

# --- 5. Install the package ---
!pip install -e . --quiet

print("Environment Ready!")

In [None]:
# Verify GPU is available
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Understanding GitHub Integration

RLHF Canary provides built-in GitHub integration through the `gh-report` command. This command:

1. **Posts a PR comment** with detailed regression analysis
2. **Updates commit status** (success/failure) for PR gating
3. **Writes to GitHub job summary** (visible in Actions)

### CLI Options

| Option | Description | Default |
|--------|-------------|---------|
| `--post-comment` / `--no-comment` | Post PR comment | Yes |
| `--update-status` / `--no-status` | Update commit status | Yes |
| `--threshold-tier` | Threshold tier (smoke/default/perf/nightly) | default |
| `--threshold-file` | Custom threshold file | None |
| `-o` / `--output` | Save report to file | None |

### Environment Variables Used

In GitHub Actions, these are auto-detected:

| Variable | Description |
|----------|-------------|
| `GITHUB_REPOSITORY` | Repository (owner/repo) |
| `GITHUB_REF` | Ref containing PR number |
| `GITHUB_SHA` | Commit SHA for status |
| `GITHUB_OUTPUT` | For setting outputs |
| `GITHUB_STEP_SUMMARY` | For job summary |
| `GH_TOKEN` | Authentication token |

## 3. Running a Baseline and Comparison

Before using `gh-report`, let's create some metrics to work with.

In [None]:
# Run DPO smoke test to create baseline metrics
!python -m canary.cli run configs/dpo_smoke.yaml -o ./canary_output/baseline

In [None]:
# Find and save the baseline
from pathlib import Path

baseline_path = next(Path('./canary_output/baseline').rglob('metrics.json'))
!mkdir -p baselines
!cp {baseline_path} baselines/main.json
print(f"Baseline saved to baselines/main.json")

In [None]:
# Run another canary to compare
!python -m canary.cli run configs/dpo_smoke.yaml -o ./canary_output/current

In [None]:
current_path = next(Path('./canary_output/current').rglob('metrics.json'))
print(f"Current metrics: {current_path}")

### Standard Compare Output

The `compare` command outputs markdown by default:

In [None]:
# Standard comparison - markdown output
!python -m canary.cli compare {current_path} baselines/main.json --threshold-tier smoke

### JSON Output for Automation

Use `--json` for programmatic access to comparison results:

In [None]:
# JSON output - useful for dashboards and automation
!python -m canary.cli compare {current_path} baselines/main.json --threshold-tier smoke --json

In [None]:
# Parse JSON output programmatically
import subprocess
import json

result = subprocess.run(
    ["python", "-m", "canary.cli", "compare", str(current_path), "baselines/main.json", "--threshold-tier", "smoke", "--json"],
    capture_output=True,
    text=True
)

report = json.loads(result.stdout)
print(f"Passed: {report['passed']}")
print(f"Checks failed: {sum(1 for c in report['checks'] if not c['passed'])}")
print(f"Checks passed: {sum(1 for c in report['checks'] if c['passed'])}")

## 4. The `gh-report` Command

The `gh-report` command extends `compare` with GitHub integration. When run locally (outside CI), it shows what would be posted:

In [None]:
# Run gh-report locally (will show warnings about missing GitHub context)
!python -m canary.cli gh-report {current_path} baselines/main.json --threshold-tier smoke -o comparison.md

In [None]:
# View the generated report
!cat comparison.md

### PR Comment Format

When posted to a PR, the comment includes:

1. **Collapsible summary** - Shows pass/fail at a glance
2. **Detailed metrics** - Step time, throughput, memory
3. **Stability checks** - NaN/Inf detection, loss divergence
4. **Root cause analysis** - When regressions are detected

The comment uses `<details>` tags to keep PRs clean:

```html
<details>
<summary>RLHF Canary Results: PASSED (6/6 checks)</summary>

[Full report here]

</details>
```

## 5. GitHub Actions Workflow Anatomy

RLHF Canary includes two workflow files:

### PR Canary (`workflows/pr_canary.yml`)

Runs on every pull request to main. Key steps:

1. **Checkout** - Get code
2. **Install** - Set up Python and dependencies
3. **Unit tests** - Run pytest first
4. **Download baseline** - Get baseline from artifact storage
5. **Run canary** - Execute smoke test
6. **Compare** - Check for regressions
7. **Post results** - Comment and status
8. **Upload artifacts** - Store metrics for future baselines

In [None]:
# View the PR workflow file
!cat workflows/pr_canary.yml

### Nightly Canary (`workflows/nightly_canary.yml`)

Runs daily at 2 AM UTC. Key differences from PR:

| Aspect | PR Canary | Nightly Canary |
|--------|-----------|----------------|
| Trigger | Pull request | Schedule (cron) |
| Config | `dpo_smoke.yaml` | `dpo_nightly.yaml` |
| Steps | 100 | 2000 |
| Timeout | 15 min | 180 min |
| Thresholds | smoke (lenient) | nightly (strict) |
| Purpose | Gate PRs | Catch slow regressions |

In [None]:
# View the nightly workflow file
!cat workflows/nightly_canary.yml

### Key Workflow Components

**Baseline Management with Artifacts:**

```yaml
# Download baseline from previous successful run
- uses: dawidd6/action-download-artifact@v3
  with:
    workflow: pr_canary.yml
    branch: main
    name: canary-baseline

# Upload new baseline (only on main)
- if: github.ref == 'refs/heads/main'
  uses: actions/upload-artifact@v4
  with:
    name: canary-baseline
    retention-days: 90
```

**Posting Results:**

```yaml
- env:
    GH_TOKEN: ${{ github.token }}
  run: canary gh-report $CURRENT $BASELINE --threshold-tier smoke
```

**Fail on Regression:**

```yaml
- if: steps.compare.outcome == 'failure'
  run: |
    echo "::error::Canary regression detected!"
    exit 1
```

## 6. Customizing for Your Repository

### Step 1: Copy Workflow Files

```bash
# In your repository
mkdir -p .github/workflows
cp rlhf-canary/workflows/pr_canary.yml .github/workflows/
cp rlhf-canary/workflows/nightly_canary.yml .github/workflows/
```

### Step 2: Configure GPU Runners

For real canary testing, you need GPU runners. Options:

1. **Self-hosted runners** (recommended for production)
2. **GitHub-hosted GPU runners** (Actions GPU beta)
3. **External CI** (e.g., Modal, Lambda Labs)

Update the workflow:

```yaml
jobs:
  canary-smoke:
    runs-on: [self-hosted, gpu]  # Your GPU runner label
```

### Step 3: Customize Thresholds

Create custom thresholds for your use case:

In [None]:
# Example: Create project-specific thresholds
custom_thresholds = """
# Custom thresholds for my-project
base_tier: smoke

# We allow more variance on our noisy GPU cluster
max_step_time_increase_pct: 20.0
max_tps_drop_pct: 15.0

# But we're strict about memory (limited VRAM)
max_mem_increase_mb: 256.0
max_mem_increase_pct: 10.0

# Zero tolerance for NaN/Inf
nan_steps_allowed: 0
inf_steps_allowed: 0
"""

with open('my_thresholds.yaml', 'w') as f:
    f.write(custom_thresholds)

print("Custom thresholds saved to my_thresholds.yaml")

In [None]:
# Use custom thresholds in comparison
!python -m canary.cli compare {current_path} baselines/main.json --threshold-file my_thresholds.yaml

### Step 4: Workflow File Customizations

Common customizations:

```yaml
# Change config file
run: canary run configs/your_custom_config.yaml

# Use custom thresholds
run: canary gh-report $CURRENT $BASELINE --threshold-file .canary/thresholds.yaml

# Increase artifact retention
with:
  retention-days: 180  # Keep 6 months of history

# Add Slack notification on failure
- if: failure()
  uses: 8398a7/action-slack@v3
  with:
    status: failure
    text: "Canary regression detected!"
```

## 7. Programmatic GitHub Functions

For advanced integrations, you can use the Python API directly:

In [None]:
# Import GitHub integration functions
from canary.report.github import (
    post_github_comment,
    update_pr_status,
    write_github_summary,
    set_github_output,
)

# These functions are also available:
from canary.report.markdown import (
    generate_markdown_report,
    generate_short_summary,
)

print("GitHub integration functions:")
print("  - post_github_comment(report, current, baseline)")
print("  - update_pr_status(report)")
print("  - write_github_summary(markdown)")
print("  - set_github_output(key, value)")
print("")
print("Report generation functions:")
print("  - generate_markdown_report(report, current, baseline)")
print("  - generate_short_summary(report)")

In [None]:
# Example: Generate a custom integration
from canary.compare.stats import compare_to_baseline, load_metrics
from canary.compare.thresholds import SMOKE_THRESHOLDS

# Load metrics
current = load_metrics(str(current_path))
baseline = load_metrics('baselines/main.json')

# Run comparison
report = compare_to_baseline(current, baseline, SMOKE_THRESHOLDS)

# Generate outputs
summary = generate_short_summary(report)
markdown = generate_markdown_report(report, current, baseline)

print(f"Summary: {summary}")
print(f"")
print(f"Full report length: {len(markdown)} characters")

In [None]:
# Example: Custom dashboard integration
def send_to_dashboard(report, current, baseline):
    """Example: Send results to a custom dashboard."""
    data = {
        "passed": report.passed,
        "timestamp": current.run_id.split("_")[1],
        "metrics": {
            "step_time_ms": current.perf.step_time.mean * 1000,
            "tokens_per_sec": current.perf.approx_tokens_per_sec,
            "peak_memory_mb": current.perf.max_mem_mb,
        },
        "deltas": {
            "step_time_pct": report.perf_delta.get("step_time_increase_pct", 0),
            "tps_drop_pct": report.perf_delta.get("tps_drop_pct", 0),
        },
        "checks_passed": sum(1 for c in report.checks if c.passed),
        "checks_total": len(report.checks),
    }
    
    # In a real implementation, you'd POST this to your dashboard
    # requests.post("https://your-dashboard.com/api/canary", json=data)
    
    print("Dashboard payload:")
    import json
    print(json.dumps(data, indent=2))

send_to_dashboard(report, current, baseline)

## 8. Troubleshooting CI Issues

### Common Problems

| Issue | Cause | Solution |
|-------|-------|----------|
| "gh CLI not found" | `gh` not installed | Add `gh` to runner or use API directly |
| "Could not detect repo" | Missing env vars | Set `GITHUB_REPOSITORY` manually |
| "No baseline found" | First run or expired artifact | Run passes without comparison |
| "Permission denied" | Token lacks PR permissions | Check `GH_TOKEN` permissions |
| Timeout | Config too large for runner | Reduce `max_steps` or use GPU |

### Testing Workflows Locally

Use [act](https://github.com/nektos/act) to test GitHub Actions locally:

```bash
# Install act
brew install act  # macOS

# Run workflow locally (CPU only)
act pull_request -W .github/workflows/pr_canary.yml

# With secrets
act -s GITHUB_TOKEN=$GITHUB_TOKEN
```

### Debug Mode

Add debug logging to workflows:

```yaml
- name: Debug info
  run: |
    echo "GITHUB_REPOSITORY: $GITHUB_REPOSITORY"
    echo "GITHUB_REF: $GITHUB_REF"
    echo "GITHUB_SHA: $GITHUB_SHA"
    canary env
```

## 9. Summary

### Key Takeaways

1. **`gh-report`** posts PR comments and commit status automatically
2. **Two workflows** provided: PR gating (smoke) and nightly (comprehensive)
3. **Baselines** are managed via GitHub artifacts with retention policies
4. **Custom thresholds** let you tune sensitivity for your environment
5. **JSON output** enables integration with dashboards and alerting
6. **Programmatic API** available for custom integrations

### Integration Checklist

- [ ] Copy workflow files to `.github/workflows/`
- [ ] Configure GPU runners (or use CPU for testing)
- [ ] Create initial baseline from main branch
- [ ] Customize thresholds for your hardware/model
- [ ] Test with a PR to verify integration
- [ ] Set up alerting for nightly failures

### Next Steps

- [01_quickstart.ipynb](01_quickstart.ipynb) - Core workflow basics
- [02_profiler_deep_dive.ipynb](02_profiler_deep_dive.ipynb) - Performance diagnostics
- [08_configuration_and_thresholds.ipynb](08_configuration_and_thresholds.ipynb) - Advanced configuration