# Automated Test Results Analysis

This notebook analyzes the **automated CI/CD test results** from the app.build benchmark.

**Data Source**: Pre-processed CSV files from `analysis/results/` directory containing:
- Binary success/failure from automated tests
- Token usage and timing metrics
- Docker health checks and template failures

**Note**: This is separate from human evaluation results. For human-evaluated analysis, see `experiments_baseline_ablation_analysis.ipynb`


## 1. Baseline (Closed APIs) vs Open Models Comparison


In [4]:
from __future__ import annotations
import pandas as pd
import numpy as np
from pathlib import Path

# Load pre-processed results from CSV files
RESULTS_DIR = Path("results") if Path("results").exists() else Path("/Users/evgenii.kniazev/projects/app.build-neurips25/analysis/results")

# Load baseline and openmodels data
baseline_raw = pd.read_csv(RESULTS_DIR / "baseline" / "raw_results.csv")
openmodels_raw = pd.read_csv(RESULTS_DIR / "openmodels" / "raw_results.csv")

# Calculate cost based on model family
PRICES = {"anthropic": (3.0, 15.0), "gemini": (0.15, 0.60), "openai": (5.0, 15.0)}

def get_model_family(coding_model: str, universal_model: str) -> str:
    """Determine model family from coding and universal model names."""
    combined = f"{coding_model} {universal_model}".lower()
    if "claude" in combined:
        return "anthropic"
    elif "gemini" in combined:
        return "gemini"
    elif "gpt" in combined and not "gpt-oss" in combined:
        return "openai"
    return "open"

def calculate_cost(df: pd.DataFrame) -> float:
    """Calculate total cost based on token usage and model pricing."""
    if df.empty:
        return 0.0
    
    df = df.copy()
    df["family"] = df.apply(lambda r: get_model_family(r["coding_model"], r["universal_model"]), axis=1)
    
    total_cost = 0.0
    for family, group in df.groupby("family"):
        rate_in, rate_out = PRICES.get(family, (0.0, 0.0))
        input_cost = (group["total_input_tokens"].sum() / 1e6) * rate_in
        output_cost = (group["total_output_tokens"].sum() / 1e6) * rate_out
        total_cost += input_cost + output_cost
    
    return total_cost

# Create summary statistics
summary_data = []

for label, df in [("Baseline (closed APIs)", baseline_raw), ("Open models", openmodels_raw)]:
    summary_data.append({
        "cohort": label,
        "num_runs": len(df),
        "success_rate": df["success"].mean() if not df.empty else np.nan,
        "mean_duration_s": df["duration_seconds"].mean() if not df.empty else np.nan,
        "total_input_tokens": int(df["total_input_tokens"].sum()) if not df.empty else 0,
        "total_output_tokens": int(df["total_output_tokens"].sum()) if not df.empty else 0,
        "est_cost_usd": calculate_cost(df),
    })

summary_df = pd.DataFrame(summary_data)

# Format for display
display_df = summary_df.copy()
display_df["success_rate"] = display_df["success_rate"].map(lambda x: "--" if pd.isna(x) else f"{x*100:.1f}%")
display_df["mean_duration_s"] = display_df["mean_duration_s"].round(1)
display_df["est_cost_usd"] = display_df["est_cost_usd"].round(2)

# Display the results
print("\n=== AUTOMATED TEST RESULTS SUMMARY ===")
try:
    display(display_df[["cohort", "num_runs", "success_rate", "mean_duration_s", "total_input_tokens", "total_output_tokens", "est_cost_usd"]])
except Exception:
    print(display_df.to_string(index=False))

# Print summary takeaway
if len(summary_df) >= 2:
    baseline = summary_df.iloc[0]
    openmodels = summary_df.iloc[1]
    print(f"\nAutomated Tests: Closed APIs: {display_df.loc[0,'success_rate']}, ~{baseline['mean_duration_s']:.0f}s, ${baseline['est_cost_usd']:.2f}; "
          f"Open models: {display_df.loc[1,'success_rate']}, ~{openmodels['mean_duration_s']:.0f}s, ${openmodels['est_cost_usd']:.2f}.")



=== AUTOMATED TEST RESULTS SUMMARY ===


Unnamed: 0,cohort,num_runs,success_rate,mean_duration_s,total_input_tokens,total_output_tokens,est_cost_usd
0,Baseline (closed APIs),30,86.7%,478.3,27691026,1808588,110.2
1,Open models,180,56.7%,628.6,219116367,7771684,37.53



Automated Tests: Closed APIs: 86.7%, ~478s, $110.20; Open models: 56.7%, ~629s, $37.53.


## 2. Ablation Study: Impact of Validation Layers

Comparing baseline with ablations (no_lint, no_playwright, no_tests) using automated test results.


In [5]:
# Load ablation study results from CSV files
ablation_types = ["baseline", "no_lint", "no_playwright", "no_tests"]
ablation_data = []

for ablation in ablation_types:
    # Load raw results
    raw_df = pd.read_csv(RESULTS_DIR / ablation / "raw_results.csv")
    
    # Calculate stats
    ablation_data.append({
        "cohort": ablation.replace("_", " ").title(),
        "num_runs": len(raw_df),
        "success_rate": raw_df["success"].mean() if not raw_df.empty else np.nan,
        "mean_duration_s": raw_df["duration_seconds"].mean() if not raw_df.empty else np.nan,
        "total_input_tokens": int(raw_df["total_input_tokens"].sum()) if not raw_df.empty else 0,
        "total_output_tokens": int(raw_df["total_output_tokens"].sum()) if not raw_df.empty else 0,
        "est_cost_usd": calculate_cost(raw_df),
    })

ablation_df = pd.DataFrame(ablation_data)

# Format for display
ablation_display = ablation_df.copy()
ablation_display["success_rate"] = ablation_display["success_rate"].map(lambda x: "--" if pd.isna(x) else f"{x*100:.1f}%")
ablation_display["mean_duration_s"] = ablation_display["mean_duration_s"].round(1)
ablation_display["est_cost_usd"] = ablation_display["est_cost_usd"].round(2)

# Display the ablation comparison
print("\n=== AUTOMATED TEST ABLATION STUDY ===")
try:
    display(ablation_display[["cohort", "num_runs", "success_rate", "mean_duration_s", "total_input_tokens", "total_output_tokens", "est_cost_usd"]])
except Exception:
    print(ablation_display.to_string(index=False))

# Calculate relative changes from baseline
baseline_idx = 0
print("\nRelative changes from baseline (automated tests):")
for i in range(1, len(ablation_df)):
    cohort = ablation_df.iloc[i]["cohort"]
    success_delta = (ablation_df.iloc[i]["success_rate"] - ablation_df.iloc[baseline_idx]["success_rate"]) * 100
    duration_delta = ablation_df.iloc[i]["mean_duration_s"] - ablation_df.iloc[baseline_idx]["mean_duration_s"]
    cost_delta = ablation_df.iloc[i]["est_cost_usd"] - ablation_df.iloc[baseline_idx]["est_cost_usd"]
    
    print(f"{cohort}: Success {success_delta:+.1f}%, Duration {duration_delta:+.0f}s, Cost ${cost_delta:+.2f}")



=== AUTOMATED TEST ABLATION STUDY ===


Unnamed: 0,cohort,num_runs,success_rate,mean_duration_s,total_input_tokens,total_output_tokens,est_cost_usd
0,Baseline,30,86.7%,478.3,27691026,1808588,110.2
1,No Lint,30,93.3%,496.1,15938734,1511313,70.49
2,No Playwright,30,83.3%,462.7,20822496,1580274,86.17
3,No Tests,30,93.3%,372.7,15915645,1553654,71.05



Relative changes from baseline (automated tests):
No Lint: Success +6.7%, Duration +18s, Cost $-39.72
No Playwright: Success -3.3%, Duration -16s, Cost $-24.03
No Tests: Success +6.7%, Duration -106s, Cost $-39.15


## 3. Summary from .out Files

Parse the pre-generated analysis output files for exact statistics from the automated test runs.


In [6]:
import re

def parse_out_file(filepath: Path) -> dict:
    """Parse a .out file and extract key statistics."""
    content = filepath.read_text()
    stats = {}
    
    # Extract key metrics using regex
    patterns = {
        "total_experiments": r"Total experiments: (\d+)",
        "success_rate": r"Success rate: ([\d.]+)%",
        "healthcheck_pass_rate": r"Healthcheck pass rate: ([\d.]+)%",
        "template_failure_rate": r"Template failure rate \(trpc-agent\): ([\d.]+)%",
        "avg_duration": r"Average duration: ([\d.]+)s",
        "median_duration": r"Median duration: ([\d.]+)s",
        "total_input_tokens": r"Total input tokens: ([\d,_]+)",
        "total_output_tokens": r"Total output tokens: ([\d,_]+)",
        "total_llm_calls": r"Total LLM calls: ([\d,_]+)",
        "total_cost": r"Total cost: \$([\d.]+)",
        "avg_tokens_sec_input": r"Avg tokens/sec \(input\): (\d+)",
        "avg_tokens_sec_output": r"Avg tokens/sec \(output\): (\d+)",
    }
    
    for key, pattern in patterns.items():
        match = re.search(pattern, content)
        if match:
            value = match.group(1).replace(",", "").replace("_", "")
            if key in ["success_rate", "healthcheck_pass_rate", "template_failure_rate"]:
                stats[key] = float(value)
            elif key in ["avg_duration", "median_duration", "total_cost"]:
                stats[key] = float(value)
            elif "tokens" in key or "calls" in key:
                stats[key] = int(value)
            else:
                stats[key] = value
    
    return stats

# Parse all .out files
out_files = {
    "baseline": "baseline.out",
    "no_lint": "ablation_no_lint.out",
    "no_playwright": "ablation_no_playwright.out",
    "no_tests": "ablation_no_tests.out"
}

analysis_dir = Path(".") if Path("baseline.out").exists() else Path("/Users/evgenii.kniazev/projects/app.build-neurips25/analysis")

out_file_stats = []
for ablation_type, filename in out_files.items():
    filepath = analysis_dir / filename
    if filepath.exists():
        stats = parse_out_file(filepath)
        stats["ablation_type"] = ablation_type
        out_file_stats.append(stats)

# Create DataFrame from parsed data
out_stats_df = pd.DataFrame(out_file_stats)

# Format for display
display_out_stats = pd.DataFrame({
    "Ablation": out_stats_df["ablation_type"].str.replace("_", " ").str.title(),
    "Experiments": out_stats_df["total_experiments"],
    "Success Rate": out_stats_df["success_rate"].map(lambda x: f"{x:.1f}%"),
    "Avg Duration": out_stats_df["avg_duration"].map(lambda x: f"{x:.1f}s"),
    "Input Tokens": out_stats_df["total_input_tokens"].map(lambda x: f"{x:,}"),
    "Output Tokens": out_stats_df["total_output_tokens"].map(lambda x: f"{x:,}"),
    "Total Cost": out_stats_df["total_cost"].map(lambda x: f"${x:.2f}")
})

print("\n=== SUMMARY FROM .OUT FILES (AUTOMATED TESTS) ===")
try:
    display(display_out_stats)
except Exception:
    print(display_out_stats.to_string(index=False))

# Show performance comparison
print("\nPerformance Impact of Ablations (Automated Tests):")
baseline_stats = out_stats_df[out_stats_df["ablation_type"] == "baseline"].iloc[0]
for _, row in out_stats_df[out_stats_df["ablation_type"] != "baseline"].iterrows():
    ablation = row["ablation_type"].replace("_", " ").title()
    success_diff = row["success_rate"] - baseline_stats["success_rate"]
    duration_diff = row["avg_duration"] - baseline_stats["avg_duration"]
    cost_diff = row["total_cost"] - baseline_stats["total_cost"]
    
    print(f"\n{ablation}:")
    print(f"  - Success rate: {success_diff:+.1f}% ({row['success_rate']:.1f}% vs {baseline_stats['success_rate']:.1f}%)")
    print(f"  - Avg duration: {duration_diff:+.1f}s ({row['avg_duration']:.1f}s vs {baseline_stats['avg_duration']:.1f}s)")
    print(f"  - Total cost: ${cost_diff:+.2f} (${row['total_cost']:.2f} vs ${baseline_stats['total_cost']:.2f})")



=== SUMMARY FROM .OUT FILES (AUTOMATED TESTS) ===


Unnamed: 0,Ablation,Experiments,Success Rate,Avg Duration,Input Tokens,Output Tokens,Total Cost
0,Baseline,30,86.7%,478.3s,27691026,1808588,$110.20
1,No Lint,30,93.3%,496.1s,15938734,1511313,$70.49
2,No Playwright,30,83.3%,462.7s,20822496,1580274,$86.17
3,No Tests,30,93.3%,372.7s,15915645,1553654,$71.05



Performance Impact of Ablations (Automated Tests):

No Lint:
  - Success rate: +6.7% (93.3% vs 86.7%)
  - Avg duration: +17.8s (496.1s vs 478.3s)
  - Total cost: $-39.71 ($70.49 vs $110.20)

No Playwright:
  - Success rate: -3.3% (83.3% vs 86.7%)
  - Avg duration: -15.6s (462.7s vs 478.3s)
  - Total cost: $-24.03 ($86.17 vs $110.20)

No Tests:
  - Success rate: +6.7% (93.3% vs 86.7%)
  - Avg duration: -105.6s (372.7s vs 478.3s)
  - Total cost: $-39.15 ($71.05 vs $110.20)


## 4. Key Findings from Automated Tests

### Success Rates (Automated CI/CD)
- **Baseline**: ~86.7% success rate
- **No Lint**: ~93.3% success rate (+6.7%)
- **No Playwright**: ~83.3% success rate (-3.3%)
- **No Tests**: ~93.3% success rate (+6.7%)

### Observations
1. **Linting and unit tests appear to reduce success rates** in automated tests
2. **Playwright (E2E) tests have minimal impact** on automated success
3. **Cost savings are significant** when validation is reduced (~$40 less)

### ⚠️ Important Caveat
These are automated test results that may not reflect actual app quality. Human evaluation shows different patterns:
- Human viability rates are generally lower (73.3% vs 86.7%)
- Quality assessment requires nuanced evaluation beyond binary pass/fail

For human-evaluated results, see `experiments_baseline_ablation_analysis.ipynb`
