# Run and Download Benchmark Sessions

This notebook helps you:
1. **Query running VMs** - List active benchmark VMs and their status
2. **Download results** - Fetch benchmark results, logs, and startup scripts from running or completed sessions
3. **View progress** - Observe the progress of the benchmarking session

## Workflow
1. Run `tofu apply -var-file benchmarks/{timestamp}_batch/run.tfvars` from `infrastructure/` to start benchmarks

```
...
google_compute_instance.benchmark_instances["standard-00"]: Creating...
google_compute_instance.benchmark_instances["standard-01"]: Creating...
google_compute_instance.benchmark_instances["standard-01"]: Still creating... [10s elapsed]
google_compute_instance.benchmark_instances["standard-00"]: Still creating... [10s elapsed]
google_compute_instance.benchmark_instances["standard-01"]: Still creating... [20s elapsed]
google_compute_instance.benchmark_instances["standard-00"]: Still creating... [20s elapsed]
google_compute_instance.benchmark_instances["standard-00"]: Creation complete after 21s [id=projects/compute-app-427709/zones/europe-west4-a/instances/benchmark-instance-standard-00]
google_compute_instance.benchmark_instances["standard-01"]: Creation complete after 21s [id=projects/compute-app-427709/zones/europe-west4-a/instances/benchmark-instance-standard-01]

Apply complete! Resources: 2 added, 0 changed, 0 destroyed.

Outputs:

instance_ips = {
  "standard-00" = "34.90.225.213"
  "standard-01" = "34.90.240.138"
}
run_id = "20251106_153156_batch"
```

2. Use section *1. Configure variables for the Benchmarking Campaign* to point the notebook at your session
3. Use section *2. Observe the Running Benchmark Session* to download result CSVs and logs to observe the progress of the Benchmarking Session

# 1. Configure variables for the Benchmarking Campaign

- Set up Benchmarking Session configuration. (Set up using `notebooks/allocate-benchmarks-to-vms.ipynb`) 
- Set up results output directories

In [1]:
from pathlib import Path

# Configuration
PROJECT_ROOT = Path.cwd().parent
BASE_RESULTS_DIR = PROJECT_ROOT / "results"
INFRASTRUCTURE_DIR = PROJECT_ROOT / "infrastructure"
BASE_BENCHMARKS_DIR = INFRASTRUCTURE_DIR / "benchmarks"
SESSION_ID = "sample_run"
RUN_ID = "20251106_153156_batch"

SESSION_RESULTS_DIR = BASE_RESULTS_DIR / SESSION_ID
SESSION_BENCHMARKS_DIR = BASE_BENCHMARKS_DIR / SESSION_ID
VM_SCRIPT = PROJECT_ROOT / "vms_gcloud.py"
VM_NAME_PREFIX = "benchmark-instance"

print(f"Benchmarks Dir: {SESSION_BENCHMARKS_DIR}")
!ls {SESSION_BENCHMARKS_DIR}

print(f"\nResults Dir: {SESSION_RESULTS_DIR}")

print(f"\nSession config:\n{'=' * 60}\n")
!cat $SESSION_BENCHMARKS_DIR/run.tfvars

Benchmarks Dir: /home/madhukar/oet/solver-benchmark/infrastructure/benchmarks/sample_run
run.tfvars  standard-00.yaml  standard-01.yaml

Results Dir: /home/madhukar/oet/solver-benchmark/results/sample_run

Session config:

project_id = "compute-app-427709"
# This will be overriden if a value is specified in the input metadata file
zone = "europe-west4-a"
# Optional
enable_gcs_upload = false
auto_destroy_vm = false
benchmarks_dir = "benchmarks/sample_run"

# 2. Observe the Running Benchmark Session

All artifacts are plain text flat files that are accessed over ssh. The cells below download results and check them against the allocation configs

## 2.1 Query Running VMs

In [95]:
# Query standard VMs
!python $VM_SCRIPT $VM_NAME_PREFIX --output table

Using current project: compute-app-427709
Discovering zones across 1 projects...
Querying VMs across 127 zones...
Progress: 50/127 zones checked
Progress: 100/127 zones checked

Found 2 matching VMs

Project                        Zone                 Name                           Status       Machine Type         Internal IP     External IP    
------------------------------------------------------------------------------------------------------------------------------------------------------------------
compute-app-427709             europe-west4-a       benchmark-instance-standard-00 RUNNING      c4-standard-2        10.164.0.42     34.90.225.213  
compute-app-427709             europe-west4-a       benchmark-instance-standard-01 RUNNING      c4-standard-2        10.164.0.41     34.90.240.138  


In [96]:
# Run commands over ssh for running VMs
!python $VM_SCRIPT $VM_NAME_PREFIX --ssh "uptime"

Using current project: compute-app-427709
Discovering zones across 1 projects...
Querying VMs across 127 zones...
Progress: 50/127 zones checked
Progress: 100/127 zones checked

Found 2 matching VMs
Executing command on 2 VMs: uptime

✓ benchmark-instance-standard-00: Success
STDOUT:
 17:33:12 up  2:00,  1 user,  load average: 0.00, 0.00, 0.00


✓ benchmark-instance-standard-01: Success
STDOUT:
 17:33:12 up  2:00,  2 users,  load average: 0.00, 0.00, 0.00


Completed: 2/2 successful


## 2.2 Download Results for the Session

In [97]:
# Download benchmark_results.csv
!python $VM_SCRIPT $VM_NAME_PREFIX '--scp-source' 'vm:/solver-benchmark/results/benchmark_results.csv' '--scp-dest' $SESSION_RESULTS_DIR/scp/{{vm_name}}/benchmark_results.csv

Using current project: compute-app-427709
Discovering zones across 1 projects...
Querying VMs across 127 zones...
Progress: 50/127 zones checked
Progress: 100/127 zones checked

Found 2 matching VMs
Copying from 2 VMs...
✓ benchmark-instance-standard-00: Success
✓ benchmark-instance-standard-01: Success

Completed: 2/2 successful


In [98]:
# Download startup-script.log
!python $VM_SCRIPT $VM_NAME_PREFIX '--scp-source' 'vm:/var/log/startup-script.log' '--scp-dest' $SESSION_RESULTS_DIR/scp/{{vm_name}}/startup-script.log

Using current project: compute-app-427709
Discovering zones across 1 projects...
Querying VMs across 127 zones...
Progress: 50/127 zones checked
Progress: 100/127 zones checked

Found 2 matching VMs
Copying from 2 VMs...
✓ benchmark-instance-standard-00: Success
✓ benchmark-instance-standard-01: Success

Completed: 2/2 successful


In [None]:
# Download runner logs (recursive)
!python $VM_SCRIPT $VM_NAME_PREFIX '--scp-source' 'vm:/solver-benchmark/runner/logs/' '--scp-dest' $SESSION_RESULTS_DIR/scp/{{vm_name}}/ --recursive

Using current project: compute-app-427709
Discovering zones across 1 projects...
Querying VMs across 127 zones...
Progress: 50/127 zones checked
Progress: 100/127 zones checked

Found 2 matching VMs
Copying from 2 VMs...


## Download completed results from Google Cloud Storage

In [None]:
# Download results from GCS
!gsutil -m cp -r gs://solver-benchmarks/results/$RUN_ID'*' $SESSION_RESULTS_DIR/gcs/results 2>/dev/null || echo "No GCS results found"
!gsutil -m cp -r gs://solver-benchmarks/logs/$RUN_ID'*' $SESSION_RESULTS_DIR/gcs/logs 2>/dev/null || echo "No GCS logs found"

No GCS results found
No GCS logs found


In [None]:
# List downloaded results
!tree $SESSION_RESULTS_DIR/ 2>/dev/null || find $SESSION_RESULTS_DIR -type f | head -20

[01;34m/home/madhukar/oet/solver-benchmark/results/sample_run/[0m
└── [01;34mscp[0m
    ├── [01;34mbenchmark-instance-standard-00[0m
    │   ├── [00mbenchmark_results.csv[0m
    │   ├── [01;34mlogs[0m
    │   │   ├── [00mgenx-3_three_zones_w_co2_capture-no_uc-3-1h-highs-1.10.0.log[0m
    │   │   ├── [00mgenx-3_three_zones_w_co2_capture-no_uc-3-1h-scip-9.2.2.log[0m
    │   │   ├── [00mSienna_modified_RTS_GMLC_DA_sys_NetTransport_Horizon24_Day314-1-1h-highs-1.10.0.log[0m
    │   │   └── [00mSienna_modified_RTS_GMLC_DA_sys_NetTransport_Horizon24_Day314-1-1h-scip-9.2.2.log[0m
    │   └── [00mstartup-script.log[0m
    └── [01;34mbenchmark-instance-standard-01[0m
        ├── [00mbenchmark_results.csv[0m
        ├── [01;34mlogs[0m
        │   ├── [00mgenx-2_three_zones_w_electrolyzer-3-1h-highs-1.10.0.log[0m
        │   ├── [00mgenx-2_three_zones_w_electrolyzer-3-1h-scip-9.2.2.log[0m
        │   ├── [00mpypsa-power+ely+battery-ucgas-1-1h-highs-1.10.0.log[0m
   

## 2.3 Verify the Results against VM Allocations

Check the results vs the allocation configs in the `infrastructure/benchmarks/<session-id>`
to see how the benchmarks are being allocated and how the solvers are performing.

In [None]:
import pandas as pd


def load_and_combine_results(session_dir):
    """
    Load benchmark results from GCS and SCP, preferring GCS over SCP.

    Returns:
        pd.DataFrame: Combined results from all sources
    """
    session_dir = Path(session_dir)
    all_dfs = []

    # Load GCS results first (preferred)
    gcs_dir = session_dir / "gcs"
    gcs_count = 0
    if gcs_dir.exists():
        for csv_file in sorted(gcs_dir.glob("**/benchmark_results.csv")):
            try:
                df = pd.read_csv(csv_file)
                df["_source"] = "gcs"
                all_dfs.append(df)
                gcs_count += 1
                print(f"  Loaded {len(df)} rows from {csv_file.parent.name}")
            except Exception as e:
                print(f"  Error loading {csv_file}: {e}")

    print(f"Found {gcs_count} GCS CSV files" if gcs_count else "No GCS results found")

    # Load SCP results
    scp_dir = session_dir / "scp"
    scp_count = 0
    if scp_dir.exists():
        for csv_file in sorted(scp_dir.glob("**/benchmark_results.csv")):
            try:
                df = pd.read_csv(csv_file)
                df["_source"] = "scp"
                all_dfs.append(df)
                scp_count += 1
                print(f"  Loaded {len(df)} rows from {csv_file.parent.name}")
            except Exception as e:
                print(f"  Error loading {csv_file}: {e}")

    print(f"Found {scp_count} SCP CSV files" if scp_count else "No SCP results found")

    # Combine all dataframes and deduplicate, preferring GCS
    if all_dfs:
        combined_df = pd.concat(all_dfs, ignore_index=True)

        # Deduplicate: keep GCS over SCP for same benchmark runs
        # Sort so GCS comes first, then drop duplicates keeping first occurrence
        combined_df = combined_df.sort_values(
            "_source", key=lambda x: (x != "gcs")
        ).reset_index(drop=True)

        # Identify duplicates by benchmark data (all columns except _source)
        cols_to_check = [c for c in combined_df.columns if c != "_source"]
        combined_df = combined_df.drop_duplicates(subset=cols_to_check, keep="first")

        print(
            f"\nSuccessfully combined {len(combined_df)} total rows (after deduplication)"
        )
        return combined_df
    else:
        print("\nNo CSV files found")
        return None


# Load results
results = load_and_combine_results(SESSION_RESULTS_DIR)

No GCS results found
  Loaded 5 rows from benchmark-instance-standard-00
  Loaded 5 rows from benchmark-instance-standard-01
Found 2 SCP CSV files

Successfully combined 10 total rows (after deduplication)


In [None]:
def get_timeout_by_machine(runs: pd.DataFrame) -> dict:
    """Infer timeout values by machine type from actual results."""
    timeout_map = {}

    # Find VM hostname column
    vm_col = None
    vm_candidates = {
        "hostname",
        "host",
        "vmhostname",
        "vm",
        "instancename",
        "instance",
        "_vm",
    }
    for c in runs.columns:
        normalized = c.lower().replace(" ", "").replace("_", "").replace("-", "")
        if normalized in vm_candidates:
            vm_col = c
            break

    if vm_col and "Timeout" in runs.columns:
        # Map hostname patterns to machine types
        for _, row in runs.iterrows():
            hostname = str(row.get(vm_col, ""))
            timeout = row.get("Timeout")

            if pd.notna(timeout):
                if "highmem" in hostname.lower():
                    timeout_map["c4-highmem-8"] = timeout
                elif "standard" in hostname.lower():
                    timeout_map["c4-standard-2"] = timeout

    return timeout_map


if results is not None:
    timeout_by_machine = get_timeout_by_machine(results)
    print("\nTimeout by machine type:")
    for machine_type, timeout in timeout_by_machine.items():
        print(f"  {machine_type}: {timeout}s")


Timeout by machine type:
  c4-standard-2: 3600.0s


In [None]:
import yaml

if results is not None:
    print("\n" + "=" * 100)
    print("PER-VM EXPECTED VS OBSERVED SUMMARY")
    print("=" * 100)

    # Load allocation from YAML files
    expected_by_vm = {}

    print(f"Looking for YAML files in: {SESSION_BENCHMARKS_DIR}")
    yaml_files = sorted(SESSION_BENCHMARKS_DIR.glob("*.yaml"))
    print(f"Found {len(yaml_files)} YAML files\n")

    # Debug: show what hostnames are in results
    print(f"Hostnames in results: {results['Hostname'].unique().tolist()}\n")

    for yaml_file in yaml_files:
        if yaml_file.name == "run.tfvars":
            continue

        vm_name = yaml_file.stem  # e.g., "standard-01"

        try:
            with open(yaml_file) as f:
                config = yaml.safe_load(f)

            if not config:
                continue

            # Extract top-level fields
            solvers_list = (
                config.get("solver", "").split() if config.get("solver") else []
            )
            machine_type = config.get("machine-type", "unknown")
            years = config.get("years", [])

            expected_by_vm[vm_name] = {
                "expected_runs": [],  # List of (benchmark, size, solver, runtime_s)
                "solvers": solvers_list,
                "machine_type": machine_type,
                "years": years,
            }

            # Iterate through each benchmark
            if "benchmarks" in config:
                for bench_name, bench_data in config["benchmarks"].items():
                    if "Sizes" in bench_data:
                        for size in bench_data["Sizes"]:
                            size_name = size.get("Name", "unknown")
                            size_solvers = size.get("_solvers", [])
                            solver_runtimes = size.get("_solver_runtimes_s", {})

                            # Create a (benchmark, size, solver) run for each solver
                            for solver in size_solvers:
                                solver_runtime_s = float(solver_runtimes.get(solver, 0))
                                expected_by_vm[vm_name]["expected_runs"].append(
                                    {
                                        "benchmark": bench_name,
                                        "size": size_name,
                                        "solver": solver,
                                        "runtime_s": solver_runtime_s,
                                    }
                                )

            total_runtime_s = sum(
                r["runtime_s"] for r in expected_by_vm[vm_name]["expected_runs"]
            )
            total_h = total_runtime_s / 3600
            solvers_str = " ".join(solvers_list) if solvers_list else "default"
            print(
                f"{vm_name} ({machine_type}): {len(expected_by_vm[vm_name]['expected_runs'])} expected runs, {total_h:.1f}h, solvers=[{solvers_str}]"
            )

        except Exception as e:
            print(f"Error loading {yaml_file}: {e}")

    print()  # Blank line for readability

    if not expected_by_vm:
        print("\nNo expected allocation found. Skipping per-VM comparison.")
    else:
        # Build summary table
        vm_summary = []

        for vm_name in sorted(expected_by_vm.keys()):
            expected = expected_by_vm[vm_name]

            # Total expected runtime across all runs
            total_expected_runtime_s = sum(
                r["runtime_s"] for r in expected["expected_runs"]
            )
            expected_runtime_h = total_expected_runtime_s / 3600

            # Get actual results for this VM - match by substring in Hostname
            # e.g., vm_name="standard-00" matches Hostname="benchmark-instance-standard-00"
            vm_results = results[
                results["Hostname"].str.contains(vm_name, case=False, na=False)
            ]

            # Filter to only benchmarks that are in the expected allocation (not reference benchmarks)
            expected_benchmark_names = set(
                r["benchmark"] for r in expected["expected_runs"]
            )
            vm_results_filtered = vm_results[
                vm_results["Benchmark"].isin(expected_benchmark_names)
            ]

            print(
                f"Matching '{vm_name}' to hostnames: {vm_results_filtered['Hostname'].unique().tolist() if len(vm_results_filtered) > 0 else 'no match'}"
            )

            # Count completed runs (by benchmark-size-solver combination)
            completed_runs = 0
            completed_runtime_s = 0
            completed_failed = 0

            for run in expected["expected_runs"]:
                # Check if this specific run has a result
                result = vm_results_filtered[
                    (vm_results_filtered["Benchmark"] == run["benchmark"])
                    & (vm_results_filtered["Size"] == run["size"])
                    & (vm_results_filtered["Solver"] == run["solver"])
                ]

                if len(result) > 0:
                    completed_runs += 1
                    completed_runtime_s += result.iloc[0]["Runtime (s)"]
                    # Check status (handle both 'ok' and 'OK')
                    if result.iloc[0]["Status"].lower() != "ok":
                        completed_failed += 1

            completed_runtime_h = completed_runtime_s / 3600
            completed_ok = completed_runs - completed_failed

            # Calculate remaining (expected runs that haven't completed)
            remaining_runtime_s = total_expected_runtime_s - completed_runtime_s
            remaining_runtime_h = remaining_runtime_s / 3600

            # Estimate timeouts
            timeout_s = (
                timeout_by_machine.get("c4-standard-2", 3600)
                if "standard" in vm_name
                else timeout_by_machine.get("c4-highmem-8", 3600)
            )
            timeout_h = timeout_s / 3600

            expected_timeout = sum(
                1
                for r in expected["expected_runs"]
                if r["runtime_s"] / 3600 >= timeout_h
            )
            expected_complete = len(expected["expected_runs"]) - expected_timeout

            vm_summary.append(
                {
                    "VM": vm_name,
                    "Machine": expected["machine_type"],
                    "Solvers": " ".join(expected["solvers"])
                    if expected["solvers"]
                    else "default",
                    "Expected Runs": len(expected["expected_runs"]),
                    "Expected Runtime (h)": expected_runtime_h,
                    "Completed Runs": completed_ok,
                    "Failed Runs": completed_failed,
                    "Completed Runtime (h)": completed_runtime_h,
                    "Remaining Runtime (h)": max(0, remaining_runtime_h),
                    "Expected Timeout": expected_timeout,
                    "Expected Complete": expected_complete,
                }
            )

        vm_df = pd.DataFrame(vm_summary)

        # Pretty print
        print("\n")
        with pd.option_context("display.max_columns", None, "display.width", None):
            with pd.option_context("display.float_format", "{:,.2f}".format):
                print(vm_df.to_string(index=False))

        # Summary
        print("\n" + "=" * 100)
        print("SUMMARY")
        print("=" * 100)
        print(f"Total runs expected: {vm_df['Expected Runs'].sum()}")
        print(f"Total expected runtime: {vm_df['Expected Runtime (h)'].sum():.1f}h")
        print(f"Total completed runs: {vm_df['Completed Runs'].sum()}")
        print(f"Total completed runtime: {vm_df['Completed Runtime (h)'].sum():.1f}h")
        print(f"Total remaining runtime: {vm_df['Remaining Runtime (h)'].sum():.1f}h")
        print(f"\nExpected to timeout: {vm_df['Expected Timeout'].sum()}")
        print(f"Expected to complete: {vm_df['Expected Complete'].sum()}")

        # Detailed solver breakdown
        print("\n" + "=" * 100)
        print("SOLVER ALLOCATION DETAIL")
        print("=" * 100)
        for vm_name in sorted(expected_by_vm.keys()):
            expected = expected_by_vm[vm_name]
            if expected["solvers"]:
                print(f"\n{vm_name}: {', '.join(expected['solvers'])}")
                print(f"  Years: {expected['years']}")
                print(f"  Expected runs: {len(expected['expected_runs'])}")

                # Group by benchmark-size to show summary
                benchmarks_by_name = {}
                for run in expected["expected_runs"]:
                    key = f"{run['benchmark']}-{run['size']}"
                    if key not in benchmarks_by_name:
                        benchmarks_by_name[key] = {"solvers": [], "total_runtime_s": 0}
                    benchmarks_by_name[key]["solvers"].append(run["solver"])
                    benchmarks_by_name[key]["total_runtime_s"] += run["runtime_s"]

                for bench_name in sorted(benchmarks_by_name.keys())[:3]:
                    b = benchmarks_by_name[bench_name]
                    print(
                        f"    - {bench_name}: {b['total_runtime_s']:.1f}s ({b['total_runtime_s'] / 3600:.2f}h) with solvers: {', '.join(b['solvers'])}"
                    )

                if len(benchmarks_by_name) > 3:
                    print(f"    ... and {len(benchmarks_by_name) - 3} more")

        # Detailed benchmark-by-benchmark-by-solver comparison
        print("\n" + "=" * 100)
        print("BENCHMARK COMPLETION DETAIL (Expected vs Actual)")
        print("=" * 100)
        for vm_name in sorted(expected_by_vm.keys()):
            expected = expected_by_vm[vm_name]
            vm_results = results[
                results["Hostname"].str.contains(vm_name, case=False, na=False)
            ]

            # Only show allocated benchmarks, exclude reference benchmarks
            expected_benchmark_names = set(
                r["benchmark"] for r in expected["expected_runs"]
            )
            vm_results_filtered = vm_results[
                vm_results["Benchmark"].isin(expected_benchmark_names)
            ]

            print(f"\n{vm_name}:")

            # Group expected runs by benchmark-size for cleaner output
            benchmarks_by_key = {}
            for run in expected["expected_runs"]:
                key = f"{run['benchmark']}-{run['size']}"
                if key not in benchmarks_by_key:
                    benchmarks_by_key[key] = []
                benchmarks_by_key[key].append(run)

            for bench_key in sorted(benchmarks_by_key.keys()):
                runs = benchmarks_by_key[bench_key]
                print(f"  {bench_key}:")

                for run in runs:
                    # Check if this run has a result
                    result = vm_results_filtered[
                        (vm_results_filtered["Benchmark"] == run["benchmark"])
                        & (vm_results_filtered["Size"] == run["size"])
                        & (vm_results_filtered["Solver"] == run["solver"])
                    ]

                    if len(result) > 0:
                        status = result.iloc[0]["Status"]
                        runtime = result.iloc[0]["Runtime (s)"]
                        status_symbol = (
                            "✓" if status.lower() == "ok" else f"✗({status})"
                        )
                        print(
                            f"    {run['solver']}: {runtime:.1f}s {status_symbol} (expected {run['runtime_s']:.1f}s)"
                        )
                    else:
                        print(
                            f"    {run['solver']}: NOT STARTED (expected {run['runtime_s']:.1f}s)"
                        )


PER-VM EXPECTED VS OBSERVED SUMMARY
Looking for YAML files in: /home/madhukar/oet/solver-benchmark/infrastructure/benchmarks/sample_run
Found 2 YAML files

Hostnames in results: ['benchmark-instance-standard-00', 'benchmark-instance-standard-01']

standard-00 (c4-standard-2): 4 expected runs, 1.0h, solvers=[highs scip]
standard-01 (c4-standard-2): 4 expected runs, 1.0h, solvers=[highs scip]

Matching 'standard-00' to hostnames: ['benchmark-instance-standard-00']
Matching 'standard-01' to hostnames: ['benchmark-instance-standard-01']


         VM       Machine    Solvers  Expected Runs  Expected Runtime (h)  Completed Runs  Failed Runs  Completed Runtime (h)  Remaining Runtime (h)  Expected Timeout  Expected Complete
standard-00 c4-standard-2 highs scip              4                  1.00               4            0                   0.92                   0.08                 0                  4
standard-01 c4-standard-2 highs scip              4                  0.99             