# Benchmark Log Bloat Analysis

This notebook analyzes the space usage in benchmark results (specifically `results.json`) to identify why they are reaching several GBs and where the most significant bloat occurs.

In [None]:
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys

RUN_DIR = "benchmark_runs/2026-01-26_21-28-34"
RESULTS_FILE = Path(RUN_DIR) / "results.json"

def estimate_size(obj):
    if obj is None: return 0
    if isinstance(obj, str): return len(obj)
    return len(json.dumps(obj))

## Size Breakdown by Event Type
We iterate through all trace logs and measure the size of each event type.

In [None]:
data = []
with open(RESULTS_FILE, 'r') as f:
    results = json.load(f)

def process_logs(logs, source_type):
    for log in logs or []:
        data.append({
            'source': source_type,
            'event_type': log.get('type', 'unknown'),
            'content_size': estimate_size(log.get('content')),
            'tool_output_size': estimate_size(log.get('tool_output')),
            'details_size': estimate_size(log.get('details')),
            'total_size': estimate_size(log)
        })

for res in results:
    # Top-level logs
    process_logs(res.get('trace_logs'), 'top_level')
    
    # Attempt-level logs
    for attempt in res.get('generation_attempts') or []:
        process_logs(attempt.get('trace_logs'), 'attempt')

df = pd.DataFrame(data)
stats = df.groupby(['source', 'event_type'])['total_size'].agg(['sum', 'count', 'mean']).sort_values('sum', ascending=False)
stats['sum_mb'] = stats['sum'] / (1024*1024)
stats['percent'] = (stats['sum'] / stats['sum'].sum()) * 100
stats

### Visualization of Usage
The following plot shows the total MB consumed by each log event type.

In [None]:
plt.figure(figsize=(12, 6))
sns.barplot(data=stats.reset_index(), x='event_type', y='sum_mb')
plt.title("Total Space Usage per Event Type (MB)")
plt.ylabel("MB")
plt.xticks(rotation=45)
plt.show()

## Findings and Optimization

1. **Duplication**: `tool_result` events store the same huge output in both `tool_output` and `details`. 
2. **Raw CLI Bloat**: `CLI_STDOUT_FULL` captures the raw JSON line from the CLI, which contains the entire tool output again.
3. **Multi-Attempt Multiplier**: Since `trace_logs` are stored both in `generation_attempts` and the top-level result, the bloat is multiplied by the number of retries.

**Optimization Implemented:**
- Truncate `tool_output` and `CLI_STDOUT_FULL` to 5KB.
- Remove `details` from `tool_result` (redundant).
- Apply `optimize_trace_logs` before saving results.