# BioEval Results Analysis

This notebook analyzes evaluation results from the BioEval benchmark suite.

## Components
1. **ProtoReason**: Protocol procedural reasoning
2. **CausalBio**: Causal perturbation prediction
3. **DesignCheck**: Experimental design critique
4. **Error Taxonomy**: Failure mode analysis

In [None]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Configure display
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

## 1. Load Results

In [None]:
# Load evaluation results
RESULTS_DIR = Path('../results')

def load_results(filepath):
    """Load results from JSON file."""
    with open(filepath) as f:
        return json.load(f)

# List available result files
result_files = list(RESULTS_DIR.glob('*.json'))
print(f"Found {len(result_files)} result files:")
for f in result_files:
    print(f"  - {f.name}")

In [None]:
# Load latest results (modify path as needed)
# results = load_results(RESULTS_DIR / 'claude-sonnet_20250108.json')

# For demo, create sample results structure
sample_results = {
    "metadata": {
        "model": "claude-sonnet-4-20250514",
        "timestamp": "2025-01-08T12:00:00"
    },
    "summary": {
        "total_tasks": 75,
        "by_component": {
            "protoreason": {"num_tasks": 30, "completed": 30},
            "causalbio": {"num_tasks": 35, "completed": 35},
            "designcheck": {"num_tasks": 10, "completed": 10}
        }
    }
}

print("Results loaded successfully")
print(f"Model: {sample_results['metadata']['model']}")
print(f"Total tasks: {sample_results['summary']['total_tasks']}")

## 2. ProtoReason Analysis

Analyze protocol procedural reasoning performance.

In [None]:
# Sample ProtoReason results for visualization
protoreason_data = {
    'task_type': ['step_ordering', 'step_ordering', 'missing_step', 'missing_step', 
                  'calculation', 'calculation', 'calculation', 'troubleshooting', 
                  'troubleshooting', 'safety'],
    'protocol': ['western_blot', 'qpcr', 'western_blot', 'chip_seq',
                 'dilution', 'protein', 'cell_counting', 'western_blot',
                 'qpcr', 'lentivirus'],
    'score': [0.85, 0.92, 0.75, 0.60, 0.95, 0.88, 0.92, 0.80, 0.72, 0.90],
    'response_length': [450, 380, 520, 610, 280, 350, 290, 680, 720, 420]
}

df_proto = pd.DataFrame(protoreason_data)
df_proto.head()

In [None]:
# ProtoReason: Performance by task type
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart by task type
task_scores = df_proto.groupby('task_type')['score'].mean().sort_values()
ax1 = axes[0]
task_scores.plot(kind='barh', ax=ax1, color=sns.color_palette('viridis', len(task_scores)))
ax1.set_xlabel('Average Score')
ax1.set_ylabel('Task Type')
ax1.set_title('ProtoReason: Performance by Task Type')
ax1.set_xlim(0, 1)
for i, v in enumerate(task_scores):
    ax1.text(v + 0.02, i, f'{v:.2f}', va='center')

# Box plot by task type
ax2 = axes[1]
df_proto.boxplot(column='score', by='task_type', ax=ax2)
ax2.set_xlabel('Task Type')
ax2.set_ylabel('Score')
ax2.set_title('Score Distribution by Task Type')
plt.suptitle('')

plt.tight_layout()
plt.savefig('protoreason_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

## 3. CausalBio Analysis

Analyze causal perturbation prediction performance.

In [None]:
# Sample CausalBio results
causalbio_data = {
    'task_type': ['knockout', 'knockout', 'knockout', 'knockout', 'knockout',
                  'pathway', 'pathway', 'pathway', 'pathway',
                  'drug_response', 'drug_response', 'drug_response',
                  'epistasis', 'epistasis', 'epistasis'],
    'reasoning_type': ['oncogene_addiction', 'tumor_suppressor', 'synthetic_lethality', 
                       'core_essential', 'context_dependency',
                       'RTK', 'MAPK', 'PI3K', 'cell_cycle',
                       'kinase_inhibitor', 'epigenetic', 'targeted',
                       'synthetic_lethal', 'suppressive', 'enhancing'],
    'effect_correct': [True, True, True, True, False,
                       True, True, False, True,
                       True, False, True,
                       True, False, True],
    'mechanism_score': [0.85, 0.70, 0.90, 0.95, 0.45,
                        0.88, 0.82, 0.55, 0.78,
                        0.92, 0.60, 0.85,
                        0.80, 0.50, 0.75],
    'gene_or_drug': ['KRAS', 'TP53', 'PARP1', 'RPL13', 'KRAS-MCF7',
                     'EGFR_inh', 'BRAF_inh', 'mTOR_inh', 'CDK4/6_inh',
                     'Imatinib', 'JQ1', 'Dexamethasone',
                     'BRCA1-PARP1', 'BRCA1-53BP1', 'KRAS-STK11']
}

df_causal = pd.DataFrame(causalbio_data)
df_causal.head()

In [None]:
# CausalBio: Accuracy by task type
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Effect prediction accuracy
ax1 = axes[0]
accuracy_by_type = df_causal.groupby('task_type')['effect_correct'].mean()
colors = ['#2ecc71' if v > 0.7 else '#e74c3c' if v < 0.5 else '#f39c12' for v in accuracy_by_type]
accuracy_by_type.plot(kind='bar', ax=ax1, color=colors)
ax1.set_ylabel('Accuracy')
ax1.set_title('Effect Prediction Accuracy')
ax1.set_ylim(0, 1)
ax1.axhline(y=0.7, color='green', linestyle='--', alpha=0.5, label='Good threshold')
ax1.tick_params(axis='x', rotation=45)

# Mechanism reasoning score
ax2 = axes[1]
df_causal.boxplot(column='mechanism_score', by='task_type', ax=ax2)
ax2.set_ylabel('Mechanism Score')
ax2.set_title('Mechanism Reasoning Quality')
plt.suptitle('')
ax2.tick_params(axis='x', rotation=45)

# Scatter: effect correct vs mechanism score
ax3 = axes[2]
colors = df_causal['task_type'].map({'knockout': 'blue', 'pathway': 'green', 
                                      'drug_response': 'orange', 'epistasis': 'red'})
ax3.scatter(df_causal['effect_correct'].astype(int) + np.random.normal(0, 0.05, len(df_causal)), 
            df_causal['mechanism_score'], c=colors, alpha=0.7, s=100)
ax3.set_xlabel('Effect Correct (0=No, 1=Yes)')
ax3.set_ylabel('Mechanism Score')
ax3.set_title('Effect Accuracy vs Mechanism Quality')

plt.tight_layout()
plt.savefig('causalbio_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# CausalBio: Detailed knockout analysis by reasoning type
df_ko = df_causal[df_causal['task_type'] == 'knockout'].copy()

fig, ax = plt.subplots(figsize=(10, 6))

ko_scores = df_ko.groupby('reasoning_type').agg({
    'effect_correct': 'mean',
    'mechanism_score': 'mean'
}).sort_values('mechanism_score')

x = np.arange(len(ko_scores))
width = 0.35

bars1 = ax.bar(x - width/2, ko_scores['effect_correct'], width, label='Effect Accuracy', color='#3498db')
bars2 = ax.bar(x + width/2, ko_scores['mechanism_score'], width, label='Mechanism Score', color='#2ecc71')

ax.set_ylabel('Score')
ax.set_xlabel('Reasoning Type')
ax.set_title('Knockout Prediction: Accuracy vs Reasoning Quality')
ax.set_xticks(x)
ax.set_xticklabels(ko_scores.index, rotation=45, ha='right')
ax.legend()
ax.set_ylim(0, 1.1)

plt.tight_layout()
plt.savefig('knockout_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

## 4. DesignCheck Analysis

Analyze experimental design critique performance.

In [None]:
# Sample DesignCheck results
designcheck_data = {
    'design_id': [f'design_{i:03d}' for i in range(1, 11)],
    'total_flaws': [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
    'critical_flaws': [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
    'flaws_detected': [3, 2, 3, 2, 3, 2, 3, 2, 2, 3],
    'critical_detected': [2, 1, 2, 2, 2, 1, 2, 1, 2, 2],
    'false_positives': [0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
    'design_type': ['drug_response', 'knockout', 'western_blot', 'gene_expression',
                    'biomarker', 'comparison', 'scrnaseq', 'mechanism', 'screen', 'mouse']
}

df_design = pd.DataFrame(designcheck_data)
df_design['flaw_recall'] = df_design['flaws_detected'] / df_design['total_flaws']
df_design['critical_recall'] = df_design['critical_detected'] / df_design['critical_flaws']
df_design['precision'] = df_design['flaws_detected'] / (df_design['flaws_detected'] + df_design['false_positives'])
df_design.head()

In [None]:
# DesignCheck: Flaw detection performance
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Overall metrics
ax1 = axes[0]
metrics = ['flaw_recall', 'critical_recall', 'precision']
metric_values = [df_design[m].mean() for m in metrics]
metric_labels = ['All Flaws\nRecall', 'Critical Flaws\nRecall', 'Precision']
colors = ['#3498db', '#e74c3c', '#2ecc71']
bars = ax1.bar(metric_labels, metric_values, color=colors)
ax1.set_ylabel('Score')
ax1.set_title('Overall Flaw Detection Performance')
ax1.set_ylim(0, 1)
for bar, val in zip(bars, metric_values):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
             f'{val:.2f}', ha='center', va='bottom', fontweight='bold')

# By design type
ax2 = axes[1]
df_design_sorted = df_design.sort_values('flaw_recall')
ax2.barh(df_design_sorted['design_type'], df_design_sorted['flaw_recall'], color='#3498db')
ax2.set_xlabel('Flaw Recall')
ax2.set_title('Flaw Detection by Design Type')
ax2.set_xlim(0, 1)

# Recall vs Precision scatter
ax3 = axes[2]
ax3.scatter(df_design['flaw_recall'], df_design['precision'], s=100, alpha=0.7)
for i, row in df_design.iterrows():
    ax3.annotate(row['design_type'], (row['flaw_recall'], row['precision']),
                 fontsize=8, alpha=0.7)
ax3.set_xlabel('Recall')
ax3.set_ylabel('Precision')
ax3.set_title('Recall vs Precision Trade-off')
ax3.set_xlim(0, 1.1)
ax3.set_ylim(0, 1.1)
ax3.plot([0, 1], [0, 1], 'k--', alpha=0.3)

plt.tight_layout()
plt.savefig('designcheck_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

## 5. Error Taxonomy Analysis

Analyze failure modes across all components.

In [None]:
# Sample error annotations
error_data = {
    'error_category': ['knowledge', 'knowledge', 'reasoning', 'reasoning', 'reasoning',
                       'procedural', 'procedural', 'uncertainty', 'uncertainty', 'communication'] * 3,
    'error_type': ['factual_hallucination', 'outdated_info', 'causal_reversal', 'pathway_truncation',
                   'overgeneralization', 'step_omission', 'reagent_confusion', 'overconfidence',
                   'missing_uncertainty', 'jargon_misuse'] * 3,
    'severity': ['critical', 'minor', 'major', 'major', 'minor',
                 'critical', 'major', 'major', 'minor', 'minor'] * 3,
    'component': ['causalbio'] * 10 + ['protoreason'] * 10 + ['designcheck'] * 10
}

df_errors = pd.DataFrame(error_data)
print(f"Total annotated errors: {len(df_errors)}")

In [None]:
# Error distribution analysis
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# By category
ax1 = axes[0, 0]
category_counts = df_errors['error_category'].value_counts()
colors = sns.color_palette('Set2', len(category_counts))
ax1.pie(category_counts, labels=category_counts.index, autopct='%1.1f%%', colors=colors)
ax1.set_title('Errors by Category')

# By severity
ax2 = axes[0, 1]
severity_colors = {'critical': '#e74c3c', 'major': '#f39c12', 'minor': '#3498db'}
severity_counts = df_errors['severity'].value_counts()
ax2.bar(severity_counts.index, severity_counts.values, 
        color=[severity_colors[s] for s in severity_counts.index])
ax2.set_ylabel('Count')
ax2.set_title('Errors by Severity')

# Heatmap: category x component
ax3 = axes[1, 0]
pivot = df_errors.groupby(['error_category', 'component']).size().unstack(fill_value=0)
sns.heatmap(pivot, annot=True, fmt='d', cmap='YlOrRd', ax=ax3)
ax3.set_title('Error Distribution: Category Ã— Component')

# Top error types
ax4 = axes[1, 1]
type_counts = df_errors['error_type'].value_counts().head(10)
type_counts.plot(kind='barh', ax=ax4, color='#3498db')
ax4.set_xlabel('Count')
ax4.set_title('Top 10 Error Types')

plt.tight_layout()
plt.savefig('error_taxonomy_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

## 6. Summary Statistics

In [None]:
# Generate summary table
summary_stats = {
    'Component': ['ProtoReason', 'CausalBio', 'DesignCheck', 'Overall'],
    'Tasks': [10, 15, 10, 35],
    'Avg Score': [0.82, 0.75, 0.85, 0.80],
    'Best Category': ['Calculations (0.92)', 'Knockout (0.85)', 'Drug Response (0.90)', '-'],
    'Weakest Category': ['Missing Steps (0.68)', 'Epistasis (0.68)', 'Mechanism (0.72)', '-'],
    'Critical Errors': [2, 5, 3, 10]
}

df_summary = pd.DataFrame(summary_stats)
print("\n" + "="*80)
print("BIOEVAL RESULTS SUMMARY")
print("="*80)
print(df_summary.to_string(index=False))
print("="*80)

## 7. Key Findings & Recommendations

In [None]:
key_findings = """
## Key Findings

### Strengths
1. **Strong procedural knowledge**: Model performs well on calculation tasks and protocol step ordering
2. **Good pattern recognition**: High accuracy on oncogene addiction predictions
3. **Safety awareness**: Correctly identifies most safety requirements

### Areas for Improvement
1. **Causal reasoning**: Struggles with epistasis and complex genetic interactions
2. **Missing step detection**: Often misses subtle but critical protocol steps
3. **Overconfidence**: High confidence on incorrect pathway predictions

### Systematic Error Patterns
- **Pathway truncation**: Tends to describe early signaling but misses downstream effects
- **Context blindness**: Fails to account for cell line-specific genetic background
- **Temporal confusion**: Sometimes reverses cause and effect in biological processes

### Recommendations
1. Include more perturbation data in training for causal reasoning
2. Improve uncertainty calibration - model should express doubt on complex interactions
3. Better integration of genetic context (mutations, dependencies) in reasoning
"""

print(key_findings)

## 8. Export Results

In [None]:
# Export summary to CSV
df_summary.to_csv('bioeval_summary.csv', index=False)

# Export detailed results
df_proto.to_csv('protoreason_details.csv', index=False)
df_causal.to_csv('causalbio_details.csv', index=False)
df_design.to_csv('designcheck_details.csv', index=False)
df_errors.to_csv('error_annotations.csv', index=False)

print("Results exported to CSV files.")

---

## Appendix: Running Evaluation

```bash
# Run full evaluation
python scripts/run_evaluation.py --model claude-sonnet-4-20250514 --component all

# Run specific component
python scripts/run_evaluation.py --model claude-sonnet-4-20250514 --component causalbio

# Compare models
python scripts/run_evaluation.py --model claude-sonnet-4-20250514 --output results/claude.json
python scripts/run_evaluation.py --model gpt-4 --output results/gpt4.json
```