# RESP AI Findings Demo
This notebook walks through the evaluation results for outcome, process, and structured response styles on a GSM8K slice.

**Highlights**
- Process-style reasoning hit 1.00 accuracy vs. 0.86 for outcome/structured; McNemar p=0.617 shows differences are not statistically significant at this sample size.
- Structured outputs were most auditable (clarity/verification/coherence ? 3.86) while process traces were the most faithful to the underlying reasoning (2.0 mean faithfulness).
- Accuracy is most correlated with faithfulness (r?0.95) and moderately with step correctness (r?0.69).
- The notebook reloads metrics from `results/`, recreates a few plots, and embeds the saved figures for quick review.

In [None]:
from pathlib import Path
import json
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image, display

sns.set_theme(style="whitegrid")
project_root = Path.cwd()
if project_root.name == "notebooks":
    project_root = project_root.parent
results_dir = project_root / "results"
print(f"Using results directory: {results_dir}")

In [None]:
with open(results_dir / "accuracy_metrics.json", "r", encoding="utf-8") as f:
    accuracy_metrics = json.load(f)
with open(results_dir / "interpretability_metrics.json", "r", encoding="utf-8") as f:
    interpretability_metrics = json.load(f)
with open(results_dir / "statistical_analysis.json", "r", encoding="utf-8") as f:
    statistical_analysis = json.load(f)

accuracy_df = pd.DataFrame(accuracy_metrics["accuracy_comparison"]["comparison_table"])
accuracy_df

In [None]:
_, ax = plt.subplots(figsize=(4, 3))
sns.barplot(data=accuracy_df, x="Condition", y="Accuracy", palette="Blues_d", ax=ax)
ax.set_ylim(0, 1.05)
ax.set_title("Accuracy by condition")
for p in ax.patches:
    ax.annotate(f"{p.get_height():.2f}", (p.get_x() + p.get_width() / 2, p.get_height() + 0.01), ha="center")
plt.tight_layout()
plt.show()

stat = accuracy_metrics["accuracy_comparison"]
print("McNemar outcome vs process p-value:", round(accuracy_metrics.get("mcnemar_outcome_vs_process_pvalue", float('nan')), 3))
print("Effect size (outcome vs process):", round(stat.get("effect_size_outcome_vs_process", float('nan')), 3))

In [None]:
rows = []
for condition, metrics in interpretability_metrics["summary"].items():
    row = {"condition": condition}
    for key in ["step_correctness", "faithfulness", "clarity", "verification_effort", "coherence"]:
        row[key] = metrics[key]["mean"]
    rows.append(row)
interpret_df = pd.DataFrame(rows).set_index("condition").round(3)
interpret_df

In [None]:
plot_df = interpret_df.reset_index().melt(id_vars="condition", var_name="metric", value_name="score")
order = ["process", "structured"]
metrics_order = ["step_correctness", "faithfulness", "clarity", "verification_effort", "coherence"]
_, ax = plt.subplots(figsize=(8, 4))
sns.barplot(data=plot_df, x="metric", y="score", hue="condition", order=metrics_order, palette="Set2", ax=ax)
ax.set_title("Interpretability metrics (means)")
ax.set_ylim(0, 4.2)
plt.xticks(rotation=20)
plt.tight_layout()
plt.show()

In [None]:
pearson_df = pd.DataFrame(statistical_analysis["correlations"]["pearson"]).astype(float)
_, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(pearson_df, annot=True, fmt=".2f", cmap="coolwarm", vmin=-1, vmax=1, ax=ax)
ax.set_title("Pearson correlations across metrics")
plt.tight_layout()
plt.show()

In [None]:
figure_dir = results_dir / "figures"
figure_names = [
    "01_accuracy_comparison.png",
    "03_interpretability_metrics.png",
    "05_summary_comparison.png",
    "06_correlation_heatmap.png",
    "07_faithfulness_vs_accuracy.png",
]
for name in figure_names:
    path = figure_dir / name
    if path.exists():
        display(Image(filename=path))
    else:
        print(f"Missing figure: {name}")

# Next steps
- Swap in a larger evaluation set and rerun the pipeline to firm up the significance tests.
- Add error analysis cells to inspect mispredicted items and their rationales.
- Wire this notebook into the report workflow (e.g., papermill) so it refreshes alongside new results.