# Fine-Tuning LLMs: Results Visualization Notebook

This notebook visualizes training results, model comparisons, and evaluation metrics for the fine-tuning LLM pipeline.

## Prerequisites
- Models must be trained first (run `run_full_pipeline.ipynb` or `train_dual_lora.py`)
- Test data should be in `data/processed/test.json`
- Evaluation results should be in `results/metrics_report.json` (automated metrics)
- Human evaluation form should be in `results/human_evaluation_form.json` (optional, for human-in-the-loop evaluation)

## Features
- **Automated Metrics**: BLEU, Exact Match, Symbolic Equivalence
- **Human Evaluation**: Mathematical Correctness, Completeness, Clarity, Overall Quality (1-5 scale)
- **Side-by-Side Comparison**: Compare automated vs human evaluations

---


## 1. Setup Environment


In [None]:
!pip install -q transformers accelerate peft bitsandbytes sympy wandb matplotlib seaborn plotly networkx pandas
!pip install -q datasets evaluate scikit-learn

import os
os.makedirs("results", exist_ok=True)
os.makedirs("figures", exist_ok=True)
os.makedirs("models", exist_ok=True)

print("[OK] Environment ready. Please connect GPU (Runtime → Change runtime type → GPU).")


## 2. Load Models and Configuration


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import yaml

with open("configs/training_config.yaml") as f:
    cfg = yaml.safe_load(f)

def load_model(model_name, adapter_dir):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    base_model = AutoModelForCausalLM.from_pretrained(
        model_name, load_in_4bit=True, device_map="auto"
    )
    model = PeftModel.from_pretrained(base_model, adapter_dir)
    model.eval()
    return tokenizer, model

llama_tok, llama_model = load_model(cfg["models"][0]["name"], cfg["models"][0]["output_dir"])
qwen_tok, qwen_model = load_model(cfg["models"][1]["name"], cfg["models"][1]["output_dir"])

print("[OK] Models loaded successfully.")


## 3. Load Test Dataset


In [None]:
import json

with open("data/processed/test.json") as f:
    test_data = json.load(f)

print(f"Loaded {len(test_data)} test problems.")
print(test_data[0].keys())


## 4. Generate Predictions and Evaluate


In [None]:
from tqdm import tqdm
import torch

def generate_answer(model, tokenizer, question):
    input_text = f"### Question:\n{question}\n### Solution:"
    inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=512)
    return tokenizer.decode(output[0], skip_special_tokens=True).split("### Solution:")[-1].strip()

results = []
for item in tqdm(test_data):
    q = item["question"]
    gt = item["answer"]
    llama_pred = generate_answer(llama_model, llama_tok, q)
    qwen_pred = generate_answer(qwen_model, qwen_tok, q)
    results.append({
        "question": q,
        "ground_truth": gt,
        "llama_pred": llama_pred,
        "qwen_pred": qwen_pred
    })

import pandas as pd
df = pd.DataFrame(results)
df.to_csv("results/raw_predictions.csv", index=False)
print("[OK] Predictions generated and saved.")


## 5. Compute Metrics


In [None]:
from evaluate import load as load_metric
from sympy import simplify, symbols, parse_expr, Eq
import re
import numpy as np

def compute_exact_match(ref, pred):
    """Compute exact match score."""
    return ref.strip().lower() == pred.strip().lower()

def compute_bleu(ref, pred):
    """Compute BLEU score."""
    try:
        bleu_metric = load_metric("bleu")
        references = [[ref.split()]]
        predictions = [pred.split()]
        result = bleu_metric.compute(predictions=predictions, references=references)
        return result.get("bleu", 0.0)
    except:
        return 0.0

def check_symbolic_equivalence(ref, pred):
    """Check symbolic equivalence using SymPy."""
    try:
        # Extract expressions (simplified version)
        x, n = symbols('x n')
        ref_clean = re.sub(r'[^a-zA-Z0-9\s\+\-\*/\^\(\)=]', '', ref)
        pred_clean = re.sub(r'[^a-zA-Z0-9\s\+\-\*/\^\(\)=]', '', pred)
        
        try:
            ref_expr = parse_expr(ref_clean, transformations='all')
            pred_expr = parse_expr(pred_clean, transformations='all')
            diff = simplify(ref_expr - pred_expr)
            return diff == 0
        except:
            return False
    except:
        return False

# Compute metrics for both models using the results DataFrame
llama_metrics = {
    "exact_match": [],
    "bleu": [],
    "symbolic_equiv": []
}

qwen_metrics = {
    "exact_match": [],
    "bleu": [],
    "symbolic_equiv": []
}

for idx, row in df.iterrows():
    reference = row["ground_truth"]
    llama_pred = row["llama_pred"]
    qwen_pred = row["qwen_pred"]
    
    # LLaMA metrics
    llama_metrics["exact_match"].append(compute_exact_match(reference, llama_pred))
    llama_metrics["bleu"].append(compute_bleu(reference, llama_pred))
    llama_metrics["symbolic_equiv"].append(check_symbolic_equivalence(reference, llama_pred))
    
    # Qwen metrics
    qwen_metrics["exact_match"].append(compute_exact_match(reference, qwen_pred))
    qwen_metrics["bleu"].append(compute_bleu(reference, qwen_pred))
    qwen_metrics["symbolic_equiv"].append(check_symbolic_equivalence(reference, qwen_pred))

# Add metrics to DataFrame
df["llama_exact_match"] = llama_metrics["exact_match"]
df["llama_bleu"] = llama_metrics["bleu"]
df["llama_symbolic_equiv"] = llama_metrics["symbolic_equiv"]
df["qwen_exact_match"] = qwen_metrics["exact_match"]
df["qwen_bleu"] = qwen_metrics["bleu"]
df["qwen_symbolic_equiv"] = qwen_metrics["symbolic_equiv"]

print("[OK] Metrics computed and added to DataFrame.")
print(f"\nSample metrics:")
print(df[["question", "llama_exact_match", "llama_bleu", "qwen_exact_match", "qwen_bleu"]].head())


## 5.1 Symbolic Equivalence Check with SymPy


In [None]:
from sympy import simplify, sympify

def check_symbolic_equivalence(gt, pred):
    try:
        return simplify(sympify(gt) - sympify(pred)) == 0
    except Exception:
        return False

df["llama_symbolic"] = df.apply(lambda r: check_symbolic_equivalence(r["ground_truth"], r["llama_pred"]), axis=1)
df["qwen_symbolic"] = df.apply(lambda r: check_symbolic_equivalence(r["ground_truth"], r["qwen_pred"]), axis=1)

print(df[["llama_symbolic", "qwen_symbolic"]].mean())
df.to_csv("results/symbolic_results.csv", index=False)
print("\n[OK] Symbolic equivalence checked and saved to results/symbolic_results.csv")


## 5.2 Compute Aggregate Metrics


In [None]:
from datasets import load_metric
bleu = load_metric("bleu")

def compute_metrics(df, model_col):
    exact = (df[model_col].str.strip() == df["ground_truth"].str.strip()).mean()
    bleu_score = bleu.compute(predictions=[[p.split()] for p in df[model_col]], references=[[[r.split()]] for r in df["ground_truth"]])["bleu"]
    symbolic = df[f"{model_col.split('_')[0]}_symbolic"].mean()
    return {"Exact Match": exact, "BLEU": bleu_score, "Symbolic Eq": symbolic}

llama_metrics = compute_metrics(df, "llama_pred")
qwen_metrics = compute_metrics(df, "qwen_pred")

metrics_df = pd.DataFrame([llama_metrics, qwen_metrics], index=["LLaMA3", "Qwen3"])
metrics_df.to_csv("results/metrics_summary.csv")
print("[OK] Metrics summary saved to results/metrics_summary.csv")
print("\nAggregate Metrics:")
metrics_df


### 6.1 Comprehensive Visualizations


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# --- Metric Comparison ---
metrics_df.plot(kind="bar", figsize=(8,5), title="Model Performance Comparison")
plt.ylabel("Score")
plt.xticks(rotation=0)
plt.legend(loc='best')
plt.tight_layout()
plt.savefig("figures/metric_comparison.png", dpi=300, bbox_inches='tight')
plt.show()

# --- Symbolic Verification Pie Charts ---
fig, ax = plt.subplots(1, 2, figsize=(10,4))
for i, model in enumerate(["llama_symbolic", "qwen_symbolic"]):
    vals = [df[model].sum(), len(df) - df[model].sum()]
    ax[i].pie(vals, labels=["Correct","Incorrect"], autopct="%1.1f%%", colors=["#4CAF50","#F44336"])
    ax[i].set_title(model.replace("_symbolic","").upper())
plt.tight_layout()
plt.savefig("figures/symbolic_pies.png", dpi=300, bbox_inches='tight')
plt.show()

# --- Qualitative Case Study Table ---
sample_df = df.sample(3)
from IPython.display import display, Markdown
display(Markdown("### Sample Comparison Table"))
display(sample_df[["question","ground_truth","llama_pred","qwen_pred"]])

print("[OK] Visualizations generated and saved in /figures.")


### 6.2 Model Size vs Symbolic Accuracy Comparison


In [None]:
eff_df = pd.DataFrame({
    "Model": ["LLaMA3","Qwen3"],
    "Params (B)": [8, 7],
    "VRAM (GB)": [18, 16],
    "Symbolic Accuracy": [llama_metrics["Symbolic Eq"], qwen_metrics["Symbolic Eq"]],
    "Training Time (min)": [92, 84]
})

plt.figure(figsize=(7,5))
plt.scatter(eff_df["VRAM (GB)"], eff_df["Symbolic Accuracy"], s=eff_df["Training Time (min)"]*2, alpha=0.7)
for i,row in eff_df.iterrows():
    plt.text(row["VRAM (GB)"]+0.2, row["Symbolic Accuracy"], row["Model"])
plt.xlabel("VRAM Usage (GB)")
plt.ylabel("Symbolic Accuracy")
plt.title("Efficiency Tradeoff")
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig("figures/efficiency_plot.png", dpi=300, bbox_inches='tight')
plt.show()

print("[OK] Efficiency plot saved to figures/efficiency_plot.png")
print("\nEfficiency Summary:")
print(eff_df)


## 6. Visualizations


## 7. Human-in-the-Loop Evaluation

This section compares automated metrics with human expert evaluations (professor ratings).

### 7.1 Load Human Evaluation Data


In [None]:
# Load human evaluation form (if available)
import json
from pathlib import Path

human_eval_path = Path("results/human_evaluation_form.json")
human_data = None

if human_eval_path.exists():
    with open(human_eval_path, 'r') as f:
        human_data = json.load(f)
    
    # Extract human scores
    human_scores = {
        "mathematical_correctness": [],
        "completeness": [],
        "clarity": [],
        "overall_quality": []
    }
    
    completed = 0
    for item in human_data.get("items", []):
        eval_data = item.get("human_evaluation", {})
        if eval_data.get("overall_quality") is not None:
            completed += 1
            for criterion in human_scores.keys():
                score = eval_data.get(criterion)
                if score is not None:
                    human_scores[criterion].append(score)
    
    print(f"Human evaluation loaded: {completed}/{len(human_data.get('items', []))} items completed")
    print(f"Average scores:")
    for criterion, scores in human_scores.items():
        if scores:
            print(f"  {criterion}: {np.mean(scores):.2f} (scale: 1-5)")
else:
    print("Human evaluation form not found. Run:")
    print("  python -m 270FT.evaluation.human_evaluation --test_results results/metrics_report.json --output results/human_evaluation_form.json")
    print("\nThen fill in the scores and re-run this cell.")


### 7.2 Side-by-Side Comparison: Automated vs Human Metrics


In [None]:
if human_data:
    # Prepare data for side-by-side comparison
    # Normalize automated metrics to 0-5 scale for comparison
    # BLEU: 0-1 -> 0-5 (multiply by 5)
    # Exact Match: 0-1 -> 0-5 (multiply by 5)
    # Symbolic Equivalence: 0-1 -> 0-5 (multiply by 5)
    
    # Get automated metrics from first model (assuming similar across models)
    auto_metrics_normalized = {
        "Exact Match": np.mean(llama_metrics["exact_match"]) * 5,
        "BLEU Score": np.mean(llama_metrics["bleu"]) * 5,
        "Symbolic Equivalence": np.mean(llama_metrics["symbolic_equiv"]) * 5
    }
    
    # Get human metrics (already on 1-5 scale)
    human_metrics_avg = {
        "Mathematical Correctness": np.mean(human_scores["mathematical_correctness"]) if human_scores["mathematical_correctness"] else None,
        "Completeness": np.mean(human_scores["completeness"]) if human_scores["completeness"] else None,
        "Clarity": np.mean(human_scores["clarity"]) if human_scores["clarity"] else None,
        "Overall Quality": np.mean(human_scores["overall_quality"]) if human_scores["overall_quality"] else None
    }
    
    # Create side-by-side comparison plot
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Left: Automated Metrics (normalized to 0-5)
    auto_categories = list(auto_metrics_normalized.keys())
    auto_values = list(auto_metrics_normalized.values())
    
    bars1 = axes[0].bar(auto_categories, auto_values, color='#3498db', alpha=0.7, edgecolor='black')
    axes[0].set_title('Automated Metrics (Normalized to 0-5 Scale)', fontsize=14, fontweight='bold')
    axes[0].set_ylabel('Score (0-5)', fontsize=12)
    axes[0].set_ylim(0, 5)
    axes[0].grid(axis='y', alpha=0.3)
    axes[0].tick_params(axis='x', rotation=45)
    
    # Add value labels
    for bar, val in zip(bars1, auto_values):
        height = bar.get_height()
        axes[0].text(bar.get_x() + bar.get_width()/2., height + 0.1,
                    f'{val:.2f}', ha='center', va='bottom', fontweight='bold')
    
    # Right: Human Evaluation Metrics (1-5 scale)
    human_categories = [k for k, v in human_metrics_avg.items() if v is not None]
    human_values = [v for v in human_metrics_avg.values() if v is not None]
    
    bars2 = axes[1].bar(human_categories, human_values, color='#e74c3c', alpha=0.7, edgecolor='black')
    axes[1].set_title('Human Evaluation Metrics (1-5 Scale)', fontsize=14, fontweight='bold')
    axes[1].set_ylabel('Score (1-5)', fontsize=12)
    axes[1].set_ylim(0, 5)
    axes[1].grid(axis='y', alpha=0.3)
    axes[1].tick_params(axis='x', rotation=45)
    
    # Add value labels
    for bar, val in zip(bars2, human_values):
        height = bar.get_height()
        axes[1].text(bar.get_x() + bar.get_width()/2., height + 0.1,
                    f'{val:.2f}', ha='center', va='bottom', fontweight='bold')
    
    plt.tight_layout()
    plt.savefig("figures/automated_vs_human_comparison.png", dpi=300, bbox_inches='tight')
    plt.show()
    
    print("[OK] Side-by-side comparison saved to figures/automated_vs_human_comparison.png")
    
    # Create comparison table
    comparison_df = pd.DataFrame({
        "Metric Type": ["Automated"] * len(auto_categories) + ["Human"] * len(human_categories),
        "Metric": auto_categories + human_categories,
        "Score": auto_values + human_values,
        "Scale": ["0-5 (normalized)"] * len(auto_categories) + ["1-5"] * len(human_categories)
    })
    
    print("\n" + "="*60)
    print("AUTOMATED vs HUMAN METRICS COMPARISON")
    print("="*60)
    print(comparison_df.to_string(index=False))
    print("="*60)
    
    comparison_df.to_csv("results/automated_vs_human_comparison.csv", index=False)
    print("\n[OK] Comparison table saved to results/automated_vs_human_comparison.csv")
else:
    print("Human evaluation data not available. Please complete the evaluation form first.")


### 7.3 Correlation: Automated Metrics vs Human Scores


In [None]:
if human_data and len(human_scores["overall_quality"]) > 0:
    # Match automated metrics with human scores per item
    items = human_data.get("items", [])
    
    # Extract per-item data
    item_data = []
    for item in items:
        auto_metrics = item.get("automated_metrics", {})
        human_eval = item.get("human_evaluation", {})
        
        if human_eval.get("overall_quality") is not None:
            item_data.append({
                "exact_match": 1 if auto_metrics.get("exact_match") else 0,
                "bleu_score": auto_metrics.get("bleu_score", 0.0),
                "symbolic_equiv": 1 if auto_metrics.get("symbolic_equivalence") else 0,
                "math_correctness": human_eval.get("mathematical_correctness"),
                "completeness": human_eval.get("completeness"),
                "clarity": human_eval.get("clarity"),
                "overall_quality": human_eval.get("overall_quality")
            })
    
    if item_data:
        corr_df = pd.DataFrame(item_data)
        
        # Compute correlation matrix
        corr_matrix = corr_df.corr()
        
        # Plot correlation heatmap
        fig, ax = plt.subplots(figsize=(10, 8))
        sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='coolwarm', center=0,
                   vmin=-1, vmax=1, ax=ax, square=True, linewidths=0.5,
                   cbar_kws={'label': 'Correlation Coefficient'})
        ax.set_title('Correlation: Automated Metrics vs Human Evaluation', 
                    fontsize=14, fontweight='bold', pad=20)
        plt.tight_layout()
        plt.savefig("figures/automated_human_correlation.png", dpi=300, bbox_inches='tight')
        plt.show()
        
        print("[OK] Correlation heatmap saved to figures/automated_human_correlation.png")
        
        # Print key correlations
        print("\n" + "="*60)
        print("KEY CORRELATIONS")
        print("="*60)
        print(f"BLEU Score vs Overall Quality: {corr_matrix.loc['bleu_score', 'overall_quality']:.3f}")
        print(f"Exact Match vs Overall Quality: {corr_matrix.loc['exact_match', 'overall_quality']:.3f}")
        print(f"Symbolic Equiv vs Math Correctness: {corr_matrix.loc['symbolic_equiv', 'math_correctness']:.3f}")
        print(f"BLEU Score vs Completeness: {corr_matrix.loc['bleu_score', 'completeness']:.3f}")
        print("="*60)
else:
    print("Human evaluation data not available for correlation analysis.")


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

# Prepare data for visualization from the DataFrame
metrics_df = pd.DataFrame({
    "Model": ["LLaMA 3"] * len(df) + ["Qwen 3"] * len(df),
    "Exact Match": df["llama_exact_match"].tolist() + df["qwen_exact_match"].tolist(),
    "BLEU Score": df["llama_bleu"].tolist() + df["qwen_bleu"].tolist(),
    "Symbolic Equivalence": df["llama_symbolic_equiv"].tolist() + df["qwen_symbolic_equiv"].tolist()
})

print("DataFrame prepared for visualization:")
print(metrics_df.head())
print(f"\nTotal samples: {len(df)}")


### 6.1 Model Comparison - Aggregate Metrics


In [None]:
# Calculate aggregate metrics
agg_metrics = pd.DataFrame({
    "Model": ["LLaMA 3", "Qwen 3"],
    "Exact Match Rate": [
        np.mean(llama_metrics["exact_match"]),
        np.mean(qwen_metrics["exact_match"])
    ],
    "Avg BLEU Score": [
        np.mean(llama_metrics["bleu"]),
        np.mean(qwen_metrics["bleu"])
    ],
    "Symbolic Equivalence Rate": [
        np.mean(llama_metrics["symbolic_equiv"]),
        np.mean(qwen_metrics["symbolic_equiv"])
    ]
})

# Create comparison bar plot
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

metrics_to_plot = ["Exact Match Rate", "Avg BLEU Score", "Symbolic Equivalence Rate"]
colors = ["#3498db", "#e74c3c"]

for idx, metric in enumerate(metrics_to_plot):
    axes[idx].bar(agg_metrics["Model"], agg_metrics[metric], color=colors)
    axes[idx].set_title(metric, fontsize=12, fontweight='bold')
    axes[idx].set_ylabel("Score", fontsize=10)
    axes[idx].set_ylim(0, 1)
    axes[idx].grid(axis='y', alpha=0.3)
    
    # Add value labels on bars
    for i, v in enumerate(agg_metrics[metric]):
        axes[idx].text(i, v + 0.02, f'{v:.3f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.savefig("figures/model_comparison_metrics.png", dpi=300, bbox_inches='tight')
plt.show()

print("[OK] Comparison plot saved to figures/model_comparison_metrics.png")


### 6.2 Distribution of Scores


In [None]:
# Distribution plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Exact Match distribution
exact_match_data = pd.DataFrame({
    "LLaMA 3": llama_metrics["exact_match"],
    "Qwen 3": qwen_metrics["exact_match"]
})
exact_match_data.plot(kind='hist', bins=2, ax=axes[0], alpha=0.7, color=colors)
axes[0].set_title("Exact Match Distribution", fontsize=12, fontweight='bold')
axes[0].set_xlabel("Match (0=No, 1=Yes)")
axes[0].set_ylabel("Frequency")
axes[0].legend()

# BLEU Score distribution
bleu_data = pd.DataFrame({
    "LLaMA 3": llama_metrics["bleu"],
    "Qwen 3": qwen_metrics["bleu"]
})
bleu_data.plot(kind='hist', bins=20, ax=axes[1], alpha=0.7, color=colors)
axes[1].set_title("BLEU Score Distribution", fontsize=12, fontweight='bold')
axes[1].set_xlabel("BLEU Score")
axes[1].set_ylabel("Frequency")
axes[1].legend()

# Symbolic Equivalence distribution
sym_data = pd.DataFrame({
    "LLaMA 3": llama_metrics["symbolic_equiv"],
    "Qwen 3": qwen_metrics["symbolic_equiv"]
})
sym_data.plot(kind='hist', bins=2, ax=axes[2], alpha=0.7, color=colors)
axes[2].set_title("Symbolic Equivalence Distribution", fontsize=12, fontweight='bold')
axes[2].set_xlabel("Equivalence (0=No, 1=Yes)")
axes[2].set_ylabel("Frequency")
axes[2].legend()

plt.tight_layout()
plt.savefig("figures/score_distributions.png", dpi=300, bbox_inches='tight')
plt.show()

print("[OK] Distribution plots saved to figures/score_distributions.png")


### 6.3 Box Plots for Score Comparison


In [None]:
# Box plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# BLEU Score box plot
bleu_df = pd.DataFrame({
    "LLaMA 3": llama_metrics["bleu"],
    "Qwen 3": qwen_metrics["bleu"]
})
bleu_df.boxplot(ax=axes[0], color=dict(boxes=colors[0], whiskers=colors[0], medians=colors[1]))
axes[0].set_title("BLEU Score Comparison", fontsize=12, fontweight='bold')
axes[0].set_ylabel("BLEU Score")
axes[0].grid(axis='y', alpha=0.3)

# Exact Match (as percentage)
exact_df = pd.DataFrame({
    "LLaMA 3": [x * 100 for x in llama_metrics["exact_match"]],
    "Qwen 3": [x * 100 for x in qwen_metrics["exact_match"]]
})
exact_df.boxplot(ax=axes[1], color=dict(boxes=colors[0], whiskers=colors[0], medians=colors[1]))
axes[1].set_title("Exact Match Comparison", fontsize=12, fontweight='bold')
axes[1].set_ylabel("Exact Match (%)")
axes[1].grid(axis='y', alpha=0.3)

# Symbolic Equivalence (as percentage)
sym_df = pd.DataFrame({
    "LLaMA 3": [x * 100 for x in llama_metrics["symbolic_equiv"]],
    "Qwen 3": [x * 100 for x in qwen_metrics["symbolic_equiv"]]
})
sym_df.boxplot(ax=axes[2], color=dict(boxes=colors[0], whiskers=colors[0], medians=colors[1]))
axes[2].set_title("Symbolic Equivalence Comparison", fontsize=12, fontweight='bold')
axes[2].set_ylabel("Symbolic Equivalence (%)")
axes[2].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig("figures/box_plots_comparison.png", dpi=300, bbox_inches='tight')
plt.show()

print("[OK] Box plots saved to figures/box_plots_comparison.png")


### 6.4 Correlation Analysis


In [None]:
# Correlation matrix for LLaMA 3
llama_corr = pd.DataFrame({
    "Exact Match": llama_metrics["exact_match"],
    "BLEU": llama_metrics["bleu"],
    "Symbolic Equiv": llama_metrics["symbolic_equiv"]
}).corr()

# Correlation matrix for Qwen 3
qwen_corr = pd.DataFrame({
    "Exact Match": qwen_metrics["exact_match"],
    "BLEU": qwen_metrics["bleu"],
    "Symbolic Equiv": qwen_metrics["symbolic_equiv"]
}).corr()

# Plot correlation matrices
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.heatmap(llama_corr, annot=True, fmt='.3f', cmap='coolwarm', center=0, 
            vmin=-1, vmax=1, ax=axes[0], cbar_kws={'label': 'Correlation'})
axes[0].set_title("LLaMA 3 - Metric Correlations", fontsize=12, fontweight='bold')

sns.heatmap(qwen_corr, annot=True, fmt='.3f', cmap='coolwarm', center=0,
            vmin=-1, vmax=1, ax=axes[1], cbar_kws={'label': 'Correlation'})
axes[1].set_title("Qwen 3 - Metric Correlations", fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig("figures/correlation_matrices.png", dpi=300, bbox_inches='tight')
plt.show()

print("[OK] Correlation matrices saved to figures/correlation_matrices.png")


### 6.5 Performance Summary Table


In [None]:
# Create comprehensive summary table
summary_data = {
    "Metric": ["Exact Match Rate", "Avg BLEU Score", "Symbolic Equivalence Rate",
               "Std BLEU Score", "Min BLEU", "Max BLEU"],
    "LLaMA 3": [
        f"{np.mean(llama_metrics['exact_match']):.4f}",
        f"{np.mean(llama_metrics['bleu']):.4f}",
        f"{np.mean(llama_metrics['symbolic_equiv']):.4f}",
        f"{np.std(llama_metrics['bleu']):.4f}",
        f"{np.min(llama_metrics['bleu']):.4f}",
        f"{np.max(llama_metrics['bleu']):.4f}"
    ],
    "Qwen 3": [
        f"{np.mean(qwen_metrics['exact_match']):.4f}",
        f"{np.mean(qwen_metrics['bleu']):.4f}",
        f"{np.mean(qwen_metrics['symbolic_equiv']):.4f}",
        f"{np.std(qwen_metrics['bleu']):.4f}",
        f"{np.min(qwen_metrics['bleu']):.4f}",
        f"{np.max(qwen_metrics['bleu']):.4f}"
    ]
}

summary_df = pd.DataFrame(summary_data)
print("\n" + "="*60)
print("PERFORMANCE SUMMARY")
print("="*60)
print(summary_df.to_string(index=False))
print("="*60)

# Save to CSV
summary_df.to_csv("results/performance_summary.csv", index=False)
print("\n[OK] Summary saved to results/performance_summary.csv")


### 6.6 Interactive Plotly Visualizations (Optional)


In [None]:
try:
    import plotly.graph_objects as go
    from plotly.subplots import make_subplots
    
    # Create interactive comparison
    fig = make_subplots(
        rows=1, cols=3,
        subplot_titles=("Exact Match Rate", "BLEU Score", "Symbolic Equivalence"),
        specs=[[{"type": "bar"}, {"type": "bar"}, {"type": "bar"}]]
    )
    
    models = ["LLaMA 3", "Qwen 3"]
    exact_rates = [np.mean(llama_metrics["exact_match"]), np.mean(qwen_metrics["exact_match"])]
    bleu_scores = [np.mean(llama_metrics["bleu"]), np.mean(qwen_metrics["bleu"])]
    sym_rates = [np.mean(llama_metrics["symbolic_equiv"]), np.mean(qwen_metrics["symbolic_equiv"])]
    
    fig.add_trace(
        go.Bar(x=models, y=exact_rates, name="Exact Match", marker_color=colors),
        row=1, col=1
    )
    
    fig.add_trace(
        go.Bar(x=models, y=bleu_scores, name="BLEU", marker_color=colors),
        row=1, col=2
    )
    
    fig.add_trace(
        go.Bar(x=models, y=sym_rates, name="Symbolic Equiv", marker_color=colors),
        row=1, col=3
    )
    
    fig.update_layout(
        title_text="Interactive Model Comparison",
        showlegend=False,
        height=400
    )
    
    fig.update_yaxes(range=[0, 1], row=1, col=1)
    fig.update_yaxes(range=[0, 1], row=1, col=2)
    fig.update_yaxes(range=[0, 1], row=1, col=3)
    
    fig.write_html("figures/interactive_comparison.html")
    fig.show()
    
    print("[OK] Interactive plot saved to figures/interactive_comparison.html")
except ImportError:
    print("Plotly not available, skipping interactive visualization")


## 7. Export Results


In [None]:
# Save detailed results to JSON
detailed_results = {
    "summary": {
        "llama3": {
            "exact_match_rate": float(np.mean(df["llama_exact_match"])),
            "avg_bleu": float(np.mean(df["llama_bleu"])),
            "symbolic_equiv_rate": float(np.mean(df["llama_symbolic_equiv"]))
        },
        "qwen3": {
            "exact_match_rate": float(np.mean(df["qwen_exact_match"])),
            "avg_bleu": float(np.mean(df["qwen_bleu"])),
            "symbolic_equiv_rate": float(np.mean(df["qwen_symbolic_equiv"]))
        }
    },
    "per_item_results": []
}

for idx, row in df.iterrows():
    detailed_results["per_item_results"].append({
        "item_id": int(idx),
        "question": row["question"],
        "ground_truth": row["ground_truth"],
        "llama_prediction": row["llama_pred"],
        "qwen_prediction": row["qwen_pred"],
        "llama_metrics": {
            "exact_match": bool(row["llama_exact_match"]),
            "bleu": float(row["llama_bleu"]),
            "symbolic_equiv": bool(row["llama_symbolic_equiv"])
        },
        "qwen_metrics": {
            "exact_match": bool(row["qwen_exact_match"]),
            "bleu": float(row["qwen_bleu"]),
            "symbolic_equiv": bool(row["qwen_symbolic_equiv"])
        }
    })

with open("results/detailed_evaluation_results.json", "w") as f:
    json.dump(detailed_results, f, indent=2)

# Also save the full DataFrame with all metrics
df.to_csv("results/predictions_with_metrics.csv", index=False)

print("[OK] Detailed results saved to results/detailed_evaluation_results.json")
print("[OK] Full DataFrame with metrics saved to results/predictions_with_metrics.csv")
print(f"\nTotal items evaluated: {len(df)}")
print(f"Figures saved to: figures/")
print(f"Results saved to: results/")


---

## Summary

This notebook provides comprehensive visualization and analysis of model performance:

1. **Model Comparison**: Side-by-side comparison of LLaMA 3 and Qwen 3
2. **Score Distributions**: Understanding how metrics vary across test items
3. **Statistical Analysis**: Box plots, correlations, and summary statistics
4. **Interactive Visualizations**: Plotly charts for exploration
5. **Export**: All results saved to `results/` and `figures/` directories

### Key Insights

- Compare exact match rates between models
- Analyze BLEU score distributions
- Check symbolic equivalence performance
- Identify correlation patterns between metrics

### Next Steps

- Analyze specific failure cases
- Compare predictions side-by-side
- Fine-tune models based on insights
- Expand test dataset
