# 🧪 Free-text Field Evaluation Runner

This notebook runs LLM-based evaluation for free-text fields on all model outputs, saving the results to Excel files (`free_text_eval_{model}_{split}.xlsx`) for downstream analysis and reporting.

- **Inputs:** All model prediction folders under `outputs/`
- **Outputs:** One Excel file per model and split with LLM-scored results
- **No performance aggregation or summary in this notebook.**  
  See main evaluation notebook for metrics and tables.

In [None]:
import os
import re
import json
from pathlib import Path
import pandas as pd
from dotenv import load_dotenv
from openai import AzureOpenAI
from tqdm.auto import tqdm

## 1. Setup Azure OpenAI Client

Environment variables required:
- `ENDPOINT_URL`
- `AZURE_OPENAI_API_KEY`
- `DEPLOYMENT_NAME`
- `AZURE_OPENAI_API_VERSION` (optional)

In [None]:
# Load Azure settings
load_dotenv()
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("ENDPOINT_URL"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
)
DEPLOYMENT_NAME = os.getenv("DEPLOYMENT_NAME")

# Regex to strip code fences from LLM output
fence_re = re.compile(r"^```(?:json)?\s*|\s*```$", flags=re.MULTILINE)


## 2. Run LLM Evaluation for Each Model and Split

This cell will:
- Scan all model subfolders in `outputs/`
- For each model and split (`train`, `test`), run LLM evaluation for all Included records
- Save results to `free_text_eval_{model}_{split}.xlsx`

In [None]:
outputs_root = Path.cwd() / "outputs"
ft_metrics = []

for model_dir in outputs_root.iterdir():
    if not model_dir.is_dir():
        continue
    model = model_dir.name

    for split in ("train", "test"):
        preds_dir = model_dir / split / "predictions"
        if not preds_dir.exists():
            continue
        preds = sorted(preds_dir.glob("*.json"))
        n_skipped = 0

        for jf in tqdm(preds, desc=f"eval free-text {model} {split}"):
            rec = json.loads(jf.read_text(encoding="utf-8"))
            gt  = rec["ground_truth"]
            pr  = rec["prediction"]

            # Require both truth and prediction to be 'Included'
            cls_pred = pr.get("classification", "")
            cls_truth = gt.get("classification", "")
            if not (isinstance(cls_pred, str) and cls_pred.strip().lower() == "included"):
                n_skipped += 1
                continue
            if not (isinstance(cls_truth, str) and cls_truth.strip().lower() == "included"):
                n_skipped += 1
                continue

            # Extract ground-truth free-text fields
            aud_gt = gt.get("audience", "") or ""
            meth_gt = gt.get("methodology", "") or ""
            sz_raw = gt.get("sample_size", None)
            sz_gt = str(sz_raw).strip() if sz_raw not in (None, "") else ""
            idfac_gt = gt.get("identity_factors") or gt.get("identity_factors_gt") or ""

            # Skip if all truth fields are blank
            if not any([aud_gt.strip(), meth_gt.strip(), sz_gt, idfac_gt.strip()]):
                n_skipped += 1
                continue

            # Extract predicted free-text fields
            aud_pr = pr.get("audience", "") or ""
            meth_pr = pr.get("methodology", "") or ""
            sz_pr = str(pr.get("sample_size", "")).strip()
            idfac_pr = pr.get("Identity_Factors", "") or ""

            prompt = f"""
You are an evaluator of free-text fields. Compare these pairs:

Ground truth Audience: {aud_gt}
Predicted Audience: {aud_pr}

Ground truth Methodology: {meth_gt}
Predicted Methodology: {meth_pr}

Ground truth Sample Size: {sz_gt}
Predicted Sample Size: {sz_pr}

Ground truth Identity Factors: {idfac_gt}
Predicted Identity Factors: {idfac_pr}

Be lenient and forgiving:
- Audience is correct if the core demographic appears in the predicted text even with extra qualifiers
- Methodology is correct if it’s semantically equivalent or more specific
- Sample Size is correct if numeric values match ignoring formatting (for example "n=20" vs "20")
- Identity Factors is correct if core identity considerations in ground-truth appear in the prediction even with extra qualifiers

Reply in JSON format like this:

{{
  "audience_correct": 0_or_1,
  "methodology_correct": 0_or_1,
  "sample_size_correct": 0_or_1,
  "identity_factors_correct": 0_or_1
}}
"""

            comp = {
                "audience_correct": None,
                "methodology_correct": None,
                "sample_size_correct": None,
                "identity_factors_correct": None
            }
            for _ in range(5):
                resp = client.chat.completions.create(
                    model=DEPLOYMENT_NAME,
                    messages=[
                        {
                            "role": "system",
                            "content": "You are a helpful assistant that evaluates the similarity of pairs of free-text fields and responds in JSON format."
                        },
                        {
                            "role": "user",
                            "content": prompt
                        }
                    ],
                    temperature=0
                )
                text = fence_re.sub("", resp.choices[0].message.content.strip())
                try:
                    comp = json.loads(text)
                    break
                except json.JSONDecodeError:
                    continue

            ft_metrics.append({
                "model": model,
                "split": split,
                "id": jf.stem,
                "audience_gt": aud_gt,
                "audience_pred": aud_pr,
                "audience_correct": comp.get("audience_correct", None),
                "methodology_gt": meth_gt,
                "methodology_pred": meth_pr,
                "methodology_correct": comp.get("methodology_correct", None),
                "sample_size_gt": sz_gt,
                "sample_size_pred": sz_pr,
                "sample_size_correct": comp.get("sample_size_correct", None),
                "identity_factors_gt": idfac_gt,
                "identity_factors_pred": idfac_pr,
                "identity_factors_correct": comp.get("identity_factors_correct", None),
            })

        # Save XLSX for this model and split
        df_split = pd.DataFrame([r for r in ft_metrics if r["model"] == model and r["split"] == split])
        if not df_split.empty:
            out_dir = outputs_root / model / split
            out_dir.mkdir(parents=True, exist_ok=True)
            file_path = out_dir / f"free_text_eval_{model}_{split}.xlsx"
            df_split[[
                "id",
                "audience_gt", "audience_pred", "audience_correct",
                "methodology_gt", "methodology_pred", "methodology_correct",
                "sample_size_gt", "sample_size_pred", "sample_size_correct",
                "identity_factors_gt", "identity_factors_pred", "identity_factors_correct"
            ]].to_excel(file_path, index=False)
            print(f"Saved {file_path}")
        else:
            print(f"Nothing to score for {model} {split}. Skipped {n_skipped} records.")

---

## ✔️ All Done!

Your free-text evaluations have been saved as Excel files in each model's `outputs/{model}/{split}/` directory.

**You can now return to your main evaluation notebook to aggregate and visualize results.**