# Generate Alternative Answers

Will generate alternative answers by finding topk or min_prob predictions for each gap.

Using the test data from human raters, will find optimal values for topk/min_prob.

Optimal alternative answers will overlap with human responses, especially answers that multiple humans suggested.

In [3]:
import json
from collections import Counter, defaultdict

from tqdm.auto import tqdm
import numpy as np
import pandas as pd
import spacy
from spacy.tokens import Doc, Span, Token
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
import seaborn as sns
from tqdm.auto import tqdm

torch.set_float32_matmul_precision("high")

In [4]:
!pip install spacy transformers tqdm seaborn pandas numpy torch --quiet
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m79.0 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Data

Combine the human annotations with the generated cloze exercises and source texts

In [5]:
annotations_df = (
    # Human responses and ratings to a selection of Cloze exercises
    pd.read_csv("../data/testResults_from_2025-05-07.csv")
    # Convert dictionaries to lists
    .assign(
        answers=lambda x: x["answers"].apply(
            lambda row: list(json.loads(row).values())
        ),
        correctAnswers=lambda x: x["correctAnswers"].apply(
            lambda row: list(json.loads(row).values())
        ),
        annotation_counts=lambda x: x["annotations"].apply(
            lambda row: Counter(json.loads(row).values())
        ),
        annotations=lambda x: x["annotations"].apply(
            lambda row: list(json.loads(row).values())
        ),
    )
    # Calculate percentage correct
    .assign(
        pct_correct=lambda x: x.apply(
            lambda row: sum(
                a == c for a, c in zip(row["answers"], row["correctAnswers"])
            )
            / len(row["answers"])
            * 100,
            axis=1,
        )
    )
)

cloze_df = pd.read_json(
    "../results/cloze_exercises_kl_divergence.jsonl", lines=True
).assign(passageId=lambda x: x.index + 1)

df = pd.merge(cloze_df, annotations_df, on="passageId").query(
    'method == "contextuality_plus"'
)
df.sample(2)

Unnamed: 0,volume,page,summary,markdown,text,contextuality,contextuality_plus,keyword,passageId,id,...,method,score,timeSpent,answers,correctAnswers,annotations,holisticScore,timestamp,annotation_counts,pct_correct
16,research-methods-in-psychology-demo,9-generating-good-research-questions-1,"When developing a research idea, transforming ...","<i-callout variant=""info"" title=""Learning Obje...",Learning Objectives\n\n1. Describe some techni...,"{'text': 'When developing a research idea, tra...","{'text': 'When developing a research idea, tra...","{'text': 'When developing a research idea, tra...",3,3tX80XR8HgqetI32rkil,...,contextuality_plus,44.444444,469.855,"[achieved, researchers, conceptualizing, study...","[achieved, researchers, conceptualizing, explo...","[sentence, source, sentence, sentence, sentenc...",3,2025-06-02T21:59:37.864Z,"{'sentence': 4, 'source': 4, 'unpredictable': 1}",44.444444
28,research-methods-in-psychology,1-methods-of-knowing,The text discusses various methods of acquirin...,"<i-callout variant=""info"" title=""Learning Obje...",Learning Objectives\n\n1. Describe the 5 metho...,{'text': 'The text discusses various methods o...,{'text': 'The text discusses various methods o...,{'text': 'The text discusses various methods o...,5,68bhaXDbcVAwvyqfLgM6,...,contextuality_plus,77.777778,307.15,"[instincts, involves, shows, incorrect, based,...","[instincts, involves, shows, incorrect, based,...","[source, sentence, sentence, source, sentence,...",2,2025-05-10T23:29:56.864Z,"{'source': 4, 'sentence': 3, 'unpredictable': 2}",77.777778


# Initialize Gapper

Just using this for a few of its utility methods...

In [6]:
class ContextualityGapper:
    def __init__(self, model_name: str = "answerdotai/ModernBERT-large"):
        # Load SpaCy for sentence splitting and preprocessing
        self.nlp = spacy.load("en_core_web_sm")
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

        # Load model and tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForMaskedLM.from_pretrained(model_name).to(self.device)
        self.model.eval()

        self.min_blank_distance = 7  # Minimum distance between blanks

        # Minimum log-predictability of alternatives
        self.min_predictability = np.log(0.05)

        # Part-of-Speech Blacklist (do not delete these words)
        self.blacklist = [
            "PROPN",  # Proper nouns
            "NUM",  # Numbers
            "PUNCT",  # Punctuation
            "SYM",  # Symbols
            "X",  # Other
        ]

    def _get_leading_ws_tokens(self, doc: Doc) -> list[str]:
        """The ModernBERT Tokenizer will work fine if we give it tokens with leading spaces.
        SpaCy normally handles whitespace in terms of trailing space."""
        if not len(doc):
            return [""]

        tokens = [doc[0].text]
        # For tokens after the 0th, prepend trailing whitespace from the previous token.
        tokens += [doc[i - 1].whitespace_ + doc[i].text for i in range(1, len(doc))]
        return tokens

    def get_token_mappings(self, tokens: list[str]) -> dict[int, list[int]]:
        """Get mappings between word positions and token positions"""
        # Tokenize while keeping track of word IDs
        tokenized = self.tokenizer(
            tokens, return_tensors="pt", is_split_into_words=True
        )
        word_ids = tokenized.word_ids()

        # Create mapping from word position to token positions
        word_to_tokens = defaultdict(list)

        for token_idx, word_idx in enumerate(word_ids):
            if word_idx is not None:
                word_to_tokens[word_idx].append(token_idx)

        return word_to_tokens

    def get_masked_logits(
        self, tokens: list[str], mask_idx: int
    ) -> tuple[torch.Tensor, int]:
        """Get model logits for a masked position in text"""
        # Get the word tokens and their alignment info
        word_to_tokens = self.get_token_mappings(tokens)

        # Find all token positions for the word we want to mask
        token_positions = word_to_tokens[mask_idx]

        # Create masked version of the text
        input_ids = self.tokenizer(
            tokens, is_split_into_words=True, return_tensors="pt"
        ).input_ids[0]
        masked_ids = input_ids.clone()

        # ID of the first subword token that we masked
        first_token_id = input_ids[token_positions[0]]

        # Mask all tokens corresponding to our target word
        masked_ids[token_positions] = self.tokenizer.mask_token_id

        # Get model outputs
        outputs = self.model(input_ids.unsqueeze(0).to(self.device))

        # Get logits
        logits = outputs.logits[0, token_positions, :]

        return logits, first_token_id

    def get_contextuality_score(
        self,
        page_doc: Doc,
        summary_doc: Doc,
        sent: Span,
        tok: Token,
        method: str = "kl",
    ) -> float:
        """Calculate contextuality score for a word position using full page context

        Args:
            page_doc: The full page text as a spaCy Doc
            summary_doc: The summary text as a spaCy Doc
            sent: The sentence from the summary containing the token
            tok: The token from the summary to evaluate
            method: "kl" for kl-divergence or "contextuality" for contextuality score

        Returns:
            Contextuality score
        """

        # Get logits for both full text and sentence text
        # For the full text context, we use the page + summary
        full_toks = self._get_leading_ws_tokens(page_doc) + self._get_leading_ws_tokens(
            summary_doc
        )
        full_pos = len(page_doc) + tok.i  # Position of token in full document
        full_logits, word_id = self.get_masked_logits(full_toks, full_pos)

        # For the local context, we use just the sentence from the summary
        sent_pos = tok.i - sent.start  # Position of token in the sentence
        sent_logits, _ = self.get_masked_logits([tok.text for tok in sent], sent_pos)

        # Calculate probabilities using first sub-word token
        full_probs = torch.softmax(full_logits[0], dim=0)
        sent_probs = torch.softmax(sent_logits[0], dim=0)

        p = full_probs[word_id]
        q = sent_probs[word_id]

        if method == "kl":
            # KL-divergence is p*log(p/q)
            score = float(p * torch.log2(p / q))
        elif method == "contextuality":
            # Contextuality is distance between full-text and sentence probability
            score = float(p - q)
        else:
            raise ValueError("Unknown method.")

        return score

    def choose_blank_positions(
        self, page_doc: Doc, summary_doc: Doc, num_blanks: int
    ) -> list[int]:
        """Choose positions to blank in the summary based on contextuality scores with full page"""
        scores = []
        valid_positions = []

        page_lemmas = {tok.lemma_ for tok in page_doc}

        # Calculate scores for each position in the summary
        for i, sent in enumerate(summary_doc.sents):
            if i == 0:
                continue  # Skip first sentence
            for tok in sent:
                if (
                    len(tok.text) < 3
                    or tok.pos_ in self.blacklist
                    or tok.is_stop
                    or not tok.text.isalpha()
                    or tok.lemma_ not in page_lemmas
                ):
                    scores.append(-float("inf"))
                else:
                    # Calculate contextuality using both the full page and summary
                    score = self.get_contextuality_score(
                        page_doc, summary_doc, sent, tok
                    )
                    scores.append(score)
                valid_positions.append(tok.i)

        # Convert to numpy for easier manipulation
        scores = np.array(scores)

        # Choose positions greedily while maintaining minimum distance
        positions = []
        for _ in range(num_blanks):
            if np.all(scores == -float("inf")):
                break

            # Choose highest scoring position
            idx = np.argmax(scores)
            pos = valid_positions[idx]
            positions.append(pos)

            # Zero out scores within minimum distance
            start = max(0, idx - self.min_blank_distance)
            end = min(len(scores), idx + self.min_blank_distance + 1)
            scores[start:end] = -float("inf")

        return sorted(positions)

    def get_alternates(self, tokens: list[str], topk=5) -> list[dict]:
        """Get top k predictions for the masked positions in tokens

        Returns:
            List of dictionaries, one per masked position, with candidate words and their probabilities
        """
        predictions = []

        # Find all mask positions
        mask_positions = [i for i, token in enumerate(tokens) if token == "[MASK]"]

        for mask_pos in mask_positions:
            word_candidates = {}

            # Try different mask lengths (1, 2, or 3 tokens)
            for mask_length in range(1, 4):
                # Replace the single mask with multiple if needed
                masked_tokens = (
                    tokens[:mask_pos]
                    + ["[MASK]"] * mask_length
                    + tokens[mask_pos + 1 :]
                )

                # Get initial predictions for first token
                current_candidates = []
                logits, _ = self.get_masked_logits(masked_tokens, mask_pos)
                probs = torch.softmax(logits[0], dim=0)
                top_values, top_indices = torch.topk(probs, topk)

                # Start with first token candidates
                for idx, prob in zip(top_indices.tolist(), top_values.tolist()):
                    current_candidates.append(([idx], prob))

                # Build up multi-token predictions if needed
                for token_idx in range(1, mask_length):
                    new_candidates = []
                    for token_ids, prob in current_candidates:
                        # Fill in what we've predicted so far
                        partial_filled = tokens.copy()
                        filled_text = self.tokenizer.decode(token_ids)
                        remaining_masks = mask_length - token_idx

                        partial_filled = (
                            tokens[:mask_pos]
                            + [filled_text]
                            + ["[MASK]"] * remaining_masks
                            + tokens[mask_pos + 1 :]
                        )

                        # Get prediction for next position
                        next_logits, _ = self.get_masked_logits(
                            partial_filled, mask_pos + 1
                        )
                        next_probs = torch.softmax(next_logits[0], dim=0)
                        next_values, next_indices = torch.topk(next_probs, 1)

                        # Add to candidates
                        new_token_ids = token_ids + [next_indices[0].item()]
                        new_prob = prob * next_values[0].item()
                        new_candidates.append((new_token_ids, new_prob))

                    current_candidates = new_candidates

                # Add final decoded words
                for token_ids, prob in current_candidates:
                    word = self.tokenizer.decode(token_ids).strip()
                    if " " in word:
                        # Word contains a space (is actually multiple words)
                        continue
                    if word not in word_candidates or prob > word_candidates[word]:
                        word_candidates[word] = prob

            # Sort candidates by probability
            sorted_candidates = sorted(
                word_candidates.items(), key=lambda x: x[1], reverse=True
            )
            predictions.append({word: prob for word, prob in sorted_candidates[:topk]})

        return predictions

    def generate_cloze(
        self,
        summary_text: str,
        page_text: str = "",
        num_blanks: int = 10,
    ) -> tuple[str, list[str], list[dict[str, float]]]:
        """Generate a cloze text from summary using page for context

        Args:
            page_text: The full page text
            summary_text: The summary text to create gaps in
            num_blanks: Number of blanks to create

        Returns:
            Tuple of (cloze_text, answers, alternates)
        """
        # Process both texts
        page_doc = self.nlp(page_text)
        summary_doc = self.nlp(summary_text)

        # Choose positions to blank in the summary
        masked_positions = self.choose_blank_positions(
            page_doc, summary_doc, num_blanks
        )

        # Get the answers (the original words that will be blanked)
        answers = [summary_doc[pos].text for pos in masked_positions]

        # Replace tokens with mask
        summary_tokens = np.array(self._get_leading_ws_tokens(summary_doc))
        summary_tokens[masked_positions] = "[MASK]"
        summary_tokens = summary_tokens.tolist()

        # Construct cloze token input for gap predictions
        cloze_tokens = self._get_leading_ws_tokens(page_doc) + summary_tokens

        # Collect gaps
        gaps = []
        for tok in summary_doc:
            if tok.i in masked_positions:
                gaps.append((tok.text, tok.idx, len(tok.text)))

        return gaps

    def score_cloze_answers(self, text_with_masks: str, top_k: int) -> list[dict[str, float]]:
        """
        Return, for each [MASK], a dict {token -> probability}.
        """
        # normalize top_k 
        k_norm = _normalize_topk_for_use(top_k)
        k = 5000 if k_norm is None else int(k_norm)
    
        inputs = self.tokenizer(text_with_masks, return_tensors="pt").to(self.device)
        with torch.no_grad():
            outputs = self.model(**inputs)
        logits = outputs.logits  # [batch, seq, vocab]
    
        mask_positions = torch.where(inputs["input_ids"][0] == self.tokenizer.mask_token_id)[0]
    
        results = []
        for pos in mask_positions:
            vec = logits[0, pos.item()]            # [vocab]
            probs = torch.softmax(vec, dim=-1)
            top_probs, top_idx = torch.topk(probs, k)   # <- k is guaranteed int
            preds = {}
            for p, idx in zip(top_probs.tolist(), top_idx.tolist()):
                token = self.tokenizer.decode([idx]).strip()
                preds[token] = p
            results.append(preds)
        return results


In [7]:
def _normalize_topk_for_use(x):
    """Return an int topk or None (== unlimited). Accepts int/float/NaN/None/'unlimited'."""
    if x is None: 
        return None
    if isinstance(x, str) and x.strip().lower() == "unlimited":
        return None
    try:
        # treat NaN like unlimited
        if isinstance(x, float) and np.isnan(x):
            return None
        return int(x)
    except Exception:
        return None  # safest fallback: unlimited

def _display_topk(x):
    """Return a display value for topk: int or 'unlimited'."""
    k = _normalize_topk_for_use(x)
    return "unlimited" if k is None else int(k)

def mask_by_char_spans(text: str, spans: list[tuple[int, int]]) -> str:
    """Replace text at gapped locations with [MASK] using (start, end) tuples"""
    for start, end in sorted(spans, reverse=True):
        text = text[:start] + "[MASK]" + text[end:]
    return text

def _extract_from_contextuality_plus(row):
    """
    extract all relevant information from dataframe
    Return:
      summary_text : str
      gap_spans    : list[(start, end)]
      originals    : list[str]
    """
    cp = row.contextuality_plus
    summary_text = cp.get("text", row.summary)
    gap_spans, originals = [], []
    for word, start, length in cp.get("gaps", []):
        originals.append(word)
        gap_spans.append((start, start + length))
    return summary_text, gap_spans, originals

In [8]:
def generate_accepted_answers(
    gapper,
    page_text: str,
    row,
    min_probability: float = 0.05,
    topk: int | None = None,
):
    summary_text, gap_spans, originals = _extract_from_contextuality_plus(row)
    masked_summary = mask_by_char_spans(summary_text, gap_spans)

    page_str = "".join(gapper._get_leading_ws_tokens(gapper.nlp(page_text)))
    full_context = page_str + "\n\n" + masked_summary

    k_norm = _normalize_topk_for_use(topk)
    k_arg = 5000 if k_norm is None else int(k_norm)

    preds_per_gap = gapper.score_cloze_answers(full_context, top_k=k_arg)

    accepted = []
    for i, pred_dict in enumerate(preds_per_gap):
        keep = set()
        for w, p in pred_dict.items():
            w2 = w.strip()
            if w2 and " " not in w2 and w2.isalpha() and p >= min_probability:
                keep.add(w2.lower())
        if i < len(originals):
            keep.add(originals[i].strip().lower())
        accepted.append(keep)

    return accepted, originals



# Evaluate accuracy of annotations given new acceptable answers
def rescore_annotations(
    df,
    gapper,
    min_probability: float = 0.05,
    topk: int | None = None,
):
    rows = []
    for row in tqdm(df.itertuples(), total=len(df), desc="Rescoring"):
        accepted, originals = generate_accepted_answers(
            gapper=gapper,
            page_text=row.text,
            row=row,
            min_probability=min_probability,
            topk=topk,
        )

        # Use human answers
        human = [answer.strip().lower() for answer in row.answers]

        flags = []
        for i, ans in enumerate(human):
            okset = accepted[i] if i < len(accepted) else set()
            if not okset and i < len(originals):  # safety fallback
                okset = {originals[i].strip().lower()}
            flags.append(ans in okset)

        percent = 100.0 * (sum(flags) / max(1, len(flags)))
        rows.append(
            {
                "accepted_answers": accepted,
                "human_correctness": flags,
                "percent_correct_alt": percent,
                "alt_counts": [len(s) for s in accepted],
            }
        )

    extra = pd.DataFrame(rows, index=df.index)
    return pd.concat([df.copy(), extra], axis=1)


In [9]:
gapper = ContextualityGapper(model_name="answerdotai/ModernBERT-large")

In [21]:
# ============================================
#      ANALYZE HUMAN AGREEMENT PATTERNS
# ============================================
def analyze_human_agreement(df):
    """
    Find cases where multiple humans gave the same *incorrect* answer
    for the same passage and the same gap index.
    Returns: (counts_df, common_df)
    """
    recs = []
    for _, row in df.iterrows():
        pid   = row["passageId"]
        anns  = row["answers"] # human answers
        golds = row["correctAnswers"]
        for i, (ann, gold) in enumerate(zip(anns, golds)):
            ann  = ann.strip().lower()
            gold = gold.strip().lower()
            if ann != gold:
                recs.append({"passageId": pid, "gap_idx": i, "human_answer": ann})

    if not recs:
        print("No wrong answers found.")
        return pd.DataFrame(), pd.DataFrame()

    tmp = pd.DataFrame(recs)
    counts = (
        tmp.value_counts(["passageId", "gap_idx", "human_answer"])
        .reset_index(name="n")
        .sort_values("n", ascending=False)
    )
    common = counts[counts["n"] > 1]

    print(f"Found {len(common)} (passageId, gap_idx) cases with agreement on the same wrong answer.")
    if not common.empty:
        print("\nTop 10:")
        print(common.head(10).to_string(index=False))
    else:
        print("\nNo per-gap agreements detected. This usually means one annotator per passage, or all wrong answers are unique.")

    return counts, common


def _prop_repeated_alts_captured(df_rescored: pd.DataFrame, common_df: pd.DataFrame) -> float:
    """
    Percentage of (passageId, gap_idx, human_answer) 'common wrong answers'
    that are captured by the accepted sets under current parameters.
    Returns a percentage [0..100].
    """
    if common_df is None or common_df.empty:
        return 0.0

    rescored_by_pid = df_rescored.set_index("passageId", drop=False)

    matched = 0
    total = 0
    for _, rec in common_df.iterrows():
        pid = rec["passageId"]
        gap = int(rec["gap_idx"])
        ans = str(rec["human_answer"]).strip().lower()

        if pid not in rescored_by_pid.index:
            continue

        # Handle possible duplicate rows per passageId
        obj = rescored_by_pid.loc[pid]
        candidate_rows = (
            obj.to_dict(orient="records") if isinstance(obj, pd.DataFrame)
            else [obj.to_dict()]
        )

        found = False
        for r in candidate_rows:
            acc = r.get("accepted_answers", None)
            if not isinstance(acc, (list, tuple)):
                continue
            if 0 <= gap < len(acc) and ans in acc[gap]:
                found = True
                break

        total += 1
        if found:
            matched += 1

    return 100.0 * matched / max(1, total)


# ============================================
#            PARAMETER GRID SEARCH
# ============================================


def test_parameters(df, gapper, min_prob_values, topk_values=None, common_df=None):
    results = []

    if topk_values is None:
        topk_values = [10_000]  # Effectively unlimited

    total_tests = len(min_prob_values) * len(topk_values)
    pbar = tqdm(total=total_tests, desc="Testing parameters")

    for min_prob in min_prob_values:
        for topk in topk_values:
            k_use = _normalize_topk_for_use(topk)

            df_rescored = rescore_annotations(
                df.copy(),
                gapper,
                min_probability=min_prob,
                topk=k_use,
            )

            # --- ADDED: compute "proportion of gaps with 1..5 alternatives" as a dataset-wide proportion
            alt_counts_flat = [c for row in df_rescored["alt_counts"] for c in row]
            prop_alts_1_5 = 100.0 * sum(1 <= c <= 5 for c in alt_counts_flat) / max(1, len(alt_counts_flat))

            # --- ADDED: compute "proportion of repeated alt answers captured"
            prop_repeats_captured = _prop_repeated_alts_captured(df_rescored, common_df)

            metrics = {
                "min_prob": min_prob,
                "topk": _display_topk(topk),
                "mean_accuracy": df_rescored["percent_correct_alt"].mean(),
                "median_accuracy": df_rescored["percent_correct_alt"].median(),
                "std_accuracy": df_rescored["percent_correct_alt"].std(),
                "mean_alternatives": df_rescored["alt_counts"].apply(lambda x: np.mean(x)).mean(),
                "median_alternatives": df_rescored["alt_counts"].apply(lambda x: np.median(x)).median(),
                "max_alternatives": df_rescored["alt_counts"].apply(lambda x: np.max(x)).max(),

                # OLD row-level metric kept for reference, but no longer used in score:
                "gaps_with_1_5_alts_rows_all_ok": (
                    sum(df_rescored["alt_counts"].apply(lambda x: all(1 <= c <= 5 for c in x)))
                    / len(df_rescored) * 100
                ),

                # NEW METRICS:
                "prop_alts_1_5": prop_alts_1_5,
                "prop_repeated_alt_answers_captured": prop_repeats_captured,
            }

            # --- NEW SCORE: equal weight across the three criteria (all already in % units)
            metrics["score"] = (
                metrics["mean_accuracy"]
                + metrics["prop_alts_1_5"]
                + metrics["prop_repeated_alt_answers_captured"]
            )

            results.append(metrics)
            pbar.update(1)

    pbar.close()
    return pd.DataFrame(results)




# ============================================
#                 BEST PARAMS
# ============================================


def analyze_best_parameters(df, gapper, min_prob, topk, common_df):
    """
    Analysis of a specific parameter combo.
    common_df is the DataFrame returned by analyze_human_agreement(...)[1]

    FIXED: Properly handles NaN/None topk display, KeyError, and index out-of-bounds issues.
    """
    # Re-run rescoring with given parameters
    df_rescored = rescore_annotations(
        df.copy(), gapper, min_probability=min_prob, topk=topk
    )

    print(f"\n{'=' * 60}")
    print(f"DETAILED ANALYSIS: min_prob={min_prob}, topk={_display_topk(topk)}")
    print(f"{'=' * 60}")

    # === Overall Stats ===
    print(f"\nOverall Performance:")
    print(f"  Mean accuracy: {df_rescored['percent_correct_alt'].mean():.1f}%")
    print(f"  Median accuracy: {df_rescored['percent_correct_alt'].median():.1f}%")
    print(f"  Std deviation: {df_rescored['percent_correct_alt'].std():.1f}%")

    # === Alternative Answer Counts ===
    alt_counts_flat = [c for row in df_rescored["alt_counts"] for c in row]
    print(f"\nAlternative Answers per Gap:")
    print(f"  Mean: {np.mean(alt_counts_flat):.1f}")
    print(f"  Median: {np.median(alt_counts_flat):.0f}")
    print(f"  Max: {np.max(alt_counts_flat)}")
    print(f"  % gaps with 1–5 alternatives: "
          f"{sum(1 <= c <= 5 for c in alt_counts_flat) / len(alt_counts_flat) * 100:.1f}%")

    # === Check how many 'common wrong' answers are accepted ===
    accepted_mistakes = 0
    total_mistakes = len(common_df)
    errors_encountered = 0

    if total_mistakes:
        rescored_by_pid = df_rescored.set_index("passageId", drop=False)

        for _, rec in common_df.iterrows():
            pid = rec["passageId"]
            gap = int(rec["gap_idx"])
            ans = str(rec["human_answer"]).strip().lower()

            if pid not in rescored_by_pid.index:
                continue

            obj = rescored_by_pid.loc[pid]
            candidate_rows = (
                obj.to_dict(orient="records") if isinstance(obj, pd.DataFrame)
                else [obj.to_dict()]
            )

            matched = False
            for r in candidate_rows:
                acc = r.get("accepted_answers", None)
                if not isinstance(acc, (list, tuple)):
                    continue

                try:
                    if gap < len(acc) and ans in acc[gap]:
                        matched = True
                        break
                except KeyError:
                    print(f"KeyError in analyze_best_parameters. Attempting to access index {gap} of:\n", acc)
                    errors_encountered += 1
                except Exception as e:
                    # General catch
                    print(f"[warn] issue at pid={pid}, gap={gap}: {e}")
                    errors_encountered += 1

            if matched:
                accepted_mistakes += 1

    # === Summary Output ===
    print(f"\nCommon Human Agreements (per passage+gap):")
    if total_mistakes:
        print(f"  Accepting {accepted_mistakes}/{total_mistakes} agreed-on 'incorrect' answers")
        if errors_encountered > 0:
            print(f"  Note: {errors_encountered} entries had mismatched gap indices or missing data")
    else:
        print("  (None found in this dataset)")

    return df_rescored

def _sentence_with_gap(summary_text: str, gap_span: tuple[int, int], pad: int = 0) -> str:
    """
    Return the sentence containing the gap, with the gap shown as ‹…›.
    Falls back to a local window if punctuation is missing.
    """
    start, end = gap_span
    n = len(summary_text)

    # find sentence bounds (simple punctuation heuristics)
    left = max(summary_text.rfind('.', 0, start),
               summary_text.rfind('?', 0, start),
               summary_text.rfind('!', 0, start))
    right_period = summary_text.find('.', end)
    right_q = summary_text.find('?', end)
    right_ex = summary_text.find('!', end)
    rights = [x for x in [right_period, right_q, right_ex] if x != -1]
    right = min(rights) if rights else -1

    s = left + 1 if left != -1 else max(0, start - 120)
    e = right + 1 if right != -1 else min(n, end + 120)

    s = max(0, s - pad)
    e = min(n, e + pad)

    pre = summary_text[s:start]
    gap = summary_text[start:end]
    post = summary_text[end:e]
    return pre + "‹" + gap + "›" + post

import re

def review_repeated_answers_with_sentence(df_rescored: pd.DataFrame,
                                          common_df: pd.DataFrame,
                                          acc_preview: int = 6,
                                          max_sentence_len: int = 220) -> pd.DataFrame:
    """
    Returns a tidy audit table + one sentence string where the ORIGINAL gold word is highlighted:  ‹word›
    Columns:
      passageId, gap_idx, gold_original, repeated_human, accepted_now,
      accepted_set_size, accepted_set_preview, sentence
    """
    cols = ["passageId","gap_idx","gold_original","repeated_human",
            "accepted_now","accepted_set_size","accepted_set_preview","sentence"]
    if common_df is None or common_df.empty:
        return pd.DataFrame(columns=cols)

    rows_out = []
    by_pid = df_rescored.set_index("passageId", drop=False)

    for _, rec in common_df.iterrows():
        pid = rec["passageId"]
        gap = int(rec["gap_idx"])
        human_raw = str(rec["human_answer"]).strip().lower()

        if pid not in by_pid.index:
            continue

        obj = by_pid.loc[pid]

        # iterate as Series (not dict) so _extract_from_contextuality_plus works
        if isinstance(obj, pd.DataFrame):
            series_iter = (row for _, row in obj.iterrows())
        else:  # already a Series
            series_iter = [obj]

        accepted_now = False
        acc_set = set()
        gold_original = ""
        sentence = ""

        for r in series_iter:
            # accepted set for this gap
            acc = r.get("accepted_answers", [])
            if 0 <= gap < len(acc):
                acc_set = acc[gap]
                accepted_now = (human_raw in acc_set)

            # pull summary text + spans + originals from contextuality_plus
            try:
                summary_text, gap_spans, originals = _extract_from_contextuality_plus(r)
            except Exception:
                summary_text, gap_spans, originals = r.get("summary",""), [], []

            if 0 <= gap < len(originals):
                gold_original = (originals[gap] or "").strip()

            if summary_text and 0 <= gap < len(gap_spans):
                s = _sentence_with_gap(summary_text, gap_spans[gap])  # shows ‹gold›
                s = re.sub(r"\s+", " ", s).strip()                     # single-line
                if len(s) > max_sentence_len:
                    s = s[:max_sentence_len-3] + "..."
                sentence = s

            # first usable row is enough
            break

        rows_out.append({
            "passageId": pid,
            "gap_idx": gap,
            "gold_original": gold_original,
            "repeated_human": human_raw,
            "accepted_now": bool(accepted_now),
            "accepted_set_size": len(acc_set) if isinstance(acc_set, (set, list, tuple)) else 0,
            "accepted_set_preview": ", ".join(sorted(list(acc_set))[:acc_preview]) if isinstance(acc_set, (set, list, tuple)) else "",
            "sentence": sentence
        })

    df_out = pd.DataFrame(rows_out).sort_values(
        ["accepted_now","accepted_set_size","passageId","gap_idx"],
        ascending=[True, True, True, True]
    )
    return df_out


In [22]:
def run_testing(df, gapper):
    print("=" * 60)
    print("PARAMETER OPTIMIZATION FOR CLOZE SCORING")
    print("=" * 60)

    print("\nSTEP 1: Analyzing Human Agreement Patterns_____")
    counts_df, common_df = analyze_human_agreement(df)

    print("\nSTEP 2: Testing Parameter Combinations_____")
    min_prob_values = [0.002, 0.005, 0.01, 0.015, 0.02, 0.03, 0.05, 0.07, 0.10, 0.15, 0.20]
    topk_values = [5, 10, 20, 50, 100, None]


    # --- CHANGED: pass common_df
    results_df = test_parameters(df, gapper, min_prob_values, topk_values, common_df=common_df)

    print("\nSTEP 3: Finding Optimal Parameters_____")

    # --- REMOVED old score formulas entirely; score is built inside test_parameters now.

    best_params = results_df.nlargest(5, "score")
    print("\nTOP 5 PARAMETER COMBINATIONS:")
    print(
        best_params[
            [
                "min_prob",
                "topk",
                "mean_accuracy",
                "prop_alts_1_5",
                "prop_repeated_alt_answers_captured",
                "score",
            ]
        ].to_string(index=False)
    )

    best_row = best_params.iloc[0]
    best_min_prob = float(best_row["min_prob"])
    best_topk = _normalize_topk_for_use(best_row["topk"])

    print("\nSTEP 4: Best Parameters______")
    df_best = analyze_best_parameters(df, gapper, best_min_prob, best_topk, common_df)

    print("\n" + "=" * 60)
    print("FINAL RECOMMENDATIONS")
    print("=" * 60)
    print(f"\nRECOMMENDED PARAMETERS:")
    print(f"  min_probability = {best_min_prob}")
    print(f"  topk = {_display_topk(best_topk)}")
    print(f"\nExpected Performance:")
    print(f"  Mean human accuracy: {best_row['mean_accuracy']:.1f}%")
    print(f"  Proportion gaps with 1–5 alternatives: {best_row['prop_alts_1_5']:.1f}%")
    print(f"  Repeated human alternatives captured: {best_row['prop_repeated_alt_answers_captured']:.1f}%")

    return results_df, df_best, best_min_prob, best_topk



In [23]:
counts_df, common_df = analyze_human_agreement(df)
df_best = analyze_best_parameters(df, gapper, best_min_prob, best_topk, common_df)

review_sent = review_repeated_answers_with_sentence(df_best, common_df, acc_preview=6, max_sentence_len=220)
print("\nREVIEW repeated answers (with sentence):")
print(review_sent.to_string(index=False))


Found 7 (passageId, gap_idx) cases with agreement on the same wrong answer.

Top 10:
 passageId  gap_idx human_answer  n
         5        5  experiences  5
         5        2        prove  4
         1        7  advancement  3
         3        3    examining  2
         6        3     variable  2
         1        0    unfounded  2
         1        5       debunk  2


Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]


DETAILED ANALYSIS: min_prob=0.07, topk=unlimited

Overall Performance:
  Mean accuracy: 77.5%
  Median accuracy: 77.8%
  Std deviation: 15.6%

Alternative Answers per Gap:
  Mean: 1.9
  Median: 1
  Max: 5
  % gaps with 1–5 alternatives: 100.0%

Common Human Agreements (per passage+gap):
  Accepting 2/7 agreed-on 'incorrect' answers

REVIEW repeated answers (with sentence):
 passageId  gap_idx gold_original repeated_human  accepted_now  accepted_set_size                                   accepted_set_preview                                                                                                                                                                    sentence
         1        0     intuitive      unfounded         False                  1                                              intuitive                                                                      However, scientific research frequently contradicts these ‹_________› beliefs, revealing inaccuracies.
     

In [12]:
results_df, df_best, best_min_prob, best_topk = run_testing(df, gapper)

PARAMETER OPTIMIZATION FOR CLOZE SCORING

STEP 1: Analyzing Human Agreement Patterns_____
Found 7 (passageId, gap_idx) cases with agreement on the same wrong answer.

Top 10:
 passageId  gap_idx human_answer  n
         5        5  experiences  5
         5        2        prove  4
         1        7  advancement  3
         3        3    examining  2
         6        3     variable  2
         1        0    unfounded  2
         1        5       debunk  2

STEP 2: Testing Parameter Combinations_____


Testing parameters:   0%|          | 0/66 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]


STEP 3: Finding Optimal Parameters_____

TOP 5 PARAMETER COMBINATIONS:
 min_prob      topk  mean_accuracy  prop_alts_1_5  prop_repeated_alt_answers_captured      score
     0.07 unlimited      77.500000     100.000000                           28.571429 206.071429
     0.05 unlimited      77.934783      97.584541                           28.571429 204.090752
     0.07         5      77.500000      97.101449                           28.571429 203.172878
     0.07        10      77.500000      97.101449                           28.571429 203.172878
     0.07        20      77.500000      97.101449                           28.571429 203.172878

STEP 4: Best Parameters______


Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]


DETAILED ANALYSIS: min_prob=0.07, topk=unlimited

Overall Performance:
  Mean accuracy: 77.5%
  Median accuracy: 77.8%
  Std deviation: 15.6%

Alternative Answers per Gap:
  Mean: 1.9
  Median: 1
  Max: 5
  % gaps with 1–5 alternatives: 100.0%

Common Human Agreements (per passage+gap):
  Accepting 2/7 agreed-on 'incorrect' answers

FINAL RECOMMENDATIONS

RECOMMENDED PARAMETERS:
  min_probability = 0.07
  topk = unlimited

Expected Performance:
  Mean human accuracy: 77.5%
  Proportion gaps with 1–5 alternatives: 100.0%
  Repeated human alternatives captured: 28.6%


In [94]:
# Smoke test on a couple of (pid, gap) pairs from common_df
test_rows = common_df.head(3).to_dict(orient="records")
for rec in test_rows:
    pid = int(rec["passageId"])
    gap = int(rec["gap_idx"])
    ans = str(rec["human_answer"]).strip().lower()

    sub = df[df["passageId"] == pid].copy()
    resc = rescore_annotations(sub, gapper, min_probability=0.07, topk=None)
    matched = False
    for r in resc.to_dict(orient="records"):
        acc = r.get("accepted_answers", [])
        if isinstance(acc, list) and 0 <= gap < len(acc) and ans in acc[gap]:
            matched = True
            break
    print(f"(pid={pid}, gap={gap}) human_wrong='{ans}'  -> accepted? {matched}")


Rescoring:   0%|          | 0/7 [00:00<?, ?it/s]

(pid=5, gap=5) human_wrong='experiences'  -> accepted? False


Rescoring:   0%|          | 0/7 [00:00<?, ?it/s]

(pid=5, gap=2) human_wrong='prove'  -> accepted? False


Rescoring:   0%|          | 0/4 [00:00<?, ?it/s]

(pid=1, gap=7) human_wrong='advancement'  -> accepted? True


In [82]:
print(df.answers)

3     [unfounded, alleviate, confessions, calorie, s...
4     [unfounded, alleviate, confessions, calorie, s...
5     [intuitive, alleviate, confessions, calorie, s...
6     [intuitive, alleviate, confessions, calorie, s...
14    [achieved, researchers, conceptualizing, exami...
15    [achieved, researchers, conceptualizing, exami...
16    [achieved, researchers, conceptualizing, study...
24    [instincts, involves, prove, incorrect, based,...
25    [instincts, involves, prove, incorrect, based,...
26    [instincts, involves, prove, incorrect, based,...
27    [instincts, involves, prove, incorrect, based,...
28    [instincts, involves, shows, incorrect, based,...
29    [instincts, involves, shows, incorrect, based,...
31    [instincts, involves, shows, incorrect, based,...
32    [quantitative, chosen, abstract, variable, ide...
34    [quantitativ, chosen, abstract, variable, iden...
37    [confirming, theory, theories, philosophical, ...
40          [intuitive, , confessions, calorie, 

In [29]:
results_df, df_best, best_min_prob, best_topk = run_testing(df, gapper)

PARAMETER OPTIMIZATION FOR CLOZE SCORING

STEP 1: Analyzing Human Agreement Patterns_____
Found 41 (passageId, gap_idx) cases with agreement on the same wrong answer.

Top 10:
 passageId  gap_idx  human_answer  n
         5        1      sentence  7
         5        6        source  6
         5        8        source  6
         5        2      sentence  6
         5        4      sentence  6
         5        3      sentence  5
         5        5      sentence  5
         1        6        source  4
         5        7      sentence  4
         1        7 unpredictable  4

STEP 2: Testing Parameter Combinations_____


Testing parameters:   0%|          | 0/40 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]


STEP 3: Finding Optimal Parameters_____

TOP 5 PARAMETER COMBINATIONS:
 min_prob      topk  mean_accuracy  mean_alternatives  gaps_with_1_5_alts      score
     0.07 unlimited            0.0           1.898309               100.0 138.983092
     0.10 unlimited            0.0           1.721981               100.0 137.219807
     0.15         5            0.0           1.673430               100.0 136.734300
     0.15        10            0.0           1.673430               100.0 136.734300
     0.15        20            0.0           1.673430               100.0 136.734300

STEP 4: Best Parameters______


Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]


DETAILED ANALYSIS: min_prob=0.07, topk=unlimited

Overall Performance:
  Mean accuracy: 0.0%
  Median accuracy: 0.0%
  Std deviation: 0.0%

Alternative Answers per Gap:
  Mean: 1.9
  Median: 1
  Max: 5
  % gaps with 1–5 alternatives: 100.0%

Common Human Agreements (per passage+gap):
  Accepting 0/41 agreed-on 'incorrect' answers

FINAL RECOMMENDATIONS

RECOMMENDED PARAMETERS:
  min_probability = 0.07
  topk = unlimited

Expected Performance:
  Mean human accuracy: 0.0%
  Mean alternatives per gap: 1.9
  Gaps with 1-5 alternatives: 100.0%


In [10]:
results_df

NameError: name 'results_df' is not defined