# Lemma Constraint

Make sure that gaps are only generated for lemmas that appear in the page text

In [5]:
import json
from collections import Counter, defaultdict

from tqdm.auto import tqdm
import numpy as np
import pandas as pd
import spacy
from spacy.tokens import Doc, Span, Token
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
import seaborn as sns
from tqdm.auto import tqdm

torch.set_float32_matmul_precision("high")

In [None]:
# !pip install spacy==3.7.2
# !pip install transformers
# !python -m spacy download en_core_web_sm

## Data

Combine the human annotations with the generated cloze exercises and source texts

In [15]:
annotations_df = (
    # Human responses and ratings to a selection of Cloze exercises
    pd.read_csv("../data/testResults_from_2025-05-07.csv")
    # Convert dictionaries to lists
    .assign(
        answers=lambda x: x["answers"].apply(
            lambda row: list(json.loads(row).values())
        ),
        correctAnswers=lambda x: x["correctAnswers"].apply(
            lambda row: list(json.loads(row).values())
        ),
        annotation_counts=lambda x: x["annotations"].apply(
            lambda row: Counter(json.loads(row).values())
        ),
        annotations=lambda x: x["annotations"].apply(
            lambda row: list(json.loads(row).values())
        ),
    )
    # Calculate percentage correct
    .assign(
        pct_correct=lambda x: x.apply(
            lambda row: sum(
                a == c for a, c in zip(row["answers"], row["correctAnswers"])
            )
            / len(row["answers"])
            * 100,
            axis=1,
        )
    )
)

cloze_df = pd.read_json(
    "../results/cloze_exercises_kl_divergence.jsonl", lines=True
).assign(passageId=lambda x: x.index + 1)

df = pd.merge(cloze_df, annotations_df, on="passageId").query(
    'method == "contextuality_plus"'
)
df.sample(2)

Unnamed: 0,volume,page,summary,markdown,text,contextuality,contextuality_plus,keyword,passageId,id,...,method,score,timeSpent,answers,correctAnswers,annotations,holisticScore,timestamp,annotation_counts,pct_correct
37,research-methods-in-psychology-demo,13-drawing-conclusions-and-reporting-the-resul...,Scientific theories are continually evaluated ...,"<i-callout variant=""info"" title=""Learning Obje...",Learning Objectives\n\n1. Identify the conclus...,{'text': 'Scientific theories are continually ...,{'text': 'Scientific theories are continually ...,{'text': 'Scientific theories are continually ...,7,UttExqGS6dSU5PLaLe85,...,contextuality_plus,80.0,176.596,"[confirming, theory, theories, philosophical, ...","[Confirming, theory, theories, philosophical, ...","[source, source, source, source, source, sourc...",3,2025-05-19T19:27:19.842Z,"{'source': 8, 'passage': 1, 'unpredictable': 1}",70.0
16,research-methods-in-psychology-demo,9-generating-good-research-questions-1,"When developing a research idea, transforming ...","<i-callout variant=""info"" title=""Learning Obje...",Learning Objectives\n\n1. Describe some techni...,"{'text': 'When developing a research idea, tra...","{'text': 'When developing a research idea, tra...","{'text': 'When developing a research idea, tra...",3,3tX80XR8HgqetI32rkil,...,contextuality_plus,44.444444,469.855,"[achieved, researchers, conceptualizing, study...","[achieved, researchers, conceptualizing, explo...","[sentence, source, sentence, sentence, sentenc...",3,2025-06-02T21:59:37.864Z,"{'sentence': 4, 'source': 4, 'unpredictable': 1}",44.444444


In [16]:
print(df.contextuality_plus.iloc[0])

{'text': 'The necessity of a scientific approach in psychology is often questioned, with many relying on common sense or intuition—known as folk psychology—for understanding human behavior. However, scientific research frequently contradicts these _________ beliefs, revealing inaccuracies. For instance, the belief that expressing anger can _________ it has been debunked, as has the notion that false ___________ are rare. Common myths, such as using only 10% of our brain or the effectiveness of _______-reducing diets, persist due to heuristics and confirmation bias. Psychologists emphasize __________ and the pursuit of empirical evidence to _________ these misconceptions. Additionally, they embrace ___________, welcoming unanswered questions as opportunities for scientific ___________.', 'gaps': [['intuitive', 238, 9], ['alleviate', 332, 9], ['confessions', 393, 11], ['calorie', 489, 7], ['skepticism', 586, 10], ['challenge', 638, 9], ['uncertainty', 697, 11], ['exploration', 773, 11]]}

# Test Restricted Generation

Only choose gaps whose lemmatized form appears in the source text

In [28]:
class ContextualityGapper:
    def __init__(self, model_name: str = "answerdotai/ModernBERT-large"):
        # Load SpaCy for sentence splitting and preprocessing
        self.nlp = spacy.load("en_core_web_sm")
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

        # Load model and tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForMaskedLM.from_pretrained(model_name).to(self.device)
        self.model.eval()

        self.min_blank_distance = 7  # Minimum distance between blanks

        # Minimum log-predictability of alternatives
        self.min_predictability = np.log(0.05)

        # Part-of-Speech Blacklist (do not delete these words)
        self.blacklist = [
            "PROPN",  # Proper nouns
            "NUM",  # Numbers
            "PUNCT",  # Punctuation
            "SYM",  # Symbols
            "X",  # Other
        ]

    def _get_leading_ws_tokens(self, doc: Doc) -> list[str]:
        """The ModernBERT Tokenizer will work fine if we give it tokens with leading spaces.
        SpaCy normally handles whitespace in terms of trailing space."""
        if not len(doc):
            return [""]

        tokens = [doc[0].text]
        # For tokens after the 0th, prepend trailing whitespace from the previous token.
        tokens += [doc[i - 1].whitespace_ + doc[i].text for i in range(1, len(doc))]
        return tokens

    def get_token_mappings(self, tokens: list[str]) -> dict[int, list[int]]:
        """Get mappings between word positions and token positions"""
        # Tokenize while keeping track of word IDs
        tokenized = self.tokenizer(
            tokens, return_tensors="pt", is_split_into_words=True
        )
        word_ids = tokenized.word_ids()

        # Create mapping from word position to token positions
        word_to_tokens = defaultdict(list)

        for token_idx, word_idx in enumerate(word_ids):
            if word_idx is not None:
                word_to_tokens[word_idx].append(token_idx)

        return word_to_tokens

    def get_masked_logits(
        self, tokens: list[str], mask_idx: int
    ) -> tuple[torch.Tensor, int]:
        """Get model logits for a masked position in text"""
        # Get the word tokens and their alignment info
        word_to_tokens = self.get_token_mappings(tokens)

        # Find all token positions for the word we want to mask
        token_positions = word_to_tokens[mask_idx]

        # Create masked version of the text
        input_ids = self.tokenizer(
            tokens, is_split_into_words=True, return_tensors="pt"
        ).input_ids[0]
        masked_ids = input_ids.clone()

        # ID of the first subword token that we masked
        first_token_id = input_ids[token_positions[0]]

        # Mask all tokens corresponding to our target word
        masked_ids[token_positions] = self.tokenizer.mask_token_id

        # Get model outputs
        outputs = self.model(input_ids.unsqueeze(0).to(self.device))

        # Get logits
        logits = outputs.logits[0, token_positions, :]

        return logits, first_token_id

    def get_contextuality_score(
        self,
        page_doc: Doc,
        summary_doc: Doc,
        sent: Span,
        tok: Token,
        method: str = "kl",
    ) -> float:
        """Calculate contextuality score for a word position using full page context

        Args:
            page_doc: The full page text as a spaCy Doc
            summary_doc: The summary text as a spaCy Doc
            sent: The sentence from the summary containing the token
            tok: The token from the summary to evaluate
            method: "kl" for kl-divergence or "contextuality" for contextuality score

        Returns:
            Contextuality score
        """

        # Get logits for both full text and sentence text
        # For the full text context, we use the page + summary
        full_toks = self._get_leading_ws_tokens(page_doc) + self._get_leading_ws_tokens(
            summary_doc
        )
        full_pos = len(page_doc) + tok.i  # Position of token in full document
        full_logits, word_id = self.get_masked_logits(full_toks, full_pos)

        # For the local context, we use just the sentence from the summary
        sent_pos = tok.i - sent.start  # Position of token in the sentence
        sent_logits, _ = self.get_masked_logits([tok.text for tok in sent], sent_pos)

        # Calculate probabilities using first sub-word token
        full_probs = torch.softmax(full_logits[0], dim=0)
        sent_probs = torch.softmax(sent_logits[0], dim=0)

        p = full_probs[word_id]
        q = sent_probs[word_id]

        if method == "kl":
            # KL-divergence is p*log(p/q)
            score = float(p * torch.log2(p / q))
        elif method == "contextuality":
            # Contextuality is distance between full-text and sentence probability
            score = float(p - q)
        else:
            raise ValueError("Unknown method.")

        return score

    def choose_blank_positions(
        self, page_doc: Doc, summary_doc: Doc, num_blanks: int
    ) -> list[int]:
        """Choose positions to blank in the summary based on contextuality scores with full page"""
        scores = []
        valid_positions = []

        page_lemmas = {tok.lemma_ for tok in page_doc}

        # Calculate scores for each position in the summary
        for i, sent in enumerate(summary_doc.sents):
            if i == 0:
                continue  # Skip first sentence
            for tok in sent:
                if (
                    len(tok.text) < 3
                    or tok.pos_ in self.blacklist
                    or tok.is_stop
                    or not tok.text.isalpha()
                    or tok.lemma_ not in page_lemmas
                ):
                    scores.append(-float("inf"))
                else:
                    # Calculate contextuality using both the full page and summary
                    score = self.get_contextuality_score(
                        page_doc, summary_doc, sent, tok
                    )
                    scores.append(score)
                valid_positions.append(tok.i)

        # Convert to numpy for easier manipulation
        scores = np.array(scores)

        # Choose positions greedily while maintaining minimum distance
        positions = []
        for _ in range(num_blanks):
            if np.all(scores == -float("inf")):
                break

            # Choose highest scoring position
            idx = np.argmax(scores)
            pos = valid_positions[idx]
            positions.append(pos)

            # Zero out scores within minimum distance
            start = max(0, idx - self.min_blank_distance)
            end = min(len(scores), idx + self.min_blank_distance + 1)
            scores[start:end] = -float("inf")

        return sorted(positions)

    def get_alternates(self, tokens: list[str], topk=5) -> list[dict]:
        """Get top k predictions for the masked positions in tokens

        Returns:
            List of dictionaries, one per masked position, with candidate words and their probabilities
        """
        predictions = []

        # Find all mask positions
        mask_positions = [i for i, token in enumerate(tokens) if token == "[MASK]"]

        for mask_pos in mask_positions:
            word_candidates = {}

            # Try different mask lengths (1, 2, or 3 tokens)
            for mask_length in range(1, 4):
                # Replace the single mask with multiple if needed
                masked_tokens = (
                    tokens[:mask_pos]
                    + ["[MASK]"] * mask_length
                    + tokens[mask_pos + 1 :]
                )

                # Get initial predictions for first token
                current_candidates = []
                logits, _ = self.get_masked_logits(masked_tokens, mask_pos)
                probs = torch.softmax(logits[0], dim=0)
                top_values, top_indices = torch.topk(probs, topk)

                # Start with first token candidates
                for idx, prob in zip(top_indices.tolist(), top_values.tolist()):
                    current_candidates.append(([idx], prob))

                # Build up multi-token predictions if needed
                for token_idx in range(1, mask_length):
                    new_candidates = []
                    for token_ids, prob in current_candidates:
                        # Fill in what we've predicted so far
                        partial_filled = tokens.copy()
                        filled_text = self.tokenizer.decode(token_ids)
                        remaining_masks = mask_length - token_idx

                        partial_filled = (
                            tokens[:mask_pos]
                            + [filled_text]
                            + ["[MASK]"] * remaining_masks
                            + tokens[mask_pos + 1 :]
                        )

                        # Get prediction for next position
                        next_logits, _ = self.get_masked_logits(
                            partial_filled, mask_pos + 1
                        )
                        next_probs = torch.softmax(next_logits[0], dim=0)
                        next_values, next_indices = torch.topk(next_probs, 1)

                        # Add to candidates
                        new_token_ids = token_ids + [next_indices[0].item()]
                        new_prob = prob * next_values[0].item()
                        new_candidates.append((new_token_ids, new_prob))

                    current_candidates = new_candidates

                # Add final decoded words
                for token_ids, prob in current_candidates:
                    word = self.tokenizer.decode(token_ids).strip()
                    if " " in word:
                        # Word contains a space (is actually multiple words)
                        continue
                    if word not in word_candidates or prob > word_candidates[word]:
                        word_candidates[word] = prob

            # Sort candidates by probability
            sorted_candidates = sorted(
                word_candidates.items(), key=lambda x: x[1], reverse=True
            )
            predictions.append({word: prob for word, prob in sorted_candidates[:topk]})

        return predictions

    def generate_cloze(
        self,
        summary_text: str,
        page_text: str = "",
        num_blanks: int = 10,
    ) -> tuple[str, list[str], list[dict[str, float]]]:
        """Generate a cloze text from summary using page for context

        Args:
            page_text: The full page text
            summary_text: The summary text to create gaps in
            num_blanks: Number of blanks to create

        Returns:
            Tuple of (cloze_text, answers, alternates)
        """
        # Process both texts
        page_doc = self.nlp(page_text)
        summary_doc = self.nlp(summary_text)

        # Choose positions to blank in the summary
        masked_positions = self.choose_blank_positions(
            page_doc, summary_doc, num_blanks
        )

        # Get the answers (the original words that will be blanked)
        answers = [summary_doc[pos].text for pos in masked_positions]

        # Replace tokens with mask
        summary_tokens = np.array(self._get_leading_ws_tokens(summary_doc))
        summary_tokens[masked_positions] = "[MASK]"
        summary_tokens = summary_tokens.tolist()

        # Construct cloze token input for gap predictions
        cloze_tokens = self._get_leading_ws_tokens(page_doc) + summary_tokens

        # Collect gaps
        gaps = []
        for tok in summary_doc:
            if tok.i in masked_positions:
                gaps.append((tok.text, tok.idx, len(tok.text)))

        return gaps

    def score_cloze_answers(self, text_with_masks: str, top_k: int) -> list[dict[str, float]]:
        """
        Return, for each [MASK], a dict {token -> probability}.
        """
        # normalize top_k 
        k_norm = _normalize_topk_for_use(top_k)
        k = 5000 if k_norm is None else int(k_norm)
    
        inputs = self.tokenizer(text_with_masks, return_tensors="pt").to(self.device)
        with torch.no_grad():
            outputs = self.model(**inputs)
        logits = outputs.logits  # [batch, seq, vocab]
    
        mask_positions = torch.where(inputs["input_ids"][0] == self.tokenizer.mask_token_id)[0]
    
        results = []
        for pos in mask_positions:
            vec = logits[0, pos.item()]            # [vocab]
            probs = torch.softmax(vec, dim=-1)
            top_probs, top_idx = torch.topk(probs, k)   # <- k is guaranteed int
            preds = {}
            for p, idx in zip(top_probs.tolist(), top_idx.tolist()):
                token = self.tokenizer.decode([idx]).strip()
                preds[token] = p
            results.append(preds)
        return results


In [29]:
# LH: Use doc strings for function descriptions
# LH: The underscore before a function name indicates that it is for internal use only, usually methods in a class that shouldn't be called from outside that class.
# LH: I wouldn't use it for functions inside the global scope of a notebook.


def _mask_by_char_spans(
    text: str, spans: list[tuple[int, int]]
) -> str:  # converting character positions from df to masks
    parts, prev = [], 0
    for start, end in spans:
        parts.append(text[prev:start])
        parts.append("[MASK]")
        prev = end
    parts.append(text[prev:])
    return "".join(parts)

def _normalize_topk_for_use(x):
    """Return an int topk or None (== unlimited). Accepts int/float/NaN/None/'unlimited'."""
    if x is None: 
        return None
    if isinstance(x, str) and x.strip().lower() == "unlimited":
        return None
    try:
        # treat NaN like unlimited
        if isinstance(x, float) and np.isnan(x):
            return None
        return int(x)
    except Exception:
        return None  # safest fallback: unlimited

def _display_topk(x):
    """Return a display value for topk: int or 'unlimited'."""
    k = _normalize_topk_for_use(x)
    return "unlimited" if k is None else int(k)



# LH: Could do this a bit more cleanly like
def mask_by_char_spans(text: str, spans: list[tuple[int, int]]) -> str:
    """Replace text at gapped locations with [MASK] using (start, end) tuples"""
    for start, end in sorted(spans, reverse=True):
        text = text[:start] + "[MASK]" + text[end:]
    return text


def _extract_from_contextuality_plus(
    row,
):  # extract all relevant information from dataframe
    """
    Return:
      summary_text : str
      gap_spans    : list[(start, end)]
      originals    : list[str]
    """
    cp = row.contextuality_plus
    summary_text = cp.get("text", row.summary)
    gap_spans, originals = [], []
    for word, start, length in cp.get("gaps", []):
        originals.append(word)
        gap_spans.append((start, start + length))
    return summary_text, gap_spans, originals

In [30]:
def generate_accepted_answers(
    gapper,
    page_text: str,
    row,
    min_probability: float = 0.05,
    topk: int | None = None,
):
    summary_text, gap_spans, originals = _extract_from_contextuality_plus(row)
    masked_summary = mask_by_char_spans(summary_text, gap_spans)

    page_str = "".join(gapper._get_leading_ws_tokens(gapper.nlp(page_text)))
    full_context = page_str + "\n\n" + masked_summary

    k_norm = _normalize_topk_for_use(topk)
    k_arg = 5000 if k_norm is None else int(k_norm)

    preds_per_gap = gapper.score_cloze_answers(full_context, top_k=k_arg)

    accepted = []
    for i, pred_dict in enumerate(preds_per_gap):
        keep = set()
        for w, p in pred_dict.items():
            w2 = w.strip()
            if w2 and " " not in w2 and w2.isalpha() and p >= min_probability:
                keep.add(w2.lower())
        if i < len(originals):
            keep.add(originals[i].strip().lower())
        accepted.append(keep)

    return accepted, originals



# Evaluate accuracy of annotations given new acceptable answers
def rescore_annotations(
    df,
    gapper,
    min_probability: float = 0.05,
    topk: int | None = None,
):
    # LH: Put imports at the top of the notebook.
    # These are already imported...
    import pandas as pd
    from tqdm.auto import tqdm

    rows = []
    for row in tqdm(df.itertuples(), total=len(df), desc="Rescoring"):
        accepted, originals = generate_accepted_answers(
            gapper=gapper,  # LH: You wouldn't need to pass this class instance around if this function were a class method.
            page_text=row.text,
            row=row,
            min_probability=min_probability,
            topk=topk,
        )

        human = [annotation.strip().lower() for annotation in row.annotations]
        flags = []
        for i, ans in enumerate(human):
            okset = accepted[i] if i < len(accepted) else set()
            if not okset and i < len(originals):  # safety fallback
                okset = {originals[i].strip().lower()}
            flags.append(ans in okset)

        percent = 100.0 * (sum(flags) / max(1, len(flags)))
        rows.append(
            {
                "accepted_answers": accepted,
                "human_correctness": flags,
                "percent_correct_alt": percent,
                "alt_counts": [len(s) for s in accepted],
            }
        )

    extra = pd.DataFrame(rows, index=df.index)
    return pd.concat([df.copy(), extra], axis=1)

In [31]:
# LH: see above
# !pip install -U "transformers>=4.46" "tokenizers>=0.15" safetensors

In [32]:
gapper = ContextualityGapper(model_name="answerdotai/ModernBERT-large")

In [33]:
# ============================================
#      ANALYZE HUMAN AGREEMENT PATTERNS
# ============================================
def analyze_human_agreement(df):
    """
    Find cases where multiple humans gave the same *incorrect* answer
    for the same passage and the same gap index.
    Returns: (counts_df, common_df)
    """

    recs = []

    for _, row in df.iterrows():
        pid = row["passageId"]
        anns = row["annotations"]
        gold = row["correctAnswers"]
        # LH: Please use descriptive variable names
        for g, (h, c) in enumerate(zip(anns, gold)):
            h = h.strip().lower()
            c = c.strip().lower()
            if h != c:
                recs.append({"passageId": pid, "gap_idx": g, "human_answer": h})

    if not recs:
        print("No wrong answers found.")
        return pd.DataFrame(), pd.DataFrame()

    tmp = pd.DataFrame(recs)
    counts = (
        tmp.value_counts(["passageId", "gap_idx", "human_answer"])
        .reset_index(name="n")
        .sort_values("n", ascending=False)
    )

    common = counts[counts["n"] > 1]

    print(
        f"Found {len(common)} (passageId, gap_idx) cases with agreement on the same wrong answer."
    )
    if not common.empty:
        print("\nTop 10:")
        print(common.head(10).to_string(index=False))
    else:
        print(
            "\nNo per-gap agreements detected. This usually means you have one annotator per passage, "
            "or the wrong answers are all unique per gap."
        )

    return counts, common


# ============================================
#            PARAMETER GRID SEARCH
# ============================================


def test_parameters(df, gapper, min_prob_values, topk_values=None):
    results = []

    if topk_values is None:
        topk_values = [10_000]  # Effectively unlimited

    total_tests = len(min_prob_values) * len(topk_values)
    pbar = tqdm(total=total_tests, desc="Testing parameters")

    for min_prob in min_prob_values:
        for topk in topk_values:
            # Normalize for use (None => unlimited)
            k_use = _normalize_topk_for_use(topk)

            df_rescored = rescore_annotations(
                df.copy(),
                gapper,
                min_probability=min_prob,
                topk=k_use,  # pass None or int
            )

            metrics = {
                "min_prob": min_prob,
                # -------- FIX: store a *display* topk that is not NaN --------
                "topk": _display_topk(topk),
                # ----------------------------------------------------------------
                "mean_accuracy": df_rescored["percent_correct_alt"].mean(),
                "median_accuracy": df_rescored["percent_correct_alt"].median(),
                "std_accuracy": df_rescored["percent_correct_alt"].std(),
                "mean_alternatives": df_rescored["alt_counts"].apply(lambda x: np.mean(x)).mean(),
                "median_alternatives": df_rescored["alt_counts"].apply(lambda x: np.median(x)).median(),
                "max_alternatives": df_rescored["alt_counts"].apply(lambda x: np.max(x)).max(),
                "gaps_with_1_5_alts": (
                    sum(df_rescored["alt_counts"].apply(lambda x: all(1 <= c <= 5 for c in x)))
                    / len(df_rescored) * 100
                ),
            }

            results.append(metrics)
            pbar.update(1)

    pbar.close()
    return pd.DataFrame(results)



# ============================================
#                 BEST PARAMS
# ============================================


def analyze_best_parameters(df, gapper, min_prob, topk, common_df):
    """
    Analysis of a specific parameter combo.
    common_df is the DataFrame returned by analyze_human_agreement(...)[1]
    """

    df_rescored = rescore_annotations(
        df.copy(), gapper, min_probability=min_prob, topk=topk
    )

    print(f"\n{'=' * 60}")
    print(
        f"DETAILED ANALYSIS: min_prob={min_prob}, topk={topk if topk else 'unlimited'}"
    )
    print(f"{'=' * 60}")

    # Overall stats
    print(f"\nOverall Performance:")
    print(f"  Mean accuracy: {df_rescored['percent_correct_alt'].mean():.1f}%")
    print(f"  Median accuracy: {df_rescored['percent_correct_alt'].median():.1f}%")
    print(f"  Std deviation: {df_rescored['percent_correct_alt'].std():.1f}%")

    # Alternative counts (flatten)
    alt_counts_flat = [c for row in df_rescored["alt_counts"] for c in row]
    print(f"\nAlternative Answers per Gap:")
    print(f"  Mean: {np.mean(alt_counts_flat):.1f}")
    print(f"  Median: {np.median(alt_counts_flat):.0f}")
    print(f"  Max: {np.max(alt_counts_flat)}")
    print(
        f"  % gaps with 1-5 alternatives: {sum(1 <= c <= 5 for c in alt_counts_flat) / len(alt_counts_flat) * 100:.1f}%"
    )

    # How many “common wrong” are now accepted for the same passage+gap
    accepted_mistakes = 0
    total_mistakes = len(common_df)

    if total_mistakes:
        # Build a quick lookup: passageId -> row in rescored df
        rescored_by_pid = df_rescored.set_index("passageId")

        for _, rec in common_df.iterrows():
            pid = rec["passageId"]
            gap = int(rec["gap_idx"])
            ans = str(rec["human_answer"]).strip().lower()

            if pid in rescored_by_pid.index:
                row = rescored_by_pid.loc[pid]
                if gap < len(row["accepted_answers"]):
                    if ans in row["accepted_answers"][gap]:
                        accepted_mistakes += 1

    print(f"\nCommon Human Agreements (per passage+gap):")
    if total_mistakes:
        print(
            f"  Accepting {accepted_mistakes}/{total_mistakes} agreed-on 'incorrect' answers"
        )
    else:
        print("  (None found in this dataset)")

    return df_rescored

In [35]:
def run_testing(df, gapper):
    print("=" * 60)
    print("PARAMETER OPTIMIZATION FOR CLOZE SCORING")
    print("=" * 60)

    print("\nSTEP 1: Analyzing Human Agreement Patterns_____")
    counts_df, common_df = analyze_human_agreement(df)

    print("\nSTEP 2: Testing Parameter Combinations_____")
    min_prob_values = [0.01, 0.02, 0.03, 0.05, 0.07, 0.10, 0.15, 0.20]
    topk_values = [5, 10, 20, 50, None]  # None = unlimited
    results_df = test_parameters(df, gapper, min_prob_values, topk_values)

    print("\nSTEP 3: Finding Optimal Parameters_____")
    results_df["score"] = (
        results_df["mean_accuracy"] * 2
        + results_df["gaps_with_1_5_alts"]
        + (5 - abs(results_df["mean_alternatives"] - 3)) * 10
    )
    best_params = results_df.nlargest(5, "score")
    print("\nTOP 5 PARAMETER COMBINATIONS:")
    print(
        best_params[
            ["min_prob", "topk", "mean_accuracy", "mean_alternatives", "gaps_with_1_5_alts", "score"]
        ].to_string(index=False)
    )

    best_row = best_params.iloc[0]
    best_min_prob = float(best_row["min_prob"])

    # convert display topk back to usable value --------
    best_topk = _normalize_topk_for_use(best_row["topk"])  # None or int

    print("\nSTEP 4: Best Parameters______")
    df_best = analyze_best_parameters(df, gapper, best_min_prob, best_topk, common_df)

    print("\n" + "=" * 60)
    print("FINAL RECOMMENDATIONS")
    print("=" * 60)
    print(f"\nRECOMMENDED PARAMETERS:")
    print(f"  min_probability = {best_min_prob}")
    print(f"  topk = {_display_topk(best_topk)}")
    print(f"\nExpected Performance:")
    print(f"  Mean human accuracy: {best_row['mean_accuracy']:.1f}%")
    print(f"  Mean alternatives per gap: {best_row['mean_alternatives']:.1f}")
    print(f"  Gaps with 1-5 alternatives: {best_row['gaps_with_1_5_alts']:.1f}%")

    return results_df, df_best, best_min_prob, best_topk


In [None]:
results_df, df_best, best_min_prob, best_topk = run_testing(df, gapper)


PARAMETER OPTIMIZATION FOR CLOZE SCORING

STEP 1: Analyzing Human Agreement Patterns_____
Found 41 (passageId, gap_idx) cases with agreement on the same wrong answer.

Top 10:
 passageId  gap_idx  human_answer  n
         5        1      sentence  7
         5        6        source  6
         5        8        source  6
         5        2      sentence  6
         5        4      sentence  6
         5        3      sentence  5
         5        5      sentence  5
         1        6        source  4
         5        7      sentence  4
         1        7 unpredictable  4

STEP 2: Testing Parameter Combinations_____


Testing parameters:   0%|          | 0/40 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

Rescoring:   0%|          | 0/23 [00:00<?, ?it/s]

## Compare Gaps

Without collecting additional annotations, the best we can do is look at which gaps would be retained with this method and which gaps would be removed.

If this method disproportionately removes gaps that were more difficult to answer, then we can infer that the lemma overlap restriction will make the cloze exercise easier (which is what we want).

First, we see that the lemma overlap constraint does not substantially decrease the number of gaps that are generated (8.65 gaps per exercise vs. 9 gaps per exercise).

In [12]:
display(df.restricted_answers.str.len().describe())
display(df.correctAnswers.str.len().describe())

AttributeError: 'DataFrame' object has no attribute 'restricted_answers'

## Results

The lemma overlap restriction retains 44% of gaps that are "source-predictable" (ideal), and removes 31% of gaps that were scored as "unpredictable" (bad gaps). This is a good indication that the lemma overlap constraint improves the cloze exercise.

In [60]:
unchanged = []
removed = []
# added = []


def normalize_counter(c: Counter):
    total = sum(c.values())
    for key in c:
        c[key] = round(c[key] / total, 2)
    return c


for row in df.itertuples():
    for answer, rating in zip(row.correctAnswers, row.annotations):
        if answer in row.restricted_answers:
            unchanged.append(rating)
        else:
            removed.append(rating)

print("Retained gaps")
print(sorted(normalize_counter(Counter(unchanged)).items()))
print("\nRemoved gaps")
print(sorted(normalize_counter(Counter(removed)).items()))

Retained gaps
[('passage', 0.18), ('sentence', 0.3), ('source', 0.44), ('unpredictable', 0.08)]

Removed gaps
[('passage', 0.17), ('sentence', 0.43), ('source', 0.09), ('unpredictable', 0.31)]
