# 05 — Contextual Diversity

Contextual diversity captures how *varied* the contexts are in which words and ideas appear.

Why it matters:
- Statistical learning benefits from varied contexts
- Over-constrained text often repeats the same frames and neighborhoods

This notebook computes two complementary families of metrics:

A) Lexical contextual diversity (robust, simple)
   - document-frequency-like measures across chunks
   - repeated-context penalties

B) Semantic contextual diversity (LSA-based, more "meaning bandwidth")
   - dispersion of chunk embeddings within a document
   - neighborhood variety using cosine similarity structure

Inputs:
- `data/texts_clean/chunks.jsonl`
- `data/lsa/chunk_embeddings.npy`
- `data/lsa/chunk_index.jsonl`

Outputs:
- `data/diversity/word_contextual_diversity.jsonl`
- `data/diversity/doc_contextual_diversity_summary.jsonl`
- `data/diversity/semantic_context_dispersion_per_chunk.jsonl`


## Imports + paths

In [None]:
from __future__ import annotations

import json
import re
import math
from pathlib import Path
from typing import Dict, Any, List, Tuple
from collections import Counter, defaultdict

import numpy as np

from _paths import set_repo_root
ROOT = set_repo_root()

CHUNKS_TEXT_IN = ROOT / "data" / "texts_clean" / "chunks.jsonl"

EMB_IN = ROOT / "data" / "lsa" / "chunk_embeddings.npy"
INDEX_IN = ROOT / "data" / "lsa" / "chunk_index.jsonl"

OUT_DIR = ROOT / "data" / "diversity"
OUT_DIR.mkdir(parents=True, exist_ok=True)

WORD_OUT = OUT_DIR / "word_contextual_diversity.jsonl"
DOC_SUMMARY_OUT = OUT_DIR / "doc_contextual_diversity_summary.jsonl"
SEM_CHUNK_OUT = OUT_DIR / "semantic_context_dispersion_per_chunk.jsonl"

print("Chunks:", CHUNKS_TEXT_IN.resolve())
print("Embeddings:", EMB_IN.resolve())
print("Index:", INDEX_IN.resolve())
print("Output dir:", OUT_DIR.resolve())


## Load chunks + LSA embeddings + aligned index

In [None]:
def read_jsonl(path: Path) -> List[Dict[str, Any]]:
    rows = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            rows.append(json.loads(line))
    return rows

chunks = read_jsonl(CHUNKS_TEXT_IN)
index = read_jsonl(INDEX_IN)
Z = np.load(EMB_IN)

# Build lookup for text by chunk_id
text_by_chunk_id = {r["chunk_id"]: r["text"] for r in chunks}

if len(index) != Z.shape[0]:
    raise ValueError(f"Mismatch: {len(index)} index rows vs {Z.shape[0]} embeddings")

# Ensure index order matches embedding rows
chunk_ids = [r["chunk_id"] for r in index]
missing_text = sum(1 for cid in chunk_ids if cid not in text_by_chunk_id)
if missing_text:
    print("Warning: missing text for", missing_text, "chunk_ids")

print("Loaded:", len(chunk_ids), "indexed chunks; embedding dim:", Z.shape[1])
print("Example chunk_id:", chunk_ids[0], "title:", index[0]["title"])


## Group by document (using index order)

In [None]:
doc_to_rows: Dict[str, List[int]] = defaultdict(list)
doc_info: Dict[str, Dict[str, Any]] = {}

for i, r in enumerate(index):
    doc_id = r["doc_id"]
    doc_to_rows[doc_id].append(i)
    doc_info.setdefault(doc_id, {"title": r["title"], "chunk_type": r["chunk_type"]})

# Sort within doc by chunk_index (just to be safe)
for doc_id, rows in doc_to_rows.items():
    rows.sort(key=lambda i: index[i]["chunk_index"])

print("Documents:", len(doc_to_rows))


## Tokenization for lexical diversity

We’ll use a conservative tokenization and ignore very short tokens.

In [None]:
RE_TOKEN = re.compile(r"[a-zA-Z]+(?:'[a-zA-Z]+)?|\d+")

def tokens(text: str) -> List[str]:
    t = text.lower()
    toks = RE_TOKEN.findall(t)
    # Drop tiny tokens that often act like noise in curriculum text
    toks = [x for x in toks if len(x) >= 2]
    return toks


## Part A — Lexical contextual diversity

### Compute word -> set(chunks) + frequency stats

Definitions:

- `df_chunks(word)`: in how many chunks does the word appear?
- `contextual_diversity = df_chunks / n_chunks_in_doc (per doc)` or `/ total chunks (global)`
- Also compute `burstiness`: how clustered the appearances are (same chunk repeated doesn’t count).

We’ll compute per-doc word diversity and also a global view.

In [None]:
MIN_WORD_TOTAL_CT = 5      # ignore extremely rare words
MIN_WORD_DF_CHUNKS = 2     # require appearing in >= 2 chunks for diversity stats

# Global stats across all chunks
global_word_ct = Counter()
global_word_chunks = defaultdict(set)  # word -> {global_chunk_row_index}

# Per-doc stats
doc_word_ct = defaultdict(Counter)          # doc_id -> Counter(word)
doc_word_chunks = defaultdict(lambda: defaultdict(set))  # doc_id -> word -> {chunk_index_in_doc}

for doc_id, row_ids in doc_to_rows.items():
    for local_pos, row_id in enumerate(row_ids):
        cid = index[row_id]["chunk_id"]
        text = text_by_chunk_id.get(cid, "")
        toks = tokens(text)

        # update global
        global_word_ct.update(toks)
        for w in set(toks):
            global_word_chunks[w].add(row_id)

        # update per-doc
        doc_word_ct[doc_id].update(toks)
        for w in set(toks):
            doc_word_chunks[doc_id][w].add(local_pos)

print("Global vocab size:", len(global_word_ct))


### Burstiness / dispersion helper

We want a simple "are occurrences spread out or clustered?" measure.

For each word, we have the set of chunk positions it appears in: e.g., `{0,1,2,10,11}`.
Compute mean gap between sorted positions; larger mean gap implies more dispersion.

In [None]:
def mean_gap(positions: List[int]) -> float:
    if len(positions) <= 1:
        return float("nan")
    pos = sorted(positions)
    gaps = [pos[i+1] - pos[i] for i in range(len(pos)-1)]
    return float(np.mean(gaps)) if gaps else float("nan")

def normalized_dispersion(positions: List[int], n_chunks: int) -> float:
    """
    Normalize mean gap by maximum possible mean gap (~n_chunks).
    This is a heuristic 0..1-ish score: higher = more spread out.
    """
    mg = mean_gap(positions)
    if not np.isfinite(mg) or n_chunks <= 1:
        return float("nan")
    return float(mg / n_chunks)


### Build per-word contextual diversity table (global + per doc)

This writes one JSON per word per doc (plus a global summary row per word).

In [None]:
word_rows: List[Dict[str, Any]] = []

# Global rows
n_global_chunks = len(index)
for w, ct in global_word_ct.items():
    df = len(global_word_chunks[w])
    if ct < MIN_WORD_TOTAL_CT or df < MIN_WORD_DF_CHUNKS:
        continue
    word_rows.append({
        "scope": "global",
        "doc_id": None,
        "title": None,
        "word": w,
        "total_count": int(ct),
        "df_chunks": int(df),
        "contextual_diversity": float(df / n_global_chunks),
        "dispersion_norm": float(normalized_dispersion(list(global_word_chunks[w]), n_global_chunks)),
    })

# Per-doc rows
for doc_id, counts in doc_word_ct.items():
    n_chunks_doc = len(doc_to_rows[doc_id])
    title = doc_info[doc_id]["title"]

    for w, ct in counts.items():
        pos_set = doc_word_chunks[doc_id].get(w, set())
        df = len(pos_set)
        if ct < MIN_WORD_TOTAL_CT or df < MIN_WORD_DF_CHUNKS:
            continue

        word_rows.append({
            "scope": "doc",
            "doc_id": doc_id,
            "title": title,
            "word": w,
            "total_count": int(ct),
            "df_chunks": int(df),
            "contextual_diversity": float(df / n_chunks_doc),
            "dispersion_norm": float(normalized_dispersion(list(pos_set), n_chunks_doc)),
        })

print("Word contextual diversity rows:", len(word_rows))
print("Example:", word_rows[0])


### Save word contextual diversity JSONL

In [None]:
def write_jsonl(path: Path, rows: List[Dict[str, Any]]) -> None:
    with path.open("w", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

write_jsonl(WORD_OUT, word_rows)
print("Wrote:", WORD_OUT, f"({WORD_OUT.stat().st_size} bytes)")


## Part B — Semantic contextual diversity (LSA dispersion)

Here we measure "how many distinct semantic neighborhoods does this document cover?" using chunk embeddings.

### Semantic dispersion per document + per chunk

Metrics:
- `mean_pairwise_distance` (approx via sampling, to avoid O(n^2) for long docs)
- `centroid_similarity_mean`: average cosine similarity of chunks to doc centroid (higher = more semantically tight/low variety)
- `local_neighbor_similarity_mean`: similarity to nearest neighbors within doc (higher = more repetition)

We'll compute:

- per-doc summary
- per-chunk "semantic neighborhood tightness" (avg similarity to k nearest chunks in same doc)

In [None]:
def l2_normalize(v: np.ndarray, eps: float = 1e-12) -> np.ndarray:
    n = np.linalg.norm(v)
    if n < eps:
        return v
    return v / n

def doc_semantic_metrics(doc_rows: List[int], Z: np.ndarray, k_nn: int = 5, sample_pairs: int = 2000) -> Dict[str, float]:
    M = Z[doc_rows]  # [n, d] assumed normalized
    n = M.shape[0]
    if n <= 1:
        return {
            "n_chunks": int(n),
            "centroid_sim_mean": float("nan"),
            "centroid_sim_median": float("nan"),
            "mean_pairwise_distance": float("nan"),
            "local_nn_sim_mean": float("nan"),
        }

    centroid = l2_normalize(np.mean(M, axis=0))
    centroid_sims = M @ centroid  # cosine similarity
    centroid_sim_mean = float(np.mean(centroid_sims))
    centroid_sim_median = float(np.median(centroid_sims))

    # Approx mean pairwise distance via random sampling of pairs
    rng = np.random.default_rng(42)
    pairs = min(sample_pairs, n * (n - 1) // 2)
    if pairs <= 0:
        mpd = float("nan")
    else:
        i = rng.integers(0, n, size=pairs)
        j = rng.integers(0, n, size=pairs)
        mask = i != j
        i, j = i[mask], j[mask]
        sims = np.sum(M[i] * M[j], axis=1)
        dists = 1.0 - sims
        mpd = float(np.mean(dists)) if dists.size else float("nan")

    # Local nearest-neighbor similarity (within doc)
    # Compute similarity matrix if doc small, else approximate by sampling
    if n <= 400:
        S = M @ M.T
        np.fill_diagonal(S, -1.0)
        k = min(k_nn, n - 1)
        # average of top-k similarities for each row
        topk = np.partition(S, -k, axis=1)[:, -k:]
        local_nn_sim_mean = float(np.mean(topk))
    else:
        # Approx: sample anchor points, measure top-k within a random subset
        anchors = rng.choice(n, size=min(80, n), replace=False)
        k = min(k_nn, n - 1)
        sims_all = []
        for a in anchors:
            candidates = rng.choice(n, size=min(300, n), replace=False)
            candidates = candidates[candidates != a]
            sims = M[candidates] @ M[a]
            if sims.size >= k:
                topk = np.partition(sims, -k)[-k:]
                sims_all.append(np.mean(topk))
        local_nn_sim_mean = float(np.mean(sims_all)) if sims_all else float("nan")

    return {
        "n_chunks": int(n),
        "centroid_sim_mean": centroid_sim_mean,
        "centroid_sim_median": centroid_sim_median,
        "mean_pairwise_distance": mpd,
        "local_nn_sim_mean": local_nn_sim_mean,
    }

def per_chunk_neighbor_tightness(doc_rows: List[int], Z: np.ndarray, k_nn: int = 5) -> List[float]:
    """
    For each chunk in doc, compute mean cosine similarity to its k nearest other chunks in same doc.
    Higher = more locally repetitive / template-y semantically.
    """
    M = Z[doc_rows]
    n = M.shape[0]
    if n <= 1:
        return [float("nan")] * n

    if n <= 500:
        S = M @ M.T
        np.fill_diagonal(S, -1.0)
        k = min(k_nn, n - 1)
        topk = np.partition(S, -k, axis=1)[:, -k:]
        return [float(np.mean(topk[i])) for i in range(n)]
    else:
        # Approx for very long docs
        rng = np.random.default_rng(42)
        k = min(k_nn, n - 1)
        out = []
        for i in range(n):
            candidates = rng.choice(n, size=min(400, n), replace=False)
            candidates = candidates[candidates != i]
            sims = M[candidates] @ M[i]
            if sims.size >= k:
                topk = np.partition(sims, -k)[-k:]
                out.append(float(np.mean(topk)))
            else:
                out.append(float("nan"))
        return out


### Compute semantic dispersion per doc + per chunk outputs

In [None]:
K_NN = 5

doc_sem_summaries: List[Dict[str, Any]] = []
chunk_sem_rows: List[Dict[str, Any]] = []

for doc_id, row_ids in doc_to_rows.items():
    title = doc_info[doc_id]["title"]
    chunk_type = doc_info[doc_id]["chunk_type"]

    # doc-level metrics
    m = doc_semantic_metrics(row_ids, Z, k_nn=K_NN, sample_pairs=2000)
    doc_sem_summaries.append({
        "doc_id": doc_id,
        "title": title,
        "chunk_type": chunk_type,
        "k_nn": K_NN,
        **m,
        # Interpretable: higher centroid sim -> lower semantic variety
        "semantic_variety_proxy": (None if not np.isfinite(m["mean_pairwise_distance"]) else float(m["mean_pairwise_distance"])),
    })

    # per-chunk local tightness
    local_tight = per_chunk_neighbor_tightness(row_ids, Z, k_nn=K_NN)
    for local_pos, row_id in enumerate(row_ids):
        r = index[row_id]
        chunk_sem_rows.append({
            "chunk_id": r["chunk_id"],
            "doc_id": doc_id,
            "title": title,
            "chunk_index": int(r["chunk_index"]),
            "chunk_type": chunk_type,
            "k_nn": K_NN,
            "local_nn_sim": float(local_tight[local_pos]) if np.isfinite(local_tight[local_pos]) else float("nan"),
        })

print("Doc semantic summaries:", len(doc_sem_summaries))
print("Chunk semantic rows:", len(chunk_sem_rows))
print("Example doc semantic summary:", doc_sem_summaries[0])


### Save semantic per-doc summary + per-chunk semantic tightness

In [None]:
write_jsonl(DOC_SUMMARY_OUT, doc_sem_summaries)
write_jsonl(SEM_CHUNK_OUT, chunk_sem_rows)

print("Wrote:")
print("-", DOC_SUMMARY_OUT, f"({DOC_SUMMARY_OUT.stat().st_size} bytes)")
print("-", SEM_CHUNK_OUT, f"({SEM_CHUNK_OUT.stat().st_size} bytes)")


### Fun diagnostics: docs with lowest variety (highest centroid similarity)

In [None]:
# Sort docs by centroid_sim_mean (higher = tighter semantic cluster = less variety)
docs_sorted = sorted(
    [d for d in doc_sem_summaries if np.isfinite(d["centroid_sim_mean"])],
    key=lambda d: -d["centroid_sim_mean"]
)

print("\nDocs with LOWEST semantic variety (highest centroid similarity):")
for d in docs_sorted[:8]:
    print(f"  centroid_sim_mean={d['centroid_sim_mean']:.3f}  "
          f"mean_pairwise_dist={d['mean_pairwise_distance']:.3f}  "
          f"local_nn_sim_mean={d['local_nn_sim_mean']:.3f}  "
          f"{d['title']}  (n_chunks={d['n_chunks']})")

print("\nDocs with HIGHEST semantic variety (lowest centroid similarity):")
for d in docs_sorted[-8:][::-1]:
    print(f"  centroid_sim_mean={d['centroid_sim_mean']:.3f}  "
          f"mean_pairwise_dist={d['mean_pairwise_distance']:.3f}  "
          f"local_nn_sim_mean={d['local_nn_sim_mean']:.3f}  "
          f"{d['title']}  (n_chunks={d['n_chunks']})")


## Where we are

You now have three complementary material-side signals:

### 03 — Semantic novelty
- new meaning introduced over time (novelty curves)

### 04 — Surface redundancy
- repetition/predictability in form (gzip, entropy, templates)

### 05 — Contextual diversity
- variety of contexts for words and ideas (lexical + semantic dispersion)

## Next: unify into a "Boredom Report"

Next step is a notebook or script that joins:
- `data/lsa/novelty_summary_per_doc_labeled.jsonl`
- `data/redundancy/redundancy_summary_per_doc.jsonl`
- `data/diversity/doc_contextual_diversity_summary.jsonl`

...into one table per document with:
- semantic bandwidth score (novelty)
- redundancy score (compression/entropy)
- context diversity score (semantic dispersion + word DF)

Then we can build the Streamlit report app.
