<a href="https://colab.research.google.com/github/mahb97/joyce-dubliners-similes-analysis/blob/main/04_nlp_validation_joyce.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# POS Tagging Validation on Joyce’s *Dubliners* with a BNC Comparator

This study evaluates contemporary POS taggers against expert CLAWS7 annotations for simile-bearing sentences in James Joyce’s *Dubliners*, and incorporates a 200-sentence sample from the British National Corpus (BNC) as a comparative reference. Gold labels are projected to Penn-style tags under a strict CLAWS→Penn mapping to ensure label comparability across tools.

## Research Objectives
- Quantify token-level performance of spaCy (sm, lg), Flair, and NLTK against CLAWS7 gold under **strict** CLAWS→Penn projection.
- Characterize systematic error patterns (e.g., UNK-induced losses; function-word and verb morphology confusions).
- Assess whether performance differences are **Joyce-specific** or reflect a broader ceiling induced by projection.

## Corpora
- **Dubliners (literary)**: 183 sentences with CLAWS7 gold.
- **BNC (reference)**: 200 sentences reconstructed as `Left + Node + Right` (with `Left` possibly empty) to form a single sentence string; CLAWS7 gold provided.
- **Total**: 383 sentences. All evaluations report pooled results, with per-dataset confidence intervals provided to disentangle corpus effects.

## Preprocessing and Alignment
- **Gold normalization**: CLAWS7 tags are projected to Penn under a **strict** policy; distinctions without Penn equivalents are mapped to `UNK`.  
- **Per-tool alignment**: For each sentence and tool, comparisons are made over the **per-sentence minimum length** across (gold tokens, tool tokens, tool tags) to avoid spurious misalignments arising from tokenization differences.

## Evaluation Protocol
- **Primary metrics**: Token accuracy; micro/macro/weighted precision, recall, and F1.  
- **Uncertainty & inference**: Wilson score intervals for single-proportion estimates; Newcombe (Wilson) intervals for differences in accuracy; McNemar’s tests with Holm–Bonferroni correction for paired token outcomes; sentence-cluster bootstrap CIs for accuracy differences.

## Role of the BNC Comparator
The BNC sample functions as a **non-Joycean reference**. If tools perform markedly better on BNC than on *Dubliners*, this supports a **Joyce-specific difficulty** beyond the information loss induced by strict projection. Conversely, comparable ceilings across both corpora indicate that the binding constraint is the **projection itself**, rather than literary idiosyncrasy alone.


In [1]:
# ==============================================================================
# COMPLETE SETUP AND INSTALLATION (NLTK yes, TextBlob no)
# ==============================================================================

print("Installing required packages...")
# -q to keep output tidy
!pip install -q spacy nltk flair scikit-learn plotly seaborn

print("\nDownloading spaCy models...")
# Use python -m to ensure install into the current kernel env
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg

print("\nEnsuring NLTK POS tagger resource...")
import nltk

def _ensure_nltk_resource(path, downloader_name, verbose=False):
    try:
        nltk.data.find(path)
        return True
    except LookupError:
        try:
            nltk.download(downloader_name, quiet=not verbose)
            nltk.data.find(path)  # verify
            return True
        except Exception:
            return False

# Ensure POS tagger (try *_eng then classic)
tagger_ok = _ensure_nltk_resource('taggers/averaged_perceptron_tagger_eng', 'averaged_perceptron_tagger_eng') or \
            _ensure_nltk_resource('taggers/averaged_perceptron_tagger',     'averaged_perceptron_tagger')

print(f"NLTK averaged_perceptron_tagger available: {'✓' if tagger_ok else '✗'}")
print("\nAll installations attempted. Proceeding to imports...")

# ==============================================================================
# IMPORTS AND SETUP
# ==============================================================================

import pandas as pd
import numpy as np
import json
import time
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# NLP libraries
import spacy
from nltk.tokenize import TreebankWordTokenizer  # avoids punkt/punkt_tab
from nltk.tag import pos_tag

# Optional: Flair (enabled if import + model load succeed)
FLAIR_AVAILABLE = False
try:
    from flair.data import Sentence
    from flair.models import SequenceTagger
    FLAIR_AVAILABLE = True
    print("✓ Flair imported successfully")
except Exception as e:
    print(f"✗ Flair not available: {e}")

# Analysis libraries
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from collections import defaultdict, Counter
import scipy.stats as stats
from math import sqrt

# Visualization libraries
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# ==============================================================================
# Load NLP models
# ==============================================================================

print("\nLoading spaCy models...")
nlp_sm = None
nlp_lg = None
try:
    nlp_sm = spacy.load("en_core_web_sm")
    print("✓ en_core_web_sm loaded")
except Exception as e:
    print(f"✗ Failed to load en_core_web_sm: {e}")

try:
    nlp_lg = spacy.load("en_core_web_lg")
    print("✓ en_core_web_lg loaded")
except Exception as e:
    print(f"✗ Failed to load en_core_web_lg: {e}")

# Load Flair model if available
flair_tagger = None
if FLAIR_AVAILABLE:
    try:
        print("Loading Flair POS tagger (this may take a moment)...")
        flair_tagger = SequenceTagger.load('pos')
        print("✓ Flair model loaded successfully")
    except Exception as e:
        print(f"✗ Failed to load Flair model: {e}")
        flair_tagger = None
        FLAIR_AVAILABLE = False

# Final setup summary
print("\nSetup Summary:")
print(f"spaCy sm: {'✓ Available' if nlp_sm is not None else '✗ Not available'}")
print(f"spaCy lg: {'✓ Available' if nlp_lg is not None else '✗ Not available'}")
print(f"NLTK averaged_perceptron_tagger: {'✓' if tagger_ok else '✗'} (tokenizer: TreebankWordTokenizer)")
print(f"Flair: {'✓ Available' if FLAIR_AVAILABLE and flair_tagger is not None else '✗ Not available'}")
print("\nReady to proceed with analysis!")


Installing required packages...
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m202.9/202.9 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

pytorch_model.bin:   0%|          | 0.00/249M [00:00<?, ?B/s]

2025-08-28 10:16:32,077 SequenceTagger predicts: Dictionary with 53 tags: <unk>, O, UH, ,, VBD, PRP, VB, PRP$, NN, RB, ., DT, JJ, VBP, VBG, IN, CD, NNS, NNP, WRB, VBZ, WDT, CC, TO, MD, VBN, WP, :, RP, EX, JJR, FW, XX, HYPH, POS, RBR, JJS, PDT, NNPS, RBS, AFX, WP$, -LRB-, -RRB-, ``, '', LS, $, SYM, ADD
✓ Flair model loaded successfully

Setup Summary:
spaCy sm: ✓ Available
spaCy lg: ✓ Available
NLTK averaged_perceptron_tagger: ✓ (tokenizer: TreebankWordTokenizer)
Flair: ✓ Available

Ready to proceed with analysis!


In [3]:
# ==============================================================================
# CLEAN SETUP: CLAWS7 → Penn (STRICT), AUDIT + EVAL
# ==============================================================================

import sys, re, time, warnings, json
warnings.filterwarnings('ignore')
from collections import defaultdict, Counter
from math import sqrt
from datetime import datetime

import numpy as np
import pandas as pd
import scipy.stats as stats

print("Imports loaded.")

# ------------------------------------------------------------------------------
# NLP libraries (optional ones guarded)
# ------------------------------------------------------------------------------
import spacy
from nltk.tokenize import TreebankWordTokenizer
from nltk.tag import pos_tag

# Flair is optional
try:
    from flair.data import Sentence
    from flair.models import SequenceTagger
    FLAIR_AVAILABLE = True
except Exception:
    FLAIR_AVAILABLE = False

print("Loading spaCy models...")
try:
    nlp_sm = spacy.load("en_core_web_sm")
except Exception as e:
    print("spaCy sm load failed:", e)
    nlp_sm = None

try:
    nlp_lg = spacy.load("en_core_web_lg")
except Exception as e:
    print("spaCy lg load failed:", e)
    nlp_lg = None

flair_tagger = None
if FLAIR_AVAILABLE:
    try:
        print("Loading Flair POS tagger...")
        flair_tagger = SequenceTagger.load('pos')
    except Exception as e:
        print("Flair load failed:", e)
        FLAIR_AVAILABLE = False
        flair_tagger = None

print("\nSetup Summary:")
print(f"  spaCy sm: {'✓' if nlp_sm else '✗'} | spaCy lg: {'✓' if nlp_lg else '✗'}")
print(f"  NLTK: ✓ (TreebankWordTokenizer + averaged_perceptron_tagger)")
print(f"  Flair: {'✓' if FLAIR_AVAILABLE and flair_tagger else '✗'}")
sys.stdout.flush()

# ------------------------------------------------------------------------------
# CLAWS7 parsing (from 'word_TAG' sequences)
# ------------------------------------------------------------------------------
def parse_claws_tags(claws_string):
    """Parse CLAWS7 format: 'word_TAG word_TAG ...' -> (tokens, tags)"""
    if pd.isna(claws_string) or not str(claws_string).strip():
        return [], []
    tokens, tags = [], []
    for item in str(claws_string).strip().split():
        if '_' in item:
            word, tag = item.rsplit('_', 1)
            tokens.append(word)
            tags.append(tag)
    return tokens, tags

# ------------------------------------------------------------------------------
# CLAWS7 → Penn mapping (closest base mapping) + STRICT UNTRANSLATABLE
# ------------------------------------------------------------------------------
_CLAWS_TO_PENN_BASE = {
    # Determiners / wh-dets
    'AT':'DT','DD':'DT','DDQ':'WDT','DDQGE':'WP$','DDQV':'WDT',
    'DA':'DT','DA1':'DT','DA2':'DT','DAR':'JJR','DAT':'JJS',
    'DB':'DT','DB2':'DT','DD1':'DT','DD2':'DT',

    # Coords / subords / clause markers
    'CC':'CC','CCB':'CC','CS':'IN','CSA':'IN','CSN':'IN','CST':'IN','CSW':'IN','BCL':'IN',

    # Prepositions (base)
    'II':'IN','IF':'IN','IO':'IN','IW':'IN',

    # Adjectives
    'JJ':'JJ','JJR':'JJR','JJT':'JJS','JK':'JJ',

    # Adverbs
    'RR':'RB','RRQ':'WRB','RRQV':'WRB','RGR':'RBR','RRT':'RBS','RG':'RB','RGQ':'WRB','RGQV':'WRB',
    'REX':'RB','RL':'RB','RP':'RP','RPK':'RP','RA':'RB','RT':'RB',

    # Nouns (basic)
    'NN':'NN','NN1':'NN','NN2':'NNS','NNO':'NN','NNO2':'NNS',
    'NP':'NNP','NP1':'NNP','NP2':'NNPS',

    # Proper noun semantic subclasses (months/weekdays)
    'NPM1':'NNP','NPM2':'NNPS','NPD1':'NNP','NPD2':'NNPS',

    # Semantic noun subclasses (mapped loosely; strict will UNK them)
    'ND1':'NN','NNL1':'NN','NNL2':'NNS','NNT1':'NN','NNT2':'NNS',
    'NNU':'NN','NNU1':'NN','NNU2':'NNS','NNA':'NN','NNB':'NN',

    # Numerals
    'MC':'CD','MC1':'CD','MC2':'CD','MCGE':'CD','MCMC':'CD','MF':'CD','MD':'JJ',

    # Pronouns (closest Penn)
    'PPGE':'PRP$','PPH1':'PRP','PPHO1':'PRP','PPHO2':'PRP',
    'PPHS1':'PRP','PPHS2':'PRP','PPIO1':'PRP','PPIO2':'PRP',
    'PPIS1':'PRP','PPIS2':'PRP','PPX1':'PRP','PPX2':'PRP','PPY':'PRP',
    'PN':'PRP','PN1':'PRP','PNQO':'WP','PNQS':'WP','PNQV':'WP',

    # Verbs: lexical (good matches)
    'VV0':'VB','VVD':'VBD','VVG':'VBG','VVI':'VB','VVN':'VBN','VVZ':'VBZ','VVGK':'VBG','VVNK':'VBN',

    # Verbs: do/have/be (Penn loses aux identity; strict will UNK these)
    'VD0':'VB','VDD':'VBD','VDG':'VBG','VDI':'VB','VDN':'VBN','VDZ':'VBZ',
    'VH0':'VB','VHD':'VBD','VHG':'VBG','VHI':'VB','VHN':'VBN','VHZ':'VBZ',
    'VB0':'VB','VBDR':'VBD','VBDZ':'VBD','VBG':'VBG','VBI':'VB','VBM':'VBP','VBN':'VBN','VBR':'VBP','VBZ':'VBZ',

    # Modals
    'VM':'MD','VMK':'MD',

    # Other function words
    'TO':'TO','UH':'UH','EX':'EX','GE':'POS','XX':'RB',

    # Foreign/formula/unclassified/letters
    'FW':'FW','FO':'FW','FU':'FW','ZZ1':'NN','ZZ2':'NNS',

    # --- Punctuation / brackets / quotes (reduce UNK→. / , noise) ---
    '.':'.', ',':',', ':':':', ';':';', '!':'.', '?':'.',
    '(': '-LRB-', ')':'-RRB-',
    '[':'-LSB-', ']':'-RSB-',
    '{':'-LCB-', '}':'-RCB-',
    '``':'``', "''":"''", '"':"''", "'":"''",
}

# Ditto tags like II31 → II
_DITTO_RE = re.compile(r'^(.*?)(\d{2,3})$')
def _strip_ditto(tag: str) -> str:
    m = _DITTO_RE.match(tag or "")
    return m.group(1) if m else (tag or "")

# Tags whose CLAWS distinctions Penn CANNOT encode → mark UNK in STRICT mode
# (Explicitly include APPGE to document possessive pronoun pre-nominal collapse.)
_STRICT_UNTRANSLATABLE = set([
    # Articles / number-specific determiners
    'AT','AT1','DD1','DD2','DA','DA1','DA2','DAR','DAT','DB','DB2','DDQGE',
    # Preposition subtypes
    'IF','IO','IW',
    # Semantic noun subclasses (temporal/locative/unit/direction/title/weekday/month)
    'ND1','NNL1','NNL2','NNT1','NNT2','NNU','NNU1','NNU2','NNA','NNB',
    'NPM1','NPM2','NPD1','NPD2',
    # Person/number/case-marked pronouns (Penn POS lacks these features)
    'APPGE','PPH1','PPHO1','PPHO2','PPHS1','PPHS2','PPIO1','PPIO2','PPIS1','PPIS2','PPX1','PPX2','PPY','PN1','PN',
    # Auxiliary identity / catenatives (Penn POS doesn't encode identity/catenative)
    'VBM','VBR','VBZ','VBDR','VBDZ','VBG','VBN','VBI','VB0',
    'VD0','VDD','VDG','VDI','VDN','VDZ',
    'VH0','VHD','VHG','VHI','VHN','VHZ',
    'VVGK','VVNK','VMK','RPK','JK',
    # Appositional adv marker / formula / unclassified
    'REX','FO','FU',
])

def convert_claws_to_penn(tag: str, strict: bool = True) -> str:
    """CLAWS7 → Penn. STRICT=True marks Penn-uncapturable distinctions as UNK."""
    if not tag or not isinstance(tag, str):
        return 'UNK'
    base = _strip_ditto(tag)
    if strict and base in _STRICT_UNTRANSLATABLE:
        return 'UNK'
    if base in _CLAWS_TO_PENN_BASE:
        return _CLAWS_TO_PENN_BASE[base]

    # Generic fallbacks (rare)
    if base.startswith('NN'):
        return 'UNK' if strict else ('NNS' if tag.endswith('2') else 'NN')
    if base.startswith('NP'):
        return 'UNK' if strict else ('NNPS' if tag.endswith('2') else 'NNP')
    if base.startswith('VV'):
        if strict:
            return 'UNK'
        if base in ('VV0','VVI'): return 'VB'
        if base == 'VVD': return 'VBD'
        if base == 'VVG': return 'VBG'
        if base == 'VVN': return 'VBN'
        if base == 'VVZ': return 'VBZ'
    if base in {'CS','CSA','CSN','CST','CSW','BCL'}:
        return 'IN' if not strict else ('IN' if base == 'CS' else 'UNK')
    if base in {'CC','CCB'}:
        return 'CC'
    return 'UNK'

# ------------------------------------------------------------------------------
# Diagnostics for unmappable tags
# ------------------------------------------------------------------------------
def analyze_unmappable_tags(processed_items, strict: bool = True, top_n: int = 30):
    counts = Counter(); examples = {}
    total = 0; unk_total = 0
    for row in processed_items:
        claws_tags = row.get('claws_tags', [])
        sent = row.get('sentence', '')
        for t in claws_tags:
            total += 1
            base = _strip_ditto(t)
            penn = convert_claws_to_penn(base, strict=strict)
            if penn == 'UNK':
                unk_total += 1
                counts[base] += 1
                if base not in examples:
                    examples[base] = sent[:120] + ('...' if len(sent) > 120 else '')
    print("="*70)
    print(f"CLAWS7 → Penn STRICT={strict}  | Total tags: {total:,} | UNK: {unk_total:,} ({(unk_total/total*100 if total else 0):.1f}%)")
    print("- Top UNTRANSLATABLE CLAWS tags -")
    for tag, c in counts.most_common(top_n):
        print(f"{tag:8s} : {c:6d}   e.g. {examples[tag]}")
    print("="*70)
    return {
        'total_tags': total,
        'unk_tags': unk_total,
        'unk_rate': (unk_total/total if total else 0.0),
        'counts': counts,
        'examples': examples
    }

# ------------------------------------------------------------------------------
# Taggers returning Penn tags directly (for fair comparison against STRICT-projected GT)
# ------------------------------------------------------------------------------
def tag_with_spacy(sentence, model='sm'):
    nlp_model = nlp_sm if model == 'sm' else nlp_lg
    if nlp_model is None:
        return []
    doc = nlp_model(sentence)
    return [(t.text, t.tag_) for t in doc]  # spaCy Penn-style tags

def tag_with_flair(sentence):
    if not (FLAIR_AVAILABLE and flair_tagger):
        return []
    s = Sentence(sentence); flair_tagger.predict(s)
    return [(t.text, t.tag) for t in s]  # Penn-style tags

_tb_tok = TreebankWordTokenizer()
def tag_with_nltk(sentence):
    try:
        tokens = _tb_tok.tokenize(sentence)
    except Exception:
        tokens = sentence.split()
    pos_tags = pos_tag(tokens)  # averaged_perceptron_tagger(_eng) ensured in setup cell
    return [(w, tag) for w, tag in pos_tags]  # Penn-style tags

# ------------------------------------------------------------------------------
# Accuracy + stats (STRICT projection on GT)
# ------------------------------------------------------------------------------
def calculate_accuracy_fair(ground_truth_claws, predicted_penn, tokens, strict=True):
    """Compute token-level accuracy after STRICT projecting CLAWS gold to Penn."""
    if not ground_truth_claws or not predicted_penn or not tokens:
        return 0.0
    m = min(len(ground_truth_claws), len(predicted_penn), len(tokens))
    if m == 0:
        return 0.0
    gt_penn = [convert_claws_to_penn(t, strict=strict) for t in ground_truth_claws[:m]]
    pred = predicted_penn[:m]
    return sum(1 for i in range(m) if gt_penn[i] == pred[i]) / m

def wilson_confidence_interval(successes, trials, confidence=0.95):
    if trials == 0:
        return 0, 0, 0
    p = successes / trials
    z = stats.norm.ppf(1 - (1 - confidence) / 2)
    denom = 1 + z**2 / trials
    centre = (p + z**2 / (2 * trials)) / denom
    half = z * sqrt((p * (1 - p) + z**2 / (4 * trials)) / trials) / denom
    return p, max(0, centre - half), min(1, centre + half)

def proportion_z_test(x1, n1, x2, n2):
    p1, p2 = x1/n1, x2/n2
    p_pool = (x1+x2)/(n1+n2)
    se = sqrt(p_pool*(1-p_pool)*(1/n1+1/n2))
    z = (p1-p2)/se
    p_value = 2*(1 - stats.norm.cdf(abs(z)))
    return z, p_value

def cohens_h(p1, p2):
    return 2*(np.arcsin(sqrt(p1)) - np.arcsin(sqrt(p2)))

print("Setup cell ready: mapping, taggers, and metrics defined.")


Imports loaded.
Loading spaCy models...
Loading Flair POS tagger...
2025-08-28 10:24:08,405 SequenceTagger predicts: Dictionary with 53 tags: <unk>, O, UH, ,, VBD, PRP, VB, PRP$, NN, RB, ., DT, JJ, VBP, VBG, IN, CD, NNS, NNP, WRB, VBZ, WDT, CC, TO, MD, VBN, WP, :, RP, EX, JJR, FW, XX, HYPH, POS, RBR, JJS, PDT, NNPS, RBS, AFX, WP$, -LRB-, -RRB-, ``, '', LS, $, SYM, ADD

Setup Summary:
  spaCy sm: ✓ | spaCy lg: ✓
  NLTK: ✓ (TreebankWordTokenizer + averaged_perceptron_tagger)
  Flair: ✓
Setup cell ready: mapping, taggers, and metrics defined.


In [5]:
# ------------------------------------------------------------------------------
# Data upload + processing (handles BNC KWIC: Left/Node/Right + CLAWS)
# ------------------------------------------------------------------------------
from google.colab import files

print("\nUpload your Dubliners CSV file:")
dubliners_uploaded = files.upload()
print("\nUpload your BNC CSV file:")
bnc_uploaded = files.upload()

def _read_csv_any_encoding(filename):
    df = None
    for enc in ('cp1252','latin1','utf-8'):
        try:
            df = pd.read_csv(filename, encoding=enc, dtype=str)
            break
        except UnicodeDecodeError:
            continue
    if df is None:
        raise ValueError(f"Failed to read {filename} with tried encodings.")
    return df

def _find_col(df, candidates):
    """Return first matching column (case-insensitive) or None."""
    lower_map = {c.lower(): c for c in df.columns}
    for cand in candidates:
        if cand in lower_map:
            return lower_map[cand]
    # also allow substring match like 'left context'
    for c in df.columns:
        lc = c.lower()
        if any(cand in lc for cand in candidates):
            return c
    return None

def load_dataset_with_kwic(filename, dataset_name, preview=3):
    df = _read_csv_any_encoding(filename)
    print(f"{dataset_name}: Loaded {len(df)} rows with columns: {list(df.columns)}")
    try:
        print(df.head(preview).to_string(index=False))
    except Exception:
        pass

    # Detects CLAWS tag string column (any 'claws' match)
    claws_col = None
    for c in df.columns:
        if 'claws' in c.lower():
            claws_col = c
            break

    # Detects KWIC columns
    left_col = _find_col(df, ['left'])
    node_col = _find_col(df, ['node'])
    right_col = _find_col(df, ['right'])

    # Detects sentence/context text columns (for non-KWIC datasets)
    sentence_col = _find_col(df, ['sentence','context','text','sent','snippet','content','line','string'])

    processed = []

    if claws_col and left_col and node_col and right_col:
        print(f"✓ Detected KWIC columns: {left_col} | {node_col} | {right_col} and CLAWS column: {claws_col}")
        subset = df[[left_col, node_col, right_col, claws_col]].dropna(subset=[claws_col])
        for _, row in subset.iterrows():
            # Reconstructs sentence text from KWIC
            L = str(row[left_col]).strip() if pd.notna(row[left_col]) else ""
            N = str(row[node_col]).strip() if pd.notna(row[node_col]) else ""
            R = str(row[right_col]).strip() if pd.notna(row[right_col]) else ""
            sent_text = " ".join([s for s in [L, N, R] if s]).strip()
            # Parse CLAWS 'word_TAG' stream
            tokens, tags = parse_claws_tags(row[claws_col])
            if tokens and tags and len(tokens) == len(tags):
                processed.append({
                    'sentence': sent_text if sent_text else " ".join(tokens),
                    'tokens': tokens,
                    'claws_tags': tags,
                    'dataset': dataset_name
                })
        print(f"{dataset_name}: Processed {len(processed)} valid KWIC rows.")
        return processed

    # Fallback: sentence + CLAWS on the same row (non-KWIC)
    if claws_col and sentence_col:
        print(f"✓ Using sentence column: {sentence_col} and CLAWS column: {claws_col}")
        subset = df[[sentence_col, claws_col]].dropna(subset=[claws_col])
        for _, row in subset.iterrows():
            tokens, tags = parse_claws_tags(row[claws_col])
            if tokens and tags and len(tokens) == len(tags):
                processed.append({
                    'sentence': row[sentence_col],
                    'tokens': tokens,
                    'claws_tags': tags,
                    'dataset': dataset_name
                })
        print(f"{dataset_name}: Processed {len(processed)} valid sentences.")
        return processed

    # Last fallback: CLAWS only → reconstruct sentence from tokens
    if claws_col:
        print(f"⚠ No text column found; reconstructing sentence from CLAWS tokens. Using: {claws_col}")
        subset = df[[claws_col]].dropna(subset=[claws_col])
        for _, row in subset.iterrows():
            tokens, tags = parse_claws_tags(row[claws_col])
            if tokens and tags and len(tokens) == len(tags):
                processed.append({
                    'sentence': " ".join(tokens),
                    'tokens': tokens,
                    'claws_tags': tags,
                    'dataset': dataset_name
                })
        print(f"{dataset_name}: Processed {len(processed)} reconstructed sentences.")
        return processed

    print(f"✗ Could not locate CLAWS or usable sentence/KWIC columns in {dataset_name}.")
    return []

# ------------------------------------------------------------------------------
# Load both datasets now
# ------------------------------------------------------------------------------
dubliners_filename = list(dubliners_uploaded.keys())[0]
bnc_filename       = list(bnc_uploaded.keys())[0]

dubliners_data = load_dataset_with_kwic(dubliners_filename, "Dubliners")
bnc_data       = load_dataset_with_kwic(bnc_filename, "BNC")

all_processed_data = dubliners_data + bnc_data

print(f"\nTotal processed sentences: {len(all_processed_data)}")
print(f"Dubliners: {len(dubliners_data)} | BNC: {len(bnc_data)}")
sys.stdout.flush()

# ------------------------------------------------------------------------------
# STRICT unmappable audit
# ------------------------------------------------------------------------------
print("\nRunning STRICT unmappable audit (CLAWS distinctions Penn can't encode)...")
strict_report = analyze_unmappable_tags(all_processed_data, strict=True, top_n=40)

# ------------------------------------------------------------------------------
# Batch POS tagging across tools (Penn output) + STRICT accuracy against GT
# ------------------------------------------------------------------------------
def process_sentence_with_all_tools(sentence):
    tools = {
        'spacy_sm': (lambda s: tag_with_spacy(s, 'sm')),
        'spacy_lg': (lambda s: tag_with_spacy(s, 'lg')),
        'nltk':      tag_with_nltk,
    }
    if FLAIR_AVAILABLE and flair_tagger:
        tools['flair'] = tag_with_flair

    results = {}
    for name, fn in tools.items():
        try:
            t0 = time.time()
            tagged = fn(sentence)  # list of (token, PennTag)
            dt = time.time() - t0
            results[name] = {
                'tags':   [tag for _, tag in tagged],   # Penn
                'tokens': [tok for tok, _ in tagged],
                'processing_time': dt
            }
        except Exception as e:
            results[name] = {'error': str(e)}
    return results

print("\nBatch tagging sentences with available tools...")
expanded_batch_results = []
for i, data in enumerate(all_processed_data):
    if i % 10 == 0:
        print(f"  Progress: {i}/{len(all_processed_data)}")
        sys.stdout.flush()
    tool_results = process_sentence_with_all_tools(data['sentence'])
    expanded_batch_results.append({
        'sentence': data['sentence'],
        'ground_truth': data['claws_tags'],  # CLAWS7 tags
        'dataset': data['dataset'],
        'tool_results': tool_results
    })
print("Batch tagging complete.")

# ------------------------------------------------------------------------------
# Post-batch sanity summary (coverage + alignment health)
# ------------------------------------------------------------------------------
tool_names = set()
for r in expanded_batch_results:
    tool_names.update(r['tool_results'].keys())
tool_names = sorted(tool_names)

coverage = {t: 0 for t in tool_names}
mean_len = {t: [] for t in tool_names}
zero_min = {t: 0 for t in tool_names}  # count of cases where min alignment length was 0

for r in expanded_batch_results:
    gt = r['ground_truth']
    for t in tool_names:
        tri = r['tool_results'].get(t, {})
        if tri and 'error' not in tri:
            coverage[t] += 1
            pred = tri['tags']
            toks = tri['tokens']
            mean_len[t].append(len(pred))
            m = min(len(gt), len(pred), len(toks))
            if m == 0:
                zero_min[t] += 1

print("\nTagger coverage summary:")
for t in tool_names:
    n = coverage[t]
    ml = (np.mean(mean_len[t]) if mean_len[t] else 0)
    zm = zero_min[t]
    print(f"  {t:10s} : sentences={n:4d} | mean_pred_len={ml:5.1f} | zero_min_align={zm}")



Upload your Dubliners CSV file:


Saving All Similes - Dubliners cont.csv to All Similes - Dubliners cont (1).csv

Upload your BNC CSV file:


Saving concordance from BNC.csv to concordance from BNC (1).csv
Dubliners: Loaded 184 rows with columns: ['ID', 'Story', 'Page No.', 'Sentence Context', 'Comparator Type ', 'Category (Framwrok)', 'Additional Notes', 'CLAWS']
   ID       Story Page No.                                                                                                                                                                                                                                                                                                                                                                   Sentence Context Comparator Type  Category (Framwrok)                                                                                                                                                                                                                                                                                                                                                     

In [7]:
# ==============================================================================
# Tool coverage + alignment audit (with optional include filter and per-dataset view)
# ==============================================================================

from collections import defaultdict, Counter

def audit_tool_coverage(results, include_tools=None):
    tools_seen = set()
    cov_sentences = Counter()      # overall sentences with non-error result
    had_error = Counter()          # overall sentences with 'error'
    zero_min = Counter()           # sentences where min alignment length == 0
    pred_len_sum = Counter()
    pred_len_cnt = Counter()
    sample_errors = defaultdict(list)

    # Per-dataset breakdown
    cov_by_ds = defaultdict(Counter)   # dataset -> tool -> count
    err_by_ds = defaultdict(Counter)   # dataset -> tool -> count

    for res in results:
        gt = res.get('ground_truth', [])
        ds = res.get('dataset', 'UNKNOWN')
        tri_map = res.get('tool_results', {})

        tools_seen.update(tri_map.keys())
        for tool, tri in tri_map.items():
            if include_tools and tool not in include_tools:
                continue
            if not isinstance(tri, dict):
                continue

            if 'error' in tri:
                had_error[tool] += 1
                err_by_ds[ds][tool] += 1
                if len(sample_errors[tool]) < 3:
                    sample_errors[tool].append(tri.get('error'))
                continue

            preds = tri.get('tags') or []
            toks  = tri.get('tokens') or []
            if preds:
                cov_sentences[tool] += 1
                cov_by_ds[ds][tool] += 1
                pred_len_sum[tool] += len(preds)
                pred_len_cnt[tool] += 1

            m = min(len(gt), len(preds), len(toks))
            if m == 0:
                zero_min[tool] += 1

    print("=== Tool coverage & alignment audit ===")
    shown_tools = sorted(t for t in tools_seen if (not include_tools or t in include_tools))
    for tool in shown_tools:
        n_sent = cov_sentences[tool]
        n_err  = had_error[tool]
        zm     = zero_min[tool]
        avg_len = (pred_len_sum[tool]/pred_len_cnt[tool]) if pred_len_cnt[tool] else 0.0
        print(f"{tool:10s} | sentences_ok={n_sent:4d} | errors={n_err:3d} | zero_min={zm:3d} | mean_pred_len={avg_len:5.1f}")
        if sample_errors[tool]:
            for e in sample_errors[tool]:
                print(f"   ↪ error sample: {e}")

    # Optional per-dataset breakdown
    print("\n--- Per-dataset coverage ---")
    for ds in sorted(cov_by_ds.keys() | err_by_ds.keys()):
        print(f"[{ds}]")
        for tool in shown_tools:
            ok = cov_by_ds[ds][tool]
            er = err_by_ds[ds][tool]
            print(f"  {tool:10s} ok={ok:4d} | errors={er:3d}")
    return {
        "tools_seen": shown_tools,
        "sentences_ok": dict(cov_sentences),
        "errors": dict(had_error),
        "zero_min": dict(zero_min),
        "per_dataset_ok": {ds: dict(cov_by_ds[ds]) for ds in cov_by_ds},
        "per_dataset_err": {ds: dict(err_by_ds[ds]) for ds in err_by_ds},
    }

# Example: exclude textblob explicitly
_coverage = audit_tool_coverage(expanded_batch_results, include_tools={'spacy_sm','spacy_lg','nltk','flair'})


=== Tool coverage & alignment audit ===
flair      | sentences_ok= 383 | errors=  0 | zero_min=  0 | mean_pred_len= 26.3
nltk       | sentences_ok= 383 | errors=  0 | zero_min=  0 | mean_pred_len= 26.0
spacy_lg   | sentences_ok= 383 | errors=  0 | zero_min=  0 | mean_pred_len= 26.6
spacy_sm   | sentences_ok= 383 | errors=  0 | zero_min=  0 | mean_pred_len= 26.6

--- Per-dataset coverage ---
[BNC]
  flair      ok= 200 | errors=  0
  nltk       ok= 200 | errors=  0
  spacy_lg   ok= 200 | errors=  0
  spacy_sm   ok= 200 | errors=  0
[Dubliners]
  flair      ok= 183 | errors=  0
  nltk       ok= 183 | errors=  0
  spacy_lg   ok= 183 | errors=  0
  spacy_sm   ok= 183 | errors=  0


In [8]:
# ==============================================================================
# Minimal STRICT error analysis (TextBlob excluded)
# ==============================================================================

from collections import defaultdict, Counter

def run_min_error_analysis_all_tools(results, top_n=5):
    # discover all tool names first
    tools_seen = set()
    for r in results:
        tools_seen.update(r.get('tool_results', {}).keys())
    # exclude TextBlob
    tools_seen.discard('textblob')

    error_counts = {t: Counter() for t in tools_seen}  # include empty counters
    confusions_by_gold = {t: defaultdict(Counter) for t in tools_seen}
    totals = Counter()
    errors = Counter()

    for res in results:
        gt_claws = res.get('ground_truth', [])
        gt_penn_full = [convert_claws_to_penn(t, strict=True) for t in gt_claws]

        for tool in tools_seen:
            tri = res.get('tool_results', {}).get(tool, {})
            if not isinstance(tri, dict) or ('error' in tri):
                continue
            pred = tri.get('tags') or []
            toks = tri.get('tokens') or []
            m = min(len(gt_penn_full), len(pred), len(toks))
            if m <= 0:
                continue
            gt_penn = gt_penn_full[:m]
            for i in range(m):
                g, p = gt_penn[i], pred[i]
                totals[tool] += 1
                if g != p:
                    errors[tool] += 1
                    error_counts[tool][f"{g}->{p}"] += 1
                    confusions_by_gold[tool][g][p] += 1

    # Report
    print("Most common error patterns (STRICT):")
    print("="*60)
    for tool in sorted(tools_seen):
        err = errors.get(tool, 0)
        tot = totals.get(tool, 0)
        err_rate = (err / tot) if tot else 0.0
        print(f"\n{tool} (errors={err:,} / {tot:,} | error rate={err_rate:.3f}):")
        for pat, c in error_counts[tool].most_common(top_n):
            print(f"  {pat:12s} : {c}")
        if tot == 0:
            print("  (no aligned tokens; check coverage/zero_min above)")

    print("\nTop confusions by GOLD (STRICT):")
    print("="*60)
    default_order = ['UNK','NN','NNS','JJ','RB','IN','VB','VBD','VBG','VBN','VBZ','PRP','DT', ',', '.']
    for tool in sorted(tools_seen):
        cg = confusions_by_gold[tool]
        extras = [g for g in cg.keys() if g not in default_order]
        key_gold = default_order + sorted(extras)
        print(f"\n{tool}:")
        any_line = False
        for g in key_gold:
            if g in cg and cg[g]:
                top = ", ".join([f"{p}×{c}" for p, c in cg[g].most_common(3)])
                print(f"  {g:>4s} → {top}")
                any_line = True
        if not any_line:
            print("  (no confusions recorded)")

    return {
        "tools_seen": sorted(tools_seen),
        "error_counts": error_counts,
        "confusions_by_gold": confusions_by_gold,
        "totals": dict(totals),
        "errors": dict(errors)
    }

_min_err = run_min_error_analysis_all_tools(expanded_batch_results, top_n=5)


Most common error patterns (STRICT):

flair (errors=4,407 / 9,994 | error rate=0.441):
  UNK->DT      : 820
  UNK->PRP     : 777
  UNK->IN      : 355
  UNK->VBD     : 306
  UNK->PRP$    : 271

nltk (errors=4,667 / 9,941 | error rate=0.469):
  UNK->DT      : 818
  UNK->PRP     : 698
  UNK->IN      : 331
  UNK->VBD     : 310
  UNK->PRP$    : 288

spacy_lg (errors=4,695 / 9,994 | error rate=0.470):
  UNK->DT      : 785
  UNK->PRP     : 754
  UNK->IN      : 339
  UNK->VBD     : 304
  UNK->PRP$    : 264

spacy_sm (errors=4,723 / 9,994 | error rate=0.473):
  UNK->DT      : 783
  UNK->PRP     : 754
  UNK->IN      : 342
  UNK->VBD     : 308
  UNK->PRP$    : 265

Top confusions by GOLD (STRICT):

flair:
   UNK → DT×820, PRP×777, IN×355
    NN → NNP×38, JJ×38, DT×31
   NNS → NN×12, JJ×5, IN×5
    JJ → RB×22, VBG×20, NN×18
    RB → JJ×16, IN×12, PRP×11
    IN → RB×24, WRB×23, WDT×22
    VB → VBP×47, MD×15, NN×11
   VBD → PRP×11, VBN×6, NN×5
   VBG → NN×5, ,×3, JJ×2
   VBN → VBD×29, JJ×10, VB×4
  

In [9]:
# ==============================================================================
# Drilldown: Which CLAWS tags become UNK (STRICT) and what do tools predict?
# ==============================================================================

from collections import defaultdict, Counter

def unk_source_analysis(results, tool_names=None, top_n=12, strip_ditto=True, return_data=True):
    """
    For tokens where STRICT-projected gold == 'UNK', count:
      - aggregate: CLAWS source -> predicted Penn
      - per tool: tool -> CLAWS source -> predicted Penn
    Parameters
      results      : expanded_batch_results
      tool_names   : iterable of tool names to include (None = all discovered)
      top_n        : top CLAWS sources to print
      strip_ditto  : collapse ditto tags (e.g., II31 -> II)
      return_data  : return dicts with flat rows for export
    """
    per_tool = defaultdict(lambda: defaultdict(Counter))   # tool -> CLAWS -> Counter(pred)
    all_tools = defaultdict(Counter)                       # CLAWS -> Counter(pred)
    totals_by_source = Counter()                           # CLAWS -> total UNK obs (all tools)
    totals_by_tool = Counter()                             # tool -> total UNK obs
    discovered_tools = set()

    for res in results:
        gt_claws_full = res.get('ground_truth', [])
        gt_penn_full = [convert_claws_to_penn(t, strict=True) for t in gt_claws_full]
        tri_map = res.get('tool_results', {})
        discovered_tools.update(tri_map.keys())

        for tool, tri in tri_map.items():
            if tool_names and tool not in tool_names:
                continue
            if not isinstance(tri, dict) or ('error' in tri):
                continue
            pred = tri.get('tags') or []
            toks = tri.get('tokens') or []
            m = min(len(gt_claws_full), len(gt_penn_full), len(pred), len(toks))
            if m <= 0:
                continue

            for i in range(m):
                if gt_penn_full[i] == 'UNK':
                    claws_src = gt_claws_full[i]
                    if strip_ditto:
                        claws_src = _strip_ditto(claws_src)
                    per_tool[tool][claws_src][pred[i]] += 1
                    all_tools[claws_src][pred[i]] += 1
                    totals_by_source[claws_src] += 1
                    totals_by_tool[tool] += 1

    # Resolve default tool list
    if tool_names is None:
        tool_names = sorted(discovered_tools)

    # ---- Aggregate printout ---------------------------------------------------
    print("Top CLAWS sources of UNK (all tools combined):")
    print("="*70)
    total_unk = sum(totals_by_source.values())
    for claws_src, cnts in sorted(all_tools.items(),
                                  key=lambda x: sum(x[1].values()),
                                  reverse=True)[:top_n]:
        src_total = sum(cnts.values())
        share = (src_total / total_unk) if total_unk else 0.0
        top_preds = ", ".join(f"{p}×{c}" for p, c in cnts.most_common(3))
        print(f"{claws_src:8s}  total={src_total:4d}  ({share:5.1%})  →  {top_preds}")

    # ---- Per-tool printout ----------------------------------------------------
    for tool in tool_names:
        print(f"\n{tool}: Top CLAWS sources of UNK")
        print("-"*70)
        tool_map = per_tool.get(tool, {})
        tool_tot = sum(sum(c.values()) for c in tool_map.values())
        if tool_tot == 0:
            print("  (no UNK observations for this tool)")
            continue
        items = sorted(tool_map.items(),
                       key=lambda x: sum(x[1].values()),
                       reverse=True)[:top_n]
        for claws_src, cnts in items:
            src_total = sum(cnts.values())
            share = src_total / tool_tot if tool_tot else 0.0
            top_preds = ", ".join(f"{p}×{c}" for p, c in cnts.most_common(3))
            print(f"{claws_src:8s}  total={src_total:4d}  ({share:5.1%})  →  {top_preds}")

    if not return_data:
        return None

    # ---- Build flat rows for downstream export --------------------------------
    agg_rows = []
    for claws_src, cnts in all_tools.items():
        src_total = sum(cnts.values())
        for pred, c in cnts.items():
            agg_rows.append({
                'CLAWS_source': claws_src,
                'pred_penn': pred,
                'count': int(c),
                'source_total': int(src_total),
                'source_share_all': (src_total / total_unk) if total_unk else 0.0
            })

    per_tool_rows = []
    for tool, m in per_tool.items():
        tool_tot = sum(sum(c.values()) for c in m.values())
        for claws_src, cnts in m.items():
            src_total = sum(cnts.values())
            for pred, c in cnts.items():
                per_tool_rows.append({
                    'tool': tool,
                    'CLAWS_source': claws_src,
                    'pred_penn': pred,
                    'count': int(c),
                    'source_total_tool': int(src_total),
                    'source_share_tool': (src_total / tool_tot) if tool_tot else 0.0
                })

    return {
        'aggregate': agg_rows,
        'per_tool': per_tool_rows,
        'totals_by_source': dict(totals_by_source),
        'totals_by_tool': dict(totals_by_tool),
        'tools_seen': tool_names,
        'total_unk': int(total_unk)
    }

# Run — include nltk explicitly (or leave tool_names=None to include all seen tools)
_unk = unk_source_analysis(expanded_batch_results,
                           tool_names=['spacy_sm','spacy_lg','flair','nltk'],
                           top_n=12,
                           strip_ditto=True,
                           return_data=True)


Top CLAWS sources of UNK (all tools combined):
AT        total=1888  (15.0%)  →  DT×1658, IN×60, NN×52
AT1       total=1201  ( 9.5%)  →  DT×1066, IN×21, NN×20
PPHS1     total=1188  ( 9.4%)  →  PRP×1066, IN×18, ,×18
APPGE     total=1159  ( 9.2%)  →  PRP$×1039, NN×25, PRP×21
IO        total= 852  ( 6.8%)  →  IN×715, DT×48, NN×22
VBDZ      total= 588  ( 4.7%)  →  VBD×547, PRP×9, ,×8
PPH1      total= 484  ( 3.8%)  →  PRP×405, VB×16, IN×15
VHD       total= 428  ( 3.4%)  →  VBD×355, IN×20, MD×19
PPIS1     total= 420  ( 3.3%)  →  PRP×355, IN×10, .×9
PPHO1     total= 407  ( 3.2%)  →  PRP×329, PRP$×32, NN×8
DD1       total= 328  ( 2.6%)  →  DT×250, IN×32, PRP×8
IW        total= 240  ( 1.9%)  →  IN×221, CC×4, DT×3

spacy_sm: Top CLAWS sources of UNK
----------------------------------------------------------------------
AT        total= 472  (15.0%)  →  DT×404, IN×16, NN×15
AT1       total= 301  ( 9.6%)  →  DT×259, IN×7, JJ×4
PPHS1     total= 297  ( 9.4%)  →  PRP×269, ,×5, IN×5
APPGE     total= 2

In [10]:
# ==============================================================================
# Build performance_summary directly from expanded_batch_results
# ==============================================================================

EXCLUDE_TOOLS = {'textblob'}  # ensure TextBlob is not considered

performance_summary = {}

for res in expanded_batch_results:
    gt_claws = res.get('ground_truth', [])
    # Strictly project CLAWS → Penn for gold
    gt_penn = [convert_claws_to_penn(t, strict=True) for t in gt_claws]

    tri_map = res.get('tool_results', {})
    for tool, tri in tri_map.items():
        # drop excluded tools (e.g., textblob)
        if tool in EXCLUDE_TOOLS:
            continue
        # skip errors / malformed entries
        if not isinstance(tri, dict) or ('error' in tri):
            continue

        pred = tri.get('tags') or []
        toks = tri.get('tokens') or []
        # align by minimum length across gold, predictions, and tokens
        m = min(len(gt_penn), len(pred), len(toks))
        if m == 0:
            continue

        acc = sum(1 for i in range(m) if gt_penn[i] == pred[i]) / m

        # init bucket and record
        if tool not in performance_summary:
            performance_summary[tool] = {
                'accuracies': [],
                'total_sentences': 0,
                'perfect_sentences': 0
            }
        performance_summary[tool]['accuracies'].append(acc)
        performance_summary[tool]['total_sentences'] += 1
        if acc == 1.0:
            performance_summary[tool]['perfect_sentences'] += 1

# Finalize stats
for tool, d in list(performance_summary.items()):
    accs = d['accuracies']
    performance_summary[tool] = {
        'mean_accuracy': float(np.mean(accs)) if accs else 0.0,
        'std_accuracy': float(np.std(accs)) if accs else 0.0,
        'total_sentences': int(d['total_sentences']),
        'perfect_sentences': int(d['perfect_sentences'])
    }

print("Tools available for pairwise tests (excluding TextBlob):", sorted(performance_summary.keys()))


Tools available for pairwise tests (excluding TextBlob): ['flair', 'nltk', 'spacy_lg', 'spacy_sm']


In [11]:
# ==============================================================================
# WILSON CONFIDENCE INTERVALS (STRICT) + SIGNIFICANCE (OVERALL, no TextBlob)
# ==============================================================================

import scipy.stats as stats
from math import sqrt
import numpy as np

EXCLUDE_TOOLS = {'textblob'}

def wilson_confidence_interval(successes, trials, confidence=0.95):
    if trials == 0:
        return 0.0, 0.0, 0.0
    p = successes / trials
    z = stats.norm.ppf(1 - (1 - confidence) / 2)
    denom = 1 + z**2 / trials
    centre = (p + z**2 / (2 * trials)) / denom
    half = z * sqrt((p * (1 - p) + z**2 / (4 * trials)) / trials) / denom
    lower = max(0, centre - half)
    upper = min(1, centre + half)
    return p, lower, upper

def proportion_z_test(x1, n1, x2, n2):
    p1, p2 = x1 / n1, x2 / n2
    p_pool = (x1 + x2) / (n1 + n2)
    se = sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
    z = (p1 - p2) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
    return z, p_value

def cohens_h(p1, p2):
    return 2 * (np.arcsin(sqrt(p1)) - np.arcsin(sqrt(p2)))

# --------------------------------------------------------------------------
# Ensure performance_summary exists (build from expanded_batch_results if not)
# --------------------------------------------------------------------------
if 'performance_summary' not in globals() or not performance_summary:
    performance_summary = {}
    for res in expanded_batch_results:
        gt_claws = res.get('ground_truth', [])
        gt_penn  = [convert_claws_to_penn(t, strict=True) for t in gt_claws]
        tri_map  = res.get('tool_results', {})
        for tool, tri in tri_map.items():
            if tool in EXCLUDE_TOOLS:
                continue
            if not isinstance(tri, dict) or ('error' in tri):
                continue
            pred = tri.get('tags') or []
            toks = tri.get('tokens') or []
            m = min(len(gt_penn), len(pred), len(toks))
            if m == 0:
                continue
            acc = sum(1 for i in range(m) if gt_penn[i] == pred[i]) / m
            if tool not in performance_summary:
                performance_summary[tool] = {
                    'accuracies': [],
                    'total_sentences': 0,
                    'perfect_sentences': 0
                }
            performance_summary[tool]['accuracies'].append(acc)
            performance_summary[tool]['total_sentences'] += 1
            if acc == 1.0:
                performance_summary[tool]['perfect_sentences'] += 1

    for tool, d in list(performance_summary.items()):
        accs = d['accuracies']
        performance_summary[tool] = {
            'mean_accuracy': float(np.mean(accs)) if accs else 0.0,
            'std_accuracy': float(np.std(accs)) if accs else 0.0,
            'total_sentences': int(d['total_sentences']),
            'perfect_sentences': int(d['perfect_sentences'])
        }

print("Tool Performance with Wilson 95% Confidence Intervals (STRICT):")
print("=" * 60)

wilson_results = {}
aligned_sentence_counts = {}  # per-tool count of sentences with m>0

for tool_name, stats_data in performance_summary.items():
    if tool_name in EXCLUDE_TOOLS:
        continue

    total_sentences = stats_data['total_sentences']
    perfect_sentences = stats_data['perfect_sentences']

    # Perfect-sentence Wilson CI
    perfect_rate, perfect_lower, perfect_upper = wilson_confidence_interval(
        perfect_sentences, total_sentences
    )

    # Token-level totals + aligned sentence count
    total_tokens = 0
    correct_tokens = 0
    aligned_sents = 0

    for res in expanded_batch_results:
        tri = res.get('tool_results', {}).get(tool_name, {})
        if not tri or 'error' in tri:
            continue
        gt = res.get('ground_truth', [])
        pred = tri.get('tags', [])
        toks = tri.get('tokens', [])
        m = min(len(gt), len(pred), len(toks))
        if m == 0:
            continue
        aligned_sents += 1
        gt_penn = [convert_claws_to_penn(t, strict=True) for t in gt[:m]]
        correct_tokens += sum(1 for i in range(m) if gt_penn[i] == pred[i])
        total_tokens += m

    aligned_sentence_counts[tool_name] = aligned_sents

    if total_tokens == 0:
        # No usable tokens for this tool; skip it
        continue

    token_accuracy, token_lower, token_upper = wilson_confidence_interval(
        correct_tokens, total_tokens
    )

    wilson_results[tool_name] = {
        'token_accuracy': token_accuracy,
        'token_ci_lower': token_lower,
        'token_ci_upper': token_upper,
        'perfect_rate': perfect_rate,
        'perfect_ci_lower': perfect_lower,
        'perfect_ci_upper': perfect_upper,
        'total_tokens': total_tokens,
        'correct_tokens': correct_tokens,
        'total_sentences': total_sentences,
        'perfect_sentences': perfect_sentences,
        'aligned_sentences': aligned_sents
    }

    print(f"\n{tool_name.upper()}:")
    print(f"  Token Accuracy: {token_accuracy:.3f} [{token_lower:.3f}, {token_upper:.3f}]")
    print(f"  Perfect Sentences: {perfect_rate:.3f} [{perfect_lower:.3f}, {perfect_upper:.3f}]")
    print(f"  Sample size: {total_tokens:,} tokens, {total_sentences} sentences "
          f"(aligned sentences used: {aligned_sents})")

# Guard and significance test
if not wilson_results:
    print("\nNo token-level data available to compute Wilson intervals. "
          "Check that expanded_batch_results is populated and tool predictions exist.")
else:
    print("\n" + "=" * 60)
    print("STATISTICAL SIGNIFICANCE TESTS")
    print("=" * 60)

    tools_by_accuracy = sorted(wilson_results.items(),
                               key=lambda x: x[1]['token_accuracy'],
                               reverse=True)
    best_tool, best_stats = tools_by_accuracy[0]
    worst_tool, worst_stats = tools_by_accuracy[-1]

    z_stat, p_value = proportion_z_test(
        best_stats['correct_tokens'], best_stats['total_tokens'],
        worst_stats['correct_tokens'], worst_stats['total_tokens']
    )

    print(f"Comparison: {best_tool} vs {worst_tool}")
    print(f"Accuracy difference: {best_stats['token_accuracy'] - worst_stats['token_accuracy']:.3f}")
    print(f"Z-statistic: {z_stat:.3f}")
    print(f"P-value: {p_value:.3f}")
    sig = "Yes" if p_value < 0.05 else "No"
    print(f"Significant at α=0.05: {sig}")

    effect_size = cohens_h(best_stats['token_accuracy'], worst_stats['token_accuracy'])
    if abs(effect_size) < 0.2:
        magnitude = "negligible"
    elif abs(effect_size) < 0.5:
        magnitude = "small"
    elif abs(effect_size) < 0.8:
        magnitude = "medium"
    else:
        magnitude = "large"
    print(f"Effect size (Cohen's h): {effect_size:.3f} ({magnitude})")

# Optional sanity check: show aligned sentence counts per tool
print("\nAligned sentences used per tool (should be close to 383 if all align):")
for t, n in sorted(aligned_sentence_counts.items()):
    print(f"  {t:10s}: {n}")


Tool Performance with Wilson 95% Confidence Intervals (STRICT):

SPACY_SM:
  Token Accuracy: 0.527 [0.518, 0.537]
  Perfect Sentences: 0.000 [0.000, 0.010]
  Sample size: 9,994 tokens, 383 sentences (aligned sentences used: 383)

SPACY_LG:
  Token Accuracy: 0.530 [0.520, 0.540]
  Perfect Sentences: 0.000 [0.000, 0.010]
  Sample size: 9,994 tokens, 383 sentences (aligned sentences used: 383)

NLTK:
  Token Accuracy: 0.531 [0.521, 0.540]
  Perfect Sentences: 0.000 [0.000, 0.010]
  Sample size: 9,941 tokens, 383 sentences (aligned sentences used: 383)

FLAIR:
  Token Accuracy: 0.559 [0.549, 0.569]
  Perfect Sentences: 0.000 [0.000, 0.010]
  Sample size: 9,994 tokens, 383 sentences (aligned sentences used: 383)

STATISTICAL SIGNIFICANCE TESTS
Comparison: flair vs spacy_sm
Accuracy difference: 0.032
Z-statistic: 4.487
P-value: 0.000
Significant at α=0.05: Yes
Effect size (Cohen's h): 0.063 (negligible)

Aligned sentences used per tool (should be close to 383 if all align):
  flair     : 383

In [12]:
# ==============================================================================
# Pairwise McNemar’s Tests (with Holm–Bonferroni) + Cluster Bootstrap CIs
# ==============================================================================

import itertools
import numpy as np
from scipy.stats import binomtest

# Try to import statsmodels' McNemar; fall back to an exact/binomial implementation
try:
    from statsmodels.stats.contingency_tables import mcnemar as _sm_mcnemar
    HAVE_STATSMODELS = True
except Exception:
    HAVE_STATSMODELS = False

def _mcnemar_from_table(table, exact):
    """
    Return (statistic, pvalue) for McNemar from 2x2 table:
      [[both_correct, a_correct_b_wrong],
       [b_correct_a_wrong, both_wrong]]
    If statsmodels is present, use it; otherwise compute:
      - exact: two-sided binomial on min(b01,b10) with n=b01+b10, p=0.5
      - chi^2 (no continuity) when exact=False   (continuity not added in fallback)
    """
    b01 = table[0][1]
    b10 = table[1][0]
    discordant = b01 + b10
    if discordant == 0:
        # no information to test; define stat=0, p=1
        return 0.0, 1.0

    if HAVE_STATSMODELS:
        res = _sm_mcnemar(table, exact=exact, correction=(not exact))
        stat = float(res.statistic) if res.statistic is not None else np.nan
        return stat, float(res.pvalue)

    # Fallbacks
    if exact:
        # Two-sided exact binomial test under H0: p = 0.5
        k = min(b01, b10)
        p = binomtest(k, n=discordant, p=0.5, alternative='two-sided').pvalue
        # A chi^2-like descriptive stat (not used for decision when exact=True)
        stat = (b01 - b10) ** 2 / discordant
        return float(stat), float(p)
    else:
        # Large-sample chi-square without continuity (approx.)
        stat = (b01 - b10) ** 2 / discordant
        # Two-sided p-value from chi-square(1)
        from scipy.stats import chi2
        p = 1 - chi2.cdf(stat, df=1)
        return float(stat), float(p)

def build_mcnemar_table(tool_a, tool_b, results):
    """
    Build 2x2 contingency for paired token outcomes:
      [[ both_correct,  a_correct_b_wrong ],
       [ b_correct_a_wrong, both_wrong     ]]
    STRICT mapping for gold.
    """
    both_correct = both_wrong = a_correct_b_wrong = b_correct_a_wrong = 0

    for res in results:
        gt = res['ground_truth']
        tri_a = res['tool_results'].get(tool_a, {})
        tri_b = res['tool_results'].get(tool_b, {})
        if not isinstance(tri_a, dict) or not isinstance(tri_b, dict):
            continue
        if 'error' in tri_a or 'error' in tri_b:
            continue

        pred_a = tri_a.get('tags', [])
        pred_b = tri_b.get('tags', [])
        toks_a = tri_a.get('tokens', [])
        toks_b = tri_b.get('tokens', [])

        m = min(len(gt), len(pred_a), len(pred_b), len(toks_a), len(toks_b))
        if m == 0:
            continue

        gt_penn = [convert_claws_to_penn(t, strict=True) for t in gt[:m]]

        for i in range(m):
            a_correct = (gt_penn[i] == pred_a[i])
            b_correct = (gt_penn[i] == pred_b[i])
            if a_correct and b_correct:
                both_correct += 1
            elif (not a_correct) and (not b_correct):
                both_wrong += 1
            elif a_correct and (not b_correct):
                a_correct_b_wrong += 1
            else:  # b_correct and not a_correct
                b_correct_a_wrong += 1

    return [[both_correct, a_correct_b_wrong],
            [b_correct_a_wrong, both_wrong]]

def run_mcnemar_for_all_pairs(results, tool_names):
    """
    Runs McNemar for each unordered tool pair.
    Uses exact test when discordant count < 25, else chi^2 with continuity (if statsmodels).
    Returns list of dicts with stats and raw p-values.
    """
    outcomes = []
    for a, b in itertools.combinations(tool_names, 2):
        table = build_mcnemar_table(a, b, results)
        b01 = table[0][1]
        b10 = table[1][0]
        discordant = b01 + b10
        exact = discordant < 25  # common rule-of-thumb
        chi2, p = _mcnemar_from_table(table, exact=exact)
        outcomes.append({
            'tool_a': a, 'tool_b': b,
            'table': table,
            'discordant': discordant,
            'exact': exact,
            'chi2': chi2,
            'p_value': p
        })
    return outcomes

def holm_bonferroni_adjust(pvals):
    """
    Holm–Bonferroni step-down procedure for FWER control.
    Returns adjusted p-values in the original order.
    """
    m = len(pvals)
    if m == 0:
        return np.array([])
    order = np.argsort(pvals)
    adj = np.empty(m, dtype=float)
    running_max = 0.0
    for rank, idx in enumerate(order, start=1):
        p = pvals[idx]
        adj_p = (m - rank + 1) * p
        running_max = max(running_max, adj_p)
        adj[idx] = min(1.0, running_max)
    return adj

def sentence_accuracy(tool_name, res):
    """
    Returns (#correct, #total) tokens for a single sentence result and tool,
    under STRICT mapping.
    """
    tri = res['tool_results'].get(tool_name, {})
    if not isinstance(tri, dict) or 'error' in tri:
        return 0, 0
    gt = res['ground_truth']
    pred = tri.get('tags', [])
    toks = tri.get('tokens', [])
    m = min(len(gt), len(pred), len(toks))
    if m == 0:
        return 0, 0
    gt_penn = [convert_claws_to_penn(t, strict=True) for t in gt[:m]]
    correct = sum(1 for i in range(m) if gt_penn[i] == pred[i])
    return correct, m

def bootstrap_accuracy_diff_sentence_cluster(tool_a, tool_b, results, n_iter=3000, seed=123):
    """
    Cluster bootstrap by sentence: resample sentences with replacement,
    compute token-level accuracy per tool, then take difference acc_a - acc_b.
    Returns mean diff and 95% percentile CI.
    """
    rng = np.random.default_rng(seed)
    n = len(results)
    diffs = np.empty(n_iter, dtype=float)

    # Precompute per-sentence (correct, total) for speed
    per_sent = []
    for res in results:
        ca, na = sentence_accuracy(tool_a, res)
        cb, nb = sentence_accuracy(tool_b, res)
        per_sent.append((ca, na, cb, nb))

    for i in range(n_iter):
        idx = rng.integers(0, n, n)
        ca_sum = na_sum = cb_sum = nb_sum = 0
        for j in idx:
            ca, na, cb, nb = per_sent[j]
            ca_sum += ca; na_sum += na
            cb_sum += cb; nb_sum += nb
        acc_a = ca_sum / na_sum if na_sum else 0.0
        acc_b = cb_sum / nb_sum if nb_sum else 0.0
        diffs[i] = acc_a - acc_b

    mean_diff = float(np.mean(diffs))
    low, high = np.percentile(diffs, [2.5, 97.5])
    return mean_diff, float(low), float(high)

# --- Choose tools from your performance_summary; drop any with no token data ---
all_tools = sorted(performance_summary.keys())
# If you ever want to explicitly exclude a tool (e.g., 'textblob'), uncomment:
# all_tools = [t for t in all_tools if t != 'textblob']

if len(all_tools) < 2:
    print("Not enough tools for pairwise tests.")
else:
    # 1) McNemar for all pairs
    mcnemar_outcomes = run_mcnemar_for_all_pairs(expanded_batch_results, all_tools)
    pvals = np.array([o['p_value'] for o in mcnemar_outcomes])
    adj_pvals = holm_bonferroni_adjust(pvals)

    print("\nPAIRWISE McNemar’s tests (token-level, STRICT gold projection):")
    print("=" * 80)
    for o, adjp in zip(mcnemar_outcomes, adj_pvals):
        a, b = o['tool_a'], o['tool_b']
        t = o['table']
        print(f"\n{a} vs {b}")
        print(f"  Table [[both_correct, a_correct_b_wrong], [b_correct_a_wrong, both_wrong]] = {t}")
        print(f"  Discordant = {o['discordant']} | {'exact' if o['exact'] else 'chi^2'} test")
        print(f"  McNemar χ² = {o['chi2']:.3f} | p = {o['p_value']:.4g} | Holm-adjusted p = {adjp:.4g}")
        if adjp < 0.05:
            print("  → Significant asymmetry in disagreements (after Holm–Bonferroni).")
        else:
            print("  → No significant asymmetry in disagreements (after correction).")

    # 2) Cluster bootstrap CIs for accuracy differences for all pairs
    print("\nPAIRWISE Bootstrap 95% CIs for accuracy differences (acc_A − acc_B):")
    print("=" * 80)
    for a, b in itertools.combinations(all_tools, 2):
        mean_diff, lo, hi = bootstrap_accuracy_diff_sentence_cluster(a, b, expanded_batch_results,
                                                                     n_iter=3000, seed=123)
        note = " (A>B)" if (lo > 0) else (" (B>A)" if (hi < 0) else " (no clear difference)")
        print(f"{a:10s} − {b:10s}: mean = {mean_diff:.3f}, 95% CI = [{lo:.3f}, {hi:.3f}]{note}")



PAIRWISE McNemar’s tests (token-level, STRICT gold projection):

flair vs nltk
  Table [[both_correct, a_correct_b_wrong], [b_correct_a_wrong, both_wrong]] = [[4815, 741], [458, 3926]]
  Discordant = 1199 | chi^2 test
  McNemar χ² = 66.325 | p = 3.823e-16 | Holm-adjusted p = 1.529e-15
  → Significant asymmetry in disagreements (after Holm–Bonferroni).

flair vs spacy_lg
  Table [[both_correct, a_correct_b_wrong], [b_correct_a_wrong, both_wrong]] = [[5079, 508], [220, 4186]]
  Discordant = 728 | chi^2 test
  McNemar χ² = 113.144 | p = 2.006e-26 | Holm-adjusted p = 1.003e-25
  → Significant asymmetry in disagreements (after Holm–Bonferroni).

flair vs spacy_sm
  Table [[both_correct, a_correct_b_wrong], [b_correct_a_wrong, both_wrong]] = [[5054, 533], [217, 4189]]
  Discordant = 750 | chi^2 test
  McNemar χ² = 132.300 | p = 1.286e-30 | Holm-adjusted p = 7.718e-30
  → Significant asymmetry in disagreements (after Holm–Bonferroni).

nltk vs spacy_lg
  Table [[both_correct, a_correct_b_wro

In [13]:
# ==============================================================================
# STRICT metrics: Precision/Recall/F1 (micro/macro/weighted)
# + Newcombe (Wilson) CIs for differences in accuracy
# (filters out TextBlob; includes spaCy_sm, spaCy_lg, Flair, NLTK)
# ==============================================================================

import numpy as np
import itertools
from collections import defaultdict, Counter
from sklearn.metrics import precision_recall_fscore_support

def collect_token_outcomes_strict(results, allowed_tools=None):
    """
    Build per-tool gold/pred arrays under STRICT projection.
    Only tools in allowed_tools are kept (if provided).
    Returns:
      per_tool = {
         tool: {
            'y_true': [... Penn tags ...],
            'y_pred': [... Penn tags ...],
            'correct_tokens': int,
            'total_tokens': int
         }, ...
      }
    """
    per_tool = defaultdict(lambda: {'y_true': [], 'y_pred': [], 'correct_tokens': 0, 'total_tokens': 0})

    for res in results:
        gt_claws = res.get('ground_truth', [])
        tri_map = res.get('tool_results', {})
        for tool, tri in tri_map.items():
            if allowed_tools is not None and tool not in allowed_tools:
                continue
            if not isinstance(tri, dict) or ('error' in tri):
                continue

            pred = tri.get('tags', []) or []
            toks = tri.get('tokens', []) or []
            m = min(len(gt_claws), len(pred), len(toks))
            if m <= 0:
                continue

            gt_penn = [convert_claws_to_penn(t, strict=True) for t in gt_claws[:m]]
            y_true = gt_penn
            y_pred = pred[:m]

            per_tool[tool]['y_true'].extend(y_true)
            per_tool[tool]['y_pred'].extend(y_pred)
            per_tool[tool]['total_tokens'] += m
            per_tool[tool]['correct_tokens'] += sum(1 for i in range(m) if y_true[i] == y_pred[i])

    return per_tool

def wilson_interval(successes, n, confidence=0.95):
    """Wilson score interval for a single proportion."""
    if n == 0:
        return (0.0, 0.0, 0.0)
    from scipy.stats import norm
    z = norm.ppf(1 - (1 - confidence)/2)
    p = successes / n
    denom = 1 + z*z/n
    centre = (p + z*z/(2*n)) / denom
    half = z * np.sqrt((p*(1-p) + z*z/(4*n))/n) / denom
    return (p, max(0.0, centre - half), min(1.0, centre + half))

def newcombe_wilson_diff(x1, n1, x2, n2, confidence=0.95):
    """
    Newcombe method 10 (1998): CI for difference of two independent proportions.
    This complements paired analyses (McNemar / bootstrap) with a descriptive CI.
    """
    p1, L1, U1 = wilson_interval(x1, n1, confidence)
    p2, L2, U2 = wilson_interval(x2, n2, confidence)
    diff = p1 - p2
    lower = L1 - U2
    upper = U1 - L2
    return diff, lower, upper, (p1, L1, U1), (p2, L2, U2)

# Choose tools explicitly (exclude textblob)
discovered = {t for r in expanded_batch_results for t in r.get('tool_results', {}).keys()}
tools_for_eval = [t for t in ('spacy_sm','spacy_lg','flair','nltk') if t in discovered]

per_tool = collect_token_outcomes_strict(expanded_batch_results, allowed_tools=tools_for_eval)

if not per_tool:
    print("No token-level data found. Make sure expanded_batch_results is populated.")
else:
    # Build a consistent label set across the included tools
    label_set = set()
    for d in per_tool.values():
        label_set.update(d['y_true'])
        label_set.update(d['y_pred'])
    labels = sorted(label_set)

    print("Token-level Precision / Recall / F1 under STRICT gold")
    print("="*70)
    metrics_summary = {}

    # Print tools in a stable order (by name)
    for tool in sorted(per_tool.keys()):
        d = per_tool[tool]
        y_true = np.array(d['y_true'])
        y_pred = np.array(d['y_pred'])
        if y_true.size == 0:
            continue

        # Averages
        prec_micro, rec_micro, f1_micro, _ = precision_recall_fscore_support(
            y_true, y_pred, labels=labels, average='micro', zero_division=0
        )
        prec_macro, rec_macro, f1_macro, _ = precision_recall_fscore_support(
            y_true, y_pred, labels=labels, average='macro', zero_division=0
        )
        prec_weighted, rec_weighted, f1_weighted, _ = precision_recall_fscore_support(
            y_true, y_pred, labels=labels, average='weighted', zero_division=0
        )

        # Per-class (optional): show top classes by support
        _, _, f1_per_class, support = precision_recall_fscore_support(
            y_true, y_pred, labels=labels, average=None, zero_division=0
        )
        support_idx = np.argsort(support)[::-1]
        topK = 10
        top_rows = [(labels[i], int(support[i]), float(f1_per_class[i])) for i in support_idx[:topK]]

        acc = d['correct_tokens'] / d['total_tokens'] if d['total_tokens'] else 0.0
        metrics_summary[tool] = {
            'accuracy': acc,
            'micro':  (prec_micro, rec_micro, f1_micro),
            'macro':  (prec_macro, rec_macro, f1_macro),
            'weighted': (prec_weighted, rec_weighted, f1_weighted),
            'top_classes': top_rows,
            'total_tokens': d['total_tokens'],
            'correct_tokens': d['correct_tokens']
        }

        print(f"\n{tool.upper()}:")
        print(f"  Tokens: {d['total_tokens']:,} | Accuracy: {acc:.3f}")
        print(f"  Micro  P/R/F1: {prec_micro:.3f} / {rec_micro:.3f} / {f1_micro:.3f}")
        print(f"  Macro  P/R/F1: {prec_macro:.3f} / {rec_macro:.3f} / {f1_macro:.3f}")
        print(f"  Weight P/R/F1: {prec_weighted:.3f} / {rec_weighted:.3f} / {f1_weighted:.3f}")
        print(f"  Top {topK} classes by support (label, support, F1):")
        for lbl, sup, f1c in top_rows:
            print(f"    {lbl:>4s}  {sup:6d}  F1={f1c:.3f}")

    # Newcombe (Wilson) CIs for accuracy differences between every pair of tools
    if metrics_summary:
        print("\nNewcombe (Wilson) 95% CIs for differences in token accuracy (acc_A − acc_B)")
        print("="*70)
        tools = sorted(metrics_summary.keys())
        for a, b in itertools.combinations(tools, 2):
            ca, na = metrics_summary[a]['correct_tokens'], metrics_summary[a]['total_tokens']
            cb, nb = metrics_summary[b]['correct_tokens'], metrics_summary[b]['total_tokens']
            diff, lo, hi, p1_info, p2_info = newcombe_wilson_diff(ca, na, cb, nb, confidence=0.95)
            print(f"{a:10s} − {b:10s}: Δ = {diff:.3f}, 95% CI = [{lo:.3f}, {hi:.3f}]  "
                  f"(p1={p1_info[0]:.3f}, p2={p2_info[0]:.3f})")
    else:
        print("No metrics to compare.")


Token-level Precision / Recall / F1 under STRICT gold

FLAIR:
  Tokens: 9,994 | Accuracy: 0.559
  Micro  P/R/F1: 0.559 / 0.559 / 0.559
  Macro  P/R/F1: 0.417 / 0.527 / 0.437
  Weight P/R/F1: 0.503 / 0.559 / 0.523
  Top 10 classes by support (label, support, F1):
     UNK    3148  F1=0.000
      IN    1195  F1=0.766
      NN    1153  F1=0.807
      JJ     597  F1=0.767
      RB     525  F1=0.780
       .     407  F1=0.861
     VBD     389  F1=0.620
      VB     347  F1=0.658
      CC     333  F1=0.902
     NNS     330  F1=0.822

NLTK:
  Tokens: 9,941 | Accuracy: 0.531
  Micro  P/R/F1: 0.531 / 0.531 / 0.531
  Macro  P/R/F1: 0.365 / 0.489 / 0.396
  Weight P/R/F1: 0.473 / 0.531 / 0.494
  Top 10 classes by support (label, support, F1):
     UNK    3140  F1=0.000
      IN    1195  F1=0.709
      NN    1144  F1=0.767
      JJ     594  F1=0.697
      RB     522  F1=0.759
     VBD     388  F1=0.580
       .     383  F1=0.876
      VB     346  F1=0.590
      CC     333  F1=0.882
     NNS     329

In [14]:
# Encodable-only (exclude UNK gold) token accuracy per tool
def encodable_accuracy(results, tools):
    acc = {}
    for tool in tools:
        correct = total = 0
        for res in results:
            tri = res['tool_results'].get(tool, {})
            if 'error' in tri:
                continue
            gt = res['ground_truth']
            pred = tri.get('tags', []) or []
            toks = tri.get('tokens', []) or []
            m = min(len(gt), len(pred), len(toks))
            if m == 0:
                continue
            gt_penn = [convert_claws_to_penn(t, strict=True) for t in gt[:m]]
            for i in range(m):
                if gt_penn[i] == 'UNK':
                    continue
                total += 1
                if gt_penn[i] == pred[i]:
                    correct += 1
        acc[tool] = (correct / total) if total else 0.0
    return acc

enc_only = encodable_accuracy(expanded_batch_results, ['flair','spacy_lg','spacy_sm','nltk'])
for k, v in sorted(enc_only.items()):
    print(f"{k:9s}: {v:.3f}")


flair    : 0.816
nltk     : 0.775
spacy_lg : 0.774
spacy_sm : 0.770


### Statistical Tests under **Strict** CLAWS→Penn Projection

#### Wilson 95% Confidence Intervals (token accuracy)
*(383 sentences; ~9,994 tokens for spaCy/Flair; ~9,941 for NLTK)*

- **Flair**: **0.559**  [0.549, 0.569]  
- **NLTK**: **0.531**  [0.521, 0.540]  
- **spaCy (lg)**: **0.530**  [0.520, 0.540]  
- **spaCy (sm)**: **0.527**  [0.518, 0.537]  

_Perfect-sentence rate ≈ 0% for all models (95% CI upper bound ≈ 1%)._

#### Two-Proportion Z-Test (best vs comparator)
- **Flair vs spaCy (sm)**: Δ = **0.032**, z = **4.487**, p < **0.001**, **significant**  
  Cohen’s *h* = **0.063** → *negligible* effect size.

#### Pairwise McNemar’s Tests (token-level; Holm–Bonferroni corrected)
- **Flair vs NLTK**: χ² = **66.325**, p = **3.823e-16** → **significant**  
- **Flair vs spaCy (lg)**: χ² = **113.144**, p = **2.006e-26** → **significant**  
- **Flair vs spaCy (sm)**: χ² = **132.300**, p = **1.286e-30** → **significant**  
- **NLTK vs spaCy (lg)**: χ² = **0.000**, p = **1.000** → **no difference**  
- **NLTK vs spaCy (sm)**: χ² = **0.594**, p = **0.4407** → **no difference**  
- **spaCy (lg) vs spaCy (sm)**: χ² = **5.879**, p = **0.0153** → **significant** (small, after correction)

_Example contingency (Flair vs spaCy-lg): [[both correct, A-correct/B-wrong], [B-correct/A-wrong, both wrong]] = [[5079, 508], [220, 4186]] (discordant = 728)._

#### Pairwise Cluster Bootstrap 95% CIs for Accuracy Differences (accₐ − accᵦ)
- **Flair − NLTK**: mean = **+0.028**, CI = **[+0.002, +0.054]** → A>B  
- **Flair − spaCy (lg)**: mean = **+0.029**, CI = **[+0.009, +0.052]** → A>B  
- **Flair − spaCy (sm)**: mean = **+0.031**, CI = **[+0.012, +0.054]** → A>B  
- **NLTK − spaCy (lg)**: mean = **+0.001**, CI = **[−0.026, +0.026]** → no clear diff  
- **NLTK − spaCy (sm)**: mean = **+0.003**, CI = **[−0.024, +0.029]** → no clear diff  
- **spaCy (lg) − spaCy (sm)**: mean = **+0.003**, CI = **[+0.001, +0.005]** → A>B

#### Alignment/coverage (sentences used)
- **Flair**: 383  
- **NLTK**: 383  
- **spaCy (lg)**: 383  
- **spaCy (sm)**: 383  

**Takeaway:** Flair is **consistently but modestly** better than NLTK and both spaCy variants under strict projection; spaCy-lg and spaCy-sm are nearly indistinguishable. Effects are statistically reliable yet **practically small**, reflecting the structural loss from CLAWS→Penn (`UNK`) rather than large modeling differences.


### Token-level performance under **strict** CLAWS→Penn projection

| Tool        | Accuracy | Micro-F1 | Macro-F1 | Weighted-F1 |
|-------------|:--------:|:--------:|:--------:|:-----------:|
| spaCy (sm)  | 0.527    | 0.527    | 0.402    | 0.494       |
| spaCy (lg)  | 0.530    | 0.530    | 0.397    | 0.497       |
| NLTK        | 0.531    | 0.531    | 0.396    | 0.494       |
| **Flair**   | **0.559**| **0.559**| **0.437**| **0.523**   |

*Notes (selected, high-support tags):* Under strict projection, `UNK`—which collapses CLAWS distinctions invisible to Penn—accounts for ≈31% of tokens and has F1 = 0.000 for all tools. Frequent tractable tags achieve materially better F1; e.g., with Flair: `IN` ≈ 0.77, `NN` ≈ 0.81, `JJ` ≈ 0.77, `RB` ≈ 0.78, `NNS` ≈ 0.82, `CC` ≈ 0.90.

#### Pairwise accuracy differences (Newcombe Wilson 95% CIs; Δ = Acc\_A − Acc\_B)

| Comparison          | Δ        | 95% CI            | Interpretation               |
|---------------------|---------:|-------------------|------------------------------|
| **Flair − NLTK**    | +0.029   | [0.009, 0.048]    | Flair significantly better   |
| **Flair − spaCy (lg)** | +0.029| [0.009, 0.048]    | Flair significantly better   |
| **Flair − spaCy (sm)** | +0.032| [0.012, 0.051]    | Flair significantly better   |
| NLTK − spaCy (lg)   | +0.000   | [−0.019, 0.020]   | No clear difference          |
| NLTK − spaCy (sm)   | +0.003   | [−0.016, 0.023]   | No clear difference          |
| spaCy (lg) − (sm)   | +0.003   | [−0.017, 0.022]   | No clear difference          |

#### Paired significance (McNemar) and robustness (bootstrap)

- **McNemar (Holm–Bonferroni):** Disagreements are asymmetric in favor of **Flair** vs each of NLTK, spaCy-lg, and spaCy-sm (all adjusted *p* < 0.05). NLTK vs spaCy models show no asymmetry; spaCy-lg vs spaCy-sm shows a small but significant asymmetry (adjusted *p* ≈ 0.046).
- **Cluster bootstrap CIs (sentence-level resampling):** Accuracy gaps of **Flair** over NLTK/spaCy are small but positive (≈ +0.02 to +0.03) with CIs excluding zero; differences among NLTK and spaCy models are not clearly different from zero.

#### Interpretation under strict projection

The strict CLAWS→Penn protocol intentionally collapses any CLAWS category that Penn cannot encode into `UNK`. The resulting `UNK` mass therefore quantifies **representational loss in Penn**, not model error per se. Within this constrained regime, **Flair** exhibits a consistent but **small** advantage (≈3 percentage points). Differences among NLTK and spaCy models are negligible. The central finding is thus **structural**: performance ceilings are dominated by the tagset mismatch—evidence that CLAWS captures fine-grained morphosyntactic distinctions that Penn-based evaluation cannot reward.


In [15]:
# ==============================================================================
# VISUALIZATIONS — per dataset (Dubliners, BNC) + pooled
# ==============================================================================

import numpy as np
import plotly.graph_objects as go

# --- helper: strict, sentence-level accuracy for a tool on one sentence ---
def _strict_sentence_acc(gt_claws, pred_tags, pred_tokens):
    m = min(len(gt_claws), len(pred_tags), len(pred_tokens))
    if m == 0:
        return None
    gt_penn = [convert_claws_to_penn(t, strict=True) for t in gt_claws[:m]]
    correct = sum(1 for i in range(m) if gt_penn[i] == pred_tags[i])
    return correct / m

# --- build per-dataset stats from expanded_batch_results ---
datasets = sorted({rec.get('dataset', 'Unknown') for rec in expanded_batch_results})
tools_seen = sorted({t for rec in expanded_batch_results for t in rec.get('tool_results', {}).keys()})

stats = {ds: {tool: {'accs': [], 'total_sentences': 0, 'perfect_sentences': 0}
              for tool in tools_seen}
         for ds in datasets}

for rec in expanded_batch_results:
    ds = rec.get('dataset', 'Unknown')
    gt = rec.get('ground_truth', [])
    tri_map = rec.get('tool_results', {})
    for tool, tri in tri_map.items():
        if not isinstance(tri, dict) or ('error' in tri):
            continue
        pred = tri.get('tags') or []
        toks = tri.get('tokens') or []
        acc = _strict_sentence_acc(gt, pred, toks)
        if acc is None:
            continue
        s = stats[ds][tool]
        s['accs'].append(acc)
        s['total_sentences'] += 1
        if acc == 1.0:
            s['perfect_sentences'] += 1

# finalize (replace accs list with summary numbers) and also compute pooled ("All")
pooled = {tool: {'accs': [], 'total_sentences': 0, 'perfect_sentences': 0} for tool in tools_seen}
for ds in datasets:
    for tool in tools_seen:
        d = stats[ds][tool]
        accs = d['accs']
        stats[ds][tool] = {
            'mean_accuracy': float(np.mean(accs)) if accs else 0.0,
            'std_accuracy': float(np.std(accs)) if accs else 0.0,
            'total_sentences': int(d['total_sentences']),
            'perfect_sentences': int(d['perfect_sentences'])
        }
        pooled[tool]['accs'].extend(accs)
        pooled[tool]['total_sentences'] += d['total_sentences']
        pooled[tool]['perfect_sentences'] += d['perfect_sentences']

stats['All (pooled)'] = {}
for tool in tools_seen:
    p = pooled[tool]
    accs = p['accs']
    stats['All (pooled)'][tool] = {
        'mean_accuracy': float(np.mean(accs)) if accs else 0.0,
        'std_accuracy': float(np.std(accs)) if accs else 0.0,
        'total_sentences': int(p['total_sentences']),
        'perfect_sentences': int(p['perfect_sentences'])
    }

# --- Figure 1: grouped bar of mean sentence accuracy per tool, by dataset ---
ordered_datasets = [ds for ds in ['Dubliners', 'BNC', 'All (pooled)'] if ds in stats]
tools = tools_seen

fig_acc = go.Figure()
for ds in ordered_datasets:
    y_vals = [stats[ds].get(t, {}).get('mean_accuracy', 0.0) for t in tools]
    y_errs = [stats[ds].get(t, {}).get('std_accuracy', 0.0) for t in tools]
    fig_acc.add_trace(go.Bar(
        name=ds,
        x=tools,
        y=y_vals,
        error_y=dict(type='data', array=y_errs),
        text=[f"{v:.3f}" for v in y_vals],
        textposition='auto'
    ))

fig_acc.update_layout(
    barmode='group',
    title="Mean sentence accuracy (STRICT) by tool and dataset",
    xaxis_title="Tool",
    yaxis_title="Mean sentence accuracy",
    yaxis=dict(range=[0, 1])
)
fig_acc.show()

# --- Conditionally render perfect-sentence plot only if any rate > 0 ---
any_perfect = False
rates_by_ds = {}
for ds in ordered_datasets:
    ds_rates = []
    for t in tools:
        d = stats[ds].get(t, {})
        tot = d.get('total_sentences', 0)
        perf = d.get('perfect_sentences', 0)
        rate = (perf / tot) if tot else 0.0
        ds_rates.append(rate)
        if rate > 0:
            any_perfect = True
    rates_by_ds[ds] = ds_rates

if any_perfect:
    fig_perf = go.Figure()
    for ds in ordered_datasets:
        fig_perf.add_trace(go.Bar(
            name=ds,
            x=tools,
            y=rates_by_ds[ds],
            text=[f"{r:.1%}" for r in rates_by_ds[ds]],
            textposition='auto'
        ))
    fig_perf.update_layout(
        barmode='group',
        title="Perfect-sentence tagging rate (STRICT) by tool and dataset",
        xaxis_title="Tool",
        yaxis_title="Perfect sentences",
        yaxis=dict(range=[0, 1])
    )
    fig_perf.show()
else:
    print("Perfect-sentence tagging rate is 0% for all tools/datasets under STRICT; plot omitted.")

print("Visualizations complete.")


Perfect-sentence tagging rate is 0% for all tools/datasets under STRICT; plot omitted.
Visualizations complete.


In [16]:
# ==============================================================================
# ADDITIONAL DATA VISUALIZATIONS (STRICT, pooled Dubliners + BNC)
# ==============================================================================

import numpy as np
import plotly.graph_objects as go
from collections import defaultdict, Counter

# --- ensure wilson_results exists (fallback compute if missing) ----------------
def _wilson_interval(successes, n, confidence=0.95):
    if n == 0:
        return 0.0, 0.0, 0.0
    from math import sqrt
    from scipy.stats import norm
    z = norm.ppf(1 - (1 - confidence) / 2)
    p = successes / n
    denom = 1 + z*z/n
    centre = (p + z*z/(2*n)) / denom
    half = z * np.sqrt((p*(1-p) + z*z/(4*n))/n) / denom
    return p, max(0.0, centre - half), min(1.0, centre + half)

if 'wilson_results' not in globals() or not wilson_results:
    # build from expanded_batch_results
    wilson_results = {}
    discovered_tools = sorted({t for r in expanded_batch_results for t in r.get('tool_results', {})})
    for tool_name in discovered_tools:
        total_tokens = 0
        correct_tokens = 0
        total_sentences = 0
        perfect_sentences = 0
        for res in expanded_batch_results:
            tri = res['tool_results'].get(tool_name, {})
            if not isinstance(tri, dict) or ('error' in tri):
                continue
            gt = res['ground_truth']
            pred = tri.get('tags', [])
            toks = tri.get('tokens', [])
            m = min(len(gt), len(pred), len(toks))
            if m == 0:
                continue
            gt_penn = [convert_claws_to_penn(t, strict=True) for t in gt[:m]]
            corr = sum(1 for i in range(m) if gt_penn[i] == pred[i])
            correct_tokens += corr
            total_tokens += m
            total_sentences += 1
            if corr == m:
                perfect_sentences += 1
        if total_tokens > 0:
            p, lo, hi = _wilson_interval(correct_tokens, total_tokens, 0.95)
            pr, pr_lo, pr_hi = _wilson_interval(perfect_sentences, total_sentences, 0.95)
            wilson_results[tool_name] = {
                'token_accuracy': p,
                'token_ci_lower': lo,
                'token_ci_upper': hi,
                'perfect_rate': pr,
                'perfect_ci_lower': pr_lo,
                'perfect_ci_upper': pr_hi,
                'total_tokens': total_tokens,
                'correct_tokens': correct_tokens,
                'total_sentences': total_sentences,
                'perfect_sentences': perfect_sentences
            }

# --- 1) Wilson CI comparison plot ---------------------------------------------
tools_order = sorted(wilson_results.keys())
ci_points = [(t, wilson_results[t]['token_accuracy'],
              wilson_results[t]['token_ci_lower'],
              wilson_results[t]['token_ci_upper']) for t in tools_order]

x_min = min(lo for _, _, lo, _ in ci_points) if ci_points else 0.0
x_max = max(hi for _, _, _, hi in ci_points) if ci_points else 1.0
pad = max(0.01, (x_max - x_min) * 0.1)
x_range = [max(0.0, x_min - pad), min(1.0, x_max + pad)]

fig_ci = go.Figure()
for tool, p, lo, hi in ci_points:
    # point
    fig_ci.add_trace(go.Scatter(
        x=[p], y=[tool], mode='markers',
        marker=dict(size=12),
        name=tool, showlegend=False,
        hovertemplate=f"{tool}<br>Accuracy={p:.3%}<br>95% CI=[{lo:.3%}, {hi:.3%}]<extra></extra>"
    ))
    # CI segment
    fig_ci.add_trace(go.Scatter(
        x=[lo, hi], y=[tool, tool], mode='lines',
        line=dict(width=4), showlegend=False
    ))

fig_ci.update_layout(
    title="Token Accuracy with 95% Wilson CIs (STRICT)<br><sub>Dubliners + BNC, pooled</sub>",
    xaxis_title="Token-level accuracy",
    yaxis_title="Tool",
    xaxis=dict(range=x_range, tickformat='.0%'),
    height=420
)
fig_ci.show()

# --- helper: strict sentence accuracy -----------------------------------------
def _strict_sentence_acc(gt_claws, pred_tags, pred_tokens):
    m = min(len(gt_claws), len(pred_tags), len(pred_tokens))
    if m <= 0:
        return None
    gt_penn = [convert_claws_to_penn(t, strict=True) for t in gt_claws[:m]]
    return sum(1 for i in range(m) if gt_penn[i] == pred_tags[i]) / m

# --- 2) Error heatmap: most problematic CLAWS tags (by strict mismatch) -------
# We compare strict-projected gold vs predicted Penn, but attribute errors to the
# original CLAWS source tag (optionally collapsed via _strip_ditto if desired).
strip_ditto = True  # set False if you want the raw CLAWS variants

tag_errors = defaultdict(lambda: defaultdict(int))  # CLAWS -> tool -> count
tools_for_heatmap = tools_order  # same tools

for res in expanded_batch_results:
    gt_claws = res.get('ground_truth', [])
    # precompute strict-projected gold
    gt_penn = [convert_claws_to_penn(t, strict=True) for t in gt_claws]
    for tool in tools_for_heatmap:
        tri = res.get('tool_results', {}).get(tool, {})
        if not isinstance(tri, dict) or ('error' in tri):
            continue
        pred = tri.get('tags', []) or []
        toks = tri.get('tokens', []) or []
        m = min(len(gt_penn), len(pred), len(toks), len(gt_claws))
        if m == 0:
            continue
        for i in range(m):
            if gt_penn[i] != pred[i]:
                claws_src = _strip_ditto(gt_claws[i]) if strip_ditto else gt_claws[i]
                tag_errors[claws_src][tool] += 1

# pick top 12 tags by total error across tools
tag_totals = {tag: sum(cnts.values()) for tag, cnts in tag_errors.items()}
top_tags = [t for t, _ in sorted(tag_totals.items(), key=lambda x: x[1], reverse=True)[:12]]

heatmap_data = []
for tag in top_tags:
    row = [tag_errors[tag].get(tool, 0) for tool in tools_for_heatmap]
    heatmap_data.append(row)

fig_heatmap = go.Figure(data=go.Heatmap(
    z=heatmap_data,
    x=tools_for_heatmap,
    y=top_tags,
    colorscale='Reds',
    text=heatmap_data,
    texttemplate="%{text}",
    textfont={"size": 10},
    hovertemplate="Tag=%{y}<br>Tool=%{x}<br>Errors=%{z}<extra></extra>"
))
fig_heatmap.update_layout(
    title="Most Problematic CLAWS Tags (STRICT mismatches attributed to CLAWS source)",
    xaxis_title="Tool",
    yaxis_title="CLAWS tag",
    height=520
)
fig_heatmap.show()

# --- 3) Sentence length vs accuracy (STRICT), colored by tool -----------------
points = []
for res in expanded_batch_results:
    ds = res.get('dataset', 'Unknown')
    gt = res.get('ground_truth', [])
    sent_len = len(gt)
    for tool in tools_order:
        tri = res.get('tool_results', {}).get(tool, {})
        if not isinstance(tri, dict) or ('error' in tri):
            continue
        pred = tri.get('tags', []) or []
        toks = tri.get('tokens', []) or []
        acc = _strict_sentence_acc(gt, pred, toks)
        if acc is not None:
            points.append((sent_len, acc, tool, ds))

import pandas as pd
df_scatter = pd.DataFrame(points, columns=['length', 'accuracy', 'tool', 'dataset'])

fig_scatter = go.Figure()
palette = {
    'flair': '#2ca02c',
    'spacy_lg': '#ff7f0e',
    'spacy_sm': '#1f77b4',
    'nltk': '#9467bd'
}
for tool in sorted(df_scatter['tool'].unique()):
    sub = df_scatter[df_scatter['tool'] == tool]
    fig_scatter.add_trace(go.Scatter(
        x=sub['length'], y=sub['accuracy'],
        mode='markers', name=tool,
        marker=dict(size=6, opacity=0.6, color=palette.get(tool, '#636EFA')),
        hovertemplate=("Tool="+tool+"<br>Len=%{x}<br>Acc=%{y:.1%}"
                       "<br>Dataset=%{text}<extra></extra>"),
        text=sub['dataset']
    ))

# Trend line (overall)
if not df_scatter.empty:
    z = np.polyfit(df_scatter['length'], df_scatter['accuracy'], 1)
    p = np.poly1d(z)
    x_trend = np.linspace(df_scatter['length'].min(), df_scatter['length'].max(), 100)
    fig_scatter.add_trace(go.Scatter(
        x=x_trend, y=p(x_trend), mode='lines', name='Trend',
        line=dict(color='red', dash='dash')
    ))

fig_scatter.update_layout(
    title="Sentence Length vs Token Accuracy (STRICT)<br><sub>Dubliners + BNC</sub>",
    xaxis_title="Sentence length (tokens in CLAWS gold)",
    yaxis_title="Token accuracy",
    yaxis=dict(range=[0, 1], tickformat='.0%'),
    height=520
)
fig_scatter.show()

# --- 4) Distribution of per-sentence accuracies (STRICT) ----------------------
tool_performance = defaultdict(list)
for res in expanded_batch_results:
    gt = res.get('ground_truth', [])
    tri_map = res.get('tool_results', {})
    for tool in tools_order:
        tri = tri_map.get(tool, {})
        if not isinstance(tri, dict) or ('error' in tri):
            continue
        pred = tri.get('tags', []) or []
        toks = tri.get('tokens', []) or []
        acc = _strict_sentence_acc(gt, pred, toks)
        if acc is not None:
            tool_performance[tool].append(acc)

fig_dist = go.Figure()
for tool in tools_order:
    accs = tool_performance.get(tool, [])
    if not accs:
        continue
    fig_dist.add_trace(go.Box(
        y=accs, name=tool, boxpoints='outliers',
        hovertemplate=f"{tool}<br>Q1–Q3 / median shown<extra></extra>"
    ))
fig_dist.update_layout(
    title="Distribution of Sentence-Level Accuracies (STRICT)",
    xaxis_title="Tool",
    yaxis_title="Sentence accuracy",
    yaxis=dict(range=[0, 1], tickformat='.0%'),
    height=520
)
fig_dist.show()

print("Enhanced visualizations complete!")


Enhanced visualizations complete!


In [17]:
# ==============================================================================
# POWERFUL ADDITIONAL VISUALS: per-dataset CIs, McNemar composition, per-tag F1
# ==============================================================================

import itertools
import numpy as np
import plotly.graph_objects as go
from collections import defaultdict, Counter
from sklearn.metrics import precision_recall_fscore_support

# -- Safety: Wilson CI helper (if not in scope)
if 'wilson_confidence_interval' not in globals():
    from math import sqrt
    import scipy.stats as stats
    def wilson_confidence_interval(successes, trials, confidence=0.95):
        if trials == 0: return 0.0, 0.0, 0.0
        z = stats.norm.ppf(1 - (1 - confidence) / 2)
        p = successes / trials
        denom = 1 + z**2 / trials
        centre = (p + z**2/(2*trials)) / denom
        half = z * np.sqrt((p*(1-p) + z**2/(4*trials))/trials) / denom
        return p, max(0, centre - half), min(1, centre + half)

# -------------------------------
# 1) Per-dataset Wilson CIs
# -------------------------------
def compute_wilson_by_dataset(results):
    tool_set = set()
    for r in results:
        tool_set.update(r.get('tool_results', {}).keys())
    datasets = sorted({r['dataset'] for r in results})

    out = {ds:{} for ds in datasets}
    for ds in datasets:
        for tool in sorted(tool_set):
            total_tokens = correct = sentences = 0
            for r in results:
                if r['dataset'] != ds:
                    continue
                tri = r['tool_results'].get(tool, {})
                if not isinstance(tri, dict) or 'error' in tri:
                    continue
                gt = r['ground_truth']
                pred = tri.get('tags', [])
                toks = tri.get('tokens', [])
                m = min(len(gt), len(pred), len(toks))
                if m == 0:
                    continue
                gt_penn = [convert_claws_to_penn(t, strict=True) for t in gt[:m]]
                correct += sum(1 for i in range(m) if gt_penn[i] == pred[i])
                total_tokens += m
                sentences += 1
            if total_tokens > 0:
                p, lo, hi = wilson_confidence_interval(correct, total_tokens)
                out[ds][tool] = dict(acc=p, lo=lo, hi=hi, tokens=total_tokens, sents=sentences)
    return out

ci_by_ds = compute_wilson_by_dataset(expanded_batch_results)

# Plot: one combined horizontal CI chart (labels "Dataset • Tool")
fig_ci_ds = go.Figure()
y_labels, x_pts, x_los, x_his = [], [], [], []
for ds in ci_by_ds:
    for tool, d in sorted(ci_by_ds[ds].items(), key=lambda kv: kv[1]['acc'], reverse=True):
        y_labels.append(f"{ds} • {tool}")
        x_pts.append(d['acc']); x_los.append(d['lo']); x_his.append(d['hi'])

for i, y in enumerate(y_labels):
    fig_ci_ds.add_trace(go.Scatter(x=[x_pts[i]], y=[y], mode='markers',
                                   marker=dict(size=10), name=y, showlegend=False))
    fig_ci_ds.add_trace(go.Scatter(x=[x_los[i], x_his[i]], y=[y, y], mode='lines',
                                   line=dict(width=3), showlegend=False))

fig_ci_ds.update_layout(
    title="Token Accuracy with 95% Wilson CIs by Dataset (STRICT)",
    xaxis_title="Token Accuracy",
    yaxis_title="Dataset • Tool",
    xaxis=dict(range=[0.45, 0.65]),
    height=400 + 20*len(y_labels)
)
fig_ci_ds.show()

# ---------------------------------------------------------
# 2) McNemar composition bars (interpretable disagreements)
# ---------------------------------------------------------
def build_mcnemar_table(tool_a, tool_b, results):
    bc = bw = a_cbw = b_caw = 0
    for r in results:
        gt = r['ground_truth']
        A = r['tool_results'].get(tool_a, {})
        B = r['tool_results'].get(tool_b, {})
        if 'error' in A or 'error' in B:
            continue
        pa, pb = A.get('tags', []), B.get('tags', [])
        ta, tb = A.get('tokens', []), B.get('tokens', [])
        m = min(len(gt), len(pa), len(pb), len(ta), len(tb))
        if m == 0:
            continue
        gt_penn = [convert_claws_to_penn(t, strict=True) for t in gt[:m]]
        for i in range(m):
            a_ok = (gt_penn[i] == pa[i]); b_ok = (gt_penn[i] == pb[i])
            if a_ok and b_ok: bc += 1
            elif (not a_ok) and (not b_ok): bw += 1
            elif a_ok and (not b_ok): a_cbw += 1
            else: b_caw += 1
    return [[bc, a_cbw],[b_caw, bw]]

# Build a stacked horizontal bar per pair
tools_for_pairs = sorted({t for r in expanded_batch_results for t in r.get('tool_results', {}).keys()})
pair_labels = []
BC = []; A_C_B_W = []; B_C_A_W = []; BW = []

for a, b in itertools.combinations(tools_for_pairs, 2):
    t = build_mcnemar_table(a, b, expanded_batch_results)
    pair_labels.append(f"{a} vs {b}")
    BC.append(t[0][0]); A_C_B_W.append(t[0][1]); B_C_A_W.append(t[1][0]); BW.append(t[1][1])

fig_mcnemar = go.Figure()
fig_mcnemar.add_trace(go.Bar(y=pair_labels, x=BC, name="Both correct", orientation='h'))
fig_mcnemar.add_trace(go.Bar(y=pair_labels, x=A_C_B_W, name="Only A correct", orientation='h'))
fig_mcnemar.add_trace(go.Bar(y=pair_labels, x=B_C_A_W, name="Only B correct", orientation='h'))
fig_mcnemar.add_trace(go.Bar(y=pair_labels, x=BW, name="Both wrong", orientation='h'))

fig_mcnemar.update_layout(
    barmode='stack',
    title="Paired Outcomes per Token (McNemar table visualisation)",
    xaxis_title="Token count",
    yaxis_title="Tool pair",
    height=300 + 30*len(pair_labels)
)
fig_mcnemar.show()

# --------------------------------------------
# 3) Per-tag F1 (top support, excluding 'UNK')
# --------------------------------------------
def per_tool_ytrue_ypred(results):
    per_tool = defaultdict(lambda: {'y_true': [], 'y_pred': []})
    for r in results:
        gt = r['ground_truth']
        for tool, tri in r.get('tool_results', {}).items():
            if 'error' in tri:
                continue
            pred = tri.get('tags', [])
            toks = tri.get('tokens', [])
            m = min(len(gt), len(pred), len(toks))
            if m == 0:
                continue
            gt_penn = [convert_claws_to_penn(t, strict=True) for t in gt[:m]]
            per_tool[tool]['y_true'].extend(gt_penn)
            per_tool[tool]['y_pred'].extend(pred[:m])
    return per_tool

per_tool_arrays = per_tool_ytrue_ypred(expanded_batch_results)

# Define label set and supports (pooled gold)
label_support = Counter()
for d in per_tool_arrays.values():
    label_support.update(d['y_true'])
# exclude UNK for readability
labels_ranked = [lbl for lbl, _ in label_support.most_common() if lbl != 'UNK']
top_labels = labels_ranked[:10] if len(labels_ranked) >= 10 else labels_ranked

fig_f1 = go.Figure()
for tool, d in sorted(per_tool_arrays.items()):
    y_true = np.array(d['y_true'])
    y_pred = np.array(d['y_pred'])
    if y_true.size == 0:
        continue
    # compute per-class metrics restricted to top_labels
    p, r, f1, support = precision_recall_fscore_support(
        y_true, y_pred, labels=top_labels, average=None, zero_division=0
    )
    fig_f1.add_trace(go.Bar(x=top_labels, y=f1, name=tool))

fig_f1.update_layout(
    title="Per-tag F1 by Tool (Top 10 Gold Labels, STRICT; UNK excluded)",
    xaxis_title="Penn tag",
    yaxis_title="F1",
    yaxis=dict(range=[0, 1]),
    barmode='group',
    height=450
)
fig_f1.show()

print("Additional visuals generated: per-dataset CIs (fig_ci_ds), McNemar composition (fig_mcnemar), per-tag F1 (fig_f1).")


AttributeError: 'dict' object has no attribute 'norm'

In [None]:
# ==============================================================================
# SAVE VISUALIZATIONS + RICH DASHBOARD (STRICT, pooled Dubliners + BNC)
# Includes: more results sections + expanded methods + CLAWS C7 note
# ==============================================================================

import os, zipfile
from datetime import datetime
from math import sqrt
import numpy as np
import scipy.stats as stats

def _exists(name):  # object exists and is not None
    return name in globals() and globals()[name] is not None

assert 'wilson_results' in globals() and wilson_results, "wilson_results not found. Run the CI cell first."

# ---------------- Core stats for dashboard header ----------------
_tool_stats = {k: v for k, v in wilson_results.items() if v.get('total_tokens', 0) > 0}
tools_sorted = sorted(_tool_stats.keys(), key=lambda t: _tool_stats[t]['token_accuracy'], reverse=True)
best_tool = tools_sorted[0]
best = _tool_stats[best_tool]
sentences_evaluated = max(s['total_sentences'] for s in _tool_stats.values())

def _proportion_z_test(x1, n1, x2, n2):
    p_pool = (x1 + x2) / (n1 + n2)
    se = sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
    z = (x1/n1 - x2/n2) / se
    p = 2 * (1 - stats.norm.cdf(abs(z)))
    return z, p

def _cohens_h(p1, p2):
    return 2 * (np.arcsin(sqrt(p1)) - np.arcsin(sqrt(p2)))

worst_tool = tools_sorted[-1]
worst = _tool_stats[worst_tool]
z_stat, p_val = _proportion_z_test(best['correct_tokens'], best['total_tokens'],
                                   worst['correct_tokens'], worst['total_tokens'])
h = _cohens_h(best['token_accuracy'], worst['token_accuracy'])
h_mag = ("negligible" if abs(h) < 0.2 else
         "small" if abs(h) < 0.5 else
         "medium" if abs(h) < 0.8 else "large")

best_acc = best['token_accuracy']
best_lo  = best['token_ci_lower']
best_hi  = best['token_ci_upper']
best_perfect = best['perfect_rate']
best_perfect_hi = best['perfect_ci_upper']

# ---------------- Pairwise Newcombe–Wilson diffs -----------------
if 'newcombe_wilson_diff' not in globals():
    def wilson_interval(successes, n, confidence=0.95):
        if n == 0: return (0.0, 0.0, 0.0)
        z = stats.norm.ppf(1 - (1 - confidence)/2)
        p = successes / n
        denom = 1 + z*z/n
        centre = (p + z*z/(2*n)) / denom
        half = z * np.sqrt((p*(1-p) + z*z/(4*n))/n) / denom
        return (p, max(0.0, centre - half), min(1.0, centre + half))

    def newcombe_wilson_diff(x1, n1, x2, n2, confidence=0.95):
        p1, L1, U1 = wilson_interval(x1, n1, confidence)
        p2, L2, U2 = wilson_interval(x2, n2, confidence)
        diff = p1 - p2
        lo = L1 - U2
        hi = U1 - L2
        return diff, lo, hi, (p1, L1, U1), (p2, L2, U2)

pair_rows = []
for i in range(len(tools_sorted)):
    for j in range(i+1, len(tools_sorted)):
        a, b = tools_sorted[i], tools_sorted[j]
        A, B = _tool_stats[a], _tool_stats[b]
        d, lo, hi, p1info, p2info = newcombe_wilson_diff(
            A['correct_tokens'], A['total_tokens'], B['correct_tokens'], B['total_tokens']
        )
        pair_rows.append((a, b, d, lo, hi))

# -------------- Optional: per-dataset CI table data ---------------
have_ci_by_ds = _exists('ci_by_ds') and bool(ci_by_ds)
if have_ci_by_ds:
    # flatten for a small table
    per_ds_rows = []
    for ds, d in ci_by_ds.items():
        for tool, vals in d.items():
            per_ds_rows.append((ds, tool, vals['acc'], vals['lo'], vals['hi'], vals['tokens'], vals['sents']))
    # sort by dataset then descending acc
    per_ds_rows.sort(key=lambda x: (x[0], -x[2]))

# -------------- Optional: McNemar outcomes table data -------------
have_mcnemar = _exists('mcnemar_outcomes') and _exists('adj_pvals') and len(mcnemar_outcomes) == len(adj_pvals)
if have_mcnemar:
    # prepare rows: a, b, table, discordant, test, chi2, p, adj
    mcn_rows = []
    for o, adjp in zip(mcnemar_outcomes, adj_pvals):
        a = o['tool_a']; b = o['tool_b']; t = o['table']
        test_name = 'exact' if o.get('exact') else 'chi²'
        mcn_rows.append((a, b, t, o['discordant'], test_name, o['chi2'], o['p_value'], float(adjp)))

# -------------- Optional: CLAWS C7 UNK share & top unmappables ----
have_strict_report = _exists('strict_report') and isinstance(strict_report, dict) and 'unk_rate' in strict_report
unk_share_pct = f"{strict_report['unk_rate']*100:.1f}%" if have_strict_report else None
top_unks = None
if have_strict_report:
    # top 10 unmappable CLAWS sources
    from collections import Counter
    cnts: Counter = strict_report.get('counts', Counter())
    top_unks = cnts.most_common(10)

# ---------------- Save figures to disk --------------------------------
results_dir = "nlp_validation_results"
os.makedirs(results_dir, exist_ok=True)
print(f"Saving visualizations to {results_dir}/ directory...")

saved = []
def _save(fig_var, filename):
    if _exists(fig_var):
        globals()[fig_var].write_html(
            f"{results_dir}/{filename}",
            config={'displayModeBar': True, 'displaylogo': False}
        )
        saved.append(filename)

# Existing
_save('fig_ci',        "confidence_intervals.html")
_save('fig_heatmap',   "error_heatmap.html")
_save('fig_scatter',   "sentence_length_analysis.html")
_save('fig_dist',      "accuracy_distributions.html")
_save('fig',           "accuracy_comparison.html")

# New visuals included earlier
_save('fig_ci_ds',     "per_dataset_confidence_intervals.html")
_save('fig_mcnemar',   "mcnemar_composition.html")
_save('fig_f1',        "per_tag_f1.html")

# ---------------- Build dashboard HTML ----------------------------
now_str = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

def _link(fn, label):
    return f'<a href="{fn}" class="chart-link">{label}</a>\n' if fn in saved else ''

dashboard_html = f"""<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <title>NLP POS Tagging Validation — Dubliners + BNC (STRICT)</title>
    <style>
        body {{ font-family: Arial, sans-serif; margin: 20px; background-color: #f5f5f5; }}
        .header {{ background-color: #2c3e50; color: white; padding: 20px; margin-bottom: 20px; border-radius: 6px; }}
        .summary, .chart-container {{ background-color: white; padding: 20px; margin-bottom: 20px; border-radius: 6px; }}
        .chart-link {{ display: inline-block; background-color: #3498db; color: white; padding: 10px 16px;
                       text-decoration: none; border-radius: 5px; margin: 5px 8px 0 0; }}
        .chart-link:hover {{ background-color: #2980b9; }}
        .stats-table {{ width: 100%; border-collapse: collapse; margin-top: 10px; }}
        .stats-table th, .stats-table td {{ border: 1px solid #ddd; padding: 8px; text-align: left; }}
        .stats-table th {{ background-color: #f2f2f2; }}
        .muted {{ color: #666; }}
        .small {{ font-size: 0.95em; }}
        code {{ background: #f0f0f0; padding: 1px 4px; border-radius: 3px; }}
    </style>
</head>
<body>
    <div class="header">
        <h1>NLP POS Tagging Validation — Dubliners + BNC (STRICT)</h1>
        <p>spaCy (sm, lg), Flair, and NLTK vs. CLAWS7 gold after strict CLAWS→Penn projection</p>
        <p class="muted">Analysis Date: {now_str}</p>
    </div>

    <div class="summary">
        <h2>Executive Summary</h2>
        <p><strong>Design:</strong> Token-level evaluation across {sentences_evaluated} sentences, pooled over Dubliners + BNC.</p>
        <ul>
            <li><strong>Best observed accuracy:</strong> {best_tool} = {best_acc:.3f} (95% CI {best_lo:.3f}–{best_hi:.3f}).</li>
            <li><strong>Perfect-sentence rate (illustrative):</strong> ≈ {best_perfect:.1%} (upper 95% CI ≈ {best_perfect_hi:.1%}).</li>
            <li><strong>Best vs worst:</strong> z = {z_stat:.3f}, p = {p_val:.3g}; Cohen’s h = {h:.3f} ({h_mag}).</li>
        </ul>
        <p class="muted small">Strict projection maps CLAWS distinctions lacking Penn equivalents to <code>UNK</code>, capping achievable accuracy.</p>
    </div>

    <div class="summary">
        <h2>Token Accuracy (STRICT)</h2>
        <table class="stats-table">
            <tr>
                <th>Tool</th>
                <th>Token Accuracy</th>
                <th>95% CI Lower</th>
                <th>95% CI Upper</th>
                <th>Tokens</th>
                <th>Sentences</th>
                <th>Perfect Sentence Rate</th>
            </tr>
"""

for tool_name in tools_sorted:
    s = _tool_stats[tool_name]
    dashboard_html += f"""
            <tr>
                <td><strong>{tool_name}</strong></td>
                <td>{s['token_accuracy']:.3f}</td>
                <td>{s['token_ci_lower']:.3f}</td>
                <td>{s['token_ci_upper']:.3f}</td>
                <td>{s['total_tokens']:,}</td>
                <td>{s['total_sentences']}</td>
                <td>{s['perfect_rate']:.1%}</td>
            </tr>
"""

dashboard_html += """
        </table>
    </div>
"""

# --------- Per-dataset CI table section (if available) ----------
if have_ci_by_ds and per_ds_rows:
    dashboard_html += """
    <div class="summary">
        <h2>Per-Dataset Token Accuracy (STRICT)</h2>
        <table class="stats-table">
            <tr><th>Dataset</th><th>Tool</th><th>Accuracy</th><th>95% CI Lower</th><th>95% CI Upper</th><th>Tokens</th><th>Sentences</th></tr>
    """
    for ds, tool, acc, lo, hi, toks, sents in per_ds_rows:
        dashboard_html += f"<tr><td>{ds}</td><td>{tool}</td><td>{acc:.3f}</td><td>{lo:.3f}</td><td>{hi:.3f}</td><td>{toks:,}</td><td>{sents}</td></tr>\n"
    dashboard_html += """
        </table>
        <p class="muted small">Wilson score intervals per dataset; useful for checking consistency across Dubliners and BNC.</p>
    </div>
    """

# --------- Pairwise Newcombe–Wilson diffs table -----------------
if pair_rows:
    dashboard_html += """
    <div class="summary">
        <h2>Pairwise Accuracy Differences (Newcombe–Wilson 95% CIs)</h2>
        <table class="stats-table">
            <tr><th>Comparison</th><th>Δ (A−B)</th><th>95% CI Lower</th><th>95% CI Upper</th><th>Interpretation</th></tr>
    """
    for a, b, d, lo, hi in pair_rows:
        interp = "A>B (CI excludes 0)" if lo > 0 else ("B>A (CI excludes 0)" if hi < 0 else "No clear difference")
        dashboard_html += f"<tr><td>{a} − {b}</td><td>{d:.3f}</td><td>{lo:.3f}</td><td>{hi:.3f}</td><td>{interp}</td></tr>\n"
    dashboard_html += """
        </table>
    </div>
    """

# --------- McNemar outcomes summary (if available) --------------
if have_mcnemar and mcn_rows:
    dashboard_html += """
    <div class="summary">
        <h2>Paired Token Outcomes (McNemar)</h2>
        <table class="stats-table">
            <tr>
                <th>Pair</th>
                <th>[both correct, A correct & B wrong]</th>
                <th>[B correct & A wrong, both wrong]</th>
                <th>Discordant</th>
                <th>Test</th>
                <th>χ²</th>
                <th>p</th>
                <th>Holm-adj p</th>
            </tr>
    """
    for a, b, t, disc, test_name, chi2, p, adjp in mcn_rows:
        t1 = f"[{t[0][0]}, {t[0][1]}]"
        t2 = f"[{t[1][0]}, {t[1][1]}]"
        dashboard_html += f"<tr><td>{a} vs {b}</td><td>{t1}</td><td>{t2}</td><td>{disc}</td><td>{test_name}</td><td>{chi2:.3f}</td><td>{p:.3g}</td><td>{adjp:.3g}</td></tr>\n"
    dashboard_html += """
        </table>
        <p class="muted small">“Only A correct” and “Only B correct” are the discordant cells that drive McNemar’s test.</p>
    </div>
    """

# --------- CLAWS C7 note + live UNK share -----------------------
dashboard_html += """
    <div class="summary">
        <h2>About the CLAWS C7 Tagset</h2>
        <p>CLAWS C7 encodes fine-grained morphosyntactic distinctions not present in Penn Treebank tags, including article subtypes (<code>AT</code>, <code>AT1</code>), rich pronoun categories by person/number/case (e.g., <code>PPHS1</code> vs <code>PPIS1</code>), auxiliary identity (<code>VB</code>/<code>VH</code>/<code>VD</code> families), preposition subtypes (<code>IO</code>, <code>IW</code>), and semantic noun subclasses (e.g., months/days as <code>NPM*</code>/<code>NPD*</code>). Under <em>strict</em> projection, distinctions that lack Penn equivalents are mapped to <code>UNK</code> to ensure label comparability with modern taggers.</p>
"""

if have_strict_report and unk_share_pct:
    dashboard_html += f"""
        <p><strong>UNK share under strict projection:</strong> {unk_share_pct} of gold tokens.</p>
    """
    if top_unks:
        dashboard_html += """
        <table class="stats-table">
            <tr><th>Top unmappable CLAWS tags</th><th>Count</th></tr>
        """
        for tag, c in top_unks:
            dashboard_html += f"<tr><td>{tag}</td><td>{c}</td></tr>\n"
        dashboard_html += "</table>\n"

dashboard_html += """
        <p class="muted small">This preserves a fair evaluation against Penn-style outputs but imposes a ceiling on achievable accuracy.</p>
    </div>
"""

# --------- Visual links (no extra autodiscovery) -----------------
dashboard_html += """
    <div class="chart-container">
        <h2>Interactive Visualizations</h2>
        <p>Click to open:</p>
"""
dashboard_html += _link("confidence_intervals.html",            "Overall CIs")
dashboard_html += _link("per_dataset_confidence_intervals.html", "Per-dataset CIs")
dashboard_html += _link("mcnemar_composition.html",             "McNemar Composition")
dashboard_html += _link("per_tag_f1.html",                      "Per-tag F1 (Top labels)")
dashboard_html += _link("error_heatmap.html",                   "Error Heatmap")
dashboard_html += _link("sentence_length_analysis.html",        "Length vs Accuracy")
dashboard_html += _link("accuracy_distributions.html",          "Accuracy Distributions")
dashboard_html += _link("accuracy_comparison.html",             "Tool Comparison")
dashboard_html += """
    </div>
"""

# --------- Expanded Statistical Methods -------------------------
dashboard_html += """
    <div class="summary">
        <h2>Statistical Methods</h2>
        <ul class="small">
            <li><strong>Strict CLAWS→Penn projection:</strong> Gold CLAWS7 tags are mapped to Penn tags; distinctions without Penn equivalents are assigned <code>UNK</code> to maintain label comparability with modern taggers.</li>
            <li><strong>Accuracy CIs:</strong> Wilson score intervals for single proportions (token accuracy; perfect-sentence rate). Preferred over Wald due to better coverage, especially away from 0.5.</li>
            <li><strong>Pairwise accuracy differences:</strong> Newcombe’s Method 10 (1998) using Wilson intervals for each proportion; CIs reported for Δ = p<sub>A</sub> − p<sub>B</sub>.</li>
            <li><strong>Best vs worst significance:</strong> Pooled two-proportion z-test plus effect size Cohen’s h for magnitude (negligible/small/medium/large).</li>
            <li><strong>Paired disagreements:</strong> McNemar’s test on discordant token outcomes (exact test when discordant &lt; 25, otherwise continuity-corrected χ²), with Holm–Bonferroni correction for multiple comparisons.</li>
            <li><strong>Bootstrap by sentence:</strong> Cluster bootstrap (resampling sentences) for 95% CIs on accuracy differences, preserving within-sentence token dependence.</li>
            <li><strong>F1 reporting:</strong> Micro, macro, and weighted F1 computed on Penn labels; per-tag F1 reported for the top gold labels (excluding <code>UNK</code> for readability).</li>
            <li><strong>Alignment:</strong> Token comparisons use sentence-internal minimum alignment length across gold/tool tokens to avoid spurious misalignments from tokenizer differences.</li>
        </ul>
    </div>
</body>
</html>
"""

# ---------------- Write dashboard & zip --------------------------
results_dir = "nlp_validation_results"
with open(f"{results_dir}/index.html", "w", encoding="utf-8") as f:
    f.write(dashboard_html)

zip_filename = f"nlp_validation_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.zip"
with zipfile.ZipFile(zip_filename, 'w') as zipf:
    for root, dirs, files in os.walk(results_dir):
        for file in files:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, results_dir)
            zipf.write(file_path, arcname)

print("Dashboard updated and all visualizations saved!")
print(f"Files saved to: {results_dir}/")
print(f"Zip archive created: {zip_filename}")

# If on Colab, prompt download
try:
    from google.colab import files as colab_files
    colab_files.download(zip_filename)
except Exception:
    pass


Saving visualizations to nlp_validation_results/ directory...
Dashboard updated and all visualizations saved!
Files saved to: nlp_validation_results/
Zip archive created: nlp_validation_results_20250824_191116.zip


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>