<a href="https://colab.research.google.com/github/mahb97/joyce-dubliners-similes-analysis/blob/main/02_linguistic_analysis_and_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Joyce Simile Research: Comprehensive Linguistic Analysis and Comparison Framework

# Abstract

This notebook presents a reproducible computational framework for analysing similes in James Joyce’s *Dubliners* and benchmarking extraction methods against a British National Corpus (BNC) baseline. Four datasets are integrated—manual close-reading (ground truth), a restrictive rule-based extractor, a less-restrictive NLP extractor, and BNC—under a harmonised taxonomy that merges *Joycean_Quasi* with the BNC’s *Quasi_Similes* into a single label **Quasi_Similes** to avoid double-counting. The pipeline performs sentence-level linguistic profiling (comparator-span detection, pre/post structure, POS distributions, syntactic complexity, exploratory sentiment) and evaluates extractors via **instance-aligned F1** (exact match then fuzzy ≥ 0.92) against the manual set.

Statistical inference combines a 4-way χ² with standardized residuals and BH-FDR control, two-proportion tests (Newcombe CIs, Cohen’s *h*), binomial checks against BNC reference proportions, and continuous-feature comparisons (Welch *t*, Mann–Whitney *U*, Hedges’ *g*, Cliff’s δ). Topic modelling (LDA) summarises thematic variation per subset.

Results show a strong categorical association across corpora (**χ² ≈ 281.88**, **df = 15**, **Cramér’s V ≈ 0.318**, *p* ≪ .001). **Quasi\_Similes** are more prevalent in **BNC** (≈ 41 %) than in **Joyce-Manual** (≈ 29 %), while Joyce exhibits enrichments in **Joycean\_Framed**, **Joycean\_Quasi\_Fuzzy**, and **Joycean\_Silent** (≈ 20 % combined in the manual set). The less-restrictive NLP subset collapses to **Standard** (control), confirming distributional contrasts are not driven by it. Instance-aligned performance indicates partial recovery of expert labels (Rule-Based vs Manual: micro-F1 ≈ **0.343**, macro-F1 ≈ **0.178**; NLP vs Manual: micro-F1 ≈ **0.292**, macro-F1 ≈ **0.059**), motivating targeted pattern and dependency enhancements.




# 6. Comprehensive Linguistic Analysis Framework

## 6.1 Multi-Dataset Integration
The pipeline ingests four sources and standardises them to a common schema with stable IDs and explicit provenance:
- **Manual_CloseReading** (expert annotations; ground truth)
- **Restrictive_Dubliners** (rule-based extractor)
- **NLP_LessRestrictive_PG** (broad pattern extractor; control—predominantly *Standard*)
- **BNC_Baseline** (standard-English reference)

Column names are normalised, missing IDs are repaired, and categories are **harmonised** so that *Joycean_Quasi* and the BNC’s *Quasi_Similes* are merged into a single label **Quasi_Similes**. This avoids double-counting and supports fair cross-corpus comparisons. All intermediate artefacts (CSV/JSON) are written with timestamps for reproducibility.

## 6.2 Advanced Linguistic Feature Extraction
Using spaCy (with a robust fallback), the framework derives sentence-level linguistic features tailored to simile analysis:
- **Comparator span detection** for `like`, `as`, **as if/as though**, **as … as**, and lemma families (**seem\***, **resembl\***), plus punctuation-mediated comparators (**colon, semicolon, ellipsis, dash**).
- **Pre/Post comparator structure**: token counts on each side and **Pre_Post_Ratio**.
- **Syntactic complexity** via dependency depth; **POS distribution** and **lemmatised text**.
- **Figurative density** from comparator/marker hits.
- **Exploratory sentiment** (TextBlob polarity/subjectivity).
If spaCy is unavailable, the fallback computes token counts, comparator positions, and sentiment only.

## 6.3 Performance Validation
Extractor performance is evaluated **instance-by-instance** against the manual set using **sentence alignment**:
- Exact match first, then **fuzzy matching (≥ 0.92)** to pair sentences.
- Per-category **TP/FP/FN** and **micro/macro F1** are computed from the aligned pairs—more faithful than bag-of-counts.
- Statistical testing (reported in Section 8): **4-way χ²** with standardised residuals and BH-FDR; **two-proportion tests** (Newcombe CIs, Cohen’s *h*), **binomial checks** vs BNC, and **continuous** comparisons (Welch *t*, Mann–Whitney *U*, Hedges’ *g*, Cliff’s δ).  
These procedures emphasise **effect sizes and error control**, and remain informative even when the less-restrictive NLP set collapses to *Standard*.



In [6]:
# =============================================================================
# COMPREHENSIVE LINGUISTIC COMPARISON OF FOUR SIMILE DATASETS
# =============================================================================

import os
import re
import json
import warnings
from collections import Counter
from difflib import SequenceMatcher
from datetime import datetime
import random
import sys

import numpy as np
import pandas as pd

# Plot libs not used in this cell but kept for notebook continuity
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder

warnings.filterwarnings('ignore')

# Optional NLP libs
try:
    import spacy
except Exception:
    spacy = None

try:
    from textblob import TextBlob
except Exception:
    TextBlob = None

print("COMPREHENSIVE LINGUISTIC COMPARISON OF FOUR SIMILE DATASETS (FIXED)")
print("=" * 75)
print("Dataset 1: Manual Annotations (Ground Truth - Close Reading)")
print("Dataset 2: Rule-Based Extraction (Restrictive - Domain-Informed)")
print("Dataset 3: NLP Extraction (Less-Restrictive - PG Dubliners)")
print("Dataset 4: BNC Baseline Corpus (Standard English Reference)")
print("=" * 75)

# Initialize spaCy if available
nlp = None
if spacy is not None:
    try:
        nlp = spacy.load("en_core_web_sm")
        print("spaCy pipeline loaded: en_core_web_sm")
    except OSError:
        print("spaCy model not found; attempting to download…")
        os.system("python -m spacy download en_core_web_sm")
        try:
            nlp = spacy.load("en_core_web_sm")
            print("spaCy pipeline loaded after download: en_core_web_sm")
        except Exception:
            print("spaCy unavailable; analysis will use simplified methods.")

class ComprehensiveLinguisticComparator:
    """
    Full pipeline:
      - robust loading & standardisation
      - linguistic feature extraction (spaCy/TextBlob, simplified fallback)
      - category harmonisation (MERGES Joycean_Quasi into Quasi_Similes)
      - instance-aligned F1 (exact + fuzzy sentence matching)
      - reproducibility & environment stamping
      - combined CSV export with stable ordering
    """

    def __init__(self):
        self.nlp = nlp
        self.datasets = {}
        self.linguistic_features = {}
        self.comparison_results = {}

        # Reproducibility
        self.random_seed = 42
        random.seed(self.random_seed)
        np.random.seed(self.random_seed)

        # Environment info for auditability
        tb_ver = "n/a"
        try:
            import textblob as _tb
            tb_ver = getattr(_tb, "__version__", "n/a")
        except Exception:
            pass
        self.env_info = {
            "python": sys.version,
            "pandas": pd.__version__,
            "numpy": np.__version__,
            "spacy": getattr(spacy, "__version__", "n/a") if spacy is not None else "n/a",
            "textblob": tb_ver,
        }
        print("Environment:", self.env_info)

    # ---------- ID / Loading / Standardisation ----------

    def _ensure_ids(self, df, dataset_name, prefix=None):
        """
        Ensure a unique, non-null 'Instance_ID' string column exists.
        If missing, non-unique, or contains NaNs, regenerate sequential IDs with a readable prefix.
        """
        if df is None or df.empty:
            return pd.DataFrame(columns=['Instance_ID'])

        short = (prefix or {
            'manual': 'MAN',
            'rule_based': 'RST',
            'nlp': 'NLP',
            'bnc': 'BNC'
        }.get(dataset_name, dataset_name[:3].upper()))

        candidates = ['Instance_ID', 'ID', 'id', 'sentence_id', 'Sentence_ID', 'Index', 'index']
        chosen = next((c for c in candidates if c in df.columns), None)
        if chosen and chosen != 'Instance_ID':
            df = df.rename(columns={chosen: 'Instance_ID'})
        elif not chosen:
            df['Instance_ID'] = np.nan

        # Normalize and test uniqueness
        df['Instance_ID'] = df['Instance_ID'].astype(str).replace({'nan': np.nan, '': np.nan})
        needs_regen = df['Instance_ID'].isna().any() or (not df['Instance_ID'].is_unique)
        if needs_regen:
            df['Instance_ID'] = [f"{short}_{i+1:05d}" for i in range(len(df))]

        return df

    def _load_manual_dataset_robust(self, file_content):
        """Robust loader for manual annotations with long quoted Joycean sentences."""
        import csv
        try:
            df = pd.read_csv(
                file_content, encoding='cp1252', quotechar='"',
                quoting=csv.QUOTE_MINIMAL, skipinitialspace=True, engine='python'
            )
            if 'Sentence Context' in df.columns:
                df = df[df['Sentence Context'].astype(str).str.lower() != 'sentence context'].copy()
                return df
        except Exception as e:
            print(f"  pandas (python engine) failed: {e}")

        # Fallback simpler read
        try:
            df = pd.read_csv(file_content, encoding='cp1252')
            if 'Sentence Context' in df.columns:
                df = df[df['Sentence Context'].astype(str).str.lower() != 'sentence context'].copy()
                return df
        except Exception as e:
            print(f"  pandas (default) failed: {e}")

        print("  Manual annotations not found or failed to load.")
        return pd.DataFrame()

    def load_datasets(self, manual_file=None, rule_based_file=None, nlp_file=None, bnc_file=None):
        print("\nLOADING DATASETS WITH FIXED ID HANDLING & EXPLICIT LABELS")
        print("-" * 70)

        # Manual (close reading)
        print("Loading manual annotations…")
        self.datasets['manual'] = self._load_manual_dataset_robust(manual_file) if manual_file else pd.DataFrame()
        self.datasets['manual'] = self._ensure_ids(self.datasets['manual'], 'manual', prefix='MAN')
        if not self.datasets['manual'].empty:
            self.datasets['manual']['Original_Dataset'] = 'Manual_CloseReading'

        # Rule-based (restrictive)
        print("Loading rule-based (restrictive)…")
        self.datasets['rule_based'] = pd.read_csv(rule_based_file) if rule_based_file else pd.DataFrame()
        self.datasets['rule_based'] = self._ensure_ids(self.datasets['rule_based'], 'rule_based', prefix='RST')
        if not self.datasets['rule_based'].empty:
            self.datasets['rule_based']['Original_Dataset'] = 'Restrictive_Dubliners'

        # NLP (less-restrictive PG)
        print("Loading NLP (less-restrictive PG)…")
        self.datasets['nlp'] = pd.read_csv(nlp_file) if nlp_file else pd.DataFrame()
        self.datasets['nlp'] = self._ensure_ids(self.datasets['nlp'], 'nlp', prefix='NLP')
        if not self.datasets['nlp'].empty:
            self.datasets['nlp']['Original_Dataset'] = 'NLP_LessRestrictive_PG'

        # BNC
        print("Loading BNC baseline…")
        self.datasets['bnc'] = pd.read_csv(bnc_file, encoding='utf-8') if bnc_file else pd.DataFrame()
        self.datasets['bnc'] = self._ensure_ids(self.datasets['bnc'], 'bnc', prefix='BNC')
        if not self.datasets['bnc'].empty:
            self.datasets['bnc']['Original_Dataset'] = 'BNC_Baseline'

        self._standardize_datasets()
        self._standardize_categories()

        for name, df in self.datasets.items():
            print(f"{name:>12}: rows={len(df):4d}  "
                  f"missing_IDs={df['Instance_ID'].isna().sum() if 'Instance_ID' in df else 'N/A'}  "
                  f"missing_Original_Dataset={df['Original_Dataset'].isna().sum() if 'Original_Dataset' in df else 'N/A'}")
        print(f"Total instances: {sum(len(df) for df in self.datasets.values())}")

    def _standardize_datasets(self):
        print("Standardizing column names & adding Dataset_Source…")

        # Manual
        df = self.datasets.get('manual', pd.DataFrame())
        if not df.empty:
            ren = {
                'Category (Framwrok)': 'Category_Framework',
                'Comparator Type ': 'Comparator_Type',
                'Sentence Context': 'Sentence_Context',
                'Page No.': 'Page_Number'
            }
            df = df.rename(columns={k: v for k, v in ren.items() if k in df.columns})
            df['Dataset_Source'] = 'Manual_Expert_Annotation'
            if 'Category_Framework' in df.columns:
                df['Category_Framework'] = df['Category_Framework'].astype(str)
            self.datasets['manual'] = df
        else:
            self.datasets['manual'] = pd.DataFrame(columns=[
                'Instance_ID','Category_Framework','Comparator_Type','Sentence_Context','Page_Number',
                'Dataset_Source','Original_Dataset'
            ])

        # Rule-based
        df = self.datasets.get('rule_based', pd.DataFrame())
        if not df.empty:
            df = df.rename(columns={
                'Sentence Context': 'Sentence_Context',
                'Comparator Type ': 'Comparator_Type',
                'Category (Framwrok)': 'Category_Framework'
            })
            df['Dataset_Source'] = 'Rule_Based_Domain_Informed'
            if 'Category_Framework' in df.columns:
                df['Category_Framework'] = df['Category_Framework'].astype(str)
            self.datasets['rule_based'] = df
        else:
            self.datasets['rule_based'] = pd.DataFrame(columns=[
                'Instance_ID','Category_Framework','Comparator_Type','Sentence_Context',
                'Dataset_Source','Original_Dataset'
            ])

        # NLP (less-restrictive)
        df = self.datasets.get('nlp', pd.DataFrame())
        if not df.empty:
            if 'Sentence_Context' not in df.columns:
                for c in ['Sentence Context','text','sentence','context','content']:
                    if c in df.columns:
                        df = df.rename(columns={c: 'Sentence_Context'})
                        break
            if 'Comparator Type ' in df.columns:
                df = df.rename(columns={'Comparator Type ': 'Comparator_Type'})
            if 'Category (Framwrok)' in df.columns and 'Category_Framework' not in df.columns:
                df = df.rename(columns={'Category (Framwrok)': 'Category_Framework'})
            if 'Category_Framework' not in df.columns:
                df['Category_Framework'] = 'NLP_Basic_Pattern'
            df['Dataset_Source'] = 'NLP_General_Pattern_Recognition'
            df['Category_Framework'] = df['Category_Framework'].astype(str)
            self.datasets['nlp'] = df
        else:
            self.datasets['nlp'] = pd.DataFrame(columns=[
                'Instance_ID','Category_Framework','Comparator_Type','Sentence_Context',
                'Dataset_Source','Original_Dataset'
            ])

        # BNC
        df = self.datasets.get('bnc', pd.DataFrame())
        if not df.empty:
            if 'Category (Framework)' in df.columns and 'Category_Framework' not in df.columns:
                df = df.rename(columns={'Category (Framework)':'Category_Framework'})
            if 'Comparator Type' in df.columns and 'Comparator_Type' not in df.columns:
                df = df.rename(columns={'Comparator Type':'Comparator_Type'})
            if 'Sentence Context' in df.columns and 'Sentence_Context' not in df.columns:
                df = df.rename(columns={'Sentence Context':'Sentence_Context'})
            df['Dataset_Source'] = 'BNC_Standard_English_Baseline'
            if 'Category_Framework' in df.columns:
                df['Category_Framework'] = df['Category_Framework'].astype(str)
            self.datasets['bnc'] = df
        else:
            self.datasets['bnc'] = pd.DataFrame(columns=[
                'Instance_ID','Sentence_Context','Comparator_Type','Category_Framework',
                'Dataset_Source','Original_Dataset'
            ])

        print("Standardization complete.")

    def _standardize_categories(self):
        """
        Harmonize Category_Framework labels.
        IMPORTANT: Merge Joycean_Quasi and its variants into Quasi_Similes (unified quasi-simile phenomenon).
        """
        print("Harmonizing Category_Framework labels…")
        mapping = {
            # Standard variants
            'NLP_Basic': 'Standard',
            'NLP_Basic_Pattern': 'Standard',
            'Standard_English_Usage': 'Standard',
            'Standard': 'Standard',

            # Joycean subtypes
            'Joycean_Framed': 'Joycean_Framed',
            'Joycean_Silent': 'Joycean_Silent',
            'Joycean_Quasi_Fuzzy': 'Joycean_Quasi_Fuzzy',
            'Joycean-Quasi-Fuzzy': 'Joycean_Quasi_Fuzzy',

            # >>> UNIFY all quasi-simile tags here <<<
            'Quasi_Similes': 'Quasi_Similes',
            'Quasi_Simile': 'Quasi_Similes',     # singular → plural canonical
            'Joycean_Quasi': 'Quasi_Similes',    # merge with BNC tag
            'Joycean-Quasi': 'Quasi_Similes',    # hyphen variant

            # mislabels leaking from dataset names → map to Standard
            'NLP_LessRestrictive': 'Standard',
            'NLP_General_Pattern': 'Standard',
            'Less-Restrictive': 'Standard',

            # housekeeping
            'Uncategorised': 'Uncategorized',
            'nan': 'Uncategorized', 'NaN': 'Uncategorized', '': 'Uncategorized'
        }
        for name, df in self.datasets.items():
            if df.empty or 'Category_Framework' not in df.columns:
                continue
            df['Category_Framework'] = df['Category_Framework'].astype(str).map(mapping).fillna(df['Category_Framework'])
            self.datasets[name] = df
        print("Category harmonization complete.")

    # ---------- Utilities for matching & normalization ----------

    @staticmethod
    def _normalize_sentence_for_match(s: str) -> str:
        s = (s or "")
        s = s.replace("—", "-").replace("–", "-")
        s = re.sub(r"\s+", " ", s).strip().lower()
        # strip quotes; keep colon/semicolon because they matter in Joyce
        table = str.maketrans("", "", "\"'“”‘’")
        s = s.translate(table)
        return s

    def _fuzzy_equal(self, a: str, b: str, threshold: float = 0.92) -> bool:
        if not a or not b:
            return False
        ra = self._normalize_sentence_for_match(a)
        rb = self._normalize_sentence_for_match(b)
        if ra == rb:
            return True
        return SequenceMatcher(None, ra, rb).ratio() >= threshold

    # ---------- Linguistic analysis (spaCy/TextBlob; corrected comparator handling) ----------

    def _find_comparator_span(self, doc, comparator_type):
        """
        Return (start_i, end_i) in doc-token indices for the comparator span, inclusive.
        Handles 'like', 'as if', 'as though', 'as … as', 'seem*', 'resembl*', punctuation comparators.
        Returns None if not found.
        """
        comp = (str(comparator_type) or "").strip().lower()

        def idx_seq_match(tokens, start, seq):
            n = len(seq)
            if start + n > len(tokens):
                return False
            for k in range(n):
                if tokens[start + k].text.lower() != seq[k]:
                    return False
            return True

        tokens = list(doc)

        # Multiword comparators
        if comp in {"as if", "as-if"}:
            seq = ["as", "if"]
            for i in range(len(tokens)-1):
                if idx_seq_match(tokens, i, seq):
                    return (i, i+1)

        if comp in {"as though", "as-though"}:
            seq = ["as", "though"]
            for i in range(len(tokens)-1):
                if idx_seq_match(tokens, i, seq):
                    return (i, i+1)

        # 'as … as' construction (if comparator supplied as 'as')
        if comp == "as":
            as_positions = [i for i,t in enumerate(tokens) if t.text.lower() == "as"]
            for i in as_positions:
                for j in as_positions:
                    if j > i and (j - i) <= 8:  # window limit
                        return (i, j)
            if as_positions:
                return (as_positions[0], as_positions[0])

        # Single-word comparators
        if comp == "like":
            for i,t in enumerate(tokens):
                if t.text.lower() == "like":
                    return (i, i)

        # Lemma families
        if comp.startswith("resembl"):
            for i,t in enumerate(tokens):
                if getattr(t, "lemma_", t.text).lower().startswith("resembl"):
                    return (i, i)

        if comp.startswith("seem"):
            for i,t in enumerate(tokens):
                if getattr(t, "lemma_", t.text).lower().startswith("seem"):
                    return (i, i)

        # punctuation comparators
        if comp in {"colon", ":"}:
            for i,t in enumerate(tokens):
                if t.text == ":":
                    return (i, i)
        if comp in {"semicolon", ";"}:
            for i,t in enumerate(tokens):
                if t.text == ";":
                    return (i, i)
        if comp in {"ellipsis", "...", "…"}:
            for i,t in enumerate(tokens):
                if t.text in {"...", "…"}:
                    return (i, i)
        if comp in {"en dash", "–", "—", "-"}:
            for i,t in enumerate(tokens):
                if t.text in {"—", "–", "-"}:
                    return (i, i)

        # last resort: exact token string match
        for i,t in enumerate(tokens):
            if t.text.lower() == comp:
                return (i, i)

        return None

    def _pre_post_counts_from_span(self, doc, span):
        """
        Compute pre/post token counts using the same token filtering used for 'total'.
        Ensures indices are comparable.
        """
        if span is None:
            nonpun = [t for t in doc if not t.is_space and not t.is_punct]
            total = len(nonpun)
            pre = total // 2
            post = total - pre
            ratio = (pre / post) if post > 0 else np.nan
            return total, pre, post, ratio

        start_i, end_i = span
        nonpun = []
        doc_to_nonpun_idx = {}
        for idx, t in enumerate(doc):
            if not t.is_space and not t.is_punct:
                doc_to_nonpun_idx[idx] = len(nonpun)
                nonpun.append(t)

        total = len(nonpun)

        def nearest_kept(i):
            if i in doc_to_nonpun_idx:
                return doc_to_nonpun_idx[i]
            L = i - 1
            R = i + 1
            while L >= 0 or R < len(doc):
                if L >= 0 and L in doc_to_nonpun_idx:
                    return doc_to_nonpun_idx[L]
                if R < len(doc) and R in doc_to_nonpun_idx:
                    return doc_to_nonpun_idx[R]
                L -= 1
                R += 1
            return 0

        start_np = nearest_kept(start_i)
        end_np = nearest_kept(end_i)

        pre = start_np
        post = max(total - (end_np + 1), 0)
        ratio = (pre / post) if post > 0 else np.nan
        return total, pre, post, ratio

    def _analyze_comparative_structure(self, doc, comparator_type):
        structure = {
            'has_explicit_comparator': False,
            'comparator_type': str(comparator_type).strip() or "Unknown",
            'comparative_adjectives': [],
            'superlative_adjectives': [],
            'modal_verbs': [],
            'epistemic_markers': []
        }
        for token in doc:
            if token.text.lower() in ['like','as','than']:
                structure['has_explicit_comparator'] = True
            if token.tag_ in ['JJR','RBR']:
                structure['comparative_adjectives'].append(token.text)
            elif token.tag_ in ['JJS','RBS']:
                structure['superlative_adjectives'].append(token.text)
            if token.pos_ == 'AUX' and token.text.lower() in ['might','could','would','should','may']:
                structure['modal_verbs'].append(token.text)
            if token.text.lower() in ['perhaps','maybe','possibly','apparently','seemingly']:
                structure['epistemic_markers'].append(token.text)
        return structure

    def _calculate_syntactic_complexity(self, doc):
        def depth(tok, d=0):
            if not list(tok.children):
                return d
            return max(depth(ch, d+1) for ch in tok.children)
        roots = [t for t in doc if t.head == t]
        if not roots:
            return 0
        try:
            return max(depth(r) for r in roots)
        except Exception:
            return np.nan

    def perform_comprehensive_linguistic_analysis(self):
        print("\nPERFORMING LINGUISTIC ANALYSIS")
        print("-" * 35)
        if self.nlp is None:
            print("spaCy unavailable → simplified analysis (token counts + TextBlob sentiment).")
            return self._perform_simplified_analysis()

        for name, df in list(self.datasets.items()):
            if df.empty:
                print(f"Skipping empty dataset: {name}")
                continue

            # Initialize feature containers
            n = len(df)
            feats = {
                'Total_Tokens': [None]*n,
                'Pre_Comparator_Tokens': [None]*n,
                'Post_Comparator_Tokens': [None]*n,
                'Pre_Post_Ratio': [None]*n,
                'Lemmatized_Text': [None]*n,
                'POS_Tags': [None]*n,
                'POS_Distribution': [None]*n,
                'Sentiment_Polarity': [None]*n,
                'Sentiment_Subjectivity': [None]*n,
                'Comparative_Structure': [None]*n,
                'Syntactic_Complexity': [None]*n,
                'Sentence_Length': [None]*n,
                'Adjective_Count': [None]*n,
                'Verb_Count': [None]*n,
                'Noun_Count': [None]*n,
                'Figurative_Density': [None]*n
            }

            for idx, row in df.iterrows():
                sent = str(row.get('Sentence_Context', '') or '').strip()
                comp = row.get('Comparator_Type', '')
                if not sent:
                    continue
                try:
                    doc = self.nlp(sent)

                    # Comparator span & pre/post
                    span = self._find_comparator_span(doc, comp)
                    total, pre, post, ratio = self._pre_post_counts_from_span(doc, span)

                    # Lemmas & POS (exclude spaces/punct for most features)
                    tokens_nopunct = [t for t in doc if not t.is_space and not t.is_punct]
                    lemmas = [t.lemma_.lower() for t in tokens_nopunct if not t.is_stop]
                    pos_tags = [t.pos_ for t in tokens_nopunct]  # no punctuation
                    pos_dist = Counter(pos_tags)

                    # Sentiment (exploratory only)
                    pol = subj = np.nan
                    if TextBlob is not None:
                        try:
                            blob = TextBlob(sent)
                            pol, subj = blob.sentiment.polarity, blob.sentiment.subjectivity
                        except Exception:
                            pass

                    comp_struct = self._analyze_comparative_structure(doc, comp)
                    complexity = self._calculate_syntactic_complexity(doc)
                    slen = len(tokens_nopunct)
                    adj = sum(1 for t in tokens_nopunct if t.pos_ == 'ADJ')
                    vrb = sum(1 for t in tokens_nopunct if t.pos_ == 'VERB')
                    nou = sum(1 for t in tokens_nopunct if t.pos_ == 'NOUN')

                    figurative_markers = {'like','as','such','seem','appear','resemble','as if','as though'}
                    # token-level density (multiword markers counted by token hits)
                    fdens = (sum(1 for t in tokens_nopunct if t.text.lower() in figurative_markers) / total) if total else 0

                    loc = df.index.get_loc(idx)
                    feats['Total_Tokens'][loc] = total
                    feats['Pre_Comparator_Tokens'][loc] = pre
                    feats['Post_Comparator_Tokens'][loc] = post
                    feats['Pre_Post_Ratio'][loc] = ratio
                    feats['Lemmatized_Text'][loc] = ' '.join(lemmas)
                    feats['POS_Tags'][loc] = '; '.join(pos_tags)
                    feats['POS_Distribution'][loc] = dict(pos_dist)
                    feats['Sentiment_Polarity'][loc] = pol
                    feats['Sentiment_Subjectivity'][loc] = subj
                    feats['Comparative_Structure'][loc] = comp_struct
                    feats['Syntactic_Complexity'][loc] = complexity
                    feats['Sentence_Length'][loc] = slen
                    feats['Adjective_Count'][loc] = adj
                    feats['Verb_Count'][loc] = vrb
                    feats['Noun_Count'][loc] = nou
                    feats['Figurative_Density'][loc] = fdens
                except Exception as e:
                    print(f"  Error in {name} row {idx}: {e}")

            # Serialize complex columns for CSV
            df['POS_Distribution'] = [json.dumps(x) if isinstance(x, dict) else None for x in feats['POS_Distribution']]
            df['Comparative_Structure'] = [json.dumps(x) if isinstance(x, dict) else None for x in feats['Comparative_Structure']]
            for k, v in feats.items():
                if k in ['POS_Distribution','Comparative_Structure']:
                    continue
                df[k] = v

            self.linguistic_features[name] = feats
            self.datasets[name] = df
            print(f"Finished linguistic analysis for {name}.")

        print("All datasets processed.")

    def _perform_simplified_analysis(self):
        for name, df in list(self.datasets.items()):
            if df.empty or 'Sentence_Context' not in df.columns:
                continue
            n = len(df)
            df['Total_Tokens'] = [None]*n
            df['Pre_Comparator_Tokens'] = [None]*n
            df['Post_Comparator_Tokens'] = [None]*n
            df['Pre_Post_Ratio'] = [np.nan]*n
            df['Sentiment_Polarity'] = [np.nan]*n
            df['Sentiment_Subjectivity'] = [np.nan]*n
            df['Sentence_Length'] = [None]*n

            for idx, row in df.iterrows():
                sent = str(row.get('Sentence_Context','') or '').strip()
                if not sent:
                    continue
                tokens = [t for t in sent.split(" ") if t]
                total = len(tokens)
                df.loc[idx, 'Total_Tokens'] = total
                df.loc[idx, 'Sentence_Length'] = total
                if TextBlob is not None:
                    try:
                        blob = TextBlob(sent)
                        df.loc[idx, 'Sentiment_Polarity'] = blob.sentiment.polarity
                        df.loc[idx, 'Sentiment_Subjectivity'] = blob.sentiment.subjectivity
                    except Exception:
                        pass
                comp = row.get('Comparator_Type','')
                pos = -1
                if str(comp).strip():
                    try:
                        m = re.search(r'\b' + re.escape(str(comp).strip()) + r'\b', sent, re.IGNORECASE)
                        if m:
                            pre_text = sent[:m.start()]
                            pos = len([t for t in pre_text.split(" ") if t])
                    except Exception:
                        pass
                if total > 0 and pos != -1:
                    pre, post = pos, total - pos - 1
                    df.loc[idx, 'Pre_Comparator_Tokens'] = pre
                    df.loc[idx, 'Post_Comparator_Tokens'] = post
                    df.loc[idx, 'Pre_Post_Ratio'] = (pre / post) if post > 0 else np.nan
            self.datasets[name] = df
        print("Simplified analysis complete.")

    # ---------- Instance-aligned F1 metrics ----------

    def _pair_rows_by_sentence(self, gold_df, pred_df, fuzzy_threshold=0.92):
        """
        Returns list of matched pairs [(gold_idx, pred_idx)] using exact match first,
        then fuzzy matching without reuse of already matched rows.
        """
        if gold_df.empty or pred_df.empty:
            return [], gold_df, pred_df

        gold_df = gold_df.copy()
        pred_df = pred_df.copy()
        gold_df["_norm_sent"] = gold_df["Sentence_Context"].map(self._normalize_sentence_for_match)
        pred_df["_norm_sent"] = pred_df["Sentence_Context"].map(self._normalize_sentence_for_match)

        exact_pairs = []
        used_pred = set()
        pred_lookup = {}
        for j, s in pred_df["_norm_sent"].items():
            pred_lookup.setdefault(s, []).append(j)

        for i, s in gold_df["_norm_sent"].items():
            if s in pred_lookup:
                js = [jj for jj in pred_lookup[s] if jj not in used_pred]
                if js:
                    j = js[0]
                    used_pred.add(j)
                    exact_pairs.append((i, j))

        unmatched_gold = [i for i in gold_df.index if i not in {gi for gi,_ in exact_pairs}]
        unmatched_pred = [j for j in pred_df.index if j not in used_pred]

        fuzzy_pairs = []
        for i in unmatched_gold:
            best_j = None
            best_r = 0.0
            gi = gold_df.at[i, "_norm_sent"]
            for j in unmatched_pred:
                r = SequenceMatcher(None, gi, pred_df.at[j, "_norm_sent"]).ratio()
                if r > best_r:
                    best_r = r
                    best_j = j
            if best_j is not None and best_r >= fuzzy_threshold:
                used_pred.add(best_j)
                fuzzy_pairs.append((i, best_j))

        pairs = exact_pairs + fuzzy_pairs
        return pairs, gold_df, pred_df

    def _compute_f1_from_pairs(self, gold_df, pred_df, pairs, category_col="Category_Framework"):
        """
        Build TP/FP/FN per category from aligned pairs.
        A predicted row is a TP for category c if both gold and pred label == c.
        If labels differ, count FP for pred's category and FN for gold's category.
        Unmatched gold rows are FNs; unmatched pred rows are FPs.
        """
        cats = sorted(set(gold_df[category_col].astype(str)) | set(pred_df[category_col].astype(str)))
        TP = {c:0 for c in cats}
        FP = {c:0 for c in cats}
        FN = {c:0 for c in cats}

        matched_gold = set(i for i,_ in pairs)
        matched_pred = set(j for _,j in pairs)

        for i,j in pairs:
            g = str(gold_df.at[i, category_col])
            p = str(pred_df.at[j, category_col])
            if p == g:
                TP[g] += 1
            else:
                FP[p] += 1
                FN[g] += 1

        for i in gold_df.index:
            if i not in matched_gold:
                g = str(gold_df.at[i, category_col])
                FN[g] += 1

        for j in pred_df.index:
            if j not in matched_pred:
                p = str(pred_df.at[j, category_col])
                FP[p] += 1

        metrics = {}
        micro_tp = micro_fp = micro_fn = 0

        for c in cats:
            tp, fp, fn = TP[c], FP[c], FN[c]
            prec = tp / (tp + fp) if (tp + fp) > 0 else 0.0
            rec  = tp / (tp + fn) if (tp + fn) > 0 else 0.0
            f1   = (2*prec*rec)/(prec+rec) if (prec+rec) > 0 else 0.0
            metrics[c] = {"tp": tp, "fp": fp, "fn": fn, "precision": prec, "recall": rec, "f1": f1}
            micro_tp += tp; micro_fp += fp; micro_fn += fn

        micro_prec = micro_tp / (micro_tp + micro_fp) if (micro_tp + micro_fp) > 0 else 0.0
        micro_rec  = micro_tp / (micro_tp + micro_fn) if (micro_tp + micro_fn) > 0 else 0.0
        micro_f1   = (2*micro_prec*micro_rec)/(micro_prec+micro_rec) if (micro_prec+micro_rec) > 0 else 0.0
        macro_f1   = np.mean([m["f1"] for m in metrics.values()]) if metrics else 0.0

        return metrics, {"micro_precision": micro_prec, "micro_recall": micro_rec, "micro_f1": micro_f1, "macro_f1": macro_f1}

    def calculate_corrected_f1_scores(self, fuzzy_threshold=0.92):
        print("\nCALCULATING INSTANCE-ALIGNED F1 METRICS")
        print("-" * 44)
        print(f"Assumptions: sentence-level alignment (exact, then fuzzy ≥ {fuzzy_threshold}); category = 'Category_Framework'.")

        manual_df = self.datasets.get('manual', pd.DataFrame())
        if manual_df.empty or 'Sentence_Context' not in manual_df or 'Category_Framework' not in manual_df:
            print("F1 unavailable: manual annotations missing/invalid.")
            self.comparison_results['f1_analysis'] = None
            return None, None

        out = {}

        def eval_one(pred_df, pred_name):
            if pred_df.empty or 'Sentence_Context' not in pred_df or 'Category_Framework' not in pred_df:
                print(f"{pred_name}: dataset missing required columns.")
                return None
            pairs, gdf, pdf = self._pair_rows_by_sentence(manual_df, pred_df, fuzzy_threshold=fuzzy_threshold)
            metrics, overall = self._compute_f1_from_pairs(gdf, pdf, pairs)
            print(f"{pred_name}: pairs={len(pairs)}  micro-F1={overall['micro_f1']:.3f}  macro-F1={overall['macro_f1']:.3f}")
            return {"pairs": len(pairs), "category_metrics": metrics, "overall": overall}

        rb = self.datasets.get('rule_based', pd.DataFrame())
        nl = self.datasets.get('nlp', pd.DataFrame())

        out['rule_based_vs_manual'] = eval_one(rb, "Rule-Based vs Manual")
        out['nlp_vs_manual'] = eval_one(nl, "NLP vs Manual")

        self.comparison_results['f1_analysis'] = out
        primary = out['rule_based_vs_manual']['overall']['micro_f1'] if out.get('rule_based_vs_manual') else None
        return out, primary

    # ---------- Save / Export ----------

    def save_comprehensive_results(self, output_path="comprehensive_linguistic_analysis_corrected.csv"):
        print("\nSAVING COMPREHENSIVE RESULTS …")
        frames = []
        for name, df in self.datasets.items():
            if df is None or df.empty:
                continue
            d = df.copy()
            for col, default in [
                ('Original_Dataset', name),
                ('Instance_ID', None),
                ('Sentence_Context', None),
                ('Category_Framework', None),
                ('Comparator_Type', None)
            ]:
                if col not in d.columns:
                    d[col] = default

            if d['Instance_ID'].isna().any() or (not d['Instance_ID'].astype(str).is_unique):
                d = self._ensure_ids(d, name)

            base = ['Instance_ID','Original_Dataset','Sentence_Context','Category_Framework','Comparator_Type']
            others = [c for c in d.columns if c not in base]
            d = d[base + others]
            frames.append(d)

        if not frames:
            print("No data to save.")
            return pd.DataFrame()

        combined = pd.concat(frames, ignore_index=True)

        # Stable sort: Manual → Restrictive → Less-Restrictive PG → BNC
        order = {
            'Manual_CloseReading': 1,
            'Restrictive_Dubliners': 2,
            'NLP_LessRestrictive_PG': 3,
            'BNC_Baseline': 4
        }
        combined['__order__'] = combined['Original_Dataset'].map(order).fillna(99).astype(int)

        def _id_numeric_tail(x):
            m = re.search(r'(\d+)$', str(x))
            return int(m.group(1)) if m else 0

        combined = combined.sort_values(
            by=['__order__','Original_Dataset','Instance_ID'],
            key=lambda s: s.map(_id_numeric_tail) if s.name == 'Instance_ID' else s
        ).drop(columns='__order__')

        combined.to_csv(output_path, index=False)
        print(f"Saved: {output_path}")
        print("Integrity:",
              "missing Instance_ID =", combined['Instance_ID'].isna().sum(),
              "| missing Original_Dataset =", combined['Original_Dataset'].isna().sum(),
              "| rows =", len(combined))
        print("Environment (for reproducibility):", self.env_info)
        return combined


# ========= RUN THE PIPELINE (with your filenames) =========
# manual_path = "All Similes - Dubliners cont.csv"           # close reading (manual)
# rule_based_path = "dubliners_corrected_extraction.csv"     # restrictive
# nlp_path = "dubliners_nlp_basic_extraction.csv"            # less-restrictive PG Dubliners
# bnc_processed_path = "bnc_processed_similes.csv"           # BNC baseline

# Example of how to use with uploaded files:
# from google.colab import files
# uploaded = files.upload()
# manual_file_content = uploaded['All Similes - Dubliners cont.csv'] if 'All Similes - Dubliners cont.csv' in uploaded else None
# rule_based_file_content = uploaded['dubliners_corrected_extraction.csv'] if 'dubliners_corrected_extraction.csv' in uploaded else None
# nlp_file_content = uploaded['dubliners_nlp_basic_extraction.csv'] if 'dubliners_nlp_basic_extraction.csv' in uploaded else None
# bnc_file_content = uploaded['bnc_processed_similes.csv'] if 'bnc_processed_similes.csv' in uploaded else None

# Temporarily using existing files for demonstration
manual_file_content = "/content/All Similes - Dubliners cont.csv"
rule_based_file_content = "/content/dubliners_rulebased_extraction.csv"
nlp_file_content = "/content/dubliners_nlp_less_restrictive_extraction.csv"
bnc_file_content = "/content/bnc_processed_similes.csv"

comparator = ComprehensiveLinguisticComparator()
comparator.load_datasets(manual_file_content, rule_based_file_content, nlp_file_content, bnc_file_content)
comparator.perform_comprehensive_linguistic_analysis()
f1_analysis, primary_f1 = comparator.calculate_corrected_f1_scores()  # instance-aligned F1
results_df = comparator.save_comprehensive_results("comprehensive_linguistic_analysis_corrected.csv")

print("\nPIPELINE COMPLETED.")


COMPREHENSIVE LINGUISTIC COMPARISON OF FOUR SIMILE DATASETS (FIXED)
Dataset 1: Manual Annotations (Ground Truth - Close Reading)
Dataset 2: Rule-Based Extraction (Restrictive - Domain-Informed)
Dataset 3: NLP Extraction (Less-Restrictive - PG Dubliners)
Dataset 4: BNC Baseline Corpus (Standard English Reference)
spaCy pipeline loaded: en_core_web_sm
Environment: {'python': '3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]', 'pandas': '2.2.2', 'numpy': '2.0.2', 'spacy': '3.8.7', 'textblob': 'n/a'}

LOADING DATASETS WITH FIXED ID HANDLING & EXPLICIT LABELS
----------------------------------------------------------------------
Loading manual annotations…
Loading rule-based (restrictive)…
Loading NLP (less-restrictive PG)…
Loading BNC baseline…
Standardizing column names & adding Dataset_Source…
Standardization complete.
Harmonizing Category_Framework labels…
Category harmonization complete.
      manual: rows= 184  missing_IDs=0  missing_Original_Dataset=0
  rule_based: rows= 218  missi

In [12]:
# =============================================================================
# MULTI-COMPARATOR SEGMENT METRICS (pre / between / post) + PATTERN FLAGS
# Between_Tokens_List exported as "10, 10" (or "0" if none); raw list preserved.
# =============================================================================

import os, re, json, time, ast
import numpy as np
import pandas as pd

try:
    import spacy
    _HAS_SPACY = True if 'nlp' in globals() and nlp is not None else False
except Exception:
    _HAS_SPACY = False

ts = time.strftime("%Y%m%d_%H%M%S")
out_dir = "analysis_outputs"
os.makedirs(out_dir, exist_ok=True)

def _pick_df_for_segments():
    for name in ("GLOBAL_DF","df_vis","results_df","df"):
        if name in globals() and isinstance(globals()[name], pd.DataFrame) and not globals()[name].empty:
            print(f"Using in-memory dataset: {name} ({len(globals()[name])} rows)")
            return globals()[name].copy()
    for p in [
        "comprehensive_linguistic_analysis_enriched.csv",
        "comprehensive_linguistic_analysis_corrected.csv",
        "comprehensive_linguistic_analysis.csv",
        "/content/comprehensive_linguistic_analysis_enriched.csv",
        "/content/comprehensive_linguistic_analysis_corrected.csv",
        "/content/comprehensive_linguistic_analysis.csv",
    ]:
        if os.path.exists(p):
            print(f"Loaded dataset from disk: {p}")
            return pd.read_csv(p)
    raise FileNotFoundError("No comprehensive dataset found in memory or standard paths.")

_SUBORDINATORS = {"that","who","which","when","because","though","although","if","as",
                  "while","since","unless","until","where","whereas"}

def _simple_tokens(s):
    s = re.sub(r"[“”‘’\"']", "", str(s))
    s = s.replace("—","-").replace("–","-")
    return re.findall(r"\w+|[:;,\.\-\(\)…]", s)

def _detect_comparator_spans(doc_tokens, use_spacy=False, doc_obj=None):
    spans = []
    n = len(doc_tokens)
    tt = (lambda i: (doc_tokens[i].text if use_spacy else doc_tokens[i]))
    lm = (lambda i: (getattr(doc_tokens[i], "lemma_", tt(i)).lower() if use_spacy else tt(i).lower()))
    lower = [tt(i).lower() for i in range(n)]
    lemmas = [lm(i) for i in range(n)]

    used = set()
    i = 0
    while i < n-1:
        if lower[i] == "as" and lower[i+1] == "if":
            spans.append((i, i+1, "as_if")); used.update((i,i+1)); i += 2; continue
        if lower[i] == "as" and lower[i+1] == "though":
            spans.append((i, i+1, "as_though")); used.update((i,i+1)); i += 2; continue
        i += 1

    as_positions = [i for i,w in enumerate(lower) if w == "as" and i not in used]
    consumed = set()
    for i in as_positions:
        if i in consumed:
            continue
        partner = None
        for j in as_positions:
            if j > i and (j - i) <= 8 and j not in consumed:
                partner = j; break
        if partner is not None:
            spans.append((i, partner, "as_as"))
            consumed.update({i, partner})
    used.update(consumed)

    for i,w in enumerate(lower):
        if w == "like":
            spans.append((i,i,"like"))

    for i,l in enumerate(lemmas):
        if l.startswith("resembl") or l.startswith("seem") or l.startswith("appear"):
            spans.append((i,i,l.split("-")[0] if "-" in l else l))

    for i,w in enumerate([tt(k) for k in range(n)]):
        if w in {":",";","—","–","-","...","…"}:
            spans.append((i,i,f"punc_{w}"))

    spans = sorted(spans, key=lambda x: (x[0], x[1]))
    coalesced, last = [], None
    for s in spans:
        if last is None:
            last = s
        else:
            if s[0] <= last[1] and s[1] >= last[0]:
                last = (min(last[0], s[0]), max(last[1], s[1]), last[2])
            else:
                coalesced.append(last); last = s
    if last is not None:
        coalesced.append(last)
    return coalesced

def _segments_from_spans(doc_tokens, spans, use_spacy=False):
    content_idx, doc_to_content = [], {}
    for i, tok in enumerate(doc_tokens):
        if use_spacy:
            if getattr(tok, "is_space", False) or getattr(tok, "is_punct", False): continue
        else:
            if re.fullmatch(r"[:;,.\-\(\)…]", str(tok)): continue
        doc_to_content[i] = len(content_idx)
        content_idx.append(i)

    def nearest(i):
        if i in doc_to_content: return doc_to_content[i]
        L, R = i-1, i+1
        while L >= 0 or R < len(doc_tokens):
            if L >= 0 and L in doc_to_content: return doc_to_content[L]
            if R < len(doc_tokens) and R in doc_to_content: return doc_to_content[R]
            L -= 1; R += 1
        return 0

    spans_np = []
    for a,b,_ in spans:
        a_np, b_np = nearest(a), nearest(b)
        spans_np.append((min(a_np,b_np), max(a_np,b_np)))
    spans_np = sorted(spans_np)

    total = len(content_idx)
    if total == 0:
        return {"Pre_Tokens":0,"Post_Tokens":0,"Between_Tokens_List":[],
                "Between_Tokens_Total":0,"Between_Segments":0,"Between_Max":0,
                "Between_Mean":0.0,"Between_Has_Verb":False,"Between_Has_Subordinator":False}

    if not spans_np:
        pre = total // 2; post = total - pre
        return {"Pre_Tokens":pre,"Post_Tokens":post,"Between_Tokens_List":[],
                "Between_Tokens_Total":0,"Between_Segments":0,"Between_Max":0,
                "Between_Mean":0.0,"Between_Has_Verb":False,"Between_Has_Subordinator":False}

    first_start, last_end = spans_np[0][0], spans_np[-1][1]
    pre = first_start
    post = max(total - (last_end + 1), 0)

    between = []
    for i in range(len(spans_np)-1):
        gap = max(spans_np[i+1][0] - (spans_np[i][1] + 1), 0)
        between.append(gap)

    has_verb = False; has_sub = False
    if len(spans) >= 2:
        for i in range(len(spans)-1):
            a_end, b_start = spans[i][1], spans[i+1][0]
            chunk = doc_tokens[a_end+1:b_start]
            for tok in chunk:
                if _HAS_SPACY:
                    if getattr(tok,"pos_","") == "VERB": has_verb = True
                    if tok.text.lower() in _SUBORDINATORS: has_sub = True
                else:
                    if re.fullmatch(r"[A-Za-z]+", str(tok)) and str(tok).lower() in _SUBORDINATORS: has_sub = True
                    if re.fullmatch(r".*(ed|ing)$", str(tok).lower()) or str(tok).lower() in {"be","is","are","was","were","been","am","have","has","had","do","did","does"}:
                        has_verb = True

    bt_total = int(np.sum(between)) if between else 0
    bt_max   = int(np.max(between)) if between else 0
    bt_mean  = float(np.mean(between)) if between else 0.0

    return {"Pre_Tokens":int(pre),"Post_Tokens":int(post),"Between_Tokens_List":between,
            "Between_Tokens_Total":bt_total,"Between_Segments":int(len(between)),
            "Between_Max":bt_max,"Between_Mean":bt_mean,
            "Between_Has_Verb":bool(has_verb),"Between_Has_Subordinator":bool(has_sub)}

def _pattern_label(labels):
    if not labels: return "none"
    if labels == ["as_as"]: return "as_as"
    if labels.count("like") >= 2: return "like_like"
    if any(l.startswith("punc_") for l in labels): return "framed_multi" if len(labels) >= 2 else "framed_single"
    return "single" if len(labels) == 1 else "multi"

# ---------- main ----------
df0 = _pick_df_for_segments()

# Sentence column
if "Sentence_Context" not in df0.columns:
    for c in ["Sentence Context","sentence","text","context","Text"]:
        if c in df0.columns:
            df0 = df0.rename(columns={c: "Sentence_Context"})
            break
if "Sentence_Context" not in df0.columns:
    raise RuntimeError("No 'Sentence_Context' column found.")

# Drop old metric columns to avoid duplicates
metric_cols = [
    "Comp_Count","Comp_Labels","Pre_Tokens","Between_Tokens_List","Between_Tokens_Total",
    "Between_Segments","Between_Max","Between_Mean","Post_Tokens",
    "Between_Has_Verb","Between_Has_Subordinator","Pattern_Label",
    "Between_Tokens_List_Raw","Pre_Share","Between_Share","Post_Share"
]
existing = [c for c in metric_cols if c in df0.columns]
if existing:
    df0 = df0.drop(columns=existing)

rows = []
use_spacy = _HAS_SPACY
print("spaCy available → using spaCy tokens." if use_spacy else "spaCy unavailable → using simplified tokenization.")

for _, r in df0.iterrows():
    sent = str(r.get("Sentence_Context","") or "").strip()
    if not sent:
        rows.append({"Comp_Count":0,"Comp_Labels":json.dumps([]),
                     "Pre_Tokens":0,"Between_Tokens_List":[],"Between_Tokens_Total":0,
                     "Between_Segments":0,"Between_Max":0,"Between_Mean":0.0,
                     "Post_Tokens":0,"Between_Has_Verb":False,"Between_Has_Subordinator":False,
                     "Pattern_Label":"none"})
        continue

    if use_spacy:
        doc = nlp(sent); toks = list(doc)
        spans = _detect_comparator_spans(toks, use_spacy=True, doc_obj=doc)
    else:
        toks = _simple_tokens(sent)
        spans = _detect_comparator_spans(toks, use_spacy=False)

    segs = _segments_from_spans(toks, spans, use_spacy=use_spacy)
    labels = [lab for *_ab, lab in spans]
    rows.append({"Comp_Count":len(spans),"Comp_Labels":json.dumps(labels),**segs,"Pattern_Label":_pattern_label(labels)})

seg_df = pd.DataFrame(rows, index=df0.index)

# Merge and ensure no duplicate column names remain
aug = pd.concat([df0, seg_df], axis=1)
aug = aug.loc[:, ~aug.columns.duplicated(keep="last")]  # keep new metrics if any name collides

# Serialise Between_Tokens_List to "10, 10" (or "0"), keep raw list
def _serialise_between(x):
    if isinstance(x, str):
        s = x.strip()
        if s.startswith("[") and s.endswith("]"):
            try: x = ast.literal_eval(s)
            except Exception: return "0" if s in ("[]","") else s
        else:
            return s if s else "0"
    if isinstance(x, (list, tuple, np.ndarray, pd.Series)):
        vals = [str(int(v)) for v in x if pd.notna(v)]
        return ", ".join(vals) if vals else "0"
    if pd.isna(x) or x == "": return "0"
    try: return str(int(x))
    except Exception: return "0"

aug["Between_Tokens_List_Raw"] = aug["Between_Tokens_List"]
aug["Between_Tokens_List"] = aug["Between_Tokens_List"].apply(_serialise_between)

# Normalized shares
pre  = pd.to_numeric(aug["Pre_Tokens"], errors="coerce").fillna(0)
btw  = pd.to_numeric(aug["Between_Tokens_Total"], errors="coerce").fillna(0)
post = pd.to_numeric(aug["Post_Tokens"], errors="coerce").fillna(0)
den  = (pre + btw + post).replace(0, np.nan)
aug["Pre_Share"]     = pre / den
aug["Between_Share"] = btw / den
aug["Post_Share"]    = post / den

# Save & expose
out_csv = os.path.join(out_dir, f"comparator_segments_{ts}.csv")
aug.to_csv(out_csv, index=False)
latest_csv = "comparator_segments_latest.csv"
aug.to_csv(latest_csv, index=False)

print(f"✓ Saved comparator segment metrics → {out_csv}")
print(f"✓ Also saved as → {latest_csv}")

try:
    from google.colab import files
    files.download(out_csv)
    files.download(latest_csv)
    print("Downloads triggered.")
except Exception:
    print("Colab download unavailable. Files are on disk.")

GLOBAL_DF = aug.copy()
df_vis = aug.copy()
print("GLOBAL_DF/df_vis updated with segment metrics.")

print("\nSegment metrics summary (Joyce vs BNC if available):")
if "Original_Dataset" in aug.columns:
    grp = aug.groupby(aug["Original_Dataset"].astype(str).str.contains("BNC", case=False, na=False).map({True:"BNC", False:"Joyce"}))
    print(grp[["Comp_Count","Between_Tokens_Total","Between_Segments","Between_Has_Verb","Between_Has_Subordinator"]]
          .agg({"Comp_Count":"mean","Between_Tokens_Total":"mean","Between_Segments":"mean",
                "Between_Has_Verb":"mean","Between_Has_Subordinator":"mean"}).round(2))
else:
    print(aug[["Comp_Count","Between_Tokens_Total","Between_Segments"]].describe().round(2))


Using in-memory dataset: GLOBAL_DF (705 rows)
spaCy available → using spaCy tokens.
✓ Saved comparator segment metrics → analysis_outputs/comparator_segments_20250825_105404.csv
✓ Also saved as → comparator_segments_latest.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloads triggered.
GLOBAL_DF/df_vis updated with segment metrics.

Segment metrics summary (Joyce vs BNC if available):
                  Comp_Count  Between_Tokens_Total  Between_Segments  \
Original_Dataset                                                       
BNC                     1.48                  2.50              0.48   
Joyce                   1.42                  4.76              0.50   

                  Between_Has_Verb  Between_Has_Subordinator  
Original_Dataset                                              
BNC                           0.16                      0.06  
Joyce                         0.26                      0.12  


# 7. Statistical Significance Testing
# 7.1 Multi-Group Comparative Analysis
The statistical analysis distinguishes between Joyce Manual, Joyce Restrictive, Joyce Less-Restrictive, and BNC subsets to provide granular assessment of methodological differences.

# 7.2 Robust Statistical Framework
Implementation includes:

Four-way chi-square analysis for categorical distribution testing
Newcombe-Wilson confidence intervals for two-proportion comparisons
Binomial testing against BNC reference proportions
Welch t-tests and Mann-Whitney U tests for continuous feature assessment

# 7.3 Topic Modeling Integration
Latent Dirichlet Allocation provides thematic analysis across all dataset subsets, revealing content-based distinctions complementing statistical findings.

In [15]:
# =============================================================================
# ROBUST STATISTICAL SIGNIFICANCE + TOPIC MODELLING (Joyce subsets vs BNC)
# =============================================================================

import os, json, time
import numpy as np
import pandas as pd

from math import asin, sqrt
from scipy.stats import chi2_contingency, mannwhitneyu, ttest_ind, binomtest, norm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

try:
    from statsmodels.stats.proportion import proportions_ztest, confint_proportions_2indep
    _HAS_STATSMODELS = True
except Exception:
    _HAS_STATSMODELS = False

# ---------- Setup ----------
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

ts = time.strftime("%Y%m%d_%H%M%S")
out_dir = os.path.join("analysis_outputs")
os.makedirs(out_dir, exist_ok=True)

print("\nROBUST STATISTICAL ANALYSIS (Joyce subsets vs BNC)")
print("=" * 75)

# --- Sanity: results_df must exist from Cell 1 ---
if 'results_df' not in globals() or results_df is None or results_df.empty:
    raise RuntimeError("results_df not found or empty. Run Cell 1 first.")

# --- Group labels (from Cell 1 'Original_Dataset') ---
LABELS = {
    "Manual_CloseReading":      "Joyce_Manual",
    "Restrictive_Dubliners":    "Joyce_Restrictive",
    "NLP_LessRestrictive_PG":   "Joyce_LessRestrictive",
    "BNC_Baseline":             "BNC"
}

df = results_df.copy()
if "Original_Dataset" not in df.columns:
    raise RuntimeError("results_df is missing 'Original_Dataset'.")
if "Category_Framework" not in df.columns:
    raise RuntimeError("results_df is missing 'Category_Framework'.")

df["__Group__"] = df["Original_Dataset"].map(LABELS).fillna(df["Original_Dataset"])

# --- Category whitelist (Joycean_Quasi was merged → keep unified 'Quasi_Similes') ---
KNOWN_CATEGORIES = {
    'Standard',
    'Quasi_Similes',          # unified quasi-simile label
    'Joycean_Quasi_Fuzzy',
    'Joycean_Framed',
    'Joycean_Silent',
    'Uncategorized'
}

bad_mask = ~df["Category_Framework"].isin(KNOWN_CATEGORIES)
if bad_mask.any():
    dropped = int(bad_mask.sum())
    print(f"[WARN] Dropping {dropped} rows with unexpected Category_Framework values: "
          f"{sorted(df.loc[bad_mask, 'Category_Framework'].astype(str).unique())}")
    df = df.loc[~bad_mask].copy()

# --- Quick distribution sanity table (post-harmonisation) ---
dist = (df.pivot_table(index="Category_Framework",
                       columns="Original_Dataset",
                       values="Instance_ID",
                       aggfunc="count", fill_value=0)
          .assign(Total=lambda x: x.sum(1))
          .sort_values("Total", ascending=False))
print("\nCategory distribution after harmonisation (counts):")
print(dist)

# --- Split groups ---
groups = {
    "Joyce_Manual":          df[df["__Group__"]=="Joyce_Manual"],
    "Joyce_Restrictive":     df[df["__Group__"]=="Joyce_Restrictive"],
    "Joyce_LessRestrictive": df[df["__Group__"]=="Joyce_LessRestrictive"],
    "BNC":                   df[df["__Group__"]=="BNC"]
}
for gname, gdf in groups.items():
    print(f"{gname:22s}: {len(gdf)} rows")

# ---------- Helpers ----------
def p_adjust_bh(p):
    """Benjamini–Hochberg FDR for a 1D array-like of p-values."""
    p = np.asarray(p, dtype=float)
    n = p.size
    order = np.argsort(p)
    ranked = np.empty(n, dtype=float)
    cummin = 1.0
    for i, idx in enumerate(order[::-1], start=1):
        rank = n - i + 1
        val = p[idx] * n / rank
        cummin = min(cummin, val)
        ranked[idx] = cummin
    return np.minimum(ranked, 1.0)

def cramers_v(chi2, n, r, c):
    """Cramér's V for r x c table."""
    if n <= 0 or min(r, c) <= 1:
        return np.nan
    return sqrt(chi2 / (n * (min(r, c) - 1)))

def cohens_h(p1, p2):
    """Cohen's h for proportions."""
    p1 = min(max(float(p1), 0.0), 1.0)
    p2 = min(max(float(p2), 0.0), 1.0)
    return 2 * (asin(sqrt(p1)) - asin(sqrt(p2)))

def hedges_g(a, b):
    """Hedges' g (small-sample corrected Cohen's d)."""
    a = np.asarray(a, float); b = np.asarray(b, float)
    na, nb = len(a), len(b)
    if na < 2 or nb < 2:
        return np.nan
    s1 = np.var(a, ddof=1); s2 = np.var(b, ddof=1)
    pooled = ((na-1)*s1 + (nb-1)*s2) / (na+nb-2) if (na+nb-2) > 0 else np.nan
    if not np.isfinite(pooled) or pooled <= 0:
        return np.nan
    d = (np.mean(a) - np.mean(b)) / np.sqrt(pooled)
    J = 1 - (3 / (4*(na+nb) - 9)) if (na+nb) > 2 else 1.0
    return d * J

def cliffs_delta_from_u(a, b):
    """Cliff's delta via Mann–Whitney U (orientation A > B)."""
    a = pd.Series(a).dropna().to_numpy()
    b = pd.Series(b).dropna().to_numpy()
    if len(a) == 0 or len(b) == 0:
        return np.nan
    u_ab, _ = mannwhitneyu(a, b, alternative="greater")
    return (2 * u_ab) / (len(a)*len(b)) - 1

# ---------- 1) 4-way Chi-square on Category_Framework ----------
cats = sorted(KNOWN_CATEGORIES)
contingency_4way = pd.DataFrame(
    {
        "Joyce_Manual":          [groups["Joyce_Manual"]["Category_Framework"].value_counts().get(cat,0) for cat in cats],
        "Joyce_Restrictive":     [groups["Joyce_Restrictive"]["Category_Framework"].value_counts().get(cat,0) for cat in cats],
        "Joyce_LessRestrictive": [groups["Joyce_LessRestrictive"]["Category_Framework"].value_counts().get(cat,0) for cat in cats],
        "BNC":                   [groups["BNC"]["Category_Framework"].value_counts().get(cat,0) for cat in cats],
    },
    index=cats
)

sim_used = False
try:
    chi2_4, p_4, dof_4, exp_4 = chi2_contingency(contingency_4way, correction=False)
    if (np.asarray(exp_4) < 5).any():
        try:
            chi2_4, p_4, dof_4, exp_4 = chi2_contingency(
                contingency_4way, correction=False, simulate_pval=True, num_simulation=5000
            )
            sim_used = True
        except TypeError:
            pass
except Exception as e:
    raise RuntimeError(f"chi2_contingency failed: {e}")

N_total = contingency_4way.values.sum()
V_4 = cramers_v(chi2_4, N_total, *contingency_4way.shape)

print("\n4-way Chi-square on Category_Framework (Joyce subsets vs BNC):")
print(f"χ² = {chi2_4:.4f} | df = {dof_4} | p = {p_4:.6f} | Cramér’s V = {V_4:.3f} | Monte-Carlo={sim_used}")

# Save contingency + expected + standardized residuals + per-cell p-values (FDR)
path_cont_4 = os.path.join(out_dir, f"chi2_contingency_by_subset_{ts}.csv")
path_exp_4  = os.path.join(out_dir, f"chi2_expected_by_subset_{ts}.csv")
contingency_4way.to_csv(path_cont_4)
exp_df_4 = pd.DataFrame(exp_4, index=cats, columns=contingency_4way.columns)
exp_df_4.to_csv(path_exp_4)

pearson_z = (contingency_4way - exp_df_4) / np.sqrt(exp_df_4.replace(0, np.nan))
cell_pvals = pearson_z.applymap(lambda z: 2*norm.sf(abs(z)) if pd.notnull(z) else np.nan)

flat = cell_pvals.values.flatten()
mask = np.isfinite(flat)
adj = np.full_like(flat, np.nan, dtype=float)
if mask.any():
    adj[mask] = p_adjust_bh(flat[mask])
cell_pvals_adj = pd.DataFrame(adj.reshape(cell_pvals.shape), index=cell_pvals.index, columns=cell_pvals.columns)

path_resid_z = os.path.join(out_dir, f"chi2_pearson_z_by_subset_{ts}.csv")
path_resid_p = os.path.join(out_dir, f"chi2_cell_p_by_subset_{ts}.csv")
path_resid_padj = os.path.join(out_dir, f"chi2_cell_padj_BH_by_subset_{ts}.csv")
pearson_z.to_csv(path_resid_z)
cell_pvals.to_csv(path_resid_p)
cell_pvals_adj.to_csv(path_resid_padj)

# Print the strongest drivers (optional but helpful)
absz = pearson_z.abs().stack().sort_values(ascending=False)
print("\nTop 10 standardized residuals (|z|):")
for (cat, grp), z in absz.head(10).items():
    obs = contingency_4way.loc[cat, grp]
    exp = exp_df_4.loc[cat, grp]
    print(f"  {grp:22s} | {cat:20s}  z={z:6.2f}  obs={obs} exp={exp:.1f}")

# ---------- 2) Two-proportion tests (each Joyce subset vs BNC) ----------
print("\nTwo-proportion tests (Newcombe–Wilson) for each Joyce subset vs BNC:")
two_prop_rows = []
bnc_total = len(groups["BNC"])
bnc_counts = groups["BNC"]["Category_Framework"].value_counts()

for subset in ["Joyce_Manual","Joyce_Restrictive","Joyce_LessRestrictive"]:
    subset_total = len(groups[subset])
    subset_counts = groups[subset]["Category_Framework"].value_counts()
    for cat in cats:
        cA = subset_counts.get(cat,0); nA = subset_total
        cB = bnc_counts.get(cat,0);    nB = bnc_total
        pA = cA/nA if nA>0 else 0.0
        pB = cB/nB if nB>0 else 0.0
        row = {"Comparison":f"{subset}_vs_BNC", "Subset":subset, "Category":cat,
               "count_A":cA, "n_A":nA, "prop_A":pA, "count_B":cB, "n_B":nB, "prop_B":pB,
               "cohens_h": float(cohens_h(pA, pB))}
        # Skip boundary cases where both groups are 0 or 1 for this category
        if (cA == 0 and cB == 0) or (cA == nA and cB == nB) or (nA == 0 or nB == 0):
            row.update({"z": np.nan, "p_value": 1.0, "CI_low": np.nan, "CI_up": np.nan})
            print(f"  {subset:22s} | {cat:20s} (both groups at boundary → skip) h={row['cohens_h']:.3f}")
            two_prop_rows.append(row); continue
        if _HAS_STATSMODELS:
            try:
                z, pz = proportions_ztest(np.array([cA,cB]), np.array([nA,nB]))
            except Exception:
                # Haldane–Anscombe continuity correction fallback
                pA_ha = (cA + 0.5) / (nA + 1)
                pB_ha = (cB + 0.5) / (nB + 1)
                se = np.sqrt(pA_ha*(1-pA_ha)/(nA+1) + pB_ha*(1-pB_ha)/(nB+1))
                z = (pA_ha - pB_ha) / se if se>0 else np.nan
                pz = 2*norm.sf(abs(z)) if np.isfinite(z) else np.nan
            ci_low, ci_up = (np.nan, np.nan)
            try:
                ci_low, ci_up = confint_proportions_2indep(cA, nA, cB, nB, method="newcombe")
            except Exception:
                pass
            row.update({"z":float(z), "p_value":float(pz), "CI_low":float(ci_low), "CI_up":float(ci_up)})
            print(f"  {subset:22s} | {cat:20s} z={z:6.3f} p={pz:.6g} CI[{ci_low:.3f},{ci_up:.3f}] h={row['cohens_h']:.3f}")
        else:
            row.update({"z":np.nan, "p_value":np.nan, "CI_low":np.nan, "CI_up":np.nan})
            print(f"  {subset:22s} | {cat:20s} (statsmodels unavailable → skipping z/CI) h={row['cohens_h']:.3f}")
        two_prop_rows.append(row)

two_prop_df = pd.DataFrame(two_prop_rows)
if "p_value" in two_prop_df.columns:
    two_prop_df["p_adj_BH"] = p_adjust_bh(two_prop_df["p_value"].fillna(1.0).to_numpy())
path_two_prop = os.path.join(out_dir, f"two_prop_newcombe_by_subset_{ts}.csv")
two_prop_df.to_csv(path_two_prop, index=False)

# ---------- 3) Binomial tests (subset vs BNC reference proportion) ----------
print("\nBinomial tests (each Joyce subset vs BNC category proportion):")
binom_rows = []
for subset in ["Joyce_Manual","Joyce_Restrictive","Joyce_LessRestrictive"]:
    nA = len(groups[subset])
    subset_counts = groups[subset]["Category_Framework"].value_counts()
    for cat in cats:
        cA = subset_counts.get(cat,0)
        cB = bnc_counts.get(cat,0); nB = bnc_total
        p_ref = (cB/nB) if nB>0 else 0.0
        if nA>0 and 0 < p_ref < 1:
            bt = binomtest(cA, n=nA, p=p_ref)
            pv = bt.pvalue
            binom_rows.append({"Comparison":f"{subset}_vs_BNC", "Subset":subset, "Category":cat,
                               "count_A":cA, "n_A":nA, "p_ref_BNC":p_ref, "p_value":pv})
            print(f"  {subset:22s} | {cat:20s} {cA}/{nA} vs p_ref={p_ref:.4f} p={pv:.6g}")
        else:
            binom_rows.append({"Comparison":f"{subset}_vs_BNC", "Subset":subset, "Category":cat,
                               "count_A":cA, "n_A":nA, "p_ref_BNC":p_ref, "p_value":np.nan})

binom_df = pd.DataFrame(binom_rows)
if "p_value" in binom_df.columns:
    binom_df["p_adj_BH"] = p_adjust_bh(binom_df["p_value"].fillna(1.0).to_numpy())
path_binom = os.path.join(out_dir, f"binomial_tests_by_subset_{ts}.csv")
binom_df.to_csv(path_binom, index=False)

# ---------- 4) Continuous features (subset vs BNC) ----------
print("\nContinuous features (Welch t + Mann–Whitney U) each Joyce subset vs BNC:")
continuous_feats = ["Sentence_Length","Pre_Post_Ratio","Sentiment_Polarity","Sentiment_Subjectivity"]
cont_rows = []

for feat in continuous_feats:
    for subset in ["Joyce_Manual","Joyce_Restrictive","Joyce_LessRestrictive"]:
        A = pd.to_numeric(groups[subset].get(feat, pd.Series(dtype=float)), errors="coerce").dropna()
        B = pd.to_numeric(groups["BNC"].get(feat, pd.Series(dtype=float)), errors="coerce").dropna()
        row = {"Feature":feat, "Comparison":f"{subset}_vs_BNC", "Subset":subset,
               "A_n":int(A.shape[0]), "B_n":int(B.shape[0])}
        if len(A)>10 and len(B)>10:
            t,p_t = ttest_ind(A,B,equal_var=False)
            u,p_u = mannwhitneyu(A,B,alternative="two-sided")
            row.update({
                "A_mean":float(A.mean()), "A_median":float(A.median()),
                "B_mean":float(B.mean()), "B_median":float(B.median()),
                "t_stat":float(t), "t_pvalue":float(p_t),
                "U_stat":float(u), "U_pvalue":float(p_u),
                "hedges_g": float(hedges_g(A, B)),
                "cliffs_delta": float(cliffs_delta_from_u(A, B))
            })
            print(f"  {feat:22s} | {subset:22s} t={t:7.3f} p={p_t:.6g} | U={u:9.1f} p={p_u:.6g} | g={row['hedges_g']:.3f} δ={row['cliffs_delta']:.3f}")
        cont_rows.append(row)

cont_df = pd.DataFrame(cont_rows)
if "t_pvalue" in cont_df.columns:
    cont_df["t_padj_BH"] = p_adjust_bh(cont_df["t_pvalue"].fillna(1.0).to_numpy())
if "U_pvalue" in cont_df.columns:
    cont_df["U_padj_BH"] = p_adjust_bh(cont_df["U_pvalue"].fillna(1.0).to_numpy())

path_cont = os.path.join(out_dir, f"continuous_tests_by_subset_{ts}.csv")
cont_df.to_csv(path_cont, index=False)

# ---------- 4B) Comparator-segment metrics (Pre / Between / Post) ----------
# Tries comparator_segments_latest.csv first, then the most recent timestamped file in analysis_outputs/
import glob

print("\nComparator-segment statistics (Pre / Between / Post)")
seg_candidates = ["comparator_segments_latest.csv"] + sorted(
    glob.glob(os.path.join("analysis_outputs", "comparator_segments_*.csv")),
    key=os.path.getctime, reverse=True
)
seg_path = next((p for p in seg_candidates if os.path.exists(p)), None)

if seg_path is None:
    print("  [WARN] No comparator segments file found — skipping section 4B.")
else:
    print(f"  Using segments file: {seg_path}")
    seg_raw = pd.read_csv(seg_path)

    # --- flexible column detection (now includes new names from the enrichment cell) ---
    def _pick(cols, cand):
        for c in cand:
            if c in cols: return c
        return None

    id_col   = _pick(seg_raw.columns, ["Instance_ID","instance_id","ID","Id"])
    pre_col  = _pick(seg_raw.columns, ["Pre_Tokens","Pre_Comparator_Tokens","pre_tokens","pre_len","Pre_Length"])
    # NOTE: include the new name 'Between_Tokens_Total'
    bet_col  = _pick(seg_raw.columns, ["Between_Tokens_Total","Between_Tokens","Between_Comparator_Tokens",
                                       "between_tokens","between_len","Between_Length","InBetween_Tokens"])
    post_col = _pick(seg_raw.columns, ["Post_Tokens","Post_Comparator_Tokens","post_tokens","post_len","Post_Length"])
    ccount   = _pick(seg_raw.columns, ["Comp_Count","Comparator_Count","comparators","n_comparators"])

    if id_col is None or (pre_col is None and post_col is None and bet_col is None):
        print("  [WARN] Required columns not found in segments CSV — skipping 4B.")
    else:
        keep_cols = [id_col] + [c for c in [pre_col, bet_col, post_col, ccount] if c is not None]
        seg = seg_raw[keep_cols].copy()

        rename_map = {}
        if pre_col:  rename_map[pre_col]  = "Seg_Pre"
        if bet_col:  rename_map[bet_col]  = "Seg_Between"
        if post_col: rename_map[post_col] = "Seg_Post"
        if ccount:   rename_map[ccount]   = "Seg_Comp_Count"
        seg = seg.rename(columns=rename_map)

        # Ensure all expected columns exist (fill missing with 0)
        for c in ["Seg_Pre","Seg_Between","Seg_Post","Seg_Comp_Count"]:
            if c not in seg.columns:
                seg[c] = 0

        # Numeric coercion
        for c in ["Seg_Pre","Seg_Between","Seg_Post","Seg_Comp_Count"]:
            seg[c] = pd.to_numeric(seg[c], errors="coerce").fillna(0)

        # Totals + ratios (safe divide)
        seg["Seg_Total"] = seg[["Seg_Pre","Seg_Between","Seg_Post"]].sum(axis=1, min_count=1)
        with np.errstate(divide="ignore", invalid="ignore"):
            seg["Seg_Pre_Ratio"]     = seg["Seg_Pre"]     / seg["Seg_Total"].replace(0, np.nan)
            seg["Seg_Between_Ratio"] = seg["Seg_Between"] / seg["Seg_Total"].replace(0, np.nan)
            seg["Seg_Post_Ratio"]    = seg["Seg_Post"]    / seg["Seg_Total"].replace(0, np.nan)
        seg["Seg_Between_gt0"] = (seg["Seg_Between"] > 0).astype(int)

        # Merge onto working df (which already has __Group__)
        merged_seg = df.merge(seg, left_on="Instance_ID", right_on=id_col, how="left")

        # Build grouped view (mirror of 'groups')
        groups_seg = {
            "Joyce_Manual":          merged_seg[merged_seg["__Group__"]=="Joyce_Manual"],
            "Joyce_Restrictive":     merged_seg[merged_seg["__Group__"]=="Joyce_Restrictive"],
            "Joyce_LessRestrictive": merged_seg[merged_seg["__Group__"]=="Joyce_LessRestrictive"],
            "BNC":                   merged_seg[merged_seg["__Group__"]=="BNC"],
        }

        seg_features = [c for c in [
            "Seg_Pre","Seg_Between","Seg_Post","Seg_Total",
            "Seg_Pre_Ratio","Seg_Between_Ratio","Seg_Post_Ratio",
            "Seg_Comp_Count"
        ] if c in merged_seg.columns]

        print("  Features available:", seg_features if seg_features else "None")

        # ---- (i) Continuous tests for segment metrics (subset vs BNC) ----
        seg_rows = []
        for feat in seg_features:
            for subset in ["Joyce_Manual","Joyce_Restrictive","Joyce_LessRestrictive"]:
                A = pd.to_numeric(groups_seg[subset][feat], errors="coerce").dropna()
                B = pd.to_numeric(groups_seg["BNC"][feat], errors="coerce").dropna()
                row = {"Feature": feat, "Comparison": f"{subset}_vs_BNC", "Subset": subset,
                       "A_n": int(A.shape[0]), "B_n": int(B.shape[0])}
                if len(A) > 10 and len(B) > 10 and np.isfinite(A).any() and np.isfinite(B).any():
                    t, p_t = ttest_ind(A, B, equal_var=False)
                    u, p_u = mannwhitneyu(A, B, alternative="two-sided")
                    row.update({
                        "A_mean": float(A.mean()), "A_median": float(A.median()),
                        "B_mean": float(B.mean()), "B_median": float(B.median()),
                        "t_stat": float(t), "t_pvalue": float(p_t),
                        "U_stat": float(u), "U_pvalue": float(p_u),
                        "hedges_g": float(hedges_g(A, B)),
                        "cliffs_delta": float(cliffs_delta_from_u(A, B))
                    })
                    print(f"  {feat:20s} | {subset:22s} t={t:7.3f} p={p_t:.3g} | U={u:9.1f} p={p_u:.3g} | g={row['hedges_g']:.3f} δ={row['cliffs_delta']:.3f}")
                seg_rows.append(row)

        seg_cont_df = pd.DataFrame(seg_rows)
        if "t_pvalue" in seg_cont_df.columns:
            seg_cont_df["t_padj_BH"] = p_adjust_bh(seg_cont_df["t_pvalue"].fillna(1.0).to_numpy())
        if "U_pvalue" in seg_cont_df.columns:
            seg_cont_df["U_padj_BH"] = p_adjust_bh(seg_cont_df["U_pvalue"].fillna(1.0).to_numpy())

        path_seg_cont = os.path.join(out_dir, f"continuous_segment_tests_by_subset_{ts}.csv")
        seg_cont_df.to_csv(path_seg_cont, index=False)

       # ---- (ii) Does 'Between > 0' differ by group? (χ² with robust fallbacks) ----
from scipy.stats import fisher_exact

# fill NaN as 0 so groups without segment rows still contribute
col_between = merged_seg.get("Seg_Between_gt0", pd.Series(index=merged_seg.index)).fillna(0).astype(int)
tmp = merged_seg.assign(_between=col_between)

pres = tmp.groupby("__Group__")["_between"].agg(["sum","count"])
# keep a stable order if present
order = [g for g in ["Joyce_Manual","Joyce_Restrictive","Joyce_LessRestrictive","BNC"] if g in pres.index]
pres = pres.loc[order]

presence_tbl = pd.DataFrame({
    "Between>0": pres["sum"].astype(int),
    "Between=0": (pres["count"] - pres["sum"]).astype(int)
})

# drop rows/cols with zero total to avoid zero expected cells
presence_tbl = presence_tbl[presence_tbl.sum(axis=1) > 0]
presence_tbl = presence_tbl.loc[:, presence_tbl.sum(axis=0) > 0]

if presence_tbl.shape[0] < 2 or presence_tbl.shape[1] != 2:
    print("\nBetween-clause presence × Group (χ²):")
    print(presence_tbl if not presence_tbl.empty else "(no usable data)")
    print("  Not enough non-zero groups/columns for χ²; skipping.")
    path_presence_csv = path_presence_exp = path_presence_resid = None
else:
    print("\nBetween-clause presence × Group (χ²):")
    print(presence_tbl)

    exp_seg = None
    try:
        chi2_seg, p_seg, dof_seg, exp_seg = chi2_contingency(presence_tbl, correction=False)
        V_seg = cramers_v(chi2_seg, presence_tbl.values.sum(), *presence_tbl.shape)
        print(f"  χ² = {chi2_seg:.3f} | df = {dof_seg} | p = {p_seg:.6f} | Cramér’s V = {V_seg:.3f}")
    except ValueError:
        # Try Monte Carlo χ² (SciPy ≥1.7); if not available, fall back to Fisher pairwise vs BNC.
        try:
            chi2_seg, p_seg, dof_seg, exp_seg = chi2_contingency(
                presence_tbl, correction=False, simulate_pval=True, num_simulation=5000
            )
            V_seg = cramers_v(chi2_seg, presence_tbl.values.sum(), *presence_tbl.shape)
            print(f"  Monte-Carlo χ² = {chi2_seg:.3f} | df = {dof_seg} | p = {p_seg:.6f} | V = {V_seg:.3f}")
        except TypeError:
            print("  χ² not valid (zero expected). Pairwise Fisher exact p-values vs BNC:")
            for subset in ["Joyce_Manual","Joyce_Restrictive","Joyce_LessRestrictive"]:
                if subset in presence_tbl.index and "BNC" in presence_tbl.index:
                    tbl2 = presence_tbl.loc[[subset, "BNC"]].to_numpy()
                    _, p_f = fisher_exact(tbl2)  # two-sided
                    print(f"   - {subset:22s}: p = {p_f:.6g}")
            p_seg = np.nan; chi2_seg = np.nan; dof_seg = 1

    # Save presence table + expected + residuals if we have them
    path_presence_csv = os.path.join(out_dir, f"between_presence_by_group_{ts}.csv")
    presence_tbl.to_csv(path_presence_csv)

    if exp_seg is not None:
        path_presence_exp = os.path.join(out_dir, f"between_presence_expected_by_group_{ts}.csv")
        path_presence_resid = os.path.join(out_dir, f"between_presence_residuals_by_group_{ts}.csv")
        exp_df = pd.DataFrame(exp_seg, index=presence_tbl.index, columns=presence_tbl.columns)
        exp_df.to_csv(path_presence_exp)
        resid = (presence_tbl - exp_df) / np.sqrt(exp_df.replace(0, np.nan))
        resid.to_csv(path_presence_resid)
    else:
        path_presence_exp = path_presence_resid = None

# ---------- 5) Topic modelling (per subset + BNC) ----------
print("\nTOPIC MODELLING (per subset + BNC)")

def prepare_corpus(gdf):
    """Prefer Lemmatized_Text from Cell 1; fallback to Sentence_Context."""
    if "Lemmatized_Text" in gdf.columns and gdf["Lemmatized_Text"].notna().any():
        texts = gdf["Lemmatized_Text"].fillna("").astype(str).tolist()
    else:
        texts = gdf["Sentence_Context"].fillna("").astype(str).tolist() if "Sentence_Context" in gdf.columns else []
    return [t for t in texts if t.strip()]

def lda_topics(corpus, n_topics=5, n_top_words=10, max_df=0.85, min_df=2, max_features=5000, random_state=RANDOM_STATE):
    if not corpus:
        return None, None, None
    vectorizer = CountVectorizer(
        max_df=max_df, min_df=min_df, max_features=max_features,
        stop_words="english",
        token_pattern=r"(?u)\b[^\W\d_][^\W\d_]+\b"
    )
    X = vectorizer.fit_transform(corpus)
    lda = LatentDirichletAllocation(
        n_components=n_topics, random_state=random_state, learning_method="batch"
    )
    lda.fit(X)
    terms = vectorizer.get_feature_names_out()
    topics = []
    for comp in lda.components_:
        top_idx = comp.argsort()[:-n_top_words-1:-1]
        topics.append([terms[i] for i in top_idx])
    doc_topic = lda.transform(X)  # rows sum ~1
    return topics, doc_topic, terms

topics_summary = {"params":{"n_topics":5,"n_top_words":10,"vectorizer":"CountVectorizer"}, "groups":{}}
topic_rows = []
topicmix_rows = []

for subset in ["Joyce_Manual","Joyce_Restrictive","Joyce_LessRestrictive","BNC"]:
    corpus = prepare_corpus(groups[subset])
    if corpus:
        tpcs, doc_topic, terms = lda_topics(corpus, n_topics=5, n_top_words=10)
        topics_summary["groups"][subset] = tpcs if tpcs is not None else []
        if tpcs:
            for i, words in enumerate(tpcs, 1):
                topic_rows.append({"Group":subset, "Topic":i, "Top_Words":", ".join(words)})
            topic_means = doc_topic.mean(axis=0) if doc_topic is not None else None
            if topic_means is not None:
                for i, val in enumerate(topic_means, 1):
                    topicmix_rows.append({"Group":subset, "Topic":i, "Mean_Weight":float(val)})
        print(f"  Topics generated for {subset}: {len(tpcs) if tpcs else 0}")
    else:
        topics_summary["groups"][subset] = []
        print(f"  Not enough text for {subset}")

topics_json_path = os.path.join(out_dir, f"lda_topics_by_subset_{ts}.json")
with open(topics_json_path, "w", encoding="utf-8") as f:
    json.dump(topics_summary, f, ensure_ascii=False, indent=2)

topics_csv = pd.DataFrame(topic_rows, columns=["Group","Topic","Top_Words"])
topics_csv_path = os.path.join(out_dir, f"lda_topics_by_subset_{ts}.csv")
topics_csv.to_csv(topics_csv_path, index=False)

topicmix_csv = pd.DataFrame(topicmix_rows, columns=["Group","Topic","Mean_Weight"])
topicmix_csv_path = os.path.join(out_dir, f"lda_topic_mix_by_subset_{ts}.csv")
topicmix_csv.to_csv(topicmix_csv_path, index=False)

# ---------- 6) Optional robustness: 3-way χ² (drop NLP group) ----------
try:
    cont_3 = contingency_4way.drop(columns=["Joyce_LessRestrictive"])
    chi2_3, p_3, dof_3, _ = chi2_contingency(cont_3, correction=False)
    V_3 = cramers_v(chi2_3, cont_3.values.sum(), *cont_3.shape)
    print(f"\n3-way χ² (Manual/Restrictive/BNC): χ²={chi2_3:.2f} df={dof_3} p={p_3:.3g} V={V_3:.3f}")
except Exception:
    pass

# ---------- 7) Master summary JSON ----------
master = {
    "generated_at": ts,
    "random_state": RANDOM_STATE,
    "note": "By-subset outputs (Manual / Restrictive / Less-Restrictive vs BNC) with unified Quasi_Similes label and FDR corrections.",
    "chi_square_4way": {
        "chi2": float(chi2_4), "dof": int(dof_4), "p_value": float(p_4),
        "cramers_v": float(V_4), "monte_carlo": bool(sim_used),
        "N": int(N_total)
    },
    "files": {
        "chi2_contingency_by_subset_csv": path_cont_4,
        "chi2_expected_by_subset_csv": path_exp_4,
        "chi2_pearson_z_by_subset_csv": path_resid_z,
        "chi2_cell_p_by_subset_csv": path_resid_p,
        "chi2_cell_padj_BH_by_subset_csv": path_resid_padj,
        "two_prop_newcombe_by_subset_csv": path_two_prop,
        "binomial_tests_by_subset_csv": path_binom,
        "continuous_tests_by_subset_csv": path_cont,
        "lda_topics_by_subset_json": topics_json_path,
        "lda_topics_by_subset_csv": topics_csv_path,
        "lda_topic_mix_by_subset_csv": topicmix_csv_path
    }
}
master_path = os.path.join(out_dir, f"stats_and_topics_summary_by_subset_{ts}.json")
with open(master_path, "w", encoding="utf-8") as f:
    json.dump(master, f, ensure_ascii=False, indent=2)

print("\nSAVED OUTPUTS (by-subset)")
print(" - 4-way contingency:", master["files"]["chi2_contingency_by_subset_csv"])
print(" - 4-way expected:", master["files"]["chi2_expected_by_subset_csv"])
print(" - 4-way Pearson z:", master["files"]["chi2_pearson_z_by_subset_csv"])
print(" - 4-way cell p-values:", master["files"]["chi2_cell_p_by_subset_csv"])
print(" - 4-way cell p-values (BH):", master["files"]["chi2_cell_padj_BH_by_subset_csv"])
print(" - Two-proportion (subset vs BNC):", master["files"]["two_prop_newcombe_by_subset_csv"])
print(" - Binomial (subset vs BNC):", master["files"]["binomial_tests_by_subset_csv"])
print(" - Continuous tests (subset vs BNC):", master["files"]["continuous_tests_by_subset_csv"])
print(" - Topics JSON (per subset):", master["files"]["lda_topics_by_subset_json"])
print(" - Topics CSV (per subset):", master["files"]["lda_topics_by_subset_csv"])
print(" - Topic mix CSV (per subset):", master["files"]["lda_topic_mix_by_subset_csv"])
print(" - Master summary JSON:", master_path)
print("\nDONE.")



ROBUST STATISTICAL ANALYSIS (Joyce subsets vs BNC)

Category distribution after harmonisation (counts):
Original_Dataset     BNC_Baseline  Manual_CloseReading  \
Category_Framework                                       
Standard                      118                   93   
Quasi_Similes                  82                   53   
Joycean_Quasi_Fuzzy             0                   13   
Joycean_Framed                  0                   18   
Joycean_Silent                  0                    6   
Uncategorized                   0                    1   

Original_Dataset     NLP_LessRestrictive_PG  Restrictive_Dubliners  Total  
Category_Framework                                                         
Standard                                330                    150    691  
Quasi_Similes                             0                     47    182  
Joycean_Quasi_Fuzzy                       0                     14     27  
Joycean_Framed                            0       

# 8. Academic Reporting and Documentation

## 8.1 Professional Report Generation
The HTML report generator produces a polished, self-contained document suitable for:
- **Peer review** and journal **supplementary materials**
- Internal/external **research documentation** and **reproducibility** records
- **Academic presentations** (slides/posters) and dissemination to non-specialists

It compiles results directly from the timestamped CSV/JSON artifacts written to `analysis_outputs/`, ensuring the report is traceable to concrete files.

## 8.2 Results Integration
The report synthesizes all major components of the analysis, including:
- **Instance-aligned F1** summaries (exact + fuzzy sentence matching) for extractor vs manual annotations
- **4-way χ²** contingency results with **expected counts**, **standardized residuals (z)**, and **BH-FDR** per-cell p-values
- **Two-proportion tests** (Newcombe CIs) with **Cohen’s h** effect sizes, BH-FDR corrected
- **Binomial tests** vs BNC reference proportions, BH-FDR corrected
- **Continuous feature** comparisons (Welch *t*, Mann–Whitney *U*) with **Hedges’ g**, **Cliff’s δ**, and optional **Hodges–Lehmann** shifts (bootstrap CIs)
- **Topic modelling** per subset (LDA top words) and **mean topic-mix** weights
- **Dataset and category overviews**, reflecting the **harmonised taxonomy** with **Quasi_Similes** as the unified label

## 8.3 Academic Standards and Transparency
The report follows scholarly communication norms:
- Clear sectioning, professional typography, descriptive figure/table captions
- Explicit **statistical choices** (e.g., BH-FDR; Monte-Carlo χ² fallback when expected counts are low)
- **Environment stamping** (Python/library versions), deterministic seeds, and a **file manifest** with timestamps for reproducibility
- **Taxonomy harmonisation** note (Joycean_Quasi → **Quasi_Similes**) to prevent label inflation
- Methodological caveats (e.g., **TextBlob sentiment** reported as exploratory; the **less-restrictive NLP** subset being **all Standard** is treated as a control and excluded in robustness checks when appropriate)

## 8.4 Reproducibility & Data Availability (recommended)
- Archive the full `analysis_outputs/` directory, `comprehensive_linguistic_analysis_corrected.csv`, and this notebook.
- Include licensing/usage notes for external corpora (e.g., **BNC** access terms) and cite the **Project Gutenberg** source for *Dubliners*.
- Provide a short **README** describing how to re-run Cells 1–3 and regenerate the identical HTML using the saved artifacts.


In [16]:
# =============================================================================
# ACADEMIC HTML REPORT GENERATOR — UPDATED (data-driven, harmonised labels)
# Now also includes comparator-segment metrics (Pre / Between / Post)
# =============================================================================

import os
import glob
import json
import pandas as pd
from datetime import datetime
import numpy as np

# for presence χ² (Between>0) if available
try:
    from scipy.stats import chi2_contingency
    _HAS_SCIPY = True
except Exception:
    _HAS_SCIPY = False

print("GENERATING ACADEMIC HTML REPORT")
print("=" * 50)

# Timestamp for report
now = datetime.now()
report_timestamp = now.strftime("%Y-%m-%d %H:%M:%S")
report_date = now.strftime("%Y%m%d_%H%M%S")

# ---------- Helpers ----------
def latest_file(pattern, base="analysis_outputs"):
    files = glob.glob(os.path.join(base, pattern))
    return max(files, key=os.path.getctime) if files else None

def safe_read_csv(path):
    if not path or not os.path.exists(path):
        return pd.DataFrame()
    try:
        return pd.read_csv(path)
    except Exception as e:
        print(f"[WARN] Could not load CSV {path}: {e}")
        return pd.DataFrame()

def safe_read_json(path):
    if not path or not os.path.exists(path):
        return {}
    try:
        with open(path, "r", encoding="utf-8") as f:
            return json.load(f)
    except Exception as e:
        print(f"[WARN] Could not load JSON {path}: {e}")
        return {}

def df_for_html(df, index=False, max_rows=None):
    if df is None or df.empty:
        return pd.DataFrame()
    d = df.copy()
    if max_rows is not None and len(d) > max_rows:
        d = d.head(max_rows)
        d.__truncated__ = True
    return d

def create_table_html(df, title="", max_rows=20, index=False):
    """Create HTML table with styling; auto-handle empty and truncation note."""
    if df is None or df.empty:
        return f"<p><em>No data available for {title}</em></p>"
    d = df.copy()
    trunc = False
    if max_rows and len(d) > max_rows:
        d = d.head(max_rows)
        trunc = True
    # If the index is meaningful (named or not default RangeIndex), show it as a column
    if index:
        d = d.reset_index()
    table_html = d.to_html(classes='analysis-table', escape=False, index=False)
    note = f"<p class='truncated-note'><em>Showing first {max_rows} of {len(df)} rows</em></p>" if trunc else ""
    return f"""
    <div class="table-container">
        <h4>{title}</h4>
        <div class="table-wrapper">{table_html}</div>
        {note}
    </div>
    """

# ---------- Pull in analysis outputs from Cell 2 ----------
out_dir = "analysis_outputs"
os.makedirs(out_dir, exist_ok=True)

paths = {
    "chi2_cont": latest_file("chi2_contingency_by_subset_*.csv", out_dir),
    "chi2_exp": latest_file("chi2_expected_by_subset_*.csv", out_dir),
    "chi2_z": latest_file("chi2_pearson_z_by_subset_*.csv", out_dir),
    "chi2_p": latest_file("chi2_cell_p_by_subset_*.csv", out_dir),
    "chi2_padj": latest_file("chi2_cell_padj_BH_by_subset_*.csv", out_dir),
    "two_prop": latest_file("two_prop_newcombe_by_subset_*.csv", out_dir),
    "binom": latest_file("binomial_tests_by_subset_*.csv", out_dir),
    "cont": latest_file("continuous_tests_by_subset_*.csv", out_dir),
    "topics": latest_file("lda_topics_by_subset_*.csv", out_dir),
    "topicmix": latest_file("lda_topic_mix_by_subset_*.csv", out_dir),
    "hl": latest_file("continuous_HL_shifts_*.csv", out_dir),
    "master": latest_file("stats_and_topics_summary_by_subset_*.json", out_dir),
    # NEW: comparator-segment outputs
    "seg_cont": latest_file("continuous_segment_tests_by_subset_*.csv", out_dir),
    "seg_presence": latest_file("between_presence_by_group_*.csv", out_dir),
    "seg_presence_exp": latest_file("between_presence_expected_by_group_*.csv", out_dir),
    "seg_presence_resid": latest_file("between_presence_residuals_by_group_*.csv", out_dir),
    # convenience (optional merge for group means)
    "seg_latest": "comparator_segments_latest.csv"
}

chi2_cont_df = safe_read_csv(paths["chi2_cont"])
chi2_exp_df  = safe_read_csv(paths["chi2_exp"])
chi2_z_df    = safe_read_csv(paths["chi2_z"])
chi2_p_df    = safe_read_csv(paths["chi2_p"])
chi2_padj_df = safe_read_csv(paths["chi2_padj"])
two_prop_df  = safe_read_csv(paths["two_prop"])
binom_df     = safe_read_csv(paths["binom"])
cont_df      = safe_read_csv(paths["cont"])
topics_df    = safe_read_csv(paths["topics"])
topicmix_df  = safe_read_csv(paths["topicmix"])
hl_df        = safe_read_csv(paths["hl"])
master_json  = safe_read_json(paths["master"])

# NEW: segment metrics
seg_cont_df       = safe_read_csv(paths["seg_cont"])
seg_presence_df   = safe_read_csv(paths["seg_presence"])
seg_presence_exp  = safe_read_csv(paths["seg_presence_exp"])
seg_presence_res  = safe_read_csv(paths["seg_presence_resid"])
seg_latest_df     = safe_read_csv(paths["seg_latest"])

# ---------- Summaries from Cell 1 (results_df, f1_analysis, comparator.env_info) ----------
def create_summary_stats_html():
    if 'results_df' not in globals() or results_df is None or results_df.empty:
        return "<p><em>No results data available</em></p>"
    dcounts = results_df['Original_Dataset'].value_counts()
    ccounts = results_df['Category_Framework'].value_counts()

    items_d = "".join([f"<li><strong>{k}:</strong> {int(v):,} instances</li>" for k,v in dcounts.items()])
    items_c = "".join([f"<li><strong>{k}:</strong> {int(v):,} ({(v/len(results_df))*100:.1f}%)</li>" for k,v in ccounts.items()])

    return f"""
    <div class="summary-stats">
        <div class="stat-group">
            <h4>Dataset Distribution</h4>
            <ul>{items_d}</ul>
            <p><strong>Total Instances:</strong> {len(results_df):,}</p>
        </div>
        <div class="stat-group">
            <h4>Category Distribution</h4>
            <ul>{items_c}</ul>
        </div>
    </div>
    """

def f1_summary_html():
    # Try multiple sources
    micro_rb = macro_rb = micro_nlp = macro_nlp = None
    pairs_rb = pairs_nlp = None
    if 'f1_analysis' in globals() and isinstance(f1_analysis, dict):
        rb = f1_analysis.get('rule_based_vs_manual') or f1_analysis.get('Rule-Based vs Manual')
        nl = f1_analysis.get('nlp_vs_manual') or f1_analysis.get('NLP vs Manual')
        if rb and rb.get("overall"):
            micro_rb = rb["overall"].get("micro_f1")
            macro_rb = rb["overall"].get("macro_f1")
            pairs_rb = rb.get("pairs")
        if nl and nl.get("overall"):
            micro_nlp = nl["overall"].get("micro_f1")
            macro_nlp = nl["overall"].get("macro_f1")
            pairs_nlp = nl.get("pairs")
    elif 'comparator' in globals():
        try:
            fa = comparator.comparison_results.get('f1_analysis', {})
            rb = fa.get('rule_based_vs_manual')
            nl = fa.get('nlp_vs_manual')
            if rb and rb.get("overall"):
                micro_rb = rb["overall"].get("micro_f1")
                macro_rb = rb["overall"].get("macro_f1")
                pairs_rb = rb.get("pairs")
            if nl and nl.get("overall"):
                micro_nlp = nl["overall"].get("micro_f1")
                macro_nlp = nl["overall"].get("macro_f1")
                pairs_nlp = nl.get("pairs")
        except Exception:
            pass

    if micro_rb is None and micro_nlp is None:
        return "<p><em>F1 metrics unavailable (run Cell 1 in this session).</em></p>"

    lines = []
    if micro_rb is not None:
        lines.append(f"• Rule-Based vs Manual: micro-F1 = {micro_rb:.3f}, macro-F1 = {macro_rb:.3f}{' (pairs=' + str(pairs_rb) + ')' if pairs_rb else ''}")
    if micro_nlp is not None:
        lines.append(f"• NLP vs Manual: micro-F1 = {micro_nlp:.3f}, macro-F1 = {macro_nlp:.3f}{' (pairs=' + str(pairs_nlp) + ')' if pairs_nlp else ''}")

    return f"""
    <div class="highlight">
      <strong>F1 Summary:</strong><br>
      {'<br>'.join(lines)}
      <br>• Total instances processed: {len(results_df):,} (across all datasets)
    </div>
    """

def env_html():
    info = {}
    if 'comparator' in globals():
        try:
            info = comparator.env_info
        except Exception:
            info = {}
    if not info:
        return ""
    items = "".join([f"<li><strong>{k}:</strong> {v}</li>" for k,v in info.items()])
    return f"""
    <div class="methodology">
      <strong>Environment:</strong>
      <ul>{items}</ul>
    </div>
    """

# ---------- Analytic highlight snippets ----------
def chi2_summary_html():
    if not master_json:
        return ""
    chi = master_json.get("chi_square_4way", {})
    chi_line = (f"χ² = {chi.get('chi2', float('nan')):.3f} | df = {chi.get('dof', 0)} | "
                f"p = {chi.get('p_value', float('nan')):.3g} | Cramér’s V = {chi.get('cramers_v', float('nan')):.3f} "
                f"| Monte-Carlo={chi.get('monte_carlo', False)} | N={chi.get('N', 0)}")
    return f"<p>{chi_line}</p>"

def top_residuals_html(k=10):
    if chi2_z_df.empty or chi2_cont_df.empty or chi2_exp_df.empty:
        return ""
    # Melt z
    z_long = chi2_z_df.copy()
    z_long = z_long.rename(columns={"Unnamed: 0":"Category_Framework"}) if "Unnamed: 0" in z_long.columns else z_long
    z_long = z_long.melt(id_vars=[c for c in ["Category_Framework"] if c in z_long.columns],
                         var_name="Group", value_name="z")
    z_long["abs_z"] = z_long["z"].abs()
    # Get obs/exp
    cont = chi2_cont_df.copy()
    cont = cont.rename(columns={"Unnamed: 0":"Category_Framework"}) if "Unnamed: 0" in cont.columns else cont
    exp = chi2_exp_df.copy()
    exp = exp.rename(columns={"Unnamed: 0":"Category_Framework"}) if "Unnamed: 0" in exp.columns else exp
    # Align to long
    obs_long = cont.melt(id_vars=[c for c in ["Category_Framework"] if c in cont.columns],
                         var_name="Group", value_name="obs")
    exp_long = exp.melt(id_vars=[c for c in ["Category_Framework"] if c in exp.columns],
                        var_name="Group", value_name="exp")
    m = (z_long.merge(obs_long, on=["Category_Framework","Group"], how="left")
               .merge(exp_long, on=["Category_Framework","Group"], how="left"))
    m = m.sort_values("abs_z", ascending=False).head(k)
    return create_table_html(m, f"Top {k} standardized residuals (|z|)", max_rows=k)

def sig_two_prop_html(alpha=0.05, top=15):
    if two_prop_df.empty or "p_adj_BH" not in two_prop_df.columns:
        return ""
    df = two_prop_df.copy()
    # Keep only significant rows, order by adjusted p then effect size magnitude
    df["|h|"] = df.get("cohens_h", 0).abs()
    sig = df[df["p_adj_BH"] < alpha].sort_values(["p_adj_BH","|h|"])
    cols = [c for c in ["Subset","Category","prop_A","prop_B","z","p_value","p_adj_BH","cohens_h"] if c in sig.columns]
    return create_table_html(sig[cols], f"Two-proportion (BH<{alpha}) — top {top}", max_rows=top)

def sig_binom_html(alpha=0.05, top=15):
    if binom_df.empty or "p_adj_BH" not in binom_df.columns:
        return ""
    df = binom_df.copy()
    sig = df[df["p_adj_BH"] < alpha].sort_values("p_adj_BH")
    cols = [c for c in ["Subset","Category","count_A","n_A","p_ref_BNC","p_value","p_adj_BH"] if c in sig.columns]
    return create_table_html(sig[cols], f"Binomial vs BNC (BH<{alpha}) — top {top}", max_rows=top)

def sig_continuous_html(alpha=0.05):
    if cont_df.empty:
        return ""
    d = cont_df.copy()
    # Prefer Mann–Whitney U adjusted p; fall back to Welch t
    if "U_padj_BH" in d.columns:
        sig = d[d["U_padj_BH"] < alpha].sort_values("U_padj_BH")
        label = "Mann–Whitney U (BH)"
        padj_col = "U_padj_BH"
        p_col = "U_pvalue"
    elif "t_padj_BH" in d.columns:
        sig = d[d["t_padj_BH"] < alpha].sort_values("t_padj_BH")
        label = "Welch t (BH)"
        padj_col = "t_padj_BH"
        p_col = "t_pvalue"
    else:
        return ""
    if sig.empty:
        return "<p><em>No continuous feature differences were significant after FDR correction.</em></p>"
    cols = [c for c in ["Feature","Subset","A_n","B_n","A_mean","B_mean","A_median","B_median",
                        "t_stat","t_pvalue","U_stat","U_pvalue","hedges_g","cliffs_delta", padj_col] if c in sig.columns]
    return create_table_html(sig[cols], f"Continuous features — significant by {label}", max_rows=15)

# NEW: comparator-segment highlights for Exec Summary (optional)
def seg_exec_highlights_html(alpha=0.05):
    if seg_cont_df.empty:
        return ""
    d = seg_cont_df.copy()
    # Prefer BH on U, else BH on t, else raw p on U.
    if "U_padj_BH" in d.columns:
        d["sig"] = d["U_padj_BH"] < alpha
        pcol = "U_padj_BH"
    elif "t_padj_BH" in d.columns:
        d["sig"] = d["t_padj_BH"] < alpha
        pcol = "t_padj_BH"
    else:
        if "U_pvalue" in d.columns:
            d["sig"] = d["U_pvalue"] < alpha
            pcol = "U_pvalue"
        else:
            return ""
    lines = []
    # Focus on Manual vs BNC for readability
    dd = d[(d["Subset"]=="Joyce_Manual") & (d["sig"]==True)].copy()
    for feat in ["Seg_Between","Seg_Post_Ratio","Seg_Total","Seg_Pre"]:
        if feat in dd["Feature"].unique():
            row = dd[dd["Feature"]==feat].sort_values(pcol).head(1)
            if not row.empty:
                r = row.iloc[0]
                a_mean = r.get("A_mean", np.nan)
                b_mean = r.get("B_mean", np.nan)
                g = r.get("hedges_g", np.nan)
                lines.append(f"• <strong>{feat}</strong>: Manual vs BNC differs (p≈{r.get(pcol, np.nan):.3g}); "
                             f"means {a_mean:.2f} vs {b_mean:.2f}, Hedges’ g≈{g:.2f}.")
    if not lines:
        return ""
    return f"""
    <div class="highlight">
      <strong>Comparator-Segment Highlights:</strong><br>
      {'<br>'.join(lines)}
    </div>
    """

# NEW: tables for segment metrics
def seg_cont_tables_html():
    if seg_cont_df.empty:
        return ""
    blocks = []
    blocks.append("<h3>Comparator-Segment Metrics (Pre / Between / Post)</h3>")
    blocks.append("<p>Welch t and Mann–Whitney U on segment token counts and ratios (BH correction when available).</p>")
    blocks.append(create_table_html(seg_cont_df, "All comparator-segment tests (subset vs BNC)", max_rows=20))
    # Significant only
    d = seg_cont_df.copy()
    label = None
    if "U_padj_BH" in d.columns:
        sig = d[d["U_padj_BH"] < 0.05].sort_values("U_padj_BH")
        label = "Mann–Whitney U (BH)"
    elif "t_padj_BH" in d.columns:
        sig = d[d["t_padj_BH"] < 0.05].sort_values("t_padj_BH")
        label = "Welch t (BH)"
    else:
        sig = pd.DataFrame()
    if not sig.empty and label:
        cols = [c for c in ["Feature","Subset","A_n","B_n","A_mean","B_mean","A_median","B_median",
                            "t_stat","t_pvalue","U_stat","U_pvalue","hedges_g","cliffs_delta",
                            "U_padj_BH" if "U_padj_BH" in sig.columns else "t_padj_BH"] if c in sig.columns]
        blocks.append(create_table_html(sig[cols], f"Significant comparator-segment tests — {label}", max_rows=20))
    return "\n".join(blocks)

def seg_presence_html():
    if seg_presence_df.empty:
        return ""
    blocks = []
    blocks.append("<h3>Between-Clause Presence × Group (χ²)</h3>")
    blocks.append(create_table_html(seg_presence_df, "Observed counts: presence vs absence by group", max_rows=10, index=True))
    # If we have expected/residuals, include them
    if not seg_presence_exp.empty:
        blocks.append(create_table_html(seg_presence_exp, "Expected counts under independence", max_rows=10, index=True))
    if not seg_presence_res.empty:
        blocks.append(create_table_html(seg_presence_res, "Pearson residuals", max_rows=10, index=True))
    # Compute χ² here as well (defensive)
    if _HAS_SCIPY:
        try:
            tbl = seg_presence_df.copy()
            if "Unnamed: 0" in tbl.columns:
                tbl = tbl.rename(columns={"Unnamed: 0":"__Group__"})
                tbl = tbl.set_index("__Group__")
            chi2, p, dof, exp = chi2_contingency(tbl, correction=False)
            blocks.append(f"<p>χ² = {chi2:.3f} | df = {dof} | p = {p:.6g}</p>")
        except Exception:
            pass
    return "\n".join(blocks)

def seg_group_means_html():
    """
    If we can merge comparator_segments_latest.csv to results_df via Instance_ID,
    show group means of segment metrics (Joyce_Manual / Restrictive / Less-Restrictive / BNC).
    """
    if seg_latest_df.empty or 'results_df' not in globals() or results_df is None or results_df.empty:
        return ""
    seg = seg_latest_df.copy()
    # try to detect join key and segment columns
    id_col = None
    for c in ["Instance_ID","instance_id","ID","Id"]:
        if c in seg.columns:
            id_col = c; break
    if id_col is None or "Instance_ID" not in results_df.columns:
        return ""
    # build __Group__ from Original_Dataset like in stats cell
    LABELS = {
        "Manual_CloseReading":      "Joyce_Manual",
        "Restrictive_Dubliners":    "Joyce_Restrictive",
        "NLP_LessRestrictive_PG":   "Joyce_LessRestrictive",
        "BNC_Baseline":             "BNC"
    }
    base = results_df[["Instance_ID","Original_Dataset"]].copy()
    base["__Group__"] = base["Original_Dataset"].map(LABELS).fillna(base["Original_Dataset"])
    merged = base.merge(seg, left_on="Instance_ID", right_on=id_col, how="left")
    seg_cols = [c for c in merged.columns if c.startswith("Seg_")]  # from the stats cell
    if not seg_cols:
        # fallback: common names from latest CSV
        for c_old, c_new in [("Pre_Tokens","Seg_Pre"),("Between_Tokens_Total","Seg_Between"),
                             ("Post_Tokens","Seg_Post")]:
            if c_old in merged.columns and c_new not in merged.columns:
                merged[c_new] = pd.to_numeric(merged[c_old], errors="coerce")
        seg_cols = [c for c in merged.columns if c.startswith("Seg_")]
    if not seg_cols or "__Group__" not in merged.columns:
        return ""
    means = merged.groupby("__Group__")[seg_cols].mean(numeric_only=True).round(3)
    return create_table_html(means, "Comparator-segment group means (merged via Instance_ID)", max_rows=10, index=True)

# ---------- Build the HTML ----------
html_content = f"""
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
<title>Joyce Simile Research: Comprehensive Linguistic Analysis Report</title>
<style>
    body {{
        font-family: 'Times New Roman', serif;
        line-height: 1.6;
        margin: 0;
        padding: 20px;
        background-color: #f9f9f9;
        color: #333;
    }}
    .container {{
        max-width: 1200px;
        margin: 0 auto;
        background: white;
        padding: 30px;
        border-radius: 8px;
        box-shadow: 0 2px 10px rgba(0,0,0,0.1);
    }}
    .header {{
        text-align: center;
        border-bottom: 3px solid #2c3e50;
        padding-bottom: 20px;
        margin-bottom: 30px;
    }}
    .header h1 {{ color: #2c3e50; margin: 0; font-size: 2.2em; font-weight: bold; }}
    .header .subtitle {{ color: #7f8c8d; font-size: 1.1em; margin: 10px 0 5px 0; font-style: italic; }}
    .header .timestamp {{ color: #95a5a6; font-size: 0.9em; }}
    .section {{
        margin: 30px 0;
        padding: 20px;
        border-left: 4px solid #3498db;
        background-color: #f8f9fa;
    }}
    .section h2 {{
        color: #2c3e50;
        margin-top: 0; border-bottom: 2px solid #ecf0f1; padding-bottom: 10px;
    }}
    .section h3 {{ color: #34495e; margin-top: 25px; }}
    .section h4 {{ color: #5d6d7e; margin-top: 20px; margin-bottom: 10px; }}
    .analysis-table {{ width: 100%; border-collapse: collapse; margin: 15px 0; font-size: 0.9em; }}
    .analysis-table th {{ background:#34495e; color:#fff; padding:12px 8px; text-align:left; font-weight:bold; }}
    .analysis-table td {{ padding:10px 8px; border-bottom:1px solid #ddd; }}
    .analysis-table tr:nth-child(even) {{ background-color:#f2f2f2; }}
    .analysis-table tr:hover {{ background-color:#e8f4fd; }}
    .summary-stats {{ display:grid; grid-template-columns:1fr 1fr; gap:30px; margin:20px 0; }}
    .stat-group {{ background:#fff; padding:20px; border-radius:6px; border:1px solid #e1e8ed; }}
    .stat-group h4 {{ margin-top:0; color:#2c3e50; border-bottom:1px solid #ecf0f1; padding-bottom:8px; }}
    .stat-group ul {{ list-style-type:none; padding:0; }}
    .stat-group li {{ padding:5px 0; border-bottom:1px solid #f8f9fa; }}
    .highlight {{ background:#d1ecf1; padding:15px; border-left:4px solid #17a2b8; margin:15px 0; }}
    .methodology {{ background:#f8f9fa; padding:15px; border-radius:5px; margin:15px 0; font-style:italic; }}
    .table-container {{ margin:20px 0; }}
    .table-wrapper {{ overflow-x:auto; }}
    .truncated-note {{ color:#6c757d; font-size:0.9em; margin-top:5px; }}
    .footer {{ text-align:center; margin-top:40px; padding-top:20px; border-top:2px solid #ecf0f1; color:#7f8c8d; font-size:0.9em; }}
    @media (max-width: 768px) {{
        .summary-stats {{ grid-template-columns:1fr; }}
        .container {{ padding:15px; }}
        .analysis-table {{ font-size:0.85em; }}
    }}
</style>
</head>
<body>
<div class="container">
    <div class="header">
        <h1>Joyce Simile Research</h1>
        <div class="subtitle">Comprehensive Linguistic Analysis Report</div>
        <div class="subtitle">Manual Annotations vs Extraction Methods vs BNC Baseline</div>
        <div class="timestamp">Generated on {report_timestamp}</div>
    </div>

    <div class="section">
        <h2>Executive Summary</h2>
        <p>This report compiles the results of a comparative analysis of simile usage in <em>Dubliners</em>, spanning manual expert annotations, a restrictive rule-based extractor, a less-restrictive NLP extractor, and a BNC baseline. Categories were harmonised with <strong>Quasi_Similes</strong> as the unified label for the Joyce/BNC quasi-simile phenomenon.</p>
        {chi2_summary_html()}
        {seg_exec_highlights_html(alpha=0.05)}
    </div>

    <div class="section">
        <h2>Dataset Overview</h2>
        <p>Four datasets were analysed, representing complementary identification approaches and a baseline:</p>
        {create_summary_stats_html()}
        {env_html()}
        <div class="methodology">
            <strong>Methodology:</strong> Features include token counts, pre/post-comparator ratio, POS distribution, syntactic complexity, TextBlob sentiment (exploratory), and comparative structure flags.
            <br><em>New:</em> Comparator-segment metrics quantify token distributions <strong>before</strong> comparators (Pre), <strong>between</strong> multiple comparators (Between), and <strong>after</strong> comparators (Post). Between spans are computed across detected comparator anchors (e.g., <code>as … as</code>, <code>like</code>, framing punctuation); we report totals and ratios, and assess group differences with non-parametric tests and χ² for presence (Between&gt;0).
        </div>
    </div>

    <div class="section">
        <h2>Performance Metrics</h2>
        <h3>Instance-Aligned F1 Scores</h3>
        {f1_summary_html()}
    </div>

    <div class="section">
        <h2>Statistical Results</h2>
        <h3>4-way Category Distribution</h3>
"""

# 4-way chi-square tables
if not chi2_cont_df.empty:
    html_content += create_table_html(chi2_cont_df, "Observed counts by group × category", max_rows=10, index=True)
if not chi2_exp_df.empty:
    html_content += create_table_html(chi2_exp_df, "Expected counts under independence", max_rows=10, index=True)
if not chi2_z_df.empty:
    html_content += create_table_html(chi2_z_df, "Pearson standardized residuals (z)", max_rows=10, index=True)
    html_content += top_residuals_html(k=10)
if not chi2_padj_df.empty:
    html_content += create_table_html(chi2_padj_df, "Per-cell p-values (BH-adjusted)", max_rows=10, index=True)

# Two-proportion
if not two_prop_df.empty:
    html_content += """
        <h3>Two-Proportion Tests (Joyce subsets vs BNC)</h3>
        <p>Newcombe confidence intervals and Cohen’s h. BH correction applied across tests.</p>
    """
    # Ensure adj col exists (defensive)
    if "p_adj_BH" not in two_prop_df.columns and "p_value" in two_prop_df.columns:
        pv = two_prop_df["p_value"].fillna(1.0).to_numpy()
        order = np.argsort(pv)
        ranked = np.empty_like(pv, dtype=float)
        cummin = 1.0
        n = pv.size
        for i, idx in enumerate(order[::-1], start=1):
            rank = n - i + 1
            val = pv[idx] * n / rank
            cummin = min(cummin, val)
            ranked[idx] = min(cummin, 1.0)
        two_prop_df["p_adj_BH"] = ranked
    html_content += create_table_html(two_prop_df, "All two-proportion results", max_rows=15)
    html_content += sig_two_prop_html(alpha=0.05, top=15)

# Binomial
if not binom_df.empty:
    if "p_adj_BH" not in binom_df.columns and "p_value" in binom_df.columns:
        pv = binom_df["p_value"].fillna(1.0).to_numpy()
        order = np.argsort(pv)
        ranked = np.empty_like(pv, dtype=float)
        cummin = 1.0
        n = pv.size
        for i, idx in enumerate(order[::-1], start=1):
            rank = n - i + 1
            val = pv[idx] * n / rank
            cummin = min(cummin, val)
            ranked[idx] = min(cummin, 1.0)
        binom_df["p_adj_BH"] = ranked
    html_content += """
        <h3>Binomial Tests (Joyce subsets vs BNC reference proportion)</h3>
        <p>One-sample tests per category per subset; BH correction across tests.</p>
    """
    html_content += create_table_html(binom_df, "All binomial results", max_rows=12)
    html_content += sig_binom_html(alpha=0.05, top=12)

# Continuous
if not cont_df.empty:
    html_content += """
        <h3>Continuous Features</h3>
        <p>Welch t and Mann–Whitney U with effect sizes (Hedges’ g, Cliff’s δ). BH correction applied.</p>
    """
    html_content += create_table_html(cont_df, "Continuous feature comparisons", max_rows=12)
    html_content += sig_continuous_html(alpha=0.05)
if not hl_df.empty:
    html_content += create_table_html(hl_df, "Hodges–Lehmann location shifts (A − B) with bootstrap CIs", max_rows=12)

# NEW: Comparator-segment section
if not seg_cont_df.empty or not seg_presence_df.empty:
    html_content += seg_cont_tables_html()
    html_content += seg_presence_html()

# Optional: group means from latest segments merged to results_df
html_content += seg_group_means_html()

# Topics
html_content += """
    </div>
    <div class="section">
        <h2>Topic Modelling</h2>
        <p>Per-subset LDA topics (CountVectorizer; 5 topics × 10 words). Topic mixture weights summarised by mean.</p>
"""
if not topics_df.empty:
    html_content += create_table_html(topics_df, "Top words per topic × group", max_rows=20)
if not topicmix_df.empty:
    html_content += create_table_html(topicmix_df, "Mean topic weights per group", max_rows=20)

# Sample of comprehensive results
if 'results_df' in globals() and isinstance(results_df, pd.DataFrame) and not results_df.empty:
    sample_cols = [c for c in ['Instance_ID','Original_Dataset','Category_Framework','Comparator_Type',
                               'Sentence_Length','Sentiment_Polarity','Pre_Post_Ratio'] if c in results_df.columns]
    sample_results = results_df.head(25)[sample_cols].round(3)
    html_content += f"""
    </div>
    <div class="section">
        <h2>Comprehensive Results — Sample</h2>
        <p>Representative snippet of the full dataset (first 25 rows):</p>
        {create_table_html(sample_results, "Sample of comprehensive analysis", max_rows=25)}
        <div class="methodology">
            The full dataset includes lemmatisation, POS distributions, syntactic complexity, and comparative structure features.
        </div>
    </div>
    """

# Files list
files_list = []
files_list.append("comprehensive_linguistic_analysis_corrected.csv")
for k, p in paths.items():
    if p:
        files_list.append(os.path.relpath(p))
files_items = "".join([f"<li><code>{f}</code></li>" for f in sorted(set(files_list))])

html_content += f"""
    <div class="section">
        <h2>Files Generated</h2>
        <ul>{files_items}</ul>
    </div>

    <div class="footer">
        <p>Generated by Comprehensive Linguistic Analysis Pipeline</p>
        <p>Joyce Simile Research Project • {report_timestamp}</p>
        <p><em>Categories harmonised with unified <strong>Quasi_Similes</strong>; results computed using FDR corrections where applicable.</em></p>
    </div>
</div>
</body>
</html>
"""

# Save HTML report
report_filename = f"joyce_simile_analysis_report_{report_date}.html"
with open(report_filename, 'w', encoding='utf-8') as f:
    f.write(html_content)

print(f"✓ Academic HTML report generated: {report_filename}")
print(f"✓ File size: {os.path.getsize(report_filename):,} bytes")
print(f"✓ Report length: {len(html_content):,} characters")
print("\nREPORT READY — open in a browser to view.")


GENERATING ACADEMIC HTML REPORT
✓ Academic HTML report generated: joyce_simile_analysis_report_20250825_113445.html
✓ File size: 60,206 bytes
✓ Report length: 60,166 characters

REPORT READY — open in a browser to view.


In [None]:
# =============================================================================
# ACADEMIC HTML REPORT GENERATOR — UPDATED (data-driven, harmonised labels)
# Generates a comprehensive, reproducible HTML report from Cells 1–2 outputs
# =============================================================================

import os
import glob
import json
import pandas as pd
from datetime import datetime
import numpy as np

print("GENERATING ACADEMIC HTML REPORT")
print("=" * 50)

# Timestamp for report
now = datetime.now()
report_timestamp = now.strftime("%Y-%m-%d %H:%M:%S")
report_date = now.strftime("%Y%m%d_%H%M%S")

# ---------- Helpers ----------
def latest_file(pattern, base="analysis_outputs"):
    files = glob.glob(os.path.join(base, pattern))
    return max(files, key=os.path.getctime) if files else None

def safe_read_csv(path):
    if not path or not os.path.exists(path):
        return pd.DataFrame()
    try:
        return pd.read_csv(path)
    except Exception as e:
        print(f"[WARN] Could not load CSV {path}: {e}")
        return pd.DataFrame()

def safe_read_json(path):
    if not path or not os.path.exists(path):
        return {}
    try:
        with open(path, "r", encoding="utf-8") as f:
            return json.load(f)
    except Exception as e:
        print(f"[WARN] Could not load JSON {path}: {e}")
        return {}

def df_for_html(df, index=False, max_rows=None):
    if df is None or df.empty:
        return pd.DataFrame()
    d = df.copy()
    if max_rows is not None and len(d) > max_rows:
        d = d.head(max_rows)
        d.__truncated__ = True
    return d

def create_table_html(df, title="", max_rows=20, index=False):
    """Create HTML table with styling; auto-handle empty and truncation note."""
    if df is None or df.empty:
        return f"<p><em>No data available for {title}</em></p>"
    d = df.copy()
    trunc = False
    if max_rows and len(d) > max_rows:
        d = d.head(max_rows)
        trunc = True
    # If the index is meaningful (named or not default RangeIndex), show it as a column
    if index:
        d = d.reset_index()
    table_html = d.to_html(classes='analysis-table', escape=False, index=False)
    note = f"<p class='truncated-note'><em>Showing first {max_rows} of {len(df)} rows</em></p>" if trunc else ""
    return f"""
    <div class="table-container">
        <h4>{title}</h4>
        <div class="table-wrapper">{table_html}</div>
        {note}
    </div>
    """

# ---------- Pull in analysis outputs from Cell 2 ----------
out_dir = "analysis_outputs"
os.makedirs(out_dir, exist_ok=True)

paths = {
    "chi2_cont": latest_file("chi2_contingency_by_subset_*.csv", out_dir),
    "chi2_exp": latest_file("chi2_expected_by_subset_*.csv", out_dir),
    "chi2_z": latest_file("chi2_pearson_z_by_subset_*.csv", out_dir),
    "chi2_p": latest_file("chi2_cell_p_by_subset_*.csv", out_dir),
    "chi2_padj": latest_file("chi2_cell_padj_BH_by_subset_*.csv", out_dir),
    "two_prop": latest_file("two_prop_newcombe_by_subset_*.csv", out_dir),
    "binom": latest_file("binomial_tests_by_subset_*.csv", out_dir),
    "cont": latest_file("continuous_tests_by_subset_*.csv", out_dir),
    "topics": latest_file("lda_topics_by_subset_*.csv", out_dir),
    "topicmix": latest_file("lda_topic_mix_by_subset_*.csv", out_dir),
    "hl": latest_file("continuous_HL_shifts_*.csv", out_dir),
    "master": latest_file("stats_and_topics_summary_by_subset_*.json", out_dir),
}

chi2_cont_df = safe_read_csv(paths["chi2_cont"])
chi2_exp_df  = safe_read_csv(paths["chi2_exp"])
chi2_z_df    = safe_read_csv(paths["chi2_z"])
chi2_p_df    = safe_read_csv(paths["chi2_p"])
chi2_padj_df = safe_read_csv(paths["chi2_padj"])
two_prop_df  = safe_read_csv(paths["two_prop"])
binom_df     = safe_read_csv(paths["binom"])
cont_df      = safe_read_csv(paths["cont"])
topics_df    = safe_read_csv(paths["topics"])
topicmix_df  = safe_read_csv(paths["topicmix"])
hl_df        = safe_read_csv(paths["hl"])
master_json  = safe_read_json(paths["master"])

# ---------- Summaries from Cell 1 (results_df, f1_analysis, comparator.env_info) ----------
def create_summary_stats_html():
    if 'results_df' not in globals() or results_df is None or results_df.empty:
        return "<p><em>No results data available</em></p>"
    dcounts = results_df['Original_Dataset'].value_counts()
    ccounts = results_df['Category_Framework'].value_counts()

    items_d = "".join([f"<li><strong>{k}:</strong> {int(v):,} instances</li>" for k,v in dcounts.items()])
    items_c = "".join([f"<li><strong>{k}:</strong> {int(v):,} ({(v/len(results_df))*100:.1f}%)</li>" for k,v in ccounts.items()])

    return f"""
    <div class="summary-stats">
        <div class="stat-group">
            <h4>Dataset Distribution</h4>
            <ul>{items_d}</ul>
            <p><strong>Total Instances:</strong> {len(results_df):,}</p>
        </div>
        <div class="stat-group">
            <h4>Category Distribution</h4>
            <ul>{items_c}</ul>
        </div>
    </div>
    """

def f1_summary_html():
    # Try multiple sources
    micro_rb = macro_rb = micro_nlp = macro_nlp = None
    pairs_rb = pairs_nlp = None
    if 'f1_analysis' in globals() and isinstance(f1_analysis, dict):
        rb = f1_analysis.get('rule_based_vs_manual') or f1_analysis.get('Rule-Based vs Manual')
        nl = f1_analysis.get('nlp_vs_manual') or f1_analysis.get('NLP vs Manual')
        if rb and rb.get("overall"):
            micro_rb = rb["overall"].get("micro_f1")
            macro_rb = rb["overall"].get("macro_f1")
            pairs_rb = rb.get("pairs")
        if nl and nl.get("overall"):
            micro_nlp = nl["overall"].get("micro_f1")
            macro_nlp = nl["overall"].get("macro_f1")
            pairs_nlp = nl.get("pairs")
    elif 'comparator' in globals():
        try:
            fa = comparator.comparison_results.get('f1_analysis', {})
            rb = fa.get('rule_based_vs_manual')
            nl = fa.get('nlp_vs_manual')
            if rb and rb.get("overall"):
                micro_rb = rb["overall"].get("micro_f1")
                macro_rb = rb["overall"].get("macro_f1")
                pairs_rb = rb.get("pairs")
            if nl and nl.get("overall"):
                micro_nlp = nl["overall"].get("micro_f1")
                macro_nlp = nl["overall"].get("macro_f1")
                pairs_nlp = nl.get("pairs")
        except Exception:
            pass

    if micro_rb is None and micro_nlp is None:
        return "<p><em>F1 metrics unavailable (run Cell 1 in this session).</em></p>"

    lines = []
    if micro_rb is not None:
        lines.append(f"• Rule-Based vs Manual: micro-F1 = {micro_rb:.3f}, macro-F1 = {macro_rb:.3f}{' (pairs=' + str(pairs_rb) + ')' if pairs_rb else ''}")
    if micro_nlp is not None:
        lines.append(f"• NLP vs Manual: micro-F1 = {micro_nlp:.3f}, macro-F1 = {macro_nlp:.3f}{' (pairs=' + str(pairs_nlp) + ')' if pairs_nlp else ''}")

    return f"""
    <div class="highlight">
      <strong>F1 Summary:</strong><br>
      {'<br>'.join(lines)}
      <br>• Total instances processed: {len(results_df):,} (across all datasets)
    </div>
    """

def env_html():
    info = {}
    if 'comparator' in globals():
        try:
            info = comparator.env_info
        except Exception:
            info = {}
    if not info:
        return ""
    items = "".join([f"<li><strong>{k}:</strong> {v}</li>" for k,v in info.items()])
    return f"""
    <div class="methodology">
      <strong>Environment:</strong>
      <ul>{items}</ul>
    </div>
    """

# ---------- Build analytic highlight snippets ----------
def chi2_summary_html():
    if not master_json:
        return ""
    chi = master_json.get("chi_square_4way", {})
    chi_line = (f"χ² = {chi.get('chi2', float('nan')):.3f} | df = {chi.get('dof', 0)} | "
                f"p = {chi.get('p_value', float('nan')):.3g} | Cramér’s V = {chi.get('cramers_v', float('nan')):.3f} "
                f"| Monte-Carlo={chi.get('monte_carlo', False)} | N={chi.get('N', 0)}")
    return f"<p>{chi_line}</p>"

def top_residuals_html(k=10):
    if chi2_z_df.empty or chi2_cont_df.empty or chi2_exp_df.empty:
        return ""
    # Melt z
    z_long = chi2_z_df.copy()
    z_long = z_long.rename(columns={"Unnamed: 0":"Category_Framework"}) if "Unnamed: 0" in z_long.columns else z_long
    z_long = z_long.melt(id_vars=[c for c in ["Category_Framework"] if c in z_long.columns],
                         var_name="Group", value_name="z")
    z_long["abs_z"] = z_long["z"].abs()
    # Get obs/exp
    cont = chi2_cont_df.copy()
    cont = cont.rename(columns={"Unnamed: 0":"Category_Framework"}) if "Unnamed: 0" in cont.columns else cont
    exp = chi2_exp_df.copy()
    exp = exp.rename(columns={"Unnamed: 0":"Category_Framework"}) if "Unnamed: 0" in exp.columns else exp
    # Align to long
    obs_long = cont.melt(id_vars=[c for c in ["Category_Framework"] if c in cont.columns],
                         var_name="Group", value_name="obs")
    exp_long = exp.melt(id_vars=[c for c in ["Category_Framework"] if c in exp.columns],
                        var_name="Group", value_name="exp")
    m = (z_long.merge(obs_long, on=["Category_Framework","Group"], how="left")
               .merge(exp_long, on=["Category_Framework","Group"], how="left"))
    m = m.sort_values("abs_z", ascending=False).head(k)
    return create_table_html(m, f"Top {k} standardized residuals (|z|)", max_rows=k)

def sig_two_prop_html(alpha=0.05, top=15):
    if two_prop_df.empty or "p_adj_BH" not in two_prop_df.columns:
        return ""
    df = two_prop_df.copy()
    # Keep only significant rows, order by adjusted p then effect size magnitude
    df["|h|"] = df.get("cohens_h", 0).abs()
    sig = df[df["p_adj_BH"] < alpha].sort_values(["p_adj_BH","|h|"])
    cols = [c for c in ["Subset","Category","prop_A","prop_B","z","p_value","p_adj_BH","cohens_h"] if c in sig.columns]
    return create_table_html(sig[cols], f"Two-proportion (BH<{alpha}) — top {top}", max_rows=top)

def sig_binom_html(alpha=0.05, top=15):
    if binom_df.empty or "p_adj_BH" not in binom_df.columns:
        return ""
    df = binom_df.copy()
    sig = df[df["p_adj_BH"] < alpha].sort_values("p_adj_BH")
    cols = [c for c in ["Subset","Category","count_A","n_A","p_ref_BNC","p_value","p_adj_BH"] if c in sig.columns]
    return create_table_html(sig[cols], f"Binomial vs BNC (BH<{alpha}) — top {top}", max_rows=top)

def sig_continuous_html(alpha=0.05):
    if cont_df.empty:
        return ""
    d = cont_df.copy()
    # Prefer Mann–Whitney U adjusted p; fall back to Welch t
    if "U_padj_BH" in d.columns:
        sig = d[d["U_padj_BH"] < alpha].sort_values("U_padj_BH")
        label = "Mann–Whitney U (BH)"
        padj_col = "U_padj_BH"
        p_col = "U_pvalue"
    elif "t_padj_BH" in d.columns:
        sig = d[d["t_padj_BH"] < alpha].sort_values("t_padj_BH")
        label = "Welch t (BH)"
        padj_col = "t_padj_BH"
        p_col = "t_pvalue"
    else:
        return ""
    if sig.empty:
        return "<p><em>No continuous feature differences were significant after FDR correction.</em></p>"
    cols = [c for c in ["Feature","Subset","A_n","B_n","A_mean","B_mean","A_median","B_median",
                        "t_stat","t_pvalue","U_stat","U_pvalue","hedges_g","cliffs_delta", padj_col] if c in sig.columns]
    return create_table_html(sig[cols], f"Continuous features — significant by {label}", max_rows=15)

# ---------- Build the HTML ----------
html_content = f"""
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
<title>Joyce Simile Research: Comprehensive Linguistic Analysis Report</title>
<style>
    body {{
        font-family: 'Times New Roman', serif;
        line-height: 1.6;
        margin: 0;
        padding: 20px;
        background-color: #f9f9f9;
        color: #333;
    }}
    .container {{
        max-width: 1200px;
        margin: 0 auto;
        background: white;
        padding: 30px;
        border-radius: 8px;
        box-shadow: 0 2px 10px rgba(0,0,0,0.1);
    }}
    .header {{
        text-align: center;
        border-bottom: 3px solid #2c3e50;
        padding-bottom: 20px;
        margin-bottom: 30px;
    }}
    .header h1 {{ color: #2c3e50; margin: 0; font-size: 2.2em; font-weight: bold; }}
    .header .subtitle {{ color: #7f8c8d; font-size: 1.1em; margin: 10px 0 5px 0; font-style: italic; }}
    .header .timestamp {{ color: #95a5a6; font-size: 0.9em; }}
    .section {{
        margin: 30px 0;
        padding: 20px;
        border-left: 4px solid #3498db;
        background-color: #f8f9fa;
    }}
    .section h2 {{
        color: #2c3e50;
        margin-top: 0; border-bottom: 2px solid #ecf0f1; padding-bottom: 10px;
    }}
    .section h3 {{ color: #34495e; margin-top: 25px; }}
    .section h4 {{ color: #5d6d7e; margin-top: 20px; margin-bottom: 10px; }}
    .analysis-table {{ width: 100%; border-collapse: collapse; margin: 15px 0; font-size: 0.9em; }}
    .analysis-table th {{ background:#34495e; color:#fff; padding:12px 8px; text-align:left; font-weight:bold; }}
    .analysis-table td {{ padding:10px 8px; border-bottom:1px solid #ddd; }}
    .analysis-table tr:nth-child(even) {{ background-color:#f2f2f2; }}
    .analysis-table tr:hover {{ background-color:#e8f4fd; }}
    .summary-stats {{ display:grid; grid-template-columns:1fr 1fr; gap:30px; margin:20px 0; }}
    .stat-group {{ background:#fff; padding:20px; border-radius:6px; border:1px solid #e1e8ed; }}
    .stat-group h4 {{ margin-top:0; color:#2c3e50; border-bottom:1px solid #ecf0f1; padding-bottom:8px; }}
    .stat-group ul {{ list-style-type:none; padding:0; }}
    .stat-group li {{ padding:5px 0; border-bottom:1px solid #f8f9fa; }}
    .highlight {{ background:#d1ecf1; padding:15px; border-left:4px solid #17a2b8; margin:15px 0; }}
    .methodology {{ background:#f8f9fa; padding:15px; border-radius:5px; margin:15px 0; font-style:italic; }}
    .table-container {{ margin:20px 0; }}
    .table-wrapper {{ overflow-x:auto; }}
    .truncated-note {{ color:#6c757d; font-size:0.9em; margin-top:5px; }}
    .footer {{ text-align:center; margin-top:40px; padding-top:20px; border-top:2px solid #ecf0f1; color:#7f8c8d; font-size:0.9em; }}
    @media (max-width: 768px) {{
        .summary-stats {{ grid-template-columns:1fr; }}
        .container {{ padding:15px; }}
        .analysis-table {{ font-size:0.85em; }}
    }}
</style>
</head>
<body>
<div class="container">
    <div class="header">
        <h1>Joyce Simile Research</h1>
        <div class="subtitle">Comprehensive Linguistic Analysis Report</div>
        <div class="subtitle">Manual Annotations vs Extraction Methods vs BNC Baseline</div>
        <div class="timestamp">Generated on {report_timestamp}</div>
    </div>

    <div class="section">
        <h2>Executive Summary</h2>
        <p>This report compiles the results of a comparative analysis of simile usage in <em>Dubliners</em>, spanning manual expert annotations, a restrictive rule-based extractor, a less-restrictive NLP extractor, and a BNC baseline. Categories were harmonised with <strong>Quasi_Similes</strong> as the unified label for the Joyce/BNC quasi-simile phenomenon.</p>
        {chi2_summary_html()}
    </div>

    <div class="section">
        <h2>Dataset Overview</h2>
        <p>Four datasets were analysed, representing complementary identification approaches and a baseline:</p>
        {create_summary_stats_html()}
        {env_html()}
        <div class="methodology">
            <strong>Methodology:</strong> Features include token counts, pre/post-comparator ratio, POS distribution, syntactic complexity, TextBlob sentiment (exploratory), and comparative structure flags. Instance-aligned F1 (exact+fuzzy sentence matching) was used to evaluate extractors against manual ground truth.
        </div>
    </div>

    <div class="section">
        <h2>Performance Metrics</h2>
        <h3>Instance-Aligned F1 Scores</h3>
        {f1_summary_html()}
    </div>

    <div class="section">
        <h2>Statistical Results</h2>
        <h3>4-way Category Distribution</h3>
"""

# 4-way chi-square tables
if not chi2_cont_df.empty:
    html_content += create_table_html(chi2_cont_df, "Observed counts by group × category", max_rows=10, index=True)
if not chi2_exp_df.empty:
    html_content += create_table_html(chi2_exp_df, "Expected counts under independence", max_rows=10, index=True)
if not chi2_z_df.empty:
    html_content += create_table_html(chi2_z_df, "Pearson standardized residuals (z)", max_rows=10, index=True)
    html_content += top_residuals_html(k=10)
if not chi2_padj_df.empty:
    html_content += create_table_html(chi2_padj_df, "Per-cell p-values (BH-adjusted)", max_rows=10, index=True)

# Two-proportion
if not two_prop_df.empty:
    html_content += """
        <h3>Two-Proportion Tests (Joyce subsets vs BNC)</h3>
        <p>Newcombe confidence intervals and Cohen’s h. BH correction applied across tests.</p>
    """
    # Ensure adj col exists (defensive)
    if "p_adj_BH" not in two_prop_df.columns and "p_value" in two_prop_df.columns:
        pv = two_prop_df["p_value"].fillna(1.0).to_numpy()
        # simple BH
        order = np.argsort(pv)
        ranked = np.empty_like(pv, dtype=float)
        cummin = 1.0
        n = pv.size
        for i, idx in enumerate(order[::-1], start=1):
            rank = n - i + 1
            val = pv[idx] * n / rank
            cummin = min(cummin, val)
            ranked[idx] = min(cummin, 1.0)
        two_prop_df["p_adj_BH"] = ranked
    html_content += create_table_html(two_prop_df, "All two-proportion results", max_rows=15)
    html_content += sig_two_prop_html(alpha=0.05, top=15)

# Binomial
if not binom_df.empty:
    if "p_adj_BH" not in binom_df.columns and "p_value" in binom_df.columns:
        pv = binom_df["p_value"].fillna(1.0).to_numpy()
        order = np.argsort(pv)
        ranked = np.empty_like(pv, dtype=float)
        cummin = 1.0
        n = pv.size
        for i, idx in enumerate(order[::-1], start=1):
            rank = n - i + 1
            val = pv[idx] * n / rank
            cummin = min(cummin, val)
            ranked[idx] = min(cummin, 1.0)
        binom_df["p_adj_BH"] = ranked
    html_content += """
        <h3>Binomial Tests (Joyce subsets vs BNC reference proportion)</h3>
        <p>One-sample tests per category per subset; BH correction across tests.</p>
    """
    html_content += create_table_html(binom_df, "All binomial results", max_rows=12)
    html_content += sig_binom_html(alpha=0.05, top=12)

# Continuous
if not cont_df.empty:
    html_content += """
        <h3>Continuous Features</h3>
        <p>Welch t and Mann–Whitney U with effect sizes (Hedges’ g, Cliff’s δ). BH correction applied.</p>
    """
    html_content += create_table_html(cont_df, "Continuous feature comparisons", max_rows=12)
    html_content += sig_continuous_html(alpha=0.05)
if not hl_df.empty:
    html_content += create_table_html(hl_df, "Hodges–Lehmann location shifts (A − B) with bootstrap CIs", max_rows=12)

# Topics
html_content += """
    </div>
    <div class="section">
        <h2>Topic Modelling</h2>
        <p>Per-subset LDA topics (CountVectorizer; 5 topics × 10 words). Topic mixture weights summarised by mean.</p>
"""
if not topics_df.empty:
    html_content += create_table_html(topics_df, "Top words per topic × group", max_rows=20)
if not topicmix_df.empty:
    html_content += create_table_html(topicmix_df, "Mean topic weights per group", max_rows=20)

# Sample of comprehensive results
if 'results_df' in globals() and isinstance(results_df, pd.DataFrame) and not results_df.empty:
    sample_cols = [c for c in ['Instance_ID','Original_Dataset','Category_Framework','Comparator_Type',
                               'Sentence_Length','Sentiment_Polarity','Pre_Post_Ratio'] if c in results_df.columns]
    sample_results = results_df.head(25)[sample_cols].round(3)
    html_content += f"""
    </div>
    <div class="section">
        <h2>Comprehensive Results — Sample</h2>
        <p>Representative snippet of the full dataset (first 25 rows):</p>
        {create_table_html(sample_results, "Sample of comprehensive analysis", max_rows=25)}
        <div class="methodology">
            The full dataset includes lemmatisation, POS distributions, syntactic complexity, and comparative structure features.
        </div>
    </div>
    """

# Files list
files_list = []
files_list.append("comprehensive_linguistic_analysis_corrected.csv")
for k, p in paths.items():
    if p:
        files_list.append(os.path.relpath(p))
files_items = "".join([f"<li><code>{f}</code></li>" for f in sorted(set(files_list))])

html_content += f"""
    <div class="section">
        <h2>Files Generated</h2>
        <ul>{files_items}</ul>
    </div>

    <div class="footer">
        <p>Generated by Comprehensive Linguistic Analysis Pipeline</p>
        <p>Joyce Simile Research Project • {report_timestamp}</p>
        <p><em>Categories harmonised with unified <strong>Quasi_Similes</strong>; results computed using FDR corrections where applicable.</em></p>
    </div>
</div>
</body>
</html>
"""

# Save HTML report
report_filename = f"joyce_simile_analysis_report_{report_date}.html"
with open(report_filename, 'w', encoding='utf-8') as f:
    f.write(html_content)

print(f"✓ Academic HTML report generated: {report_filename}")
print(f"✓ File size: {os.path.getsize(report_filename):,} bytes")
print(f"✓ Report length: {len(html_content):,} characters")
print("\nREPORT READY — open in a browser to view.")


GENERATING ACADEMIC HTML REPORT
✓ Academic HTML report generated: joyce_simile_analysis_report_20250824_210058.html
✓ File size: 46,522 bytes
✓ Report length: 46,491 characters

REPORT READY — open in a browser to view.


# 9. Research Implications and Future Directions

## 9.1 Computational Literary Analysis
Using sentence-level **instance-aligned evaluation** (exact match, then fuzzy ≥ 0.92), the extractors partially recover the manual categories:
- **Rule-Based vs Manual:** micro-F1 ≈ **0.343**, macro-F1 ≈ **0.178** (pairs matched: see Cell 1 output)
- **NLP (Less-Restrictive) vs Manual:** micro-F1 ≈ **0.292**, macro-F1 ≈ **0.059**

These scores indicate **meaningful but incomplete replication** of expert judgments—appropriate for unsupervised/heuristic extractors—and motivate targeted improvements (Section 9.4). Importantly, categorical distributions still differ strongly across corpora (**χ²≈281.88**, **Cramér’s V≈0.318**, *p*≪.001), so the **signal is primarily categorical**, not scalar.

**Key distributional contrasts (harmonised labels):**
- **BNC**: *Quasi_Similes* ≈ **41%** (82/200) vs **Joyce-Manual** ≈ **29%** (53/184).  
- **Joyce-Manual** shows enrichments in **Joycean_Framed**, **Joycean_Quasi_Fuzzy**, and **Joycean_Silent**.
- **NLP (Less-Restrictive)** is **all Standard** (330/330), by design—useful as a control but excluded in robustness checks where appropriate.

## 9.2 Innovation Quantification
Under the unified scheme:
- **Joycean-specific categories** (Framed + Quasi_Fuzzy + Silent) account for **~20%** of **Joyce-Manual** (37/184).
- **Quasi_Similes** appear **less** in **Joyce-Manual** (~29%) relative to **BNC** (~41%), with significant two-proportion differences (BH-corrected; moderate effects, e.g., |*h*|≈0.26–0.42 for Manual/Restrictive vs BNC).

This suggests Joyce’s corpus **rebalances** the space of similes: fewer conventional quasi-similes than BNC and **more Joycean-specific forms**, evidencing stylistic innovation distinct from baseline English.

## 9.3 Methodological Contributions
- **Label harmonisation**: *Joycean_Quasi* merged into **Quasi_Similes** to align BNC and Joyce phenomena, avoiding double-counting and artificial χ² inflation.
- **Robust inference**: 4-way χ² with per-cell residuals and **BH-FDR**; two-proportion tests with **Newcombe CIs** + **Cohen’s h**; binomial tests vs BNC reference; Welch-*t* & Mann–Whitney-*U* with **Hedges’ g** and **Cliff’s δ**; optional 3-way χ² excluding the degenerate NLP group.
- **Instance-aligned evaluation**: sentence-pairing (exact+fuzzy) for extractor vs manual categories, a more faithful alternative to bag-of-counts F1.
- **Reproducibility**: environment stamping; deterministic random seeds; saved CSV/JSON artifacts.

## 9.4 Future Directions
1. **Improve extractor recall** for Joycean categories  
   - Add patterns for **punctuation-mediated** comparators (colon/semicolon/ellipsis) and **‘as … as’** spans.  
   - Incorporate dependency features (governor-comparator-target) and discourse cues.

2. **Supervised modelling**  
   - Train a classifier on paired manual/extractor sentences (use current pairs as silver labels), with features from syntax, lemmata, and comparator spans.

3. **Error analysis & ablations**  
   - Per-category confusion and residuals to identify systematic misses; ablate comparator detection vs category mapping.

4. **Topic–category linkage**  
   - Relate LDA topic mixtures to categories (e.g., Framed vs Standard) and test with permutation or regression.

5. **Generalisation**  
   - Validate on other Joyce texts or contemporaries; check stability of the Joycean category enrichments.

---

## References and Data Sources

**Primary Text**  
- Joyce, James. *Dubliners*. Project Gutenberg.

**Baseline Corpus**  
- British National Corpus (BNC), used as a **standard English** reference. *Ensure appropriate licensing/access for the specific BNC source employed.*

---

## Computational Tools
- **spaCy** (`en_core_web_sm`): tokenisation, POS, dependencies  
- **scikit-learn**: CountVectorizer, LDA, utilities  
- **SciPy**: χ², *t*-tests, Mann–Whitney U, binomial  
- **statsmodels** (if available): two-proportion z-tests, Newcombe CIs  
- **TextBlob**: exploratory sentiment  
- **pandas / numpy**: data handling and numerics

---

## Research Framework (Summary)
- **Evaluation**: instance-aligned F1 (micro/macro) via sentence pairing (exact + fuzzy ≥ 0.92).  
- **Categorical inference**: 4-way χ² with per-cell residual z and BH-FDR; two-proportion tests (Newcombe CIs, Cohen’s h); binomial tests vs BNC proportions.  
- **Continuous inference**: Welch-*t* and Mann–Whitney-*U* with **Hedges’ g**, **Cliff’s δ**, and **Hodges–Lehmann** shifts (bootstrap CIs).  
- **Topics**: LDA per subset using **CountVectorizer** (5 topics × 10 words), with mean topic-mix summaries.  
- **Harmonised taxonomy**: **Quasi_Similes** (unified), **Joycean_Framed**, **Joycean_Quasi_Fuzzy**, **Joycean_Silent**, **Standard**, **Uncategorized**.

> **Terminology note:** *Quasi_Similes* is the unified tag covering both the BNC’s “Quasi_Similes” and Joyce’s former “Joycean_Quasi,” representing the same phenomenon.
