<a href="https://colab.research.google.com/github/mahb97/yes-i-said-yes-i-will-yes/blob/main/corpus_extraction_pass1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*This notebook performs the first pass of corpus extraction, flagging candidate passages for manual annotation. Accounts for free indirect discourse, mediated speech, and complex attribution challenges in Joyce's text.*

In [None]:
# Notebook overview and summary

print("""
           ♥♥♥♥♥♥♥           ♥♥♥♥♥♥♥
         ♥♥♥♥♥♥♥♥♥♥♥       ♥♥♥♥♥♥♥♥♥♥♥
       ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥   ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
      ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥ ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
     ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
     ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
     ♥  MOLLY BLOOM CANDIDATE EXTRACTION  ♥
     ♥                                    ♥
     ♥      yes-i-said-yes-i-will-yes     ♥
     ♥                                    ♥
      ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
       ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
         ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
           ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
             ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
               ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
                 ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
                   ♥♥♥♥♥♥♥♥♥♥♥
                     ♥♥♥♥♥♥♥
                       ♥♥♥
                        ♥

PROJECT GOAL:
Training a language model on Molly Bloom's voice from Joyce's Ulysses to
center female-coded language and stream-of-consciousness as primary rather
than peripheral.

THIS NOTEBOOK:
First pass corpus extraction using rule-based NLP to identify candidate
passages for manual annotation.

PIPELINE STAGES:

1. Setup & Dependencies
   - Load required libraries
   - Define data structures and enumerations

2. Text Loading
   - Upload cleaned Ulysses text
   - Verify file integrity

3. Pattern Initialization
   - Define regex patterns for:
     * Mediation markers (male-encoded speech)
     * Direct speech indicators
     * Molly-specific vocabulary
     * Female embodiment language

4. Episode Identification
   - Locate Penelope episode (definite Molly text)
   - Mark episode boundaries for context

5. Mediation Detection
   - Check for Bloom's memories vs. direct speech
   - Flag male-encoded passages for exclusion
   - Calculate Molly-likelihood scores

6. Candidate Extraction
   - Extract Penelope (100% confidence)
   - Find dialogue candidates throughout text
   - Apply automatic classification

7. Sample Review
   - Display samples from each confidence level
   - Inspect automatic flagging quality

8. Report Generation
   - Create structured markdown annotation file
   - Include context, reasoning, and flags
   - Organize by confidence level

9. Download & Manual Annotation
   - Download generated markdown file
   - Begin manual review process

CRITICAL PRINCIPLE:
Bloom's memory of Molly's words ≠ Molly's words

Even direct quotes filtered through male recollection are encoded in male
language. This is the computational equivalent of the male gaze. We exclude
all male-mediated passages to preserve primary female voice.

EXPECTED OUTPUT:
A markdown file containing:
- Penelope episode (definite inclusion)
- ~100 dialogue candidates with confidence scores
- Context windows for each passage
- Automatic flags and reasoning
- Space for manual annotation decisions

NEXT STEPS AFTER THIS NOTEBOOK:
1. Manual annotation of candidate passages
2. Create stage-specific corpus files
3. Implement overlap-cluster melting for Stage 3
4. Begin three-stage training pipeline

════════════════════════════════════════════════════════════════════════════

Ready to begin extraction.
For Molly.

════════════════════════════════════════════════════════════════════════════
""")

print("\nNotebook cells:")
print("Cell 1:  This overview")
print("Cell 2:  Dependencies and imports")
print("Cell 3:  Data structure definitions")
print("Cell 4:  File upload and text loading")
print("Cell 5:  MollyCandidateFinder class initialization")
print("Cell 6:  Helper methods (context, episodes)")
print("Cell 7:  Mediation detection methods")
print("Cell 8:  Penelope extraction")
print("Cell 9:  Dialogue candidate extraction")
print("Cell 10: Sample candidate review")
print("Cell 11: Markdown generation functions (header)")
print("Cell 12: Markdown generation functions (sections)")
print("Cell 13: Complete report generation")
print("Cell 14: Download annotation file")

print("\n" + "=" * 60)
print("Run cells sequentially from Cell 2 onwards.")
print("=" * 60)


           ♥♥♥♥♥♥♥           ♥♥♥♥♥♥♥
         ♥♥♥♥♥♥♥♥♥♥♥       ♥♥♥♥♥♥♥♥♥♥♥
       ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥   ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
      ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥ ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
     ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
     ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
     ♥  MOLLY BLOOM CANDIDATE EXTRACTION  ♥
     ♥                                    ♥
     ♥      yes-i-said-yes-i-will-yes     ♥
     ♥                                    ♥
      ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
       ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
         ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
           ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
             ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
               ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
                 ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
                   ♥♥♥♥♥♥♥♥♥♥♥
                     ♥♥♥♥♥♥♥
                       ♥♥♥
                        ♥

PROJECT GOAL:
Training a language model on Molly Bloom's voice from Joyce's Ulysses to 
center female-coded language and stream-of-consciousness as primary rather 
than peripheral.

THIS NOTEB

In [None]:
# Molly Bloom Candidate Extraction System - Setup

import re
from typing import List, Tuple, Dict, Optional
from dataclasses import dataclass
from enum import Enum
import json
from pathlib import Path

# lib for text processing
import unicodedata
from IPython.display import display, Markdown, HTML
import warnings
warnings.filterwarnings('ignore')

Dependencies loaded successfully.
Python environment ready for Molly Bloom extraction.


In [None]:
# data structures and enumerations
# classification for candidate passages

class PassageType(Enum):
    """Classification of passage types for annotation."""
    DIRECT_DIALOGUE = "direct_dialogue"
    INTERIOR_MONOLOGUE = "interior_monologue"
    THEATRICAL = "theatrical"
    MEDIATED_MEMORY = "mediated_memory"
    AMBIGUOUS = "ambiguous"


class ConfidenceLevel(Enum):
    """Confidence levels for automatic classification."""
    DEFINITE = "definite"
    PROBABLE = "probable"
    REQUIRES_REVIEW = "requires_review"
    EXCLUDE = "exclude"


@dataclass
class PassageCandidate:
    """
    Data structure for a candidate Molly Bloom passage.

    Attributes:
        text: The passage text
        location: Character offset in source text
        episode: Episode name if available
        passage_type: Automatic classification of passage type
        confidence: Confidence level for this classification
        context_before: Preceding text for disambiguation
        context_after: Following text for disambiguation
        flags: List of automatic detection flags
        reasoning: Explanation for classification
    """
    text: str
    location: int
    episode: Optional[str]
    passage_type: PassageType
    confidence: ConfidenceLevel
    context_before: str
    context_after: str
    flags: List[str]
    reasoning: str

    def __repr__(self):
        return f"PassageCandidate(location={self.location}, confidence={self.confidence.value}, type={self.passage_type.value})"

print(f"Available passage types: {[pt.value for pt in PassageType]}")
print(f"Available confidence levels: {[cl.value for cl in ConfidenceLevel]}")

In [None]:
# files and text loading (pre-clean txt manually)

from google.colab import files
import io

print("=" * 60)

uploaded = files.upload()

# ulysses.txt
filename = list(uploaded.keys())[0]

ulysses_text = uploaded[filename].decode('utf-8')

# statistics
char_count = len(ulysses_text)
word_count = len(ulysses_text.split())
line_count = ulysses_text.count('\n')

print("\n" + "=" * 60)
print(f"Filename: {filename}")
print(f"Characters: {char_count:,}")
print(f"Words (approximate): {word_count:,}")
print(f"Lines: {line_count:,}")
print("=" * 60)

# first 500 chars as sanity check
print("\nFirst 500 characters:")
print("-" * 60)
print(ulysses_text[:500])
print("-" * 60)

In [None]:
# MollyCandidateFinder Class - Initialization and Pattern Definitions (core class for identifying potential passages)

class MollyCandidateFinder:
    """
    Rule-based system for identifying candidate Molly Bloom passages.

    This system does not attempt full automatic extraction, but rather
    flags passages that require human judgment. Inspired by challenges
    encountered in rule-based approaches to Dubliners stylistic analysis.

    Parameters:
        ulysses_text: Full text of Ulysses
        context_window: Characters of context to extract (default: 200)
    """

    def __init__(self, ulysses_text: str, context_window: int = 200):
        self.text = ulysses_text
        self.context_window = context_window
        self.episodes = {}  # add

        # pattern definitions
        self.patterns = self._initialize_patterns()

        print(f"MollyCandidateFinder initialized.")
        print(f"Text length: {len(self.text):,} characters")
        print(f"Context window: {self.context_window} characters")
        print(f"Pattern categories loaded: {list(self.patterns.keys())}")

    def _initialize_patterns(self) -> Dict[str, List[str]]:
        """
        Initialize regex patterns for passage identification.

        Returns:
            Dictionary mapping pattern categories to regex lists
        """
        patterns = {
            'mediation_markers': [
                r'he remembered.*?(?:Molly|she) (?:said|saying)',
                r'(?:Molly|she) had said',
                r'(?:Molly|she) told him',
                r'he recalled.*?her words?',
                r'(?:she|Molly) would say',
                r'(?:she|Molly) might say',
                r'(?:she|Molly) used to say',
                r'he thought of.*?(?:her|Molly)',
                r'remembered.*?(?:her|Molly).*?voice',
                r'imagined.*?(?:her|Molly)',
            ],
            'direct_speech': [
                r'—\s*[A-Z]',  # Em dash
                r'"[^"]+"\s*(?:she says?|Molly says?)',
                r':\s*—',  # Colon / dialogue dash
            ],
            'molly_vocabulary': [
                # vocab
                r'\bGibraltar\b',
                r'\bMulvey\b',
                r'\bBoylan\b',
                r'\bHester\b',
                r'\bpoldy\b',
                # to be extended (yo i need to do another manual pass on ulysses give me a minute)
            ],
            'female_embodiment': [
                # vocab
                r'\bbreast',
                r'\bmenstr',
                r'\bpregnan',
                r'\bwomb\b',
                r'\bflesh\b',
                r'\bbody\b',
            ],
            'domestic_vocabulary': [
                r'\bbed\b',
                r'\bsheet',
                r'\bpillow',
                r'\bwashing\b',
                r'\bcooking\b',
                r'\bkitchen\b',
            ],
        }

        return patterns


# use finder
finder = MollyCandidateFinder(ulysses_text)

print("\n" + "=" * 60)
print(f"Mediation markers: {len(finder.patterns['mediation_markers'])} patterns")
print(f"Direct speech markers: {len(finder.patterns['direct_speech'])} patterns")
print(f"Molly vocabulary markers: {len(finder.patterns['molly_vocabulary'])} patterns")
print("=" * 60)

In [None]:
# helpers - context extraction and episode identification

def get_context(self, position: int) -> Tuple[str, str]:
    """
    Extract context before and after a given position.

    Parameters:
        position: Character position in text

    Returns:
        Tuple of (context_before, context_after)
    """
    start = max(0, position - self.context_window)
    end = min(len(self.text), position + self.context_window)

    context_before = self.text[start:position]
    context_after = self.text[position:end]

    return context_before, context_after


def identify_episode_boundaries(self) -> Dict[str, Tuple[int, int]]:
    """
    Identify episode boundaries in Ulysses text.

    Penelope is the most critical episode for this project.
    Uses known starting phrase to locate it.

    Returns:
        Dictionary mapping episode names to (start, end) character positions
    """
    episodes = {}

    # start phrase
    penelope_starts = [
        "Yes because he never did a thing like that before",
        "Yes because he never did",
    ]

    for phrase in penelope_starts:
        penelope_start = self.text.rfind(phrase)
        if penelope_start != -1:
            episodes['Penelope'] = (penelope_start, len(self.text))
            print(f"Penelope episode located at position {penelope_start:,}")
            break

    if 'Penelope' not in episodes:
        print("WARNING: Penelope episode not automatically located.")
        print("Manual identification may be required.")

    # to be extended

    return episodes


def get_episode_name(self, position: int) -> Optional[str]:
    """
    Determine which episode a position falls within.

    Parameters:
        position: Character position in text

    Returns:
        Episode name or None if not found
    """
    for episode, (start, end) in self.episodes.items():
        if start <= position < end:
            return episode
    return None


# methods for finder class
MollyCandidateFinder.get_context = get_context
MollyCandidateFinder.identify_episode_boundaries = identify_episode_boundaries
MollyCandidateFinder.get_episode_name = get_episode_name

# find episodes
finder.episodes = finder.identify_episode_boundaries()

print("\n" + "=" * 60)
print("Episode identification complete.")
print(f"Episodes found: {list(finder.episodes.keys())}")
if 'Penelope' in finder.episodes:
    start, end = finder.episodes['Penelope']
    length = end - start
    print(f"Penelope length: {length:,} characters (~{length//5:,} words)")
print("=" * 60)

In [None]:
# Mediation Detection Methods (critical methods for identifying male-encoded speech vs. primary Molly voice)

def check_mediation(self, passage: str, context: str) -> Tuple[bool, List[str]]:
    """
    Check if passage contains markers of male-encoded mediation.

    Critical: Bloom's memory of Molly's words is encoded in male language.
    Even direct quotes filtered through his recollection are not primary sources.
    This is the computational equivalent of the male gaze for language.

    Parameters:
        passage: The text to check
        context: Surrounding context

    Returns:
        Tuple of (is_mediated, list of matched patterns)
    """
    matched_patterns = []
    combined_text = context + " " + passage

    for pattern in self.patterns['mediation_markers']:
        if re.search(pattern, combined_text, re.IGNORECASE):
            matched_patterns.append(pattern)

    return len(matched_patterns) > 0, matched_patterns


def check_direct_speech(self, passage: str) -> Tuple[bool, List[str]]:
    """
    Check if passage contains markers of direct, real-time speech.

    Direct speech is unmediated - Molly speaking in the present moment
    of the narrative, not filtered through memory or imagination.

    Parameters:
        passage: The text to check

    Returns:
        Tuple of (is_direct, list of matched patterns)
    """
    matched_patterns = []

    for pattern in self.patterns['direct_speech']:
        if re.search(pattern, passage):
            matched_patterns.append(pattern)

    return len(matched_patterns) > 0, matched_patterns


def calculate_molly_likelihood(self, passage: str) -> Tuple[float, List[str]]:
    """
    Calculate likelihood that passage is Molly based on vocabulary markers.

    This is a simple heuristic - not definitive. Think of it as a prior
    before manual annotation. Like a really naive Bayes classifier,
    but we're honest about its limitations. No overconfident posteriors here.

    Parameters:
        passage: The text to analyze

    Returns:
        Tuple of (likelihood_score, list of matched markers)
    """
    matched_markers = []
    score = 0.0

    # Check for distinctive Molly vocabulary
    for pattern in self.patterns['molly_vocabulary']:
        matches = re.findall(pattern, passage, re.IGNORECASE)
        if matches:
            matched_markers.append(f"molly_vocab: {pattern}")
            score += 0.3 * len(matches)

    # Check for female embodiment vocabulary
    for pattern in self.patterns['female_embodiment']:
        matches = re.findall(pattern, passage, re.IGNORECASE)
        if matches:
            matched_markers.append(f"embodiment: {pattern}")
            score += 0.2 * len(matches)

    # Check for domestic vocabulary
    for pattern in self.patterns['domestic_vocabulary']:
        matches = re.findall(pattern, passage, re.IGNORECASE)
        if matches:
            matched_markers.append(f"domestic: {pattern}")
            score += 0.15 * len(matches)

    # Cap at 1.0
    return min(score, 1.0), matched_markers


# Add methods to finder
MollyCandidateFinder.check_mediation = check_mediation
MollyCandidateFinder.check_direct_speech = check_direct_speech
MollyCandidateFinder.calculate_molly_likelihood = calculate_molly_likelihood

print("\n" + "=" * 60)
print("Method summary:")
print("- check_mediation: Identifies male-encoded memories/imagination")
print("- check_direct_speech: Identifies real-time, unmediated speech")
print("- calculate_molly_likelihood: Heuristic scoring based on vocabulary")
print("=" * 60)

In [None]:
# Penelope episode extraction (takes the definite Molly passage)

def extract_penelope(self) -> PassageCandidate:
    """
    Extract the Penelope episode as a definite Molly passage.

    This is the only passage we can extract with complete confidence.
    24,000+ words of pure Molly, unmediated by any other consciousness.
    The famous unpunctuated stream-of-consciousness monologue.

    Returns:
        PassageCandidate for the entire Penelope episode
    """
    if 'Penelope' not in self.episodes:
        raise ValueError("Penelope episode not found in text. Manual location required.")

    start, end = self.episodes['Penelope']
    text = self.text[start:end]

    # minimal context (before Pen)
    context_before = self.text[max(0, start-200):start]

    return PassageCandidate(
        text=text,
        location=start,
        episode='Penelope',
        passage_type=PassageType.INTERIOR_MONOLOGUE,
        confidence=ConfidenceLevel.DEFINITE,
        context_before=context_before,
        context_after="[End of text]",
        flags=['penelope_episode', 'stream_of_consciousness', 'unpunctuated'],
        reasoning="Penelope episode: established critical consensus as Molly's interior monologue. "
                  "Pure stream-of-consciousness, unmediated by any other character's perspective."
    )


# method for finder
MollyCandidateFinder.extract_penelope = extract_penelope

# find Pen
print("=" * 60)

penelope_candidate = finder.extract_penelope()

print(f"Location: Character position {penelope_candidate.location:,}")
print(f"Length: {len(penelope_candidate.text):,} characters")
print(f"Approximate word count: {len(penelope_candidate.text.split()):,}")
print(f"Confidence: {penelope_candidate.confidence.value}")
print(f"Flags: {', '.join(penelope_candidate.flags)}")

print("\n" + "-" * 60)
print("First 500 characters of Penelope:")
print("-" * 60)
print(penelope_candidate.text[:500])
print("-" * 60)

print("\n" + "-" * 60)
print("Last 500 characters of Penelope:")
print("-" * 60)
print(penelope_candidate.text[-500:])
print("-" * 60)

In [None]:
# Dialogue Candidate Extraction (flag potential Molly dialogue for manual review)

def find_dialogue_candidates(self, max_candidates: int = 100) -> List[PassageCandidate]:
    """
    Find potential Molly dialogue throughout Ulysses.

    This method identifies passages that MIGHT be Molly speaking,
    but flags them appropriately for human review based on mediation markers.

    The challenge: Joyce doesn't always cleanly mark speakers, and we must
    distinguish between Molly speaking vs. Bloom remembering her words.

    Parameters:
        max_candidates: Maximum number of candidates to extract (prevents overflow)

    Returns:
        List of PassageCandidate objects for manual review
    """
    candidates = []

    # dialogue attribution patterns (conservative approach)
    dialogue_patterns = [
        r'(?:Molly|Mrs\s+Bloom)\s+said[:\.]?\s*[—"\']([^"\'—]{10,300})["\']?',
        r'—\s*([^—\n]{20,300})\s*(?:Molly|she)\s+said',
        r'she\s+said[:\.]?\s*[—"\']([^"\'—]{10,300})["\']?',
    ]

    for pattern in dialogue_patterns:
        matches = list(re.finditer(pattern, self.text, re.IGNORECASE))

        print(f"Pattern '{pattern[:50]}...' found {len(matches)} matches")

        for match in matches[:max_candidates]:  # lim to prevent overflow
            location = match.start()
            passage = match.group(0)

            # skip if in pen
            if 'Penelope' in self.episodes:
                pen_start, pen_end = self.episodes['Penelope']
                if pen_start <= location < pen_end:
                    continue

            # context
            ctx_before, ctx_after = self.get_context(location)

            # mediation
            is_mediated, mediation_patterns = self.check_mediation(passage, ctx_before)
            is_direct, direct_patterns = self.check_direct_speech(passage)

            # calc Molly likelihood based on vocabulary
            molly_score, vocab_markers = self.calculate_molly_likelihood(passage)

            # determine confidence based on checks
            flags = []

            if is_mediated:
                confidence = ConfidenceLevel.EXCLUDE
                flags.extend(['mediated_speech'] + [f"mediation: {p[:30]}" for p in mediation_patterns[:2]])
                reasoning = "Contains mediation markers - likely Bloom's encoding of her words. " \
                           "Male-mediated memory, not primary source."
            elif is_direct and molly_score > 0.3:
                confidence = ConfidenceLevel.PROBABLE
                flags.extend(['direct_speech'] + vocab_markers[:3])
                reasoning = f"Direct speech markers present with Molly vocabulary (score: {molly_score:.2f}). " \
                           "Likely real-time dialogue, but requires verification."
            elif is_direct:
                confidence = ConfidenceLevel.REQUIRES_REVIEW
                flags.append('direct_speech_low_confidence')
                reasoning = "Direct speech markers present, but limited Molly-specific vocabulary. " \
                           "Could be another female character. Requires close reading."
            else:
                confidence = ConfidenceLevel.REQUIRES_REVIEW
                flags.append('ambiguous_attribution')
                reasoning = "Unclear attribution and no strong markers. Requires close reading for context."

            candidates.append(PassageCandidate(
                text=passage,
                location=location,
                episode=self.get_episode_name(location),
                passage_type=PassageType.DIRECT_DIALOGUE,
                confidence=confidence,
                context_before=ctx_before,
                context_after=ctx_after,
                flags=flags,
                reasoning=reasoning
            ))

    return candidates


# method for finder
MollyCandidateFinder.find_dialogue_candidates = find_dialogue_candidates

# dialogue candidates
print("Extracting dialogue candidates...")
print("=" * 60)

dialogue_candidates = finder.find_dialogue_candidates(max_candidates=100)

print("\n" + "=" * 60)
print(f"Total dialogue candidates found: {len(dialogue_candidates)}")

# summary by confidence level
confidence_summary = {}
for candidate in dialogue_candidates:
    conf = candidate.confidence.value
    confidence_summary[conf] = confidence_summary.get(conf, 0) + 1

print("\nBreakdown by confidence level:")
for conf, count in sorted(confidence_summary.items()):
    print(f"  {conf}: {count}")

print("=" * 60)

In [None]:
# Sample Candidate Review (sample candidates from each confidence category for manual review)

def display_candidate_sample(candidates: List[PassageCandidate],
                            confidence_level: ConfidenceLevel,
                            n_samples: int = 3):
    """
    Display sample candidates for a given confidence level.

    Parameters:
        candidates: List of all candidates
        confidence_level: Which confidence level to display
        n_samples: Number of samples to show
    """
    filtered = [c for c in candidates if c.confidence == confidence_level]

    if not filtered:
        print(f"No candidates found for confidence level: {confidence_level.value}")
        return

    print(f"\n{'=' * 60}")
    print(f"CONFIDENCE LEVEL: {confidence_level.value.upper()}")
    print(f"Total candidates: {len(filtered)}")
    print(f"Showing {min(n_samples, len(filtered))} samples")
    print('=' * 60)

    for i, candidate in enumerate(filtered[:n_samples], 1):
        print(f"\n{'-' * 60}")
        print(f"Sample {i}/{min(n_samples, len(filtered))}")
        print(f"{'-' * 60}")
        print(f"Location: Position {candidate.location:,}")
        print(f"Episode: {candidate.episode or 'Unknown'}")
        print(f"Type: {candidate.passage_type.value}")
        print(f"\nReasoning: {candidate.reasoning}")
        print(f"\nFlags: {', '.join(candidate.flags[:5])}")

        print(f"\nContext (before, last 100 chars):")
        print(f"...{candidate.context_before[-100:]}")

        print(f"\nPassage (first 300 chars):")
        print(candidate.text[:300] + ('...' if len(candidate.text) > 300 else ''))

        print(f"\nContext (after, first 100 chars):")
        print(f"{candidate.context_after[:100]}...")


# samples from each confidence category
print("Displaying sample candidates for manual inspection...")

# candidates (if any outside Penelope)
display_candidate_sample(dialogue_candidates, ConfidenceLevel.DEFINITE, n_samples=2)

# other candidates
display_candidate_sample(dialogue_candidates, ConfidenceLevel.PROBABLE, n_samples=3)

# review candidates
display_candidate_sample(dialogue_candidates, ConfidenceLevel.REQUIRES_REVIEW, n_samples=3)

# excluded candidates (indicated flag)
display_candidate_sample(dialogue_candidates, ConfidenceLevel.EXCLUDE, n_samples=3)

print("\n" + "=" * 60)
print("Sample review complete.")
print("These samples illustrate the kinds of decisions required in manual annotation.")
print("=" * 60)

In [None]:
# markdown report generation

def generate_header() -> str:
    """Generate markdown header for annotation file."""
    return """# Molly Bloom Text Extraction: Manual Annotation

## Purpose

This document presents candidate passages for inclusion in the Molly Bloom corpus.
Automatic classification has been applied, but all passages require human judgment
due to the complex attribution challenges in Joyce's text.

## Decision Framework

For each passage, determine:

### 1. Attribution
Is this definitively Molly's voice?

**Exclude:**
- Passages mediated through Bloom's memory or imagination
- Reported speech or paraphrased content
- Male-encoded translations of her words

**Include:**
- Direct speech in real-time narrative moments
- Unmediated interior monologue
- Theatrical/performed speech (flagged for Stage 2)

### 2. Stage Assignment
If included, which training stage?

- **Stage 1**: Vocabulary foundation (all Molly text, used for frequency analysis)
- **Stage 2**: Public voice (dialogue, theatrical, conversational register)
- **Stage 3**: Interior consciousness (stream-of-consciousness, unpunctuated flow)

### 3. Notes
Record reasoning for ambiguous cases, especially:
- Free indirect discourse boundaries
- Circe episode theatrical passages
- Passages where attribution is unclear

## Annotation Format

For each passage, mark:
- **Decision**: INCLUDE / EXCLUDE / UNCERTAIN
- **Stage**: 1 / 2 / 3 / N/A
- **Notes**: Your reasoning

---

"""


def generate_penelope_section(penelope: PassageCandidate) -> str:
    """Generate section for Penelope episode."""
    word_count = len(penelope.text.split())

    return f"""## Section 1: Penelope Episode (Definite Inclusion)

**Status**: DEFINITE INCLUSION
**Recommended Stage**: 3 (Interior Consciousness)
**Location**: Character position {penelope.location:,}
**Length**: {len(penelope.text):,} characters (~{word_count:,} words)

**Reasoning**: {penelope.reasoning}

**Flags**: {', '.join(penelope.flags)}

### First 1000 characters:
```
{penelope.text[:1000]}
```

### Last 500 characters:
```
{penelope.text[-500:]}
```

**Decision**: INCLUDE
**Stage**: 3
**Notes**:


---

"""


def format_flags(flags: List[str]) -> str:
    """Format flags as markdown list."""
    if not flags:
        return "- None"
    return "\n".join(f"- `{flag}`" for flag in flags)

print("=" * 60)
print("Functions available:")
print("- generate_header(): Creates annotation file header with instructions")
print("- generate_penelope_section(): Formats Penelope episode entry")
print("- format_flags(): Formats flag lists for display")
print("=" * 60)

In [None]:
# markdown report generation / candidate sections

def generate_candidate_section(section_title: str,
                               candidates: List[PassageCandidate],
                               section_number: int) -> str:
    """
    Generate section for a group of candidates.

    Parameters:
        section_title: Title for this confidence category
        candidates: List of candidates to include
        section_number: Section number for organization

    Returns:
        Formatted markdown string
    """
    section = f"""## Section {section_number}: {section_title} ({len(candidates)} passages)

"""

    if not candidates:
        section += "*No candidates in this category.*\n\n---\n\n"
        return section

    for i, candidate in enumerate(candidates, 1):
        # truncate passage for display
        passage_display = candidate.text[:500]
        if len(candidate.text) > 500:
            passage_display += "\n[... passage continues ...]"

        section += f"""### Passage {section_number}.{i:03d}

**Location**: Position {candidate.location:,}
**Episode**: {candidate.episode or "Unknown"}
**Type**: {candidate.passage_type.value}
**Automatic Confidence**: {candidate.confidence.value}

**Reasoning**: {candidate.reasoning}

**Automatic Flags**:
{format_flags(candidate.flags)}

**Context (preceding 150 characters)**:
```
...{candidate.context_before[-150:]}
```

**Passage**:
```
{passage_display}
```

**Context (following 150 characters)**:
```
{candidate.context_after[:150]}...
```

**Decision**: [ INCLUDE / EXCLUDE / UNCERTAIN ]
**Stage**: [ 1 / 2 / 3 / N/A ]
**Notes**:


---

"""

    return section


def generate_statistics_section(dialogue_candidates: List[PassageCandidate],
                               penelope: PassageCandidate) -> str:
    """Generate summary statistics section."""

    total_candidates = len(dialogue_candidates) + 1  # +1 for Pen

    # count by confidence
    confidence_counts = {}
    for candidate in dialogue_candidates:
        conf = candidate.confidence.value
        confidence_counts[conf] = confidence_counts.get(conf, 0) + 1

    # count by episode
    episode_counts = {}
    for candidate in dialogue_candidates:
        ep = candidate.episode or "Unknown"
        episode_counts[ep] = episode_counts.get(ep, 0) + 1

    stats = f"""## Extraction Statistics

**Total candidates identified**: {total_candidates}

### By Confidence Level:
- Definite: {confidence_counts.get('definite', 0) + 1} (including Penelope)
- Probable: {confidence_counts.get('probable', 0)}
- Requires Review: {confidence_counts.get('requires_review', 0)}
- Flagged for Exclusion: {confidence_counts.get('exclude', 0)}

### By Episode:
"""

    for episode, count in sorted(episode_counts.items()):
        stats += f"- {episode}: {count}\n"

    stats += f"- Penelope: 1 (complete episode)\n"

    stats += """
### Penelope Statistics:
- Character count: {:,}
- Approximate word count: {:,}
- Status: Definite inclusion (Stage 3)

---

""".format(len(penelope.text), len(penelope.text.split()))

    return stats

print("=" * 60)
print("Functions available:")
print("- generate_candidate_section(): Formats groups of candidates")
print("- generate_statistics_section(): Creates summary statistics")
print("=" * 60)

In [None]:
# complete annotation report generation

def generate_complete_annotation_report(penelope: PassageCandidate,
                                       dialogue_candidates: List[PassageCandidate],
                                       output_filename: str = 'molly_candidates_annotation.md'):
    """
    Generate complete annotation report combining all sections.

    Parameters:
        penelope: The Penelope episode candidate
        dialogue_candidates: All dialogue candidates
        output_filename: Name for output file

    Returns:
        Path to generated file
    """

    print("Generating complete annotation report...")
    print("=" * 60)

    # sort candidates by confidence level
    definite = [c for c in dialogue_candidates if c.confidence == ConfidenceLevel.DEFINITE]
    probable = [c for c in dialogue_candidates if c.confidence == ConfidenceLevel.PROBABLE]
    review = [c for c in dialogue_candidates if c.confidence == ConfidenceLevel.REQUIRES_REVIEW]
    exclude = [c for c in dialogue_candidates if c.confidence == ConfidenceLevel.EXCLUDE]

    print(f"Organizing {len(dialogue_candidates)} dialogue candidates:")
    print(f"  Definite: {len(definite)}")
    print(f"  Probable: {len(probable)}")
    print(f"  Requires Review: {len(review)}")
    print(f"  Flagged for Exclusion: {len(exclude)}")

    # assemble complete report
    report_sections = []

    # header with instructions
    report_sections.append(generate_header())

    # statistics summary
    report_sections.append(generate_statistics_section(dialogue_candidates, penelope))

    # Pen (definite)
    report_sections.append(generate_penelope_section(penelope))

    # definite dialogue candidates
    if definite:
        report_sections.append(generate_candidate_section(
            "Definite Dialogue Candidates", definite, 2
        ))

    # probable candidates
    if probable:
        report_sections.append(generate_candidate_section(
            "Probable Candidates (Verification Required)", probable, 3
        ))

    # requires review
    if review:
        report_sections.append(generate_candidate_section(
            "Ambiguous Candidates (Close Reading Required)", review, 4
        ))

    # excluded (showing reasoning)
    if exclude:
        report_sections.append(generate_candidate_section(
            "Flagged for Exclusion (Male-Mediated or Non-Molly)", exclude, 5
        ))

    # footer
    report_sections.append("""
## Next Steps

1. Review each passage in context
2. Make inclusion/exclusion decisions
3. Assign stage numbers for included passages
4. Document reasoning for ambiguous cases
5. Use annotations to generate final corpus files

## Notes on Ambiguous Cases

Record patterns you notice during annotation:
- Common mediation structures not caught by automatic detection
- Distinctive Molly vocabulary to add to patterns
- Episode-specific attribution challenges
- Free indirect discourse boundaries

---

*Report generated by MollyCandidateFinder*
*Project: yes-i-said-yes-i-will-yes*
*For Molly.*
""")

    # write
    complete_report = "".join(report_sections)

    with open(output_filename, 'w', encoding='utf-8') as f:
        f.write(complete_report)

    print("\n" + "=" * 60)
    print(f"Annotation report generated successfully!")
    print(f"Filename: {output_filename}")
    print(f"Size: {len(complete_report):,} characters")
    print("=" * 60)

    return output_filename


# complete report
output_file = generate_complete_annotation_report(
    penelope=penelope_candidate,
    dialogue_candidates=dialogue_candidates,
    output_filename='molly_candidates_annotation.md'
)

In [None]:
from google.colab import files

print("Preparing annotation file for download...")
print("=" * 60)

# get file
files.download(output_file)

print("\n" + "=" * 60)
print("ANNOTATION FILE DOWNLOADED")
print("=" * 60)
print("\nNext steps:")
print("1. Open the markdown file in your preferred editor")
print("2. Review each candidate passage carefully")
print("3. Make INCLUDE/EXCLUDE/UNCERTAIN decisions")
print("4. Assign training stages (1, 2, or 3) for included passages")
print("5. Document your reasoning for ambiguous cases")
print("\n" + "=" * 60)
print("Key considerations during annotation:")
print("- Bloom's memories of Molly's words are male-encoded (EXCLUDE)")
print("- Direct speech in real-time scenes is primary source (INCLUDE)")
print("- Free indirect discourse requires close reading")
print("- Circe theatrical passages need special attention")
print("=" * 60)
print("\nRemember: This is computational feminism.")
print("We're building a corpus where female interiority is foundational,")
print("not filtered through male consciousness.")
print("=" * 60)