# Story Analysis - COMPLETE & SIMPLIFIED ✨

**RESTART INSTRUCTIONS:**
1. **Restart the kernel** 
2. **Run cells 2-9 in order** (setup + test)
3. **Then use cells 10-12 for full corpus processing & HTML output**

## What this notebook does:
- ✅ **One simple processor** (no confusing multiple versions!)
- ✅ **Scene segmentation** that actually works
- ✅ **Full corpus processing** - analyze all books
- ✅ **HTML visualization output** - JSON data for dashboards
- ✅ **Clear, understandable code**
- ✅ **No hanging cells**

## The cells:
1. **Header** (this cell)
2. **LLM Provider Setup** - Choose your LLM
3. **Environment** - Load settings  
4. **Connection** - Connect to LLM
5. **Data Classes** - Scene and Goal definitions
6. **Connection Test** - Make sure it works
7. **Working Processor** - Basic processor (keep for compatibility)
8. **✨ Simple Processor** - The new, clean solution!
9. **🧪 Test** - Test the simple processor
10. **📚 Corpus Processing** - Run on entire corpus
11. **🎨 HTML Visualization** - Format data for HTML dashboard
12. **📋 Workflow Guide** - Complete usage examples
13. **Footer**

## Quick Workflow:
1. **Setup**: Run cells 2-9 
2. **Test**: See cell 9 results
3. **Full corpus**: `corpus_results = process_entire_corpus()`
4. **HTML data**: `viz_data, json_file = create_html_visualization(corpus_results)`

**Ready to start fresh? Restart the kernel and run cells 2-9!**

In [2]:
# LLM Provider Configuration
import ipywidgets as widgets
from IPython.display import display

# Create provider selection widget
provider_widget = widgets.RadioButtons(
    options=['anthropic', 'openai', 'ollama'],
    value='ollama',
    description='LLM Provider:',
    disabled=False
)

# Create model configuration widgets
model_widgets = {
    'anthropic': widgets.Dropdown(
        options=['claude-opus-4-1-20250805', 'claude-sonnet-4-20250514', 'claude-3-5-haiku-latest','claude-3-5-sonnet-20241022', 'claude-3-opus-20240229', 'claude-3-sonnet-20240229'],
        value='claude-3-5-sonnet-20241022',
        description='Anthropic Model:'
    ),
    'openai': widgets.Dropdown(
        options=['gpt-5', 'gpt-5-mini', 'gpt-5-nano', 'gpt-4', 'gpt-4-turbo', 'gpt-3.5-turbo'],
        value='gpt-4',
        description='OpenAI Model:'
    ),
    'ollama': widgets.Dropdown(
        options=['gpt-oss-32k', 'gpt-oss:latest', 'llama3:8b', 'mistral', 'codellama'],
        value='gpt-oss-32k',
        description='Ollama Model:'
    )
}

# Configuration display
config_output = widgets.Output()

def update_config(change=None):
    with config_output:
        config_output.clear_output(wait=True)
        provider = provider_widget.value
        model = model_widgets[provider].value
        
        print(f"🔧 Configuration:")
        print(f"   Provider: {provider.upper()}")
        print(f"   Model: {model}")
        
        if provider == 'anthropic':
            print(f"   Context: 200k tokens | Max Output: 4k tokens")
            print(f"   📝 Note: Requires ANTHROPIC_API_KEY environment variable")
        elif provider == 'openai':
            print(f"   Context: 128k tokens | Max Output: 4k tokens")
            print(f"   📝 Note: Requires OPENAI_API_KEY environment variable")
        elif provider == 'ollama':
            if model == 'gpt-oss-32k':
                print(f"   Context: 32k tokens | Max Output: 4k tokens")
                print(f"   📝 Note: Custom model with enhanced context size")
            else:
                print(f"   Context: 8k tokens | Max Output: 1.5k tokens")
                print(f"   📝 Note: Requires Ollama running locally with model downloaded")

# Setup widgets
provider_widget.observe(update_config, names='value')
for widget in model_widgets.values():
    widget.observe(update_config, names='value')

print("🎛️ LLM Provider Configuration")
print("Choose your preferred LLM provider and model:")
display(widgets.VBox([
    provider_widget,
    widgets.HBox([model_widgets['anthropic'], model_widgets['openai'], model_widgets['ollama']]),
    config_output
]))

# Initialize display
update_config()

print("\n⚠️ Remember to run the next cell after changing provider settings!")

🎛️ LLM Provider Configuration
Choose your preferred LLM provider and model:


VBox(children=(RadioButtons(description='LLM Provider:', index=2, options=('anthropic', 'openai', 'ollama'), v…


⚠️ Remember to run the next cell after changing provider settings!


In [3]:
import os
import json
import pandas as pd
import numpy as np
from pathlib import Path
import re
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass, asdict
import time
from datetime import datetime
import ipywidgets as widgets
from IPython.display import display, clear_output
from tqdm import tqdm

# Load environment variables from .env file
try:
    from dotenv import load_dotenv
    load_dotenv()
    print("✅ Environment variables loaded from .env file")
except ImportError:
    print("⚠️ python-dotenv not installed. Install with: pip install python-dotenv")
    print("📝 Falling back to system environment variables")
except Exception as e:
    print(f"⚠️ Failed to load .env file: {e}")
    print("📝 Falling back to system environment variables")

✅ Environment variables loaded from .env file


In [4]:
# Get LLM provider configuration from widgets (if available)
try:
    LLM_PROVIDER = provider_widget.value
    if LLM_PROVIDER == 'anthropic':
        MODEL_NAME = model_widgets['anthropic'].value
    elif LLM_PROVIDER == 'openai':
        MODEL_NAME = model_widgets['openai'].value
    elif LLM_PROVIDER == 'ollama':
        MODEL_NAME = model_widgets['ollama'].value
except NameError:
    # Fallback if widgets not defined
    LLM_PROVIDER = "ollama"
    MODEL_NAME = "gpt-oss-32k"

print(f"🎯 Selected Provider: {LLM_PROVIDER.upper()}")
print(f"🤖 Selected Model: {MODEL_NAME}")

# Initialize LLM clients based on provider selection
anthropic_client = None
openai_client = None
ollama_client = None

if LLM_PROVIDER == "anthropic":
    try:
        import anthropic
        api_key = os.getenv('ANTHROPIC_API_KEY')
        if not api_key:
            print("⚠️ Warning: ANTHROPIC_API_KEY not found in environment variables")
            print("📝 Make sure your .env file contains: ANTHROPIC_API_KEY='your_key_here'")
        anthropic_client = anthropic.Anthropic(api_key=api_key) if api_key else None
        print("✅ Anthropic client initialized!")
    except ImportError:
        print("❌ anthropic package not installed. Run: pip install anthropic")
    except Exception as e:
        print(f"❌ Failed to initialize Anthropic: {e}")

elif LLM_PROVIDER == "openai":
    try:
        import openai
        api_key = os.getenv('OPENAI_API_KEY')
        if not api_key:
            print("⚠️ Warning: OPENAI_API_KEY not found in environment variables")
            print("📝 Make sure your .env file contains: OPENAI_API_KEY='your_key_here'")
        openai_client = openai.OpenAI(api_key=api_key) if api_key else None
        print("✅ OpenAI client initialized!")
    except ImportError:
        print("❌ openai package not installed. Run: pip install openai")
    except Exception as e:
        print(f"❌ Failed to initialize OpenAI: {e}")

elif LLM_PROVIDER == "ollama":
    try:
        import ollama
        # Set up working Ollama client for Windows/WSL
        ollama_client = ollama.Client(host='http://172.21.144.1:11434')
        
        # Test connection with a simple request
        test_response = ollama_client.generate(
            model=MODEL_NAME,
            prompt='Test connection',
            options={
                'num_predict': 10,
                'temperature': 0.1
            }
        )
        print("✅ Ollama connection established!")
        
    except ImportError:
        print("❌ ollama package not installed. Run: pip install ollama")
    except Exception as e:
        print(f"❌ Failed to connect to Ollama: {e}")
        print("Please ensure Ollama is running with the selected model")
        ollama_client = None

# Configuration based on provider
if LLM_PROVIDER == "anthropic":
    MAX_TOKENS = 4000
    CONTEXT_SIZE = 200000
    TEMPERATURE = 0.1
elif LLM_PROVIDER == "openai":
    MAX_TOKENS = 4000
    CONTEXT_SIZE = 128000
    TEMPERATURE = 0.1
elif LLM_PROVIDER == "ollama":
    MAX_TOKENS = 1500
    CONTEXT_SIZE = 15192
    TEMPERATURE = 0.1

# Data directories
DATA_DIR = Path('/home/lucid/story_analysis/corpus_clean/clean corpus no paratext')
RESULTS_DIR = Path('/home/lucid/story_analysis/results')
RESULTS_DIR.mkdir(exist_ok=True)

print(f"✅ Imports and {LLM_PROVIDER.upper()} LLM setup complete")
print(f"📋 Configuration: {MODEL_NAME} | Max Tokens: {MAX_TOKENS} | Context: {CONTEXT_SIZE}")

# Verify client is ready
client_ready = False
if LLM_PROVIDER == "anthropic" and anthropic_client:
    client_ready = True
elif LLM_PROVIDER == "openai" and openai_client:
    client_ready = True
elif LLM_PROVIDER == "ollama" and ollama_client:
    client_ready = True

if client_ready:
    print("🚀 Ready to process!")
else:
    print("❌ Client not ready. Please check configuration and credentials.")
    print("💡 For API providers, ensure your .env file contains the correct API keys")

🎯 Selected Provider: OLLAMA
🤖 Selected Model: gpt-oss-32k
✅ Ollama connection established!
✅ Imports and OLLAMA LLM setup complete
📋 Configuration: gpt-oss-32k | Max Tokens: 1500 | Context: 15192
🚀 Ready to process!
✅ Ollama connection established!
✅ Imports and OLLAMA LLM setup complete
📋 Configuration: gpt-oss-32k | Max Tokens: 1500 | Context: 15192
🚀 Ready to process!


In [5]:
@dataclass
class Scene:
    """Represents a scene within a chapter"""
    scene_id: str  # Format: "book_X_chapter_Y_scene_Z"
    book_id: str
    chapter_num: int
    scene_num: int
    text: str
    narrator: Optional[str] = None
    start_paragraph: Optional[int] = None
    end_paragraph: Optional[int] = None

@dataclass
class Goal:
    """Represents a character goal with dual categorization"""
    goal_id: str
    scene_id: str
    character: str
    goal_text: str
    motivation_type: str  # "internal" or "external"
    category: str  # "academic", "family", "personal", "social", "work", etc.
    evidence: str  # Text evidence supporting this goal
    confidence: float  # LLM confidence score 0-1

@dataclass
class ProcessingProgress:
    """Tracks processing progress for resumable operations"""
    books_segmented: List[str]
    books_narrators_identified: List[str]
    books_goals_analyzed: List[str]
    last_updated: str
    total_books: int
    
    def save(self, filepath: Path):
        with open(filepath, 'w') as f:
            json.dump(asdict(self), f, indent=2)
    
    @classmethod
    def load(cls, filepath: Path):
        if filepath.exists():
            with open(filepath, 'r') as f:
                data = json.load(f)
            return cls(**data)
        return cls([], [], [], datetime.now().isoformat(), 0)

print("✅ Data structures defined")

✅ Data structures defined


In [6]:
# Helper functions for provider management and dynamic context sizing
import tiktoken

def count_tokens(text: str, model_name: str = None) -> int:
    """Count tokens in text using tiktoken for OpenAI models or estimate for others"""
    try:
        if LLM_PROVIDER == "openai" and model_name:
            encoding = tiktoken.encoding_for_model(model_name)
            return len(encoding.encode(text))
        else:
            # Rough estimation: ~4 characters per token
            return len(text) // 4
    except Exception:
        # Fallback estimation
        return len(text) // 4

def get_ollama_model_info(model_name: str) -> dict:
    """Get model information from Ollama including context size"""
    try:
        if ollama_client:
            # Try to get model info
            response = ollama_client.show(model_name)
            
            # Extract context size from model info
            context_size = 2048  # Default fallback
            
            # Look for context size in various places
            if 'modelinfo' in response:
                modelinfo = response['modelinfo']
                # Common parameter names for context size
                for param in ['num_ctx', 'context_length', 'n_ctx', 'max_context']:
                    if param in modelinfo:
                        context_size = int(modelinfo[param])
                        break
            
            # Handle our custom models and known models
            if 'gpt-oss-32k' in model_name.lower():
                # Our custom model with 32k context
                context_size = 32768
            elif 'gpt-oss' in model_name.lower():
                # Original gpt-oss model
                context_size = max(context_size, 32768)  # At least 32k
            elif 'llama3:8b' in model_name.lower():
                context_size = max(context_size, 8192)
            elif 'llama2' in model_name.lower():
                context_size = max(context_size, 4096)
            elif 'mistral' in model_name.lower():
                context_size = max(context_size, 32768)
            elif 'codellama' in model_name.lower():
                context_size = max(context_size, 16384)
            
            return {
                'context_size': context_size,
                'model_info': response.get('modelinfo', {}),
                'parameters': response.get('parameters', {})
            }
    except Exception as e:
        print(f"⚠️ Could not get model info: {e}")
    
    # Fallback defaults
    if 'gpt-oss-32k' in model_name.lower():
        return {'context_size': 32768, 'model_info': {}, 'parameters': {}}
    elif 'gpt-oss' in model_name.lower():
        return {'context_size': 32768, 'model_info': {}, 'parameters': {}}
    elif 'llama3:8b' in model_name.lower():
        return {'context_size': 8192, 'model_info': {}, 'parameters': {}}
    elif 'llama2' in model_name.lower():
        return {'context_size': 4096, 'model_info': {}, 'parameters': {}}
    elif 'mistral' in model_name.lower():
        return {'context_size': 32768, 'model_info': {}, 'parameters': {}}
    elif 'codellama' in model_name.lower():
        return {'context_size': 16384, 'model_info': {}, 'parameters': {}}
    else:
        return {'context_size': 8192, 'model_info': {}, 'parameters': {}}

def get_optimal_context_size(provider: str, model: str) -> dict:
    """Get optimal context sizes for different models"""
    if provider == 'ollama':
        # Get actual model info for Ollama
        model_info = get_ollama_model_info(model)
        max_context = model_info['context_size']
        print(f"🔍 Detected Ollama model context size: {max_context:,} tokens")
    else:
        # Static limits for other providers
        context_limits = {
            'anthropic': {
                'claude-3-5-sonnet-20241022': 200000,
                'claude-3-opus-20240229': 200000,
                'claude-3-sonnet-20240229': 200000,
                'default': 200000
            },
            'openai': {
                'gpt-4': 128000,
                'gpt-4-turbo': 128000,
                'gpt-4o': 128000,
                'gpt-3.5-turbo': 16385,
                'default': 128000
            }
        }
        
        provider_limits = context_limits.get(provider, {})
        max_context = provider_limits.get(model, provider_limits.get('default', 8192))
    
    # Reserve 20% for output and system overhead
    usable_context = int(max_context * 0.8)
    max_output = min(4000, int(max_context * 0.2))
    
    return {
        'max_context': max_context,
        'usable_context': usable_context,
        'max_output': max_output
    }

def chunk_text_for_context(text: str, max_tokens: int, overlap: int = 200) -> List[str]:
    """Split text into chunks that fit within context limits with overlap"""
    if count_tokens(text, MODEL_NAME) <= max_tokens:
        return [text]
    
    # Split by paragraphs first
    paragraphs = text.split('\n\n')
    chunks = []
    current_chunk = ""
    current_tokens = 0
    
    for para in paragraphs:
        para_tokens = count_tokens(para, MODEL_NAME)
        
        # If single paragraph is too large, split it further
        if para_tokens > max_tokens:
            # Split by sentences
            sentences = para.split('. ')
            for sentence in sentences:
                sentence_tokens = count_tokens(sentence, MODEL_NAME)
                if current_tokens + sentence_tokens > max_tokens and current_chunk:
                    chunks.append(current_chunk.strip())
                    # Keep overlap
                    overlap_text = '. '.join(current_chunk.split('. ')[-overlap//20:])
                    current_chunk = overlap_text + ". " + sentence
                    current_tokens = count_tokens(current_chunk, MODEL_NAME)
                else:
                    current_chunk += ". " + sentence if current_chunk else sentence
                    current_tokens += sentence_tokens
        else:
            # Check if adding this paragraph exceeds limit
            if current_tokens + para_tokens > max_tokens and current_chunk:
                chunks.append(current_chunk.strip())
                # Keep overlap
                overlap_paras = current_chunk.split('\n\n')[-2:] if len(current_chunk.split('\n\n')) > 1 else []
                current_chunk = '\n\n'.join(overlap_paras) + '\n\n' + para if overlap_paras else para
                current_tokens = count_tokens(current_chunk, MODEL_NAME)
            else:
                current_chunk += '\n\n' + para if current_chunk else para
                current_tokens += para_tokens
    
    if current_chunk.strip():
        chunks.append(current_chunk.strip())
    
    return chunks

def test_ollama_context_systematically():
    """Test the Ollama model with realistic context sizes for the new gpt-oss-32k model"""
    if LLM_PROVIDER != "ollama" or not ollama_client:
        print("❌ This function only works with Ollama")
        return
        
    print(f"🔬 Systematic context testing for {MODEL_NAME}")
    print("=" * 60)
    
    # Get model info first
    model_info = get_ollama_model_info(MODEL_NAME)
    print(f"📊 Model claims context size: {model_info['context_size']:,} tokens")
    
    # Test with realistic incremental sizes
    if 'gpt-oss-32k' in MODEL_NAME:
        test_contexts = [4096, 8192, 16384, 24576, 32768]
    else:
        test_contexts = [2048, 4096, 8192, 16384]
    
    working_context = 2048  # Conservative default
    max_successful_tokens = 0
    
    for ctx_size in test_contexts:
        print(f"\n🧪 Testing with num_ctx={ctx_size:,}...")
        
        # Create a test that uses about 60% of the context for safety
        target_tokens = int(ctx_size * 0.6)
        
        # Create story-like content similar to what we'll actually process
        base_story = """
        Kristy Thomas sat in her room thinking about the Baby-Sitters Club meeting that was about to start. 
        As club president, she always made sure everything was organized and running smoothly. The other members 
        would arrive soon: Claudia Kishi with her artistic flair, Mary Anne Spier with her careful scheduling, 
        and Stacey McGill with her math skills for handling the money.
        """
        
        # Repeat the story to reach target tokens
        repetitions = max(1, target_tokens // count_tokens(base_story, MODEL_NAME))
        test_text = base_story * repetitions
        actual_tokens = count_tokens(test_text, MODEL_NAME)
        
        test_prompt = f"Please analyze the main themes in this text and respond in 2-3 sentences:\n\n{test_text}"
        prompt_tokens = count_tokens(test_prompt, MODEL_NAME)
        
        print(f"  📝 Prompt tokens: {prompt_tokens:,}")
        print(f"  🎯 Target context: {ctx_size:,}")
        
        try:
            response = ollama_client.generate(
                model=MODEL_NAME,
                prompt=test_prompt,
                options={
                    'num_predict': 200,
                    'temperature': 0.3,
                    'num_ctx': ctx_size,
                    'top_p': 0.9
                }
            )
            
            if response and response.get('response') and len(response['response'].strip()) > 20:
                working_context = ctx_size
                max_successful_tokens = prompt_tokens
                print(f"✅ Success with num_ctx={ctx_size:,} (input: {prompt_tokens:,} tokens)")
                print(f"   Response preview: {response['response'][:150]}...")
            else:
                print(f"❌ Empty/minimal response with num_ctx={ctx_size:,}")
                print(f"   Response: '{response.get('response', 'None')[:50]}...'")
                break
                
        except Exception as e:
            print(f"❌ Error with num_ctx={ctx_size:,}: {str(e)[:100]}...")
            # If we get a memory error, stop testing larger sizes
            if any(word in str(e).lower() for word in ["memory", "cuda", "out of memory", "allocation"]):
                print("   💾 Memory limitation reached")
                break
            # For other errors, continue testing
            continue
            
        time.sleep(2)  # Give GPU time to clear memory between tests
    
    print(f"\n🎯 Test Results:")
    print(f"   Maximum working context: {working_context:,} tokens")
    print(f"   Maximum successful input: {max_successful_tokens:,} tokens")
    print(f"   Recommended working size: {int(working_context * 0.8):,} tokens")
    
    # Update global setting
    global CONTEXT_SIZE
    old_size = CONTEXT_SIZE
    CONTEXT_SIZE = int(working_context * 0.8)  # Use 80% for safety
    
    print(f"\n🔧 Updated CONTEXT_SIZE: {old_size:,} → {CONTEXT_SIZE:,} tokens")
    
    # Calculate what this means for story analysis tasks
    print(f"\n📚 Story Analysis Capacity:")
    print(f"   Narrator identification: ✅ (needs ~500-1,500 tokens)")
    print(f"   Scene segmentation: {'✅' if CONTEXT_SIZE >= 4000 else '⚠️'} (needs ~2,000-8,000 tokens)")
    print(f"   Goal analysis: {'✅' if CONTEXT_SIZE >= 2000 else '⚠️'} (needs ~1,000-4,000 tokens)")
    print(f"   Large chapters: {'✅' if CONTEXT_SIZE >= 15000 else '⚠️'} (some chapters are 10k+ tokens)")
    
    return working_context

def test_context_limits():
    """Test and determine optimal context size for current provider"""
    print(f"🧪 Testing context limits for {LLM_PROVIDER.upper()} ({MODEL_NAME})...")
    
    # Get theoretical limits
    limits = get_optimal_context_size(LLM_PROVIDER, MODEL_NAME)
    print(f"📊 Theoretical limits:")
    print(f"   Max context: {limits['max_context']:,} tokens")
    print(f"   Usable context: {limits['usable_context']:,} tokens")
    print(f"   Max output: {limits['max_output']:,} tokens")
    
    # For Ollama, also show model-specific info
    if LLM_PROVIDER == "ollama":
        model_info = get_ollama_model_info(MODEL_NAME)
        if model_info.get('parameters'):
            print(f"📋 Model parameters: {model_info['parameters']}")
    
    # Test with progressively larger inputs
    test_sizes = [1000, 5000, 10000, 20000]
    
    # Adjust test sizes based on detected limits
    if limits['usable_context'] > 20000:
        test_sizes.extend([50000, 100000])
    if limits['usable_context'] > 100000:
        test_sizes.extend([150000, 200000])
    
    working_size = 1000
    
    for size in test_sizes:
        if size > limits['usable_context']:
            break
            
        # Create test text
        test_text = "Test sentence. " * (size // 15)  # Approximate tokens
        actual_tokens = count_tokens(test_text, MODEL_NAME)
        
        print(f"\n🔍 Testing {actual_tokens:,} tokens...")
        
        try:
            if LLM_PROVIDER == "anthropic" and anthropic_client:
                response = anthropic_client.messages.create(
                    model=MODEL_NAME,
                    max_tokens=100,
                    temperature=0,
                    messages=[{"role": "user", "content": f"Summarize this in one word: {test_text}"}]
                )
                if response.content[0].text.strip():
                    working_size = actual_tokens
                    print(f"✅ Success at {actual_tokens:,} tokens")
                else:
                    print(f"❌ Empty response at {actual_tokens:,} tokens")
                    break
                    
            elif LLM_PROVIDER == "openai" and openai_client:
                response = openai_client.chat.completions.create(
                    model=MODEL_NAME,
                    max_tokens=100,
                    temperature=0,
                    messages=[{"role": "user", "content": f"Summarize this in one word: {test_text}"}]
                )
                if response.choices[0].message.content and response.choices[0].message.content.strip():
                    working_size = actual_tokens
                    print(f"✅ Success at {actual_tokens:,} tokens")
                else:
                    print(f"❌ Empty response at {actual_tokens:,} tokens")
                    break
                    
            elif LLM_PROVIDER == "ollama" and ollama_client:
                # For Ollama, use the working parameters we discovered
                required_context = actual_tokens + 500  # Add buffer for output
                context_to_use = min(required_context, limits['max_context'])
                
                print(f"  🔧 Setting num_ctx to {context_to_use:,} tokens")
                
                response = ollama_client.generate(
                    model=MODEL_NAME,
                    prompt=f"Summarize this in one word: {test_text}",
                    options={
                        'num_predict': 100, 
                        'temperature': 0.3,
                        'num_ctx': context_to_use,
                        'top_p': 0.9
                    }
                )
                
                if response.get('response', '').strip():
                    working_size = actual_tokens
                    print(f"✅ Success at {actual_tokens:,} tokens")
                else:
                    print(f"❌ Empty response at {actual_tokens:,} tokens")
                    break
            
            time.sleep(1)  # Rate limiting
            
        except Exception as e:
            print(f"❌ Failed at {actual_tokens:,} tokens: {e}")
            break
    
    print(f"\n🎯 Recommended working context size: {working_size:,} tokens")
    return working_size

def update_context_settings():
    """Update global context settings based on testing"""
    global CONTEXT_SIZE, MAX_TOKENS
    
    # Use the systematic testing for Ollama
    if LLM_PROVIDER == "ollama":
        optimal_size = test_ollama_context_systematically()
    else:
        optimal_size = test_context_limits()
        
    limits = get_optimal_context_size(LLM_PROVIDER, MODEL_NAME)
    
    # Use conservative size (80% of tested working size)
    CONTEXT_SIZE = int(optimal_size * 0.8)
    MAX_TOKENS = min(limits['max_output'], 4000)
    
    print(f"\n⚙️ Final Updated Settings:")
    print(f"   Context size: {CONTEXT_SIZE:,} tokens")
    print(f"   Max output: {MAX_TOKENS:,} tokens")
    
    return CONTEXT_SIZE, MAX_TOKENS

def configure_ollama_context(context_size: int = None):
    """Configure Ollama with a specific context size"""
    if LLM_PROVIDER != "ollama" or not ollama_client:
        print("❌ This function only works with Ollama")
        return
    
    if not context_size:
        # Auto-detect optimal size
        limits = get_optimal_context_size(LLM_PROVIDER, MODEL_NAME)
        context_size = limits['usable_context']
    
    global CONTEXT_SIZE
    CONTEXT_SIZE = context_size
    
    print(f"🔧 Configured Ollama context size: {CONTEXT_SIZE:,} tokens")
    print(f"💡 This will be used in the num_ctx parameter for requests")

def test_llm_connection():
    """Test connection to the currently selected LLM provider"""
    print(f"🧪 Testing {LLM_PROVIDER.upper()} connection...")
    
    test_prompt = "Respond with exactly: {'test': 'success'}"
    
    try:
        if LLM_PROVIDER == "anthropic" and anthropic_client:
            response = anthropic_client.messages.create(
                model=MODEL_NAME,
                max_tokens=100,
                temperature=0,
                messages=[{"role": "user", "content": test_prompt}]
            )
            print(f"✅ Anthropic response: {response.content[0].text}")
            
        elif LLM_PROVIDER == "openai" and openai_client:
            response = openai_client.chat.completions.create(
                model=MODEL_NAME,
                max_tokens=100,
                temperature=0,
                messages=[{"role": "user", "content": test_prompt}]
            )
            print(f"✅ OpenAI response: {response.choices[0].message.content}")
            
        elif LLM_PROVIDER == "ollama" and ollama_client:
            response = ollama_client.generate(
                model=MODEL_NAME,
                prompt=test_prompt,
                options={
                    'num_predict': 100, 
                    'temperature': 0.3, 
                    'num_ctx': CONTEXT_SIZE,
                    'top_p': 0.9
                }
            )
            print(f"✅ Ollama response: {response['response']}")
            
        else:
            print(f"❌ No valid {LLM_PROVIDER} client available")
            return False
            
        return True
        
    except Exception as e:
        print(f"❌ Connection test failed: {e}")
        return False

def show_provider_status():
    """Show the status of all LLM providers"""
    print("📊 LLM Provider Status:")
    print(f"   Current: {LLM_PROVIDER.upper()} ({MODEL_NAME})")
    print(f"   Context: {CONTEXT_SIZE:,} tokens | Max Output: {MAX_TOKENS:,} tokens")
    print(f"   Anthropic: {'✅' if anthropic_client else '❌'}")
    print(f"   OpenAI: {'✅' if openai_client else '❌'}")
    print(f"   Ollama: {'✅' if ollama_client else '❌'}")

def quick_switch_provider(provider: str, model: str = None):
    """Quick function to switch providers programmatically"""
    global LLM_PROVIDER, MODEL_NAME
    
    if provider.lower() not in ['anthropic', 'openai', 'ollama']:
        print("❌ Invalid provider. Choose: anthropic, openai, or ollama")
        return
    
    # Update widget values
    provider_widget.value = provider.lower()
    
    if model:
        if provider.lower() in model_widgets:
            model_widgets[provider.lower()].value = model
    
    print(f"🔄 Switched to {provider.upper()}")
    print("💡 Run the imports cell again to initialize the new provider")

# Test current connection and get model info
show_provider_status()

if LLM_PROVIDER == "ollama" and ollama_client:
    print(f"\n🔍 Getting {MODEL_NAME} model information...")
    model_info = get_ollama_model_info(MODEL_NAME)
    print(f"📊 Detected context size: {model_info['context_size']:,} tokens")

if any([anthropic_client, openai_client, ollama_client]):
    print("\n🔧 Testing basic connection...")
    test_llm_connection()

print("\n💡 Quick Commands:")
print("   show_provider_status() - Show all provider statuses")
print("   test_llm_connection() - Test current provider")
print("   test_ollama_context_systematically() - Systematic Ollama context testing")
print("   update_context_settings() - Auto-optimize context settings")
print("   configure_ollama_context(size) - Set specific Ollama context size")
print("   quick_switch_provider('provider', 'model') - Switch providers")

📊 LLM Provider Status:
   Current: OLLAMA (gpt-oss-32k)
   Context: 15,192 tokens | Max Output: 1,500 tokens
   Anthropic: ❌
   OpenAI: ❌
   Ollama: ✅

🔍 Getting gpt-oss-32k model information...
📊 Detected context size: 32,768 tokens

🔧 Testing basic connection...
🧪 Testing OLLAMA connection...
✅ Ollama response: {'test': 'success'}

💡 Quick Commands:
   show_provider_status() - Show all provider statuses
   test_llm_connection() - Test current provider
   test_ollama_context_systematically() - Systematic Ollama context testing
   update_context_settings() - Auto-optimize context settings
   configure_ollama_context(size) - Set specific Ollama context size
   quick_switch_provider('provider', 'model') - Switch providers
✅ Ollama response: {'test': 'success'}

💡 Quick Commands:
   show_provider_status() - Show all provider statuses
   test_llm_connection() - Test current provider
   test_ollama_context_systematically() - Systematic Ollama context testing
   update_context_settings() - A

In [7]:
# Robust JSON parsing function
def robust_json_parse(response_text: str):
    """Parse JSON from LLM response with multiple fallback strategies"""
    if not response_text:
        return None
    
    # Strategy 1: Try to parse as-is
    try:
        return json.loads(response_text.strip())
    except:
        pass
    
    # Strategy 2: Look for JSON block
    import re
    json_match = re.search(r'\{.*\}', response_text, re.DOTALL)
    if json_match:
        try:
            return json.loads(json_match.group())
        except:
            pass
    
    # Strategy 3: Look for array
    array_match = re.search(r'\[.*\]', response_text, re.DOTALL)
    if array_match:
        try:
            return json.loads(array_match.group())
        except:
            pass
    
    print(f"⚠️ Could not parse JSON from: {response_text[:100]}...")
    return None

class WorkingStoryProcessor:
    """A properly working story processor with correct token settings"""
    
    def __init__(self, data_dir, results_dir):
        self.data_dir = Path(data_dir)
        self.results_dir = Path(results_dir)
        self.results_dir.mkdir(exist_ok=True)
        
        # Conservative chunk sizes that work reliably
        self.max_scene_chunk_chars = 3000
        self.max_goal_chunk_chars = 4000
    
    def call_llm(self, prompt: str, max_retries: int = 3) -> Optional[Dict]:
        """Call LLM API with unified interface for all providers"""
        for attempt in range(max_retries):
            try:
                response_text = None
                
                if LLM_PROVIDER == "anthropic" and anthropic_client:
                    response = anthropic_client.messages.create(
                        model=MODEL_NAME,
                        max_tokens=MAX_TOKENS,
                        temperature=TEMPERATURE,
                        messages=[{"role": "user", "content": prompt}]
                    )
                    response_text = response.content[0].text
                
                elif LLM_PROVIDER == "openai" and openai_client:
                    response = openai_client.chat.completions.create(
                        model=MODEL_NAME,
                        max_tokens=MAX_TOKENS,
                        temperature=TEMPERATURE,
                        messages=[{"role": "user", "content": prompt}]
                    )
                    response_text = response.choices[0].message.content
                
                elif LLM_PROVIDER == "ollama" and ollama_client:
                    # Dynamic context sizing for Ollama
                    prompt_tokens = count_tokens(prompt, MODEL_NAME)
                    dynamic_context = min(prompt_tokens + MAX_TOKENS + 500, CONTEXT_SIZE)
                    
                    response = ollama_client.generate(
                        model=MODEL_NAME,
                        prompt=prompt,
                        options={
                            'num_predict': MAX_TOKENS,
                            'temperature': TEMPERATURE,
                            'num_ctx': dynamic_context
                        }
                    )
                    response_text = response['response']
                
                else:
                    raise Exception(f"No valid {LLM_PROVIDER} client available")
                
                if not response_text:
                    raise Exception("Empty response from LLM")
                
                # Parse JSON response
                return robust_json_parse(response_text)
                    
            except Exception as e:
                print(f"Attempt {attempt + 1} failed: {e}")
                if attempt < max_retries - 1:
                    time.sleep(2 ** attempt)  # Exponential backoff
                else:
                    print(f"Failed to get LLM response after {max_retries} attempts")
                    return None
    
    def segment_scenes(self, book_id: str, text: str):
        """Segment scenes with proper token settings"""
        
        if len(text) <= self.max_scene_chunk_chars:
            return self._segment_single_chunk(book_id, text)
        
        # Split into chunks
        chunks = self._split_text(text, self.max_scene_chunk_chars)
        all_scenes = []
        
        for i, chunk in enumerate(chunks):
            print(f"   📄 Processing scene chunk {i+1}/{len(chunks)} ({len(chunk):,} chars)")
            scenes = self._segment_single_chunk(book_id, chunk, scene_offset=len(all_scenes))
            all_scenes.extend(scenes)
            print(f"      Found {len(scenes)} scenes")
        
        return all_scenes
    
    def _segment_single_chunk(self, book_id: str, text: str, scene_offset: int = 0):
        """Segment scenes in a single chunk with proper settings"""
        
        prompt = f'''Identify natural scene breaks in this Baby-sitter Club book text.

A scene is a continuous sequence in the same location/time. Look for:
- Location changes
- Time jumps  
- Clear narrative breaks

Text:
{text}

Respond in JSON:
{{
  "scenes": [
    {{
      "scene_num": 1,
      "description": "Brief scene description"
    }}
  ]
}}'''
        
        try:
            result = self.call_llm(prompt)
            if result and 'scenes' in result:
                scenes = result['scenes']
                # Add metadata and adjust numbers
                for scene in scenes:
                    scene['scene_num'] += scene_offset
                    scene['book_id'] = book_id
                    scene['text'] = text  # Store the chunk text
                    scene['scene_id'] = f"{book_id}_scene_{scene['scene_num']}"
                return scenes
        except Exception as e:
            print(f"         Error in scene segmentation: {e}")
        
        return []
    
    def analyze_goals(self, scene_data):
        """Analyze goals with proper token settings"""
        
        text = scene_data.get('text', '')
        if len(text) > self.max_goal_chunk_chars:
            text = text[:self.max_goal_chunk_chars]  # Truncate if too long
        
        prompt = f'''Analyze character goals in this Baby-sitters Club scene:

Scene: {scene_data.get('description', 'Scene from story')}

Text:
{text}

What do characters want or try to achieve? Respond in JSON:
{{
  "goals": [
    {{
      "character": "Character Name",
      "goal": "What they want to achieve",
      "evidence": "Quote or description from text",
      "category": "social/family/personal/academic/other"
    }}
  ]
}}'''
        
        try:
            result = self.call_llm(prompt)
            if result and 'goals' in result:
                goals = result['goals']
                # Add metadata
                for goal in goals:
                    goal['scene_id'] = scene_data.get('scene_id', 'unknown')
                    goal['book_id'] = scene_data.get('book_id', 'unknown')
                return goals
        except Exception as e:
            print(f"         Error in goal analysis: {e}")
        
        return []
    
    def _split_text(self, text: str, max_chars: int):
        """Split text into chunks by paragraphs"""
        paragraphs = text.split('\n\n')
        chunks = []
        current_chunk = ""
        
        for para in paragraphs:
            if len(current_chunk) + len(para) > max_chars and current_chunk:
                chunks.append(current_chunk.strip())
                current_chunk = para
            else:
                current_chunk += "\n\n" + para if current_chunk else para
        
        if current_chunk.strip():
            chunks.append(current_chunk.strip())
        
        return chunks

print("✅ WorkingStoryProcessor class defined with complete LLM integration")

✅ WorkingStoryProcessor class defined with complete LLM integration


In [None]:
# ====================================================================
# SIMPLE STORY PROCESSOR - ONE CLEAR SOLUTION
# ====================================================================

class SimpleStoryProcessor:
    """
    A single, clear processor for story analysis.
    No confusing multiple versions - just one that works!
    """
    
    def __init__(self):
        self.client = ollama_client
        self.model = MODEL_NAME
        
    def analyze_story(self, story_text, story_id="story"):
        """
        Analyze a story and return scenes with goals.
        
        Args:
            story_text: The text to analyze
            story_id: ID for the story
            
        Returns:
            List of Scene objects with goals
        """
        print(f"📖 Analyzing story: {len(story_text):,} characters")
        
        # Create prompt for scene segmentation
        prompt = f'''Analyze this Baby-sitters Club story text and identify natural scene breaks.

A scene is a continuous sequence in the same location/time. Look for:
- Chapter breaks
- Location changes  
- Time jumps
- Character perspective shifts

Text to analyze:
{story_text}

Return JSON with this structure:
{{
  "scenes": [
    {{
      "scene_id": "scene_1",
      "description": "Brief description of what happens",
      "text": "The actual scene text"
    }}
  ]
}}'''

        try:
            # Call LLM
            print("🧠 Calling LLM...")
            response = self.client.generate(
                model=self.model,
                prompt=prompt,
                options={
                    'num_predict': 2000,  # Enough tokens for response
                    'temperature': 0.3,   # Consistent results
                    'num_ctx': 16000      # Good context size
                }
            )
            
            response_text = response['response']
            print(f"✅ Got response: {len(response_text)} characters")
            
            # Parse JSON
            import json
            import re
            
            # Find JSON in response
            json_match = re.search(r'\{.*\}', response_text, re.DOTALL)
            if json_match:
                json_str = json_match.group()
                data = json.loads(json_str)
                
                scenes = []
                for i, scene_data in enumerate(data.get('scenes', []), 1):
                    scene = Scene(
                        scene_id=f"{story_id}_scene_{i}",
                        book_id=story_id,
                        chapter_num=1,
                        scene_num=i,
                        text=scene_data.get('text', '')
                    )
                    scenes.append(scene)
                
                print(f"🎬 Created {len(scenes)} scenes")
                return scenes
                
            else:
                print("❌ No valid JSON found in response")
                return []
                
        except Exception as e:
            print(f"❌ Error: {e}")
            return []

# Create the processor
processor = SimpleStoryProcessor()
print("✅ Simple story processor created!")
print("   📋 Use: processor.analyze_story(text, 'story_id')")
print("   🎯 Returns: List of Scene objects")

In [None]:
# ====================================================================
# SIMPLE TEST - DOES IT WORK?
# ====================================================================

print("🧪 TESTING THE SIMPLE PROCESSOR")
print("=" * 50)

# Test with sample text
test_story = '''Chapter 1: The New Girl

Shannon Kilbourne walked nervously into Stoneybrook Middle School. She had just moved from New York City, and everything felt different here.

"Hi there!" called a friendly voice. A girl with brown hair approached her. "I'm Mary Anne Spier. Are you new?"

"Yes, I'm Shannon," she replied with relief. "This place is so different from my old school."

Chapter 2: Meeting the Club  

At lunch, Mary Anne introduced Shannon to her friends.

"This is Kristy Thomas," Mary Anne said, pointing to a confident-looking girl. "She started our babysitters club."

"Nice to meet you!" Kristy said enthusiastically. "We're always looking for good babysitters. Do you like kids?"

Shannon's eyes lit up. "I love babysitting! Tell me more about this club."'''

print(f"📝 Test story: {len(test_story)} characters")

# Test the processor
try:
    scenes = processor.analyze_story(test_story, "test_story")
    
    if scenes:
        print(f"\n🎉 SUCCESS!")
        print(f"✅ Found {len(scenes)} scenes:")
        
        for i, scene in enumerate(scenes, 1):
            print(f"\n   Scene {i}: {scene.scene_id}")
            print(f"   Preview: {scene.text[:80]}...")
            
        # Store results for further use
        story_scenes = {"test_story": scenes}
        print(f"\n📋 Results stored in 'story_scenes' variable")
        
    else:
        print(f"\n❌ No scenes found")
        story_scenes = {}
        
except Exception as e:
    print(f"\n❌ Test failed: {e}")
    story_scenes = {}

print("=" * 50)

In [None]:
# ====================================================================
# CORPUS PROCESSING - RUN ON ENTIRE CORPUS
# ====================================================================

def process_entire_corpus():
    """Process all books in the corpus"""
    print("🚀 PROCESSING ENTIRE CORPUS")
    print("=" * 60)
    
    # Find all text files in the corpus
    corpus_path = DATA_DIR / "clean corpus no paratext"
    
    if not corpus_path.exists():
        print(f"❌ Corpus directory not found: {corpus_path}")
        return {}
    
    txt_files = list(corpus_path.glob("*.txt"))
    print(f"📚 Found {len(txt_files)} books to process")
    
    all_results = {}
    
    for i, book_file in enumerate(txt_files, 1):
        book_id = book_file.stem  # filename without extension
        print(f"\n📖 Processing {i}/{len(txt_files)}: {book_id}")
        
        try:
            # Read the book
            with open(book_file, 'r', encoding='utf-8') as f:
                text = f.read().strip()
            
            print(f"   📝 Text length: {len(text):,} characters")
            
            # Limit text size for processing (adjust as needed)
            if len(text) > 15000:
                text = text[:15000]
                print(f"   ✂️ Truncated to {len(text):,} characters")
            
            # Process with our simple processor
            scenes = processor.analyze_story(text, book_id)
            
            if scenes:
                all_results[book_id] = {
                    'scenes': scenes,
                    'book_title': book_id.replace('_', ' ').title(),
                    'scene_count': len(scenes)
                }
                print(f"   ✅ Success: {len(scenes)} scenes")
            else:
                print(f"   ❌ No scenes found")
                
        except Exception as e:
            print(f"   ❌ Error processing {book_id}: {e}")
            continue
    
    print(f"\n🎊 CORPUS PROCESSING COMPLETE!")
    print(f"   📊 Successfully processed: {len(all_results)} books")
    print(f"   📋 Total scenes: {sum(r['scene_count'] for r in all_results.values())}")
    
    return all_results

# Uncomment the line below to run corpus processing
# corpus_results = process_entire_corpus()

print("✅ Corpus processing function ready!")
print("   🚀 Run: corpus_results = process_entire_corpus()")
print("   ⚠️  Note: This will take several minutes for the full corpus")

In [None]:
# ====================================================================
# HTML VISUALIZATION OUTPUT - FORMAT DATA FOR HTML FILE
# ====================================================================

def prepare_visualization_data(results_dict):
    """
    Convert processing results to format needed for HTML visualization
    
    Args:
        results_dict: Dictionary from corpus processing
        
    Returns:
        Dictionary ready for JSON export to HTML file
    """
    print("🎨 PREPARING VISUALIZATION DATA")
    print("=" * 50)
    
    visualization_data = {
        "metadata": {
            "generated_date": "2025-08-11",
            "total_books": len(results_dict),
            "total_scenes": sum(r['scene_count'] for r in results_dict.values()),
            "processor": "SimpleStoryProcessor"
        },
        "books": []
    }
    
    for book_id, book_data in results_dict.items():
        book_viz = {
            "book_id": book_id,
            "title": book_data['book_title'],
            "scene_count": book_data['scene_count'],
            "scenes": []
        }
        
        # Convert scenes to visualization format
        for scene in book_data['scenes']:
            scene_viz = {
                "scene_id": scene.scene_id,
                "scene_num": scene.scene_num,
                "text_preview": scene.text[:200] + "..." if len(scene.text) > 200 else scene.text,
                "text_length": len(scene.text),
                "chapter": scene.chapter_num
            }
            book_viz["scenes"].append(scene_viz)
        
        visualization_data["books"].append(book_viz)
    
    print(f"✅ Prepared data for {len(results_dict)} books")
    return visualization_data

def export_for_html_visualization(visualization_data, filename="scene_analysis_visualization.json"):
    """Export data to JSON file for HTML visualization"""
    import json
    
    output_file = Path(filename)
    
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(visualization_data, f, indent=2, ensure_ascii=False)
    
    print(f"💾 Exported to: {output_file}")
    print(f"📁 File size: {output_file.stat().st_size:,} bytes")
    return output_file

# Example usage function
def create_html_visualization(results_dict):
    """Complete pipeline: results -> visualization data -> JSON file"""
    print("🔄 CREATING HTML VISUALIZATION DATA")
    print("=" * 50)
    
    # Step 1: Prepare data
    viz_data = prepare_visualization_data(results_dict)
    
    # Step 2: Export to JSON
    json_file = export_for_html_visualization(viz_data)
    
    print(f"\n🎉 VISUALIZATION READY!")
    print(f"   📊 Data: {viz_data['metadata']['total_books']} books, {viz_data['metadata']['total_scenes']} scenes")
    print(f"   📁 File: {json_file}")
    print(f"   🌐 Ready for HTML dashboard!")
    
    return viz_data, json_file

print("✅ HTML visualization functions ready!")
print("   📊 Use: viz_data, json_file = create_html_visualization(corpus_results)")
print("   💡 This will create the JSON file needed for the HTML dashboard")

In [None]:
# ====================================================================
# COMPLETE WORKFLOW EXAMPLE
# ====================================================================

print("📋 COMPLETE WORKFLOW GUIDE")
print("=" * 50)

print("""
🔄 **STEP-BY-STEP WORKFLOW:**

1️⃣ **Test with sample** (already done in previous cell):
   ✅ Small test to verify everything works

2️⃣ **Process entire corpus**:
   📚 corpus_results = process_entire_corpus()
   ⏱️  Takes several minutes for full corpus

3️⃣ **Create visualization data**:
   🎨 viz_data, json_file = create_html_visualization(corpus_results)
   📁 Creates JSON file for HTML dashboard

4️⃣ **Use the HTML file**:
   🌐 Open your HTML dashboard file
   📊 It will load the JSON data automatically

**Quick commands:**
""")

print("# For full corpus processing:")
print("corpus_results = process_entire_corpus()")
print()
print("# For HTML visualization:")
print("viz_data, json_file = create_html_visualization(corpus_results)")
print()
print("# Or do both in one go:")
print("corpus_results = process_entire_corpus()")
print("viz_data, json_file = create_html_visualization(corpus_results)")

print("\n💡 **Tips:**")
print("   • Start with the test cell to make sure everything works")
print("   • Full corpus processing takes time - be patient!")
print("   • The JSON file will be created in the current directory")
print("   • Your HTML dashboard should automatically load the new data")

## Summary of Features

This comprehensive story analysis system provides:

### 🤖 **Multi-Provider LLM Support**
- **Anthropic Claude**: Production-ready with large context (200k tokens)
- **OpenAI GPT**: Industry standard with excellent performance (128k tokens)  
- **Ollama Local**: Privacy-focused local processing (configurable models)
- **Easy Switching**: Interactive widgets for seamless provider changes
- **Unified Interface**: Same API calls work across all providers

### 🔄 **Multi-pass LLM Processing**
- **Pass 1**: Scene segmentation using LLM to identify natural scene breaks
- **Pass 2**: Narrator identification at the chapter level
- **Pass 3**: Goal analysis with dual categorization and evidence

### 📊 **Scalable Processing**
- Run on sample data for testing or full corpus for production
- Incremental progress tracking with resume capability
- Real-time progress reports during processing
- Robust error handling and retry logic

### 🎛️ **Interactive GUI**
- Browse books, chapters, and scenes
- View scene text and associated goals
- Edit and save changes to analysis results
- Export data for visualization

### 🎯 **Enhanced Goal Analysis**
- **Dual categorization**: Internal/External + Academic/Family/Personal/etc.
- **Evidence-based**: LLM provides text evidence for each goal
- **Chapter-aware scene IDs**: Retains chapter membership
- **Confidence scoring**: LLM confidence for each analysis

### 📈 **Data Export & Visualization**
- JSON export optimized for website visualization
- Summary statistics and cross-tabulations
- Individual book files for lazy loading
- Compressed formats for web deployment

### 🔧 **Technical Features**
- Multiple LLM provider support with unified interface
- Provider-specific optimizations (token limits, context sizes)
- API key management and connection testing
- Multiple narrator support per book
- Automatic progress saving and resumption
- Clean, modular architecture

### 🚀 **Getting Started**
1. **Choose Provider**: Use the configuration widget to select Anthropic, OpenAI, or Ollama
2. **Set Credentials**: Configure API keys or local Ollama setup
3. **Test Connection**: Verify your provider is working
4. **Run Analysis**: Process your corpus with `processor.process_all()`
5. **Review Results**: Use the interactive GUI to browse and edit findings

### 💡 **Provider Recommendations**
- **Anthropic Claude**: Best for large books, complex analysis, production use
- **OpenAI GPT**: Excellent balance of performance and cost
- **Ollama Local**: Perfect for privacy, experimentation, and cost-free processing

In [None]:
# Download a set of files from Project Gutenberg via the API
import requests
from pathlib import Path
def download_gutenberg_books(book_ids, save_dir="gutenberg_books"):
    """Download plain text files for a list of Gutenberg book IDs."""
    Path(save_dir).mkdir(exist_ok=True)
    for book_id in book_ids:
        url = f"https://www.gutenberg.org/files/{book_id}/{book_id}-0.txt"
        response = requests.get(url)
        if response.status_code == 200:
            with open(f"{save_dir}/{book_id}.txt", "w", encoding="utf-8") as f:
                f.write(response.text)
            print(f"✅ Downloaded book {book_id}")
        else:
            print(f"❌ Failed to download book {book_id} (status {response.status_code})")
# Example usage: download_gutenberg_books([1342, 1661, 2701])  # Pride and Prejudice, Sherlock Holmes, Moby Dick